diff --git a/.github/workflows/cla.yml b/.github/workflows/cla.yml
new file mode 100644
index 000000000..b1b31a476
--- /dev/null
+++ b/.github/workflows/cla.yml
@@ -0,0 +1,29 @@
+name: "DCO Assistant"
+on:
+ issue_comment:
+ types: [created]
+ pull_request_target:
+ types: [opened,closed,synchronize]
+
+permissions:
+ actions: write
+ contents: write
+ pull-requests: write
+ statuses: write
+
+jobs:
+ DCOAssistant:
+ runs-on: ubuntu-latest
+ steps:
+ - name: "DCO Assistant"
+ if: (github.event.comment.body == 'recheck' || github.event.comment.body == 'I have read the DCO Document and I hereby sign the DCO') || github.event_name == 'pull_request_target'
+ uses: contributor-assistant/github-action@v2.3.0
+ env:
+ GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+ with:
+ path-to-signatures: '.github/dco/signatures.json'
+ path-to-document: 'https://developercertificate.org/'
+ branch: 'dco-do-not-remove'
+ allowlist: user1,bot*
+ use-dco-flag: true
+ custom-notsigned-prcomment: '
Thank you for your submission. Before we can accept your contribution, please sign our [Developer Certificate of Origin](https://developercertificate.org) by posting a comment with the content exactly as below.
'
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/Dockerfile b/DGLPyTorch/DrugDiscovery/SE3Transformer/Dockerfile
index c9d19da03..5f3f2c4fd 100644
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/Dockerfile
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/Dockerfile
@@ -24,7 +24,7 @@
# run docker daemon with --default-runtime=nvidia for GPU detection during build
# multistage build for DGL with CUDA and FP16
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.08-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.01-py3
FROM ${FROM_IMAGE_NAME} AS dgl_builder
@@ -33,7 +33,7 @@ RUN apt-get update \
&& apt-get install -y git build-essential python3-dev make cmake \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /dgl
-RUN git clone --branch 0.9.0 --recurse-submodules --depth 1 https://github.com/dmlc/dgl.git .
+RUN git clone --branch 1.0.0 --recurse-submodules --depth 1 https://github.com/dmlc/dgl.git .
WORKDIR build
RUN export NCCL_ROOT=/usr \
&& cmake .. -GNinja -DCMAKE_BUILD_TYPE=Release \
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/README.md b/DGLPyTorch/DrugDiscovery/SE3Transformer/README.md
index ea0265038..60e95d65d 100644
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/README.md
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/README.md
@@ -252,9 +252,9 @@ The following section lists the requirements that you need to meet in order to s
### Requirements
-This repository contains a Dockerfile which extends the PyTorch 21.07 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+This repository contains a Dockerfile which extends the PyTorch 23.01 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-- PyTorch 21.07+ NGC container
+- PyTorch 23.01+ NGC container
- Supported GPUs:
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
@@ -285,12 +285,12 @@ To train your model using mixed or TF32 precision with Tensor Cores or FP32, per
3. Start an interactive session in the NGC container to run training/inference.
```
mkdir -p results
- docker run -it --runtime=nvidia --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ${PWD}/results:/results se3-transformer:latest
+ docker run -it --runtime=nvidia --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ${PWD}/results:/workspace/se3-transformer/results se3-transformer:latest
```
4. Start training.
```
- bash scripts/train.sh
+ bash scripts/train.sh # or scripts/train_multi_gpu.sh
```
5. Start inference/predictions.
@@ -474,7 +474,7 @@ The following sections provide details on how we achieved our performance and ac
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
+Our results were obtained by running the `scripts/train.sh` and `scripts/train_multi_gpu.sh` training scripts in the PyTorch 23.01 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
| GPUs | Batch size / GPU | Absolute error - TF32 | Absolute error - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (mixed precision to TF32) |
|:----:|:----------------:|:---------------------:|:--------------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
@@ -484,7 +484,7 @@ Our results were obtained by running the `scripts/train.sh` training script in t
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+Our results were obtained by running the `scripts/train.sh` and `scripts/train_multi_gpu.sh` training scripts in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
| GPUs | Batch size / GPU | Absolute error - FP32 | Absolute error - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (mixed precision to FP32) |
|:----:|:----------------:|:---------------------:|:--------------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
@@ -497,14 +497,14 @@ Our results were obtained by running the `scripts/train.sh` training script in t
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
-Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
+Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 23.01 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
| GPUs | Batch size / GPU | Throughput - TF32 [mol/ms] | Throughput - mixed precision [mol/ms] | Throughput speedup (mixed precision - TF32) | Weak scaling - TF32 | Weak scaling - mixed precision |
|:----------------:|:-------------------:|:--------------------------:|:-------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
-| 1 | 240 | 2.61 | 3.35 | 1.28x | | |
-| 1 | 120 | 1.94 | 2.07 | 1.07x | | |
-| 8 | 240 | 18.80 | 23.90 | 1.27x | 7.20 | 7.13 |
-| 8 | 120 | 14.10 | 14.52 | 1.03x | 7.27 | 7.01 |
+| 1 | 240 | 2.59 | 3.23 | 1.25x | | |
+| 1 | 120 | 1.89 | 1.89 | 1.00x | | |
+| 8 | 240 | 18.38 | 21.42 | 1.17x | 7.09 | 6.63 |
+| 8 | 120 | 13.23 | 13.23 | 1.00x | 7.00 | 7.00 |
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -512,14 +512,14 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
-Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
+Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five entire training epochs after a warmup epoch.
| GPUs | Batch size / GPU | Throughput - FP32 [mol/ms] | Throughput - mixed precision [mol/ms] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|:----------------:|:--------------------:|:--------------------------:|:--------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
-| 1 | 240 | 1.33 | 2.12 | 1.59x | | |
-| 1 | 120 | 1.11 | 1.45 | 1.31x | | |
-| 8 | 240 | 9.32 | 13.40 | 1.44x | 7.01 | 6.32 |
-| 8 | 120 | 6.90 | 8.39 | 1.22x | 6.21 | 5.79 |
+| 1 | 240 | 1.23 | 1.91 | 1.55x | | |
+| 1 | 120 | 1.01 | 1.23 | 1.22x | | |
+| 8 | 240 | 8.44 | 11.28 | 1.34x | 6.8 | 5.90 |
+| 8 | 120 | 6.06 | 7.36 | 1.21x | 6.00 | 5.98 |
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -530,23 +530,23 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
-Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.
+Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 23.01 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.
AMP
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 1600 | 13.54 | 121.44 | 118.07 | 119.00 | 366.64 |
-| 800 | 12.63 | 64.11 | 63.78 | 64.37 | 68.19 |
-| 400 | 10.65 | 37.97 | 39.02 | 39.67 | 42.87 |
+| 1600 | 9.71 | 175.2 | 190.2 | 191.8 | 432.4 |
+| 800 | 7.90 | 114.5 | 134.3 | 135.8 | 140.2 |
+| 400 | 7.18 | 75.49 | 108.6 | 109.6 | 113.2 |
TF32
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 1600 | 8.97 | 180.85 | 178.31 | 178.92 | 375.33 |
-| 800 | 8.86 | 90.76 | 90.77 | 91.11 | 92.96 |
-| 400 | 8.49 | 47.42 | 47.65 | 48.15 | 50.74 |
+| 1600 | 8.19 | 198.2 | 206.8 | 208.5 | 377.0 |
+| 800 | 7.56 | 107.5 | 119.6 | 120.5 | 125.7 |
+| 400 | 6.97 | 59.8 | 75.1 | 75.7 | 81.3 |
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -554,23 +554,23 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
-Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
+Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
AMP
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 1600 | 6.59 | 248.02 | 242.11 | 242.62 | 674.60 |
-| 800 | 6.38 | 126.49 | 125.96 | 126.31 | 127.72 |
-| 400 | 5.90 | 68.24 | 68.53 | 69.02 | 70.87 |
+| 1600 | 5.39 | 306.6 | 321.2 | 324.9 | 819.1 |
+| 800 | 4.67 | 179.8 | 201.5 | 203.8 | 213.3 |
+| 400 | 4.25 | 108.2 | 142.0 | 143.0 | 149.0 |
FP32
| Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 1600 | 3.33 | 482.20 | 483.50 | 485.28 | 754.84 |
-| 800 | 3.35 | 239.09 | 242.21 | 243.13 | 244.91 |
-| 400 | 3.27 | 122.68 | 123.60 | 124.18 | 125.85 |
+| 1600 | 3.14 | 510.9 | 518.83 | 521.1 | 808.0 |
+| 800 | 3.10 | 258.7 | 269.4 | 271.1 | 278.9 |
+| 400 | 2.93 | 137.3 | 147.5 | 148.8 | 151.7 |
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -580,6 +580,10 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
### Changelog
+February 2023:
+- Upgraded base container
+- Fixed benchmarking code
+
August 2022:
- Slight performance improvements
- Upgraded base container
@@ -604,3 +608,4 @@ August 2021
### Known issues
If you encounter `OSError: [Errno 12] Cannot allocate memory` during the Dataloader iterator creation (more precisely during the `fork()`, this is most likely due to the use of the `--precompute_bases` flag. If you cannot add more RAM or Swap to your machine, it is recommended to turn off bases precomputation by removing the `--precompute_bases` flag or using `--precompute_bases false`.
+
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train.sh b/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train.sh
index 5bcd707a9..fa7d89786 100755
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train.sh
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train.sh
@@ -8,7 +8,7 @@ AMP=${2:-true}
CUDA_VISIBLE_DEVICES=0 python -m se3_transformer.runtime.training \
--amp "$AMP" \
--batch_size "$BATCH_SIZE" \
- --epochs 6 \
+ --epochs 16 \
--use_layer_norm \
--norm \
--save_ckpt_path model_qm9.pth \
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train_multi_gpu.sh b/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train_multi_gpu.sh
index fc371490b..632dc04e9 100755
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train_multi_gpu.sh
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train_multi_gpu.sh
@@ -9,7 +9,7 @@ python -m torch.distributed.run --nnodes=1 --nproc_per_node=gpu --max_restarts 0
se3_transformer.runtime.training \
--amp "$AMP" \
--batch_size "$BATCH_SIZE" \
- --epochs 6 \
+ --epochs 16 \
--use_layer_norm \
--norm \
--save_ckpt_path model_qm9.pth \
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/convolution.py b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/convolution.py
index fc46961a8..69d7b6f02 100644
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/convolution.py
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/convolution.py
@@ -113,7 +113,7 @@ def __init__(
nn.Linear(mid_dim, num_freq * channels_in * channels_out, bias=False)
]
- self.net = nn.Sequential(*[m for m in modules if m is not None])
+ self.net = torch.jit.script(nn.Sequential(*[m for m in modules if m is not None]))
def forward(self, features: Tensor) -> Tensor:
return self.net(features)
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/norm.py b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/norm.py
index d1dd1a7da..ba83aee06 100644
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/norm.py
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/norm.py
@@ -32,6 +32,15 @@
from se3_transformer.model.fiber import Fiber
+@torch.jit.script
+def clamped_norm(x, clamp: float):
+ return x.norm(p=2, dim=-1, keepdim=True).clamp(min=clamp)
+
+@torch.jit.script
+def rescale(x, norm, new_norm):
+ return x / norm * new_norm
+
+
class NormSE3(nn.Module):
"""
Norm-based SE(3)-equivariant nonlinearity.
@@ -63,7 +72,7 @@ def forward(self, features: Dict[str, Tensor], *args, **kwargs) -> Dict[str, Ten
output = {}
if hasattr(self, 'group_norm'):
# Compute per-degree norms of features
- norms = [features[str(d)].norm(dim=-1, keepdim=True).clamp(min=self.NORM_CLAMP)
+ norms = [clamped_norm(features[str(d)], self.NORM_CLAMP)
for d in self.fiber.degrees]
fused_norms = torch.cat(norms, dim=-2)
@@ -73,11 +82,11 @@ def forward(self, features: Dict[str, Tensor], *args, **kwargs) -> Dict[str, Ten
# Scale features to the new norms
for norm, new_norm, d in zip(norms, new_norms, self.fiber.degrees):
- output[str(d)] = features[str(d)] / norm * new_norm
+ output[str(d)] = rescale(features[str(d)], norm, new_norm)
else:
for degree, feat in features.items():
- norm = feat.norm(dim=-1, keepdim=True).clamp(min=self.NORM_CLAMP)
+ norm = clamped_norm(feat, self.NORM_CLAMP)
new_norm = self.nonlinearity(self.layer_norms[degree](norm.squeeze(-1)).unsqueeze(-1))
- output[degree] = new_norm * feat / norm
+ output[degree] = rescale(new_norm, feat, norm)
return output
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/arguments.py b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/arguments.py
index 2ae115e37..d16e1617c 100644
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/arguments.py
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/arguments.py
@@ -33,7 +33,7 @@
paths = PARSER.add_argument_group('Paths')
paths.add_argument('--data_dir', type=pathlib.Path, default=pathlib.Path('./data'),
help='Directory where the data is located or should be downloaded')
-paths.add_argument('--log_dir', type=pathlib.Path, default=pathlib.Path('/results'),
+paths.add_argument('--log_dir', type=pathlib.Path, default=pathlib.Path('./results'),
help='Directory where the results logs should be saved')
paths.add_argument('--dllogger_name', type=str, default='dllogger_results.json',
help='Name for the resulting DLLogger JSON file')
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/callbacks.py b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/callbacks.py
index 112906a27..a7c5a9f48 100644
--- a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/callbacks.py
+++ b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/callbacks.py
@@ -133,6 +133,7 @@ def __init__(self, logger, batch_size: int, warmup_epochs: int = 1, mode: str =
def on_batch_start(self):
if self.epoch >= self.warmup_epochs:
+ torch.cuda.synchronize()
self.timestamps.append(time.time() * 1000.0)
def _log_perf(self):
@@ -153,7 +154,7 @@ def on_fit_end(self):
def process_performance_stats(self):
timestamps = np.asarray(self.timestamps)
deltas = np.diff(timestamps)
- throughput = (self.batch_size / deltas).mean()
+ throughput = self.batch_size / deltas.mean()
stats = {
f"throughput_{self.mode}": throughput,
f"latency_{self.mode}_mean": deltas.mean(),
diff --git a/JAX/Classification/README.md b/JAX/Classification/README.md
new file mode 100644
index 000000000..83427543b
--- /dev/null
+++ b/JAX/Classification/README.md
@@ -0,0 +1,46 @@
+# Image Classification
+
+Image classification is the task of categorizing an image into one of several predefined classes, often also giving a probability of the input belonging to a certain class. This task is crucial in understanding and analyzing images, and it comes quite effortlessly to human beings with our complex visual systems. Most powerful image classification models today are built using some form of Convolution Neural Networks (CNNs), which are also the backbone of many other tasks in Computer Vision.
+
+
+
+[Source](https://github.com/NVlabs/stylegan)
+
+In this overview, we will cover
+- Types of image Classification
+- How does it work?
+- How is the performance evaluated?
+- Use cases and applications
+- Where to get started
+
+---
+## Types of image Classification
+Image Classification can be broadly divided into either Binary or Multi-class problems depending on the number of categories. Binary image classification problems entail predicting one of two classes. An example of this would be to predict whether an image is that of a dog or not. A subtly different problem is that of single-class (one vs all) classification, where the goal is to recognize data from one class and reject all other. This is beneficial when there is an overabundance of data from one of the classes, also called a class imbalance.
+
+
+
+In Multi-class classification problems, models categorize instances into one of three or more categories. Multi-class models often also return confidence scores (or probabilities) of an image belonging to each of the possible classes. This should not be confused with multi-label classification, where a model assigns multiple labels to an instance.
+
+---
+## How is the performance evaluated?
+Image Classification performance is often reported as Top-1 or Top-5 scores. In top-1 score, classification is considered correct if the top predicted class (with the highest predicted probability) matches the true class for a given instance. In top-5, we check if one of the top 5 predictions matches the true class. The score is just the number of correct predictions divided by the total number of instances evaluated.
+
+---
+## Use cases and applications
+### Categorizing Images in Large Visual Databases
+Businesses with visual databases may accumulate large amounts of images with missing tags or meta-data. Unless there is an effective way to organize such images, they may not be much use at all. On the contrary, they may hog precious storage space. Automated image classification algorithms can classify such untagged images into predefined categories. Businesses can avoid expensive manual labor by employing automated image classification algorithms.
+
+A related task is that of Image Organization in smart devices like mobile phones. With Image Classification techniques, images and videos can be organized for improved accessibility.
+
+### Visual Search
+Visual Search or Image-based search has risen to popularity over the recent years. Many prominent search engines already provide this feature where users can search for visual content similar to a provided image. This has many applications in the e-commerce and retail industry where users can take a snap and upload an image of a product they are interested in purchasing. This makes the shopping experience much more efficient for customers, and can increase sales for businesses.
+
+
+### Healthcare
+Medical Imaging is about creating visual images of internal body parts for clinical purposes. This includes health monitoring, medical diagnosis, treatment, and keeping organized records. Image Classification algorithms can play a crucial role in Medical Imaging by assisting medical professionals detect presence of illness and having consistency in clinical diagnosis.
+
+---
+## Getting started
+NVIDIA provides examples for JAX models on [Rosetta](https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects). These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
+
+These models are tested and maintained by NVIDIA, leveraging mixed precision using tensor cores on our latest GPUs for faster training times while maintaining accuracy.
diff --git a/JAX/Classification/ViT/README.md b/JAX/Classification/ViT/README.md
new file mode 100644
index 000000000..32dff1c47
--- /dev/null
+++ b/JAX/Classification/ViT/README.md
@@ -0,0 +1,2 @@
+# ViT on GPUs
+Please refer to [Rosetta ViT](https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/vit), NVIDIA's project that enables seamless training of LLMs, CV models and multimodal models in JAX, for information about running Vision Transformer models and experiments on GPUs.
diff --git a/JAX/LanguageModeling/PAXML/README.md b/JAX/LanguageModeling/PAXML/README.md
new file mode 100644
index 000000000..96b1dceac
--- /dev/null
+++ b/JAX/LanguageModeling/PAXML/README.md
@@ -0,0 +1,4 @@
+Paxml (aka Pax) is a framework for training LLMs. It allows for advanced and configurable experimentation and parallelization. It is based on [JAX](https://github.com/google/jax) and [Praxis](https://github.com/google/praxis).
+
+# PAXML on GPUs
+Please refer to [Rosetta PAXML](https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/pax), NVIDIA's project that enables seamless training of LLMs, CV models and multimodal models in JAX, for information about running models and experiments on GPUs in PAXML.
diff --git a/JAX/LanguageModeling/README.md b/JAX/LanguageModeling/README.md
new file mode 100644
index 000000000..d8f57e0af
--- /dev/null
+++ b/JAX/LanguageModeling/README.md
@@ -0,0 +1,90 @@
+# Language Modeling
+
+
+Language modeling (LM) is a natural language processing (NLP) task that determines the probability of a given sequence of words occurring in a sentence.
+
+In an era where computers, smartphones and other electronic devices increasingly need to interact with humans, language modeling has become an indispensable technique for teaching devices how to communicate in natural languages in human-like ways.
+
+But how does language modeling work? And what can you build with it? What are the different approaches, what are its potential benefits and limitations, and how might you use it in your business?
+
+In this guide, you’ll find answers to all of those questions and more. Whether you’re an experienced machine learning engineer considering implementation, a developer wanting to learn more, or a product manager looking to explore what’s possible with natural language processing and language modeling, this guide is for you.
+
+Here’s a look at what we’ll cover:
+
+- Language modeling – the basics
+- How does language modeling work?
+- Use cases and applications
+- Getting started
+
+
+## Language modeling – the basics
+
+### What is language modeling?
+
+"*Language modeling is the task of assigning a probability to sentences in a language. […]
+Besides assigning a probability to each sequence of words, the language models also assign a
+probability for the likelihood of a given word (or a sequence of words) to follow a sequence
+of words.*" Source: Page 105, [Neural Network Methods in Natural Language Processing](http://amzn.to/2wt1nzv), 2017.
+
+
+### Types of language models
+
+There are primarily two types of Language Models:
+
+- Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM), and certain linguistic rules to learn the probability distribution of words.
+- Neural Language Models: They use different kinds of Neural Networks to model language, and have surpassed the statistical language models in their effectiveness.
+
+"*We provide ample empirical evidence to suggest that connectionist language models are
+superior to standard n-gram techniques, except their high computational (training)
+complexity.*" Source: [Recurrent neural network based language model](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf), 2010.
+
+Given the superior performance of neural language models, we include in the container two popular state-of-the-art neural language models: BERT and Transformer-XL.
+
+### Why is language modeling important?
+
+Language modeling is fundamental in modern NLP applications. It enables machines to understand qualitative information, and enables people to communicate with machines in the natural languages that humans use to communicate with each other.
+
+Language modeling is used directly in a variety of industries, including tech, finance, healthcare, transportation, legal, military, government, and more -- actually, you probably have just interacted with a language model today, whether it be through Google search, engaging with a voice assistant, or using text autocomplete features.
+
+
+## How does language modeling work?
+
+The roots of modern language modeling can be traced back to 1948, when Claude Shannon
+published a paper titled "A Mathematical Theory of Communication", laying the foundation for information theory and language modeling. In the paper, Shannon detailed the use of a stochastic model called the Markov chain to create a statistical model for the sequences of letters in English text. The Markov models, along with n-gram, are still among the most popular statistical language models today.
+
+However, simple statistical language models have serious drawbacks in scalability and fluency because of its sparse representation of language. Overcoming the problem by representing language units (eg. words, characters) as a non-linear, distributed combination of weights in continuous space, neural language models can learn to approximate words without being misled by rare or unknown values.
+
+Therefore, as mentioned above, we introduce two popular state-of-the-art neural language models, BERT and Transformer-XL, in Tensorflow and PyTorch. More details can be found in the [NVIDIA Deep Learning Examples Github Repository ](https://github.com/NVIDIA/DeepLearningExamples)
+
+
+## Use cases and applications
+
+### Speech Recognition
+
+Imagine speaking a phrase to the phone, expecting it to convert the speech to text. How does
+it know if you said "recognize speech" or "wreck a nice beach"? Language models help figure it out
+based on the context, enabling machines to process and make sense of speech audio.
+
+
+### Spelling Correction
+
+Language-models-enabled spellcheckers can point to spelling errors and possibly suggest alternatives.
+
+
+### Machine translation
+
+Imagine you are translating the Chinese sentence "我在开车" into English. Your translation system gives you several choices:
+
+- I at open car
+- me at open car
+- I at drive
+- me at drive
+- I am driving
+- me am driving
+
+A language model tells you which translation sounds the most natural.
+
+## Getting started
+NVIDIA provides examples for JAX models on [Rosetta](https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects). These examples provide you with easy to consume and highly optimized scripts for both training and inferencing. The quick start guide at our GitHub repository will help you in setting up the environment using NGC Docker Images, download pre-trained models from NGC and adapt the model training and inference for your application/use-case.
+
+These models are tested and maintained by NVIDIA, leveraging mixed precision using tensor cores on our latest GPUs for faster training times while maintaining accuracy.
diff --git a/JAX/LanguageModeling/T5X/README.md b/JAX/LanguageModeling/T5X/README.md
new file mode 100644
index 000000000..285717459
--- /dev/null
+++ b/JAX/LanguageModeling/T5X/README.md
@@ -0,0 +1,5 @@
+T5X is a framework for training, evaluation, and inference of sequence models (starting with language). It is based on [JAX](https://github.com/google/jax) and [Flax](https://github.com/google/flax). To learn more, see the [T5X Paper](https://arxiv.org/abs/2203.17189).
+
+# T5X on GPUs
+
+Please refer to [Rosetta T5X](https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/t5x), NVIDIA's project that enables seamless training of LLMs, CV models and multimodal models in JAX, for information about running models and experiments on GPUs in T5X.
diff --git a/JAX/MultiModal/Imagen/README.md b/JAX/MultiModal/Imagen/README.md
new file mode 100644
index 000000000..0de87b58b
--- /dev/null
+++ b/JAX/MultiModal/Imagen/README.md
@@ -0,0 +1,2 @@
+# Imagen on GPUs
+Please refer to [Rosetta Imagen](https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/imagen), NVIDIA's project that enables seamless training of LLMs, CV models and multimodal models in JAX, for information about running Imagen models and experiments on GPUs.
diff --git a/MxNet/Classification/RN50v1.5/dali.py b/MxNet/Classification/RN50v1.5/dali.py
index 13d044a56..6cd02f99f 100644
--- a/MxNet/Classification/RN50v1.5/dali.py
+++ b/MxNet/Classification/RN50v1.5/dali.py
@@ -31,7 +31,7 @@ def add_dali_args(parser):
group.add_argument('--dali-validation-threads', type=int, default=10, help="number of threads" +\
"per GPU for DALI for validation")
group.add_argument('--dali-prefetch-queue', type=int, default=5, help="DALI prefetch queue depth")
- group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=256, help="Memory padding value for nvJPEG (in MB)")
+ group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=64, help="Memory padding value for nvJPEG (in MB)")
group.add_argument('--dali-fuse-decoder', type=int, default=1, help="0 or 1 whether to fuse decoder or not")
group.add_argument('--dali-nvjpeg-width-hint', type=int, default=5980, help="Width hint value for nvJPEG (in pixels)")
diff --git a/MxNet/Classification/RN50v1.5/fit.py b/MxNet/Classification/RN50v1.5/fit.py
index e396606f3..8952b64e0 100644
--- a/MxNet/Classification/RN50v1.5/fit.py
+++ b/MxNet/Classification/RN50v1.5/fit.py
@@ -483,11 +483,6 @@ def fit(args, model, data_loader):
# select gpu for horovod process
if 'horovod' in args.kv_store:
args.gpus = [args.gpus[hvd.local_rank()]]
- ctx = mx.gpu(hvd.local_rank())
-
- tensor1 = mx.nd.zeros(shape=(1,), dtype='float32', ctx=ctx)
- tensor2 = mx.nd.zeros(shape=(1,), dtype='float32', ctx=ctx)
- tensor1, tensor2 = hvd.grouped_allreduce([tensor1,tensor2])
if args.amp:
amp.init()
@@ -579,6 +574,11 @@ def fit(args, model, data_loader):
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)
+ ctx = mx.gpu(hvd.local_rank())
+ tensor1 = mx.nd.zeros(shape=(1,), dtype='float32', ctx=ctx)
+ tensor2 = mx.nd.zeros(shape=(1,), dtype='float32', ctx=ctx)
+ tensor1, tensor2 = hvd.grouped_allreduce([tensor1,tensor2])
+
global_metrics = CompositeMeter()
if args.mode in ['train_val', 'train']:
global_metrics.register_metric('train.loss', MinMeter())
diff --git a/PaddlePaddle/Classification/RN50v1.5/Dockerfile b/PaddlePaddle/Classification/RN50v1.5/Dockerfile
index 44aad1329..932dca3c6 100644
--- a/PaddlePaddle/Classification/RN50v1.5/Dockerfile
+++ b/PaddlePaddle/Classification/RN50v1.5/Dockerfile
@@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:22.05-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:23.12-py3
FROM ${FROM_IMAGE_NAME}
ADD requirements.txt /workspace/
diff --git a/PaddlePaddle/Classification/RN50v1.5/README.md b/PaddlePaddle/Classification/RN50v1.5/README.md
index 76a75104e..1e981db73 100644
--- a/PaddlePaddle/Classification/RN50v1.5/README.md
+++ b/PaddlePaddle/Classification/RN50v1.5/README.md
@@ -17,6 +17,8 @@ achieve state-of-the-art accuracy. The content of this repository is tested and
* [Enabling TF32](#enabling-tf32)
* [Automatic SParsity](#automatic-sparsity)
* [Enable Automatic SParsity](#enable-automatic-sparsity)
+ * [Quantization aware training](#quantization-aware-training)
+ * [Enable quantization aware training](#enable-quantization-aware-training)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick Start Guide](#quick-start-guide)
@@ -26,6 +28,7 @@ achieve state-of-the-art accuracy. The content of this repository is tested and
* [Dataset guidelines](#dataset-guidelines)
* [Training process](#training-process)
* [Automatic SParsity training process](#automatic-sparsity-training-process)
+ * [Quantization aware training process](#quantization-aware-training-process)
* [Inference process](#inference-process)
* [Performance](#performance)
* [Benchmarking](#benchmarking)
@@ -128,6 +131,7 @@ This model supports the following features:
|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes |
|[Paddle AMP](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/performance_improving/amp_en.html) | Yes |
|[Paddle ASP](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/decorate_en.html) | Yes |
+|[PaddleSlim QAT](https://paddleslim.readthedocs.io/en/latest/quick_start/quant_aware_tutorial_en.html) | Yes |
|[Paddle-TRT](https://github.com/PaddlePaddle/Paddle-Inference-Demo/blob/master/docs/optimize/paddle_trt_en.rst) | Yes |
#### Features
@@ -139,7 +143,9 @@ with the DALI library. For more information about DALI, refer to the [DALI produ
- Paddle ASP is a PaddlePaddle built-in module that provides functions to enable automatic sparsity workflow with only a few code line insertions. The full APIs can be found in [Paddle.static.sparsity](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/calculate_density_en.html). Paddle ASP support, currently, static graph mode only (Dynamic graph support is under development). Refer to the [Enable Automatic SParsity](#enable-automatic-sparsity) section for more details.
-- Paddle-TRT is a PaddlePaddle inference integration with [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html). It selects subgraph to be accelerated by TensorRT, while leaving the rest of the operations to be executed natively by PaddlePaddle. Refer to the [Inference with TensorRT](#inference-with-tensorrt) section for more details.
+- PaddleSlim is a set of tools based on PaddlePaddle for model acceleration, quantization, pruning, and knowledge distillation. For model quantization, PaddleSlim offers simple and user-friendly APIs for quantization aware training. The full APIs can be found in [Quantization aware training](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html). PaddleSlim currently supports updating gradients and scales simultaneously during quantization aware training (Training with fixed scales is still under development). Refer to the [Enable quantization aware training](#enable-quantization-aware-training) section for more details.
+
+- Paddle-TRT is a PaddlePaddle inference integration with [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html). It selects subgraphs to be accelerated by TensorRT, while leaving the rest of the operations to be executed natively by PaddlePaddle. Refer to the [Inference with TensorRT](#inference-with-tensorrt) section for more details.
### DALI
@@ -147,7 +153,7 @@ We use [NVIDIA DALI](https://github.com/NVIDIA/DALI),
which speeds up data loading when the CPU becomes a bottleneck.
DALI can use CPU or GPU and outperforms the PaddlePaddle native data loader.
-For data loader, we only support DALI as data loader for now.
+For data loaders, we only support DALI as data loader for now.
### Mixed precision training
@@ -225,6 +231,30 @@ Moreover, ASP is also compatible with mixed precision training.
Note that currently ASP only supports static graphs (Dynamic graph support is under development).
+### Quantization Aware Training
+Quantization aware training (QAT) is a technique to train models with the awareness of the quantization process. Quantization refers to reducing the precision of numerical values in a model, typically from floating-point to lower-bit fixed-point representations. In QAT, during the training process, the model is trained to accommodate the effects of quantization, enabling it to maintain performance even when deployed with reduced precision.
+Through PaddleSlim QAT, we can quantize models by the following steps:
+- quantize and dequantize the weights and inputs before feeding them into weighted-layers (ex. Convolution and Fullyconnected)
+- record the scale of each tensor for use in low precision inference
+
+For more information, refer to
+- [INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION](https://arxiv.org/pdf/2004.09602.pdf)
+
+#### Enable Quantization Aware Training
+PaddlePaddle integrates some QAT modules from PaddleSlim, a toolkit for deep learning model compression, to enable QAT training.
+The APIs can quantize a train program and also convert it into an INT8 inference model.
+
+```python
+quant_program = quanter.quant_aware(program)
+...
+quant_infer_program = quanter.convert(quant_program)
+```
+
+The detailed information on QAT API can be found in [quantization_aware_tutorial](https://paddleslim.readthedocs.io/en/latest/quick_start/quant_aware_tutorial_en.html).
+
+Moreover, QAT is also compatible with mixed precision training.
+
+
## Setup
The following section lists the requirements you need to meet to start training the ResNet50 model.
@@ -233,7 +263,7 @@ The following section lists the requirements you need to meet to start training
This repository contains a Dockerfile that extends the CUDA NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PaddlePaddle 22.05-py3 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle) or newer
+* [PaddlePaddle 23.12-py3 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle) or newer
* Supported GPUs:
* [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
@@ -289,13 +319,13 @@ docker build . -t nvidia_resnet50
### 4. Start an interactive session in the NGC container to run training/inference.
```bash
-nvidia-docker run --rm -it -v :/imagenet --ipc=host nvidia_resnet50
+nvidia-docker run --rm -it -v :/imagenet --ipc=host --e FLAGS_apply_pass_to_program=1 nvidia_resnet50
```
### 5. Start training
To run training for a standard configuration (DGXA100, AMP/TF32),
-use one of scripts in `scripts/training` to launch training. (Please ensure ImageNet is mounted in the `/imagenet` directory.)
+use one of the scripts in `scripts/training` to launch training. (Please ensure ImageNet is mounted in the `/imagenet` directory.)
Example:
```bash
@@ -303,7 +333,7 @@ Example:
bash scripts/training/train_resnet50_TF32_90E_DGXA100.sh
# For AMP and 8 GPUs training in 90 epochs
-bash scripts/training/train_resnet50_TF32_90E_DGXA100.sh
+bash scripts/training/train_resnet50_AMP_90E_DGXA100.sh
```
Or you can manually launch training by `paddle.distributed.launch`. `paddle.distributed.launch` is a built-in module in PaddlePaddle that spawns up multiple distributed training processes on each of the training nodes.
@@ -390,7 +420,8 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py
### Command-line options:
To find the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
-`python [train.py|export_model.py|inference.py] -h`
+
+`python train.py -h`
```bash
PaddlePaddle RN50v1.5 training script
@@ -398,9 +429,11 @@ PaddlePaddle RN50v1.5 training script
optional arguments:
-h, --help show this help message and exit
-Global:
- --output-dir OUTPUT_DIR
- A path to store trained models. (default: ./output/)
+General:
+ --checkpoint-dir CHECKPOINT_DIR
+ A path to store trained models. (default: ./checkpoint)
+ --inference-dir INFERENCE_DIR
+ A path to store inference model once the training is finished. (default: ./inference/)
--run-scope {train_eval,train_only,eval_only}
Running scope. It should be one of {train_eval, train_only, eval_only}. (default: train_eval)
--epochs EPOCHS The number of epochs for training. (default: 90)
@@ -410,11 +443,9 @@ Global:
The iteration interval to test trained models on a given validation dataset. Ignored when --run-scope is
train_only. (default: 1)
--print-interval PRINT_INTERVAL
- The iteration interval to show training/evaluation message. (default: 10)
+ The iteration interval to show a training/evaluation message. (default: 10)
--report-file REPORT_FILE
- A file in which to store JSON experiment report. (default: ./report.json)
- --data-layout {NCHW,NHWC}
- Data format. It should be one of {NCHW, NHWC}. (default: NCHW)
+ A file in which to store JSON experiment reports. (default: ./train.json)
--benchmark To enable benchmark mode. (default: False)
--benchmark-steps BENCHMARK_STEPS
Steps for benchmark run, only be applied when --benchmark is set. (default: 100)
@@ -431,7 +462,7 @@ Global:
--last-epoch-of-checkpoint LAST_EPOCH_OF_CHECKPOINT
The epoch id of the checkpoint given by --from-checkpoint. It should be None, auto or integer >= 0. If it is set
as None, then training will start from 0-th epoch. If it is set as auto, then it will search largest integer-
- convertable folder --from-checkpoint, which contains required checkpoint. Default is None. (default: None)
+ convertible folder --from-checkpoint, which contains the required checkpoint. Default is None. (default: None)
--show-config SHOW_CONFIG
To show arguments. (default: True)
--enable-cpu-affinity ENABLE_CPU_AFFINITY
@@ -448,7 +479,7 @@ Dataset:
--dali-random-seed DALI_RANDOM_SEED
The random seed for DALI data loader. (default: 42)
--dali-num-threads DALI_NUM_THREADS
- The number of threads applied to DALI data loader. (default: 4)
+ The number of threads applied to the DALI data loader. (default: 4)
--dali-output-fp16 Output FP16 data from DALI data loader. (default: False)
Data Augmentation:
@@ -472,6 +503,8 @@ Model:
The model architecture name. It should be one of {ResNet50}. (default: ResNet50)
--num-of-class NUM_OF_CLASS
The number classes of images. (default: 1000)
+ --data-layout {NCHW,NHWC}
+ Data format. It should be one of {NCHW, NHWC}. (default: NCHW)
--bn-weight-decay Apply weight decay to BatchNorm shift and scale. (default: False)
Training:
@@ -479,16 +512,16 @@ Training:
The ratio of label smoothing. (default: 0.1)
--optimizer OPTIMIZER
The name of optimizer. It should be one of {Momentum}. (default: Momentum)
- --momentum MOMENTUM The momentum value of optimizer. (default: 0.875)
+ --momentum MOMENTUM The momentum value of an optimizer. (default: 0.875)
--weight-decay WEIGHT_DECAY
The coefficient of weight decay. (default: 3.0517578125e-05)
--lr-scheduler LR_SCHEDULER
- The name of learning rate scheduler. It should be one of {Cosine}. (default: Cosine)
+ The name of the learning rate scheduler. It should be one of {Cosine}. (default: Cosine)
--lr LR The initial learning rate. (default: 0.256)
--warmup-epochs WARMUP_EPOCHS
The number of epochs for learning rate warmup. (default: 5)
--warmup-start-lr WARMUP_START_LR
- The initial learning rate for warmup. (default: 0.0)
+ The initial learning rate for warm up. (default: 0.0)
Advanced Training:
--amp Enable automatic mixed precision training (AMP). (default: False)
@@ -497,36 +530,50 @@ Advanced Training:
--use-dynamic-loss-scaling
Enable dynamic loss scaling in AMP training, only be applied when --amp is set. (default: False)
--use-pure-fp16 Enable pure FP16 training, only be applied when --amp is set. (default: False)
+ --fuse-resunit Enable CUDNNv8 ResUnit fusion, only be applied when --amp is set. (default: False)
--asp Enable automatic sparse training (ASP). (default: False)
--prune-model Prune model to 2:4 sparse pattern, only be applied when --asp is set. (default: False)
--mask-algo {mask_1d,mask_2d_greedy,mask_2d_best}
The algorithm to generate sparse masks. It should be one of {mask_1d, mask_2d_greedy, mask_2d_best}. This only
be applied when --asp and --prune-model is set. (default: mask_1d)
+ --qat Enable quantization aware training (QAT). (default: False)
+```
+`python inference.py -h`
+```sh
Paddle-TRT:
- --trt-inference-dir TRT_INFERENCE_DIR
- A path to store/load inference models. export_model.py would export models to this folder, then inference.py
- would load from here. (default: ./inference)
- --trt-precision {FP32,FP16,INT8}
+ --device DEVICE_ID
+ The GPU device id for Paddle-TRT inference. (default: 0)
+ --inference-dir INFERENCE_DIR
+ A path to load inference models. (default: ./inference)
+ --batch-size BATCH_SIZE
+ The batch size for Paddle-TRT. (default: 256)
+ --image-shape IMAGE_SHAPE
+ The image shape. Its shape should be [channel, height, width]. (default: [4, 224, 224])
+ --data-layout {NCHW,NHWC}
+ Data format. It should be one of {NCHW, NHWC}. (default: NCHW)
+ --precision {FP32,FP16,INT8}
The precision of TensorRT. It should be one of {FP32, FP16, INT8}. (default: FP32)
- --trt-workspace-size TRT_WORKSPACE_SIZE
+ --workspace-size WORKSPACE_SIZE
The memory workspace of TensorRT in MB. (default: 1073741824)
- --trt-min-subgraph-size TRT_MIN_SUBGRAPH_SIZE
+ --min-subgraph-size MIN_SUBGRAPH_SIZE
The minimal subgraph size to enable PaddleTRT. (default: 3)
- --trt-use-static TRT_USE_STATIC
+ --use-static USE_STATIC
Fix TensorRT engine at first running. (default: False)
- --trt-use-calib-mode TRT_USE_CALIB_MODE
+ --use-calib-mode USE_CALIB_MODE
Use the PTQ calibration of PaddleTRT int8. (default: False)
- --trt-export-log-path TRT_EXPORT_LOG_PATH
- A file in which to store JSON model exporting report. (default: ./export.json)
- --trt-log-path TRT_LOG_PATH
- A file in which to store JSON inference report. (default: ./inference.json)
- --trt-use-synthat TRT_USE_SYNTHAT
+ --report-file REPORT_FILE
+ A file in which to store JSON experiment report. (default: ./inference.json)
+ --use-synthetic USE_SYNTHAT
Apply synthetic data for benchmark. (default: False)
+ --benchmark-steps BENCHMARK_STEPS
+ Steps for benchmark run, only be applied when --benchmark is set. (default: 100)
+ --benchmark-warmup-steps BENCHMARK_WARMUP_STEPS
+ Warmup steps for benchmark run, only be applied when --benchmark is set. (default: 100)
+ --show-config SHOW_CONFIG
+ To show arguments. (default: True)
```
-Noted that arguments in Paddle-TRT are only available to `export_model.py` or `inference.py`.
-
### Dataset guidelines
To use your own dataset, divide it in directories as in the following scheme:
@@ -537,15 +584,17 @@ To use your own dataset, divide it in directories as in the following scheme:
If the number of classes in your dataset is not 1000, you need to specify it to `--num-of-class`.
### Training process
-The model will be stored in the directory specified with `--output-dir` and `--model-arch-name`, including three files:
+The checkpoint will be stored in the directory specified with `--checkpoint-dir` and `--model-arch-name`, including three files:
- `.pdparams`: The parameters contain all the trainable tensors and will save to a file with the suffix “.pdparams”.
-- `.pdopts`: The optimizer information contains all the Tensors used by the optimizer. For Adam optimizer, it contains beta1, beta2, momentum, and so on. All the information will be saved to a file with suffix “.pdopt”. (If the optimizer has no Tensor need to save (like SGD), the file will not be generated).
+- `.pdopts`: The optimizer information contains all the Tensors used by the optimizer. For Adam optimizer, it contains beta1, beta2, momentum, and so on. All the information will be saved to a file with the suffix “.pdopt”. (If the optimizer has no Tensor need to save (like SGD), the file will not be generated).
- `.pdmodel`: The network description is the description of the program. It’s only used for deployment. The description will save to a file with the suffix “.pdmodel”.
-The prefix of model files is specified by `--model-prefix`, which default value is `resnet_50_paddle`. Model of each epoch would be stored in directory `./output/ResNet50/epoch_id/` with three files by default, including `resnet_50_paddle.pdparams`, `resnet_50_paddle.pdopts`, `resnet_50_paddle.pdmodel`. Note that `epoch_id` is 0-based, which means `epoch_id` is from 0 to 89 for a total of 90 epochs. For example, the model of the 89th epoch would be stored in `./output/ResNet50/89/resnet_50_paddle`
+The prefix of model files is specified by `--model-prefix`, whose default value is `resnet_50_paddle`. Model of each epoch would be stored in directory `./checkpoint/ResNet50/epoch_id/` with three files by default, including `resnet_50_paddle.pdparams`, `resnet_50_paddle.pdopts`, `resnet_50_paddle.pdmodel`. Note that `epoch_id` is 0-based, which means `epoch_id` is from 0 to 89 for a total of 90 epochs. For example, the model of the 89th epoch would be stored in `./output/ResNet50/89/resnet_50_paddle`
+
+When the training phase is done, the inference model will be stored in the directory specified with `--inference-dir` and `--model-arch-name`, and it includes `.pdmodel` and `.pdparams` two files.
Assume you want to train the ResNet50 for 90 epochs, but the training process aborts during the 50th epoch due to infrastructure faults. To resume training from the checkpoint, specify `--from-checkpoint` and `--last-epoch-of-checkpoint` with following these steps:
-- Set `./output/ResNet50/49` to `--from-checkpoint`.
+- Set `./checkpoint/ResNet50/49` to `--from-checkpoint`.
- Set `--last-epoch-of-checkpoint` to `49`.
Then rerun the training to resume training from the 50th epoch to the 89th epoch.
@@ -559,11 +608,11 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
--use-dynamic-loss-scaling \
--data-layout NHWC \
--model-prefix resnet_50_paddle \
- --from-checkpoint ./output/ResNet50/49 \
+ --from-checkpoint ./checkpoint/ResNet50/49 \
--last-epoch-of-checkpoint 49
```
-We also provide automatic searching for the checkpoint from last epoch. You can enable this by set `--last-epoch-of-checkpoint` as `auto`. Noted that if enable automatic searching, `--from-checkpoint` should be a folder contains chekcpoint files or `/`. In previous example, it should be `./output/ResNet50`.
+We also provide automatic searching for the checkpoint from last epoch. You can enable this by setting `--last-epoch-of-checkpoint` as `auto`. Note that if you enable automatic searching, `--from-checkpoint` should be a folder containing checkpoint files or `/`. In previous example, it should be `./checkpoint/ResNet50`.
Example:
```bash
@@ -575,11 +624,11 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
--use-dynamic-loss-scaling \
--data-layout NHWC \
--model-prefix resnet_50_paddle \
- --from-checkpoint ./output/ResNet50 \
+ --from-checkpoint ./checkpoint/ResNet50 \
--last-epoch-of-checkpoint auto
```
-To start training from pretrained weights, set `--from-pretrained-params` to `./output/ResNet50//<--model-prefix>`.
+To start training from pretrained weights, set `--from-pretrained-params` to `./checkpoint/ResNet50//<--model-prefix>`.
Example:
```bash
@@ -591,7 +640,7 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
--use-dynamic-loss-scaling \
--data-layout NHWC \
--model-prefix resnet_50_paddle \
- --from-pretrained-params ./output/ResNet50/
+ --from-pretrained-params ./checkpoint/ResNet50/
```
Make sure:
@@ -603,7 +652,7 @@ The difference between those two is that `--from-pretrained-params` contain only
`--from-checkpoint` is suitable for dividing the training into parts, for example, in order to divide the training job into shorter stages, or restart training after infrastructure faults.
-`--from-pretrained-params` can be used as a base for finetuning the model to a different dataset or as a backbone to detection models.
+`--from-pretrained-params` can be used as a base for fine tuning the model to a different dataset or as a backbone to detection models.
Metrics gathered through both training and evaluation:
- `[train|val].loss` - loss
@@ -619,24 +668,24 @@ Metrics gathered through both training and evaluation:
### Automatic SParsity training process:
-To enable automatic sparsity training workflow, turn on `--amp` and `--prune-mode` when training launches. Refer to [Command-line options](#command-line-options)
+To enable automatic sparsity training workflow, turn on `--asp` and `--prune-mode` when training launches. Refer to [Command-line options](#command-line-options)
Note that automatic sparsity (ASP) requires a pretrained model to initialize parameters.
You can apply `scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh` we provided to launch ASP + AMP training.
```bash
-# Default path to pretrained parameters is ./output/ResNet50/89/resnet_50_paddle
+# Default path to pretrained parameters is ./checkpoint/ResNet50/89/resnet_50_paddle
bash scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh
```
Or following steps below to manually launch ASP + AMP training.
-First, set `--from-pretrained-params` to a pretrained model file. For example, if you have trained the ResNet50 for 90 epochs following [Training process](#training-process), the final pretrained weights would be stored in `./output/ResNet50/89/resnet_50_paddle.pdparams` by default, and set `--from-pretrained-params` to `./output/ResNet50/89`.
+First, set `--from-pretrained-params` to a pretrained model file. For example, if you have trained the ResNet50 for 90 epochs following [Training process](#training-process), the final pretrained weights would be stored in `./checkpoint/ResNet50/89/resnet_50_paddle.pdparams` by default, and set `--from-pretrained-params` to `./checkpoint/ResNet50/89`.
Then run following command to run AMP + ASP:
```bash
python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
- --from-pretrained-params ./output/ResNet50/89 \
+ --from-pretrained-params ./checkpoint/ResNet50/89 \
--model-prefix resnet_50_paddle \
--epochs 90 \
--amp \
@@ -648,14 +697,43 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
--mask-algo mask_1d
```
+## Quantization Aware Training Process
+Quantization aware training requires a fine-tuned model. Quantize / dequantize OPs will be inserted into the model and then a smaller number of epochs of training will be taken to update the parameters in the model.
+
+To enable quantization aware training workflow, turn on `--qat` when training launches. Refer to [Command-line options](#command-line-options).
+
+You can apply the script `scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh` we provided to launch AMP + QAT training.
+```bash
+# Default path to pretrained parameters is ./output/ResNet50/89/resnet_50_paddle
+bash scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh
+```
+
+Or following steps below to manually launch AMP + QAT training.
+
+First, set `--from-pretrained-params` to a pretrained model file. For example, if you have trained the ResNet50 for 90 epochs following [Training process](#training-process), the final pretrained weights would be stored in `./output/ResNet50/89/resnet_50_paddle.pdparams` by default, and set `--from-pretrained-params` to `./output/ResNet50/89`.
+
+Then run following command to run AMP + QAT:
+```bash
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+ --from-pretrained-params ./output/ResNet50/89 \
+ --model-prefix resnet_50_paddle \
+ --epochs 10 \
+ --amp \
+ --scale-loss 128.0 \
+ --use-dynamic-loss-scaling \
+ --data-layout NHWC \
+ --qat
+```
+
+
### Inference process
#### Inference on your own datasets.
To run inference on a single example with pretrained parameters,
1. Set `--from-pretrained-params` to your pretrained parameters.
-2. Set `--image-root` to the root folder of your own dataset.
- - Note that validation dataset should be in `image-root/val`.
+2. Set `--image-root` to the root folder of your own dataset.
+ - Note that the validation dataset should be in `image-root/val`.
3. Set `--run-scope` to `eval_only`.
```bash
# For single GPU evaluation
@@ -672,17 +750,27 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
```
#### Inference with TensorRT
-To run inference with TensorRT for the best performance, you can apply the scripts in `scripts/inference`.
+For inference with TensorRT, we provide two scopes to benchmark with or without data preprocessing.
+
+The default scripts in `scripts/inference` use synthetic input to run inference without data preprocessing.
For example,
1. Run `bash scripts/inference/export_resnet50_AMP.sh ` to export an inference model.
- - The default path of checkpoint is `./output/ResNet50/89`.
+ - The default path of the checkpoint is `./output/ResNet50/89`.
2. Run `bash scripts/inference/infer_resnet50_AMP.sh` to infer with TensorRT.
Or you could manually run `export_model.py` and `inference.py` with specific arguments, refer to [Command-line options](#command-line-options).
Note that arguments passed to `export_model.py` and `inference.py` should be the same with arguments used in training.
+To run inference with data preprocessing, set the option `--use-synthetic` to false and `--image-root` to the path of your own dataset. For example,
+
+```bash
+python inference.py --inference-dir \
+ --image-root \
+ --use-synthetic False
+```
+
## Performance
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
@@ -748,32 +836,32 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
##### Benchmark with TensorRT
-To benchmark the inference performance with TensorRT on a specific batch size, run:
+To benchmark the inference performance with TensorRT on a specific batch size, run inference.py with `--use-synthetic True`. The benchmark uses synthetic input without data preprocessing.
* FP32 / TF32
```bash
python inference.py \
- --trt-inference-dir \
- --trt-precision FP32 \
+ --inference-dir \
+ --precision FP32 \
--batch-size \
--benchmark-steps 1024 \
- --benchmark-warmup-steps 16
+ --benchmark-warmup-steps 16 \
+ --use-synthetic True
```
* FP16
```bash
python inference.py \
- --trt-inference-dir \
- --trt-precision FP16 \
+ --inference-dir \
+ --precision FP16 \
--batch-size
--benchmark-steps 1024 \
- --benchmark-warmup-steps 16
+ --benchmark-warmup-steps 16 \
+ --use-synthetic True
```
Note that arguments passed to `inference.py` should be the same with arguments used in training.
-The benchmark uses the validation dataset by default, which should be put in `--image-root/val`.
-For the performance benchmark of the raw model, a synthetic dataset can be used. To use synthetic dataset, add `--trt-use-synthat True` as a command line option.
### Results
@@ -793,7 +881,7 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
##### Example plots
-The following images show the 90 epochs configuration on a DGX-A100.
+The following images show the 90 epoch configuration on a DGX-A100.


@@ -815,8 +903,8 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
| **GPUs** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Scaling** | **Mixed Precision Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
|:--------:|:------------:|:-------------:|:------------:|:------:|:--------:|:--------:|:--------:|
-| 1 | 993 img/s | 2711 img/s | 2.73 x | 1.0 x | 1.0 x | ~13 hours| ~40 hours|
-| 8 | 7955 img/s | 20267 img/s | 2.54 x | 8.01 x | 7.47 x | ~2 hours | ~4 hours |
+| 1 | 1024 img/s | 2897 img/s | 2.83 x | 1.0 x | 1.0 x | ~13 hours| ~40 hours|
+| 8 | 8013 img/s | 23874 img/s | 2.98 x | 7.83 x | 8.24 x | ~2 hours | ~4 hours |
##### Training performance of Automatic SParsity: NVIDIA DGX A100 (8x A100 80GB)
| **GPUs** | **Throughput - mixed precision** | **Throughput - mixed precision+ASP** | **Overhead** |
@@ -825,7 +913,7 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
| 8 | 20267 img/s | 20144 img/s | 0.6% |
-Note that the `train.py` would enable CPU affinity binding to GPUs by default, that is designed and guaranteed being optimal for NVIDIA DGX-series. You could disable binding via launch `train.py` with `--enable-cpu-affinity false`.
+Note that the `train.py` would enable CPU affinity binding to GPUs by default, that is designed and guaranteed to be optimal for NVIDIA DGX-series. You could disable binding via launch `train.py` with `--enable-cpu-affinity false`.
### Inference performance results
@@ -866,96 +954,143 @@ Our results were obtained by running the applicable training script with `--run-
#### Paddle-TRT performance: NVIDIA DGX A100 (1x A100 80GB)
Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA DGX A100 with (1x A100 80G) GPU.
+Note that the benchmark does not include data preprocessing. Refer to [Benchmark with TensorRT](#benchmark-with-tensorrt).
+
**TF32 Inference Latency**
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
|--------------|------------------|---------------|---------------|---------------|---------------|
-| 1 | 716.49 img/s | 1.40 ms | 1.96 ms | 2.20 ms | 3.01 ms |
-| 2 | 1219.98 img/s | 1.64 ms | 2.26 ms | 2.90 ms | 5.04 ms |
-| 4 | 1880.12 img/s | 2.13 ms | 3.39 ms | 4.44 ms | 7.32 ms |
-| 8 | 2404.10 img/s | 3.33 ms | 4.51 ms | 5.90 ms | 10.39 ms |
-| 16 | 3101.28 img/s | 5.16 ms | 7.06 ms | 9.13 ms | 15.18 ms |
-| 32 | 3294.11 img/s | 9.71 ms | 21.42 ms | 26.94 ms | 35.79 ms |
-| 64 | 4327.38 img/s | 14.79 ms | 25.59 ms | 30.45 ms | 45.34 ms |
-| 128 | 4956.59 img/s | 25.82 ms | 33.74 ms | 40.36 ms | 56.06 ms |
-| 256 | 5244.29 img/s | 48.81 ms | 62.11 ms | 67.56 ms | 88.38 ms |
+| 1 | 969.11 img/s | 1.03 ms | 1.03 ms | 1.13 ms | 1.14 ms |
+| 2 | 1775.33 img/s | 1.13 ms | 1.13 ms | 1.22 ms | 1.23 ms |
+| 4 | 3088.02 img/s | 1.29 ms | 1.30 ms | 1.39 ms | 1.40 ms |
+| 8 | 4552.29 img/s | 1.76 ms | 1.76 ms | 1.85 ms | 1.87 ms |
+| 16 | 6059.48 img/s | 2.64 ms | 2.64 ms | 2.73 ms | 2.75 ms |
+| 32 | 7264.92 img/s | 4.40 ms | 4.41 ms | 4.49 ms | 4.52 ms |
+| 64 | 8022.82 img/s | 7.98 ms | 8.03 ms | 8.05 ms | 8.11 ms |
+| 128 | 8436.27 img/s | 15.17 ms | 15.20 ms | 15.27 ms | 15.30 ms |
+| 256 | 8623.08 img/s | 29.69 ms | 29.82 ms | 29.86 ms | 29.97 ms |
**FP16 Inference Latency**
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
|--------------|------------------|---------------|---------------|---------------|---------------|
-| 1 | 860.90 img/s | 1.16 ms | 1.81 ms | 2.06 ms | 2.98 ms |
-| 2 | 1464.06 img/s | 1.37 ms | 2.13 ms | 2.73 ms | 4.76 ms |
-| 4 | 2246.24 img/s | 1.78 ms | 3.17 ms | 4.20 ms | 7.39 ms |
-| 8 | 2457.44 img/s | 3.25 ms | 4.35 ms | 5.50 ms | 9.98 ms |
-| 16 | 3928.83 img/s | 4.07 ms | 6.26 ms | 8.50 ms | 15.10 ms |
-| 32 | 3853.13 img/s | 8.30 ms | 19.87 ms | 25.51 ms | 34.99 ms |
-| 64 | 5581.89 img/s | 11.46 ms | 22.32 ms | 30.75 ms | 43.35 ms |
-| 128 | 6846.77 img/s | 18.69 ms | 25.43 ms | 35.03 ms | 50.04 ms |
-| 256 | 7481.19 img/s | 34.22 ms | 40.92 ms | 51.10 ms | 65.68 ms |
+| 1 | 1306.28 img/s | 0.76 ms | 0.77 ms | 0.86 ms | 0.87 ms |
+| 2 | 2453.18 img/s | 0.81 ms | 0.82 ms | 0.91 ms | 0.92 ms |
+| 4 | 4295.75 img/s | 0.93 ms | 0.95 ms | 1.03 ms | 1.04 ms |
+| 8 | 7036.09 img/s | 1.14 ms | 1.15 ms | 1.23 ms | 1.25 ms |
+| 16 | 10376.70 img/s | 1.54 ms | 1.56 ms | 1.64 ms | 1.66 ms |
+| 32 | 13078.23 img/s | 2.45 ms | 2.45 ms | 2.54 ms | 2.56 ms |
+| 64 | 14992.88 img/s | 4.27 ms | 4.27 ms | 4.36 ms | 4.38 ms |
+| 128 | 16386.96 img/s | 7.81 ms | 7.83 ms | 7.89 ms | 7.93 ms |
+| 256 | 17363.79 img/s | 14.74 ms | 14.80 ms | 14.82 ms | 14.90 ms |
+
+**INT8 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 1430.17 img/s | 0.70 ms | 0.70 ms | 0.79 ms | 0.80 ms |
+| 2 | 2683.75 img/s | 0.74 ms | 0.75 ms | 0.84 ms | 0.85 ms |
+| 4 | 4792.51 img/s | 0.83 ms | 0.84 ms | 0.93 ms | 0.94 ms |
+| 8 | 8366.92 img/s | 0.96 ms | 0.96 ms | 1.05 ms | 1.06 ms |
+| 16 | 13083.56 img/s | 1.22 ms | 1.22 ms | 1.32 ms | 1.33 ms |
+| 32 | 18171.90 img/s | 1.76 ms | 1.76 ms | 1.86 ms | 1.87 ms |
+| 64 | 22578.08 img/s | 2.83 ms | 2.84 ms | 2.93 ms | 2.95 ms |
+| 128 | 25730.51 img/s | 4.97 ms | 4.98 ms | 5.07 ms | 5.08 ms |
+| 256 | 27935.10 img/s | 9.16 ms | 9.26 ms | 9.30 ms | 9.34 ms |
#### Paddle-TRT performance: NVIDIA A30 (1x A30 24GB)
Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A30 with (1x A30 24G) GPU.
+Note that the benchmark does not include data preprocessing. Refer to [Benchmark with TensorRT](#benchmark-with-tensorrt).
+
**TF32 Inference Latency**
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
|--------------|------------------|---------------|---------------|---------------|---------------|
-| 1 | 672.79 img/s | 1.49 ms | 2.01 ms | 2.29 ms | 3.04 ms |
-| 2 | 1041.47 img/s | 1.92 ms | 2.49 ms | 2.87 ms | 4.13 ms |
-| 4 | 1505.64 img/s | 2.66 ms | 3.43 ms | 4.06 ms | 6.85 ms |
-| 8 | 2001.13 img/s | 4.00 ms | 4.72 ms | 5.54 ms | 9.51 ms |
-| 16 | 2462.80 img/s | 6.50 ms | 7.71 ms | 9.32 ms | 15.54 ms |
-| 32 | 2474.34 img/s | 12.93 ms | 21.61 ms | 25.76 ms | 34.69 ms |
-| 64 | 2949.38 img/s | 21.70 ms | 29.58 ms | 34.63 ms | 47.11 ms |
-| 128 | 3278.67 img/s | 39.04 ms | 43.34 ms | 52.72 ms | 66.78 ms |
-| 256 | 3293.10 img/s | 77.74 ms | 90.51 ms | 99.71 ms | 110.80 ms |
+| 1 | 860.08 img/s | 1.16 ms | 1.16 ms | 1.27 ms | 1.29 ms |
+| 2 | 1422.02 img/s | 1.40 ms | 1.41 ms | 1.52 ms | 1.53 ms |
+| 4 | 2058.41 img/s | 1.94 ms | 1.94 ms | 2.06 ms | 2.10 ms |
+| 8 | 2748.94 img/s | 2.91 ms | 2.93 ms | 3.03 ms | 3.22 ms |
+| 16 | 3329.39 img/s | 4.80 ms | 4.90 ms | 4.93 ms | 5.09 ms |
+| 32 | 3729.45 img/s | 8.58 ms | 8.68 ms | 8.74 ms | 8.84 ms |
+| 64 | 3946.74 img/s | 16.21 ms | 16.34 ms | 16.41 ms | 16.51 ms |
+| 128 | 4116.98 img/s | 31.09 ms | 31.26 ms | 31.38 ms | 31.43 ms |
+| 256 | 4227.52 img/s | 60.55 ms | 60.93 ms | 61.01 ms | 61.25 ms |
**FP16 Inference Latency**
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
|--------------|------------------|---------------|---------------|---------------|---------------|
-| 1 | 804.56 img/s | 1.24 ms | 1.81 ms | 2.15 ms | 3.07 ms |
-| 2 | 1435.74 img/s | 1.39 ms | 2.05 ms | 2.48 ms | 3.86 ms |
-| 4 | 2169.87 img/s | 1.84 ms | 2.72 ms | 3.39 ms | 5.94 ms |
-| 8 | 2395.13 img/s | 3.34 ms | 4.46 ms | 5.11 ms | 9.49 ms |
-| 16 | 3779.82 img/s | 4.23 ms | 5.83 ms | 7.66 ms | 14.44 ms |
-| 32 | 3620.18 img/s | 8.84 ms | 17.90 ms | 22.31 ms | 30.91 ms |
-| 64 | 4592.08 img/s | 13.94 ms | 24.00 ms | 29.38 ms | 41.41 ms |
-| 128 | 5064.06 img/s | 25.28 ms | 31.73 ms | 37.79 ms | 53.01 ms |
-| 256 | 4774.61 img/s | 53.62 ms | 59.04 ms | 67.29 ms | 80.51 ms |
+| 1 | 1195.76 img/s | 0.83 ms | 0.84 ms | 0.95 ms | 0.96 ms |
+| 2 | 2121.44 img/s | 0.94 ms | 0.95 ms | 1.05 ms | 1.10 ms |
+| 4 | 3498.59 img/s | 1.14 ms | 1.14 ms | 1.26 ms | 1.30 ms |
+| 8 | 5139.91 img/s | 1.55 ms | 1.56 ms | 1.67 ms | 1.72 ms |
+| 16 | 6322.78 img/s | 2.53 ms | 2.54 ms | 2.64 ms | 2.83 ms |
+| 32 | 7093.70 img/s | 4.51 ms | 4.61 ms | 4.64 ms | 4.70 ms |
+| 64 | 7682.36 img/s | 8.33 ms | 8.44 ms | 8.48 ms | 8.58 ms |
+| 128 | 8072.73 img/s | 15.85 ms | 15.98 ms | 16.04 ms | 16.14 ms |
+| 256 | 8393.37 img/s | 30.50 ms | 30.67 ms | 30.70 ms | 30.84 ms |
+
+**INT8 Inference Latency**
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 1346.83 img/s | 0.74 ms | 0.74 ms | 0.85 ms | 0.87 ms |
+| 2 | 2415.06 img/s | 0.83 ms | 0.83 ms | 0.94 ms | 0.99 ms |
+| 4 | 4152.29 img/s | 0.96 ms | 0.97 ms | 1.07 ms | 1.11 ms |
+| 8 | 6684.53 img/s | 1.20 ms | 1.20 ms | 1.31 ms | 1.37 ms |
+| 16 | 9336.11 img/s | 1.71 ms | 1.72 ms | 1.82 ms | 1.89 ms |
+| 32 | 11544.88 img/s | 2.77 ms | 2.77 ms | 2.88 ms | 3.09 ms |
+| 64 | 12954.16 img/s | 4.94 ms | 5.04 ms | 5.08 ms | 5.23 ms |
+| 128 | 13914.60 img/s | 9.20 ms | 9.27 ms | 9.34 ms | 9.45 ms |
+| 256 | 14443.15 img/s | 17.72 ms | 17.87 ms | 17.92 ms | 18.00 ms |
#### Paddle-TRT performance: NVIDIA A10 (1x A10 24GB)
Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A10 with (1x A10 24G) GPU.
+Note that the benchmark does not include data preprocessing. Refer to [Benchmark with TensorRT](#benchmark-with-tensorrt).
+
**TF32 Inference Latency**
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
|--------------|------------------|---------------|---------------|---------------|---------------|
-| 1 | 372.04 img/s | 2.69 ms | 3.64 ms | 4.20 ms | 5.28 ms |
-| 2 | 615.93 img/s | 3.25 ms | 4.08 ms | 4.59 ms | 6.42 ms |
-| 4 | 1070.02 img/s | 3.74 ms | 3.90 ms | 4.35 ms | 7.48 ms |
-| 8 | 1396.88 img/s | 5.73 ms | 6.87 ms | 7.52 ms | 10.63 ms |
-| 16 | 1522.20 img/s | 10.51 ms | 12.73 ms | 13.84 ms | 17.84 ms |
-| 32 | 1674.39 img/s | 19.11 ms | 23.23 ms | 24.63 ms | 29.55 ms |
-| 64 | 1782.14 img/s | 35.91 ms | 41.84 ms | 44.53 ms | 48.94 ms |
-| 128 | 1722.33 img/s | 74.32 ms | 85.37 ms | 89.27 ms | 94.85 ms |
-| 256 | 1576.89 img/s | 162.34 ms | 181.01 ms | 185.92 ms | 194.42 ms |
+| 1 | 601.39 img/s | 1.66 ms | 1.66 ms | 1.82 ms | 1.85 ms |
+| 2 | 962.31 img/s | 2.08 ms | 2.13 ms | 2.23 ms | 2.38 ms |
+| 4 | 1338.26 img/s | 2.99 ms | 3.04 ms | 3.14 ms | 3.32 ms |
+| 8 | 1650.56 img/s | 4.85 ms | 4.93 ms | 5.01 ms | 5.14 ms |
+| 16 | 2116.53 img/s | 7.56 ms | 7.64 ms | 7.71 ms | 7.84 ms |
+| 32 | 2316.43 img/s | 13.81 ms | 14.00 ms | 14.07 ms | 14.26 ms |
+| 64 | 2477.26 img/s | 25.83 ms | 26.05 ms | 26.15 ms | 26.35 ms |
+| 128 | 2528.92 img/s | 50.61 ms | 51.24 ms | 51.37 ms | 51.72 ms |
+| 256 | 2576.08 img/s | 99.37 ms | 100.45 ms | 100.66 ms | 101.05 ms |
**FP16 Inference Latency**
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
|--------------|------------------|---------------|---------------|---------------|---------------|
-| 1 | 365.38 img/s | 2.74 ms | 3.94 ms | 4.35 ms | 5.64 ms |
-| 2 | 612.52 img/s | 3.26 ms | 4.34 ms | 4.80 ms | 6.97 ms |
-| 4 | 1018.15 img/s | 3.93 ms | 4.95 ms | 5.55 ms | 9.16 ms |
-| 8 | 1924.26 img/s | 4.16 ms | 5.44 ms | 6.20 ms | 11.89 ms |
-| 16 | 2477.49 img/s | 6.46 ms | 8.07 ms | 9.21 ms | 15.05 ms |
-| 32 | 2896.01 img/s | 11.05 ms | 13.56 ms | 15.32 ms | 21.76 ms |
-| 64 | 3165.27 img/s | 20.22 ms | 24.20 ms | 25.94 ms | 33.18 ms |
-| 128 | 3176.46 img/s | 40.29 ms | 46.36 ms | 49.15 ms | 54.95 ms |
-| 256 | 3110.01 img/s | 82.31 ms | 93.21 ms | 96.06 ms | 99.97 ms |
+| 1 | 1109.59 img/s | 0.90 ms | 0.90 ms | 1.06 ms | 1.08 ms |
+| 2 | 1901.53 img/s | 1.05 ms | 1.05 ms | 1.22 ms | 1.23 ms |
+| 4 | 2733.20 img/s | 1.46 ms | 1.48 ms | 1.62 ms | 1.65 ms |
+| 8 | 3494.23 img/s | 2.29 ms | 2.32 ms | 2.44 ms | 2.48 ms |
+| 16 | 4113.53 img/s | 3.89 ms | 3.99 ms | 4.10 ms | 4.17 ms |
+| 32 | 4714.63 img/s | 6.79 ms | 6.98 ms | 7.14 ms | 7.30 ms |
+| 64 | 5054.70 img/s | 12.66 ms | 12.78 ms | 12.83 ms | 13.08 ms |
+| 128 | 5261.98 img/s | 24.32 ms | 24.58 ms | 24.71 ms | 24.96 ms |
+| 256 | 5397.53 img/s | 47.43 ms | 47.83 ms | 47.95 ms | 48.17 ms |
+
+**INT8 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 1285.15 img/s | 0.78 ms | 0.78 ms | 0.93 ms | 0.95 ms |
+| 2 | 2293.43 img/s | 0.87 ms | 0.88 ms | 1.03 ms | 1.05 ms |
+| 4 | 3508.39 img/s | 1.14 ms | 1.15 ms | 1.29 ms | 1.32 ms |
+| 8 | 5907.02 img/s | 1.35 ms | 1.36 ms | 1.51 ms | 1.60 ms |
+| 16 | 7416.99 img/s | 2.16 ms | 2.19 ms | 2.31 ms | 2.36 ms |
+| 32 | 8337.02 img/s | 3.84 ms | 3.91 ms | 4.01 ms | 4.14 ms |
+| 64 | 9039.71 img/s | 7.08 ms | 7.24 ms | 7.40 ms | 7.66 ms |
+| 128 | 9387.23 img/s | 13.63 ms | 13.84 ms | 13.92 ms | 14.11 ms |
+| 256 | 9598.97 img/s | 26.67 ms | 27.12 ms | 27.24 ms | 27.48 ms |
## Release notes
@@ -975,6 +1110,11 @@ Our results for Paddle-TRT were obtained by running the `inference.py` script on
* Updated README
* A100 convergence benchmark
+3. December 2023
+ * Add quantization aware training
+ * Add INT8 inference for Paddle-TRT
+ * Simplify the inference process
+
### Known issues
* Allreduce issues to top1 and top5 accuracy in evaluation. Workaround: use `build_strategy.fix_op_run_order = True` for eval program. (refer to [Paddle-issue-39567](https://github.com/PaddlePaddle/Paddle/issues/39567) for details)
diff --git a/PaddlePaddle/Classification/RN50v1.5/dali.py b/PaddlePaddle/Classification/RN50v1.5/dali.py
index 3f4a4def8..e1f99ce16 100644
--- a/PaddlePaddle/Classification/RN50v1.5/dali.py
+++ b/PaddlePaddle/Classification/RN50v1.5/dali.py
@@ -12,10 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+import ctypes
import os
from dataclasses import dataclass
+from cuda import cudart
import paddle
+import numpy as np
+from nvidia.dali.backend import TensorListCPU
import nvidia.dali.ops as ops
+import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.pipeline import Pipeline
from nvidia.dali.plugin.paddle import DALIGenericIterator
@@ -236,3 +241,54 @@ def build_dataloader(args, mode):
"""
assert mode in Mode, "Dataset mode should be in supported Modes (train or eval)"
return dali_dataloader(args, mode, paddle.device.get_device())
+
+
+def dali_synthetic_dataloader(args, device):
+ """
+ Define a dali dataloader with synthetic data.
+
+ Args:
+ args(Namespace): Arguments obtained from ArgumentParser.
+ device(int): Id of GPU to load data.
+ Outputs:
+ DALIGenericIterator(nvidia.dali.plugin.paddle.DALIGenericIterator)
+ Iteratable outputs of DALI pipeline,
+ including "data" in type of Paddle's Tensor.
+ """
+ assert "gpu" in device, "gpu training is required for DALI"
+
+ device_id = int(device.split(':')[1])
+
+ batch_size = args.batch_size
+ image_shape = args.image_shape
+ output_dtype = types.FLOAT16 if args.dali_output_fp16 else types.FLOAT
+ num_threads = args.dali_num_threads
+
+ class ExternalInputIterator(object):
+ def __init__(self, batch_size, image_shape):
+ n_bytes = int(batch_size * np.prod(image_shape) * 4)
+ err, mem = cudart.cudaMallocHost(n_bytes)
+ assert err == cudart.cudaError_t.cudaSuccess
+ mem_ptr = ctypes.cast(mem, ctypes.POINTER(ctypes.c_float))
+ self.synthetic_data = np.ctypeslib.as_array(mem_ptr, shape=(batch_size, *image_shape))
+ self.n = args.benchmark_steps
+
+ def __iter__(self):
+ self.i = 0
+ return self
+
+ def __next__(self):
+ if self.i >= self.n:
+ self.__iter__()
+ raise StopIteration()
+ self.i += 1
+ return TensorListCPU(self.synthetic_data, is_pinned=True)
+
+ eli = ExternalInputIterator(batch_size, image_shape)
+ pipe = Pipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id)
+ with pipe:
+ images = fn.external_source(source=eli, no_copy=True, dtype=output_dtype)
+ images = images.gpu()
+ pipe.set_outputs(images)
+ pipe.build()
+ return DALIGenericIterator([pipe], ['data'])
diff --git a/PaddlePaddle/Classification/RN50v1.5/export_model.py b/PaddlePaddle/Classification/RN50v1.5/export_model.py
deleted file mode 100644
index dac24d3e8..000000000
--- a/PaddlePaddle/Classification/RN50v1.5/export_model.py
+++ /dev/null
@@ -1,75 +0,0 @@
-# Copyright (c) 2022 NVIDIA Corporation. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import logging
-import paddle
-import program
-from dali import build_dataloader
-from utils.mode import Mode
-from utils.save_load import init_ckpt
-from utils.logger import setup_dllogger
-from utils.config import parse_args, print_args
-
-
-def main(args):
- '''
- Export saved model params to paddle inference model
- '''
- setup_dllogger(args.trt_export_log_path)
- if args.show_config:
- print_args(args)
-
- eval_dataloader = build_dataloader(args, Mode.EVAL)
-
- startup_prog = paddle.static.Program()
- eval_prog = paddle.static.Program()
-
- eval_fetchs, _, eval_feeds, _ = program.build(
- args,
- eval_prog,
- startup_prog,
- step_each_epoch=len(eval_dataloader),
- is_train=False)
- eval_prog = eval_prog.clone(for_test=True)
-
- device = paddle.set_device('gpu')
- exe = paddle.static.Executor(device)
- exe.run(startup_prog)
-
- path_to_ckpt = args.from_checkpoint
-
- if path_to_ckpt is None:
- logging.warning(
- 'The --from-checkpoint is not set, model weights will not be initialize.'
- )
- else:
- init_ckpt(path_to_ckpt, eval_prog, exe)
- logging.info('Checkpoint path is %s', path_to_ckpt)
-
- save_inference_dir = args.trt_inference_dir
- paddle.static.save_inference_model(
- path_prefix=os.path.join(save_inference_dir, args.model_arch_name),
- feed_vars=[eval_feeds['data']],
- fetch_vars=[eval_fetchs['label'][0]],
- executor=exe,
- program=eval_prog)
-
- logging.info('Successully export inference model to %s',
- save_inference_dir)
-
-
-if __name__ == '__main__':
- paddle.enable_static()
- main(parse_args(including_trt=True))
diff --git a/PaddlePaddle/Classification/RN50v1.5/inference.py b/PaddlePaddle/Classification/RN50v1.5/inference.py
index fe2e0c812..bad6ccac9 100644
--- a/PaddlePaddle/Classification/RN50v1.5/inference.py
+++ b/PaddlePaddle/Classification/RN50v1.5/inference.py
@@ -22,14 +22,14 @@
from paddle.fluid import LoDTensor
from paddle.inference import Config, PrecisionType, create_predictor
-from dali import dali_dataloader
+from dali import dali_dataloader, dali_synthetic_dataloader
from utils.config import parse_args, print_args
from utils.mode import Mode
from utils.logger import setup_dllogger
def init_predictor(args):
- infer_dir = args.trt_inference_dir
+ infer_dir = args.inference_dir
assert os.path.isdir(
infer_dir), f'inference_dir = "{infer_dir}" is not a directory'
pdiparams_path = glob.glob(os.path.join(infer_dir, '*.pdiparams'))
@@ -40,8 +40,8 @@ def init_predictor(args):
f'There should be only 1 pdmodel in {infer_dir}, but there are {len(pdmodel_path)}'
predictor_config = Config(pdmodel_path[0], pdiparams_path[0])
predictor_config.enable_memory_optim()
- predictor_config.enable_use_gpu(0, 0)
- precision = args.trt_precision
+ predictor_config.enable_use_gpu(0, args.device)
+ precision = args.precision
max_batch_size = args.batch_size
assert precision in ['FP32', 'FP16', 'INT8'], \
'precision should be FP32/FP16/INT8'
@@ -54,12 +54,17 @@ def init_predictor(args):
else:
raise NotImplementedError
predictor_config.enable_tensorrt_engine(
- workspace_size=args.trt_workspace_size,
+ workspace_size=args.workspace_size,
max_batch_size=max_batch_size,
- min_subgraph_size=args.trt_min_subgraph_size,
+ min_subgraph_size=args.min_subgraph_size,
precision_mode=precision_mode,
- use_static=args.trt_use_static,
- use_calib_mode=args.trt_use_calib_mode)
+ use_static=args.use_static,
+ use_calib_mode=args.use_calib_mode)
+ predictor_config.set_trt_dynamic_shape_info(
+ {"data": (1,) + tuple(args.image_shape)},
+ {"data": (args.batch_size,) + tuple(args.image_shape)},
+ {"data": (args.batch_size,) + tuple(args.image_shape)},
+ )
predictor = create_predictor(predictor_config)
return predictor
@@ -106,14 +111,14 @@ def benchmark_dataset(args):
"""
predictor = init_predictor(args)
- dali_iter = dali_dataloader(args, Mode.EVAL, 'gpu:0')
+ dali_iter = dali_dataloader(args, Mode.EVAL, 'gpu:' + str(args.device))
# Warmup some samples for the stable performance number
batch_size = args.batch_size
image_shape = args.image_shape
- image = np.zeros((batch_size, *image_shape)).astype(np.single)
+ images = np.zeros((batch_size, *image_shape)).astype(np.float32)
for _ in range(args.benchmark_warmup_steps):
- predict(predictor, [image])[0]
+ predict(predictor, [images])[0]
total_images = 0
correct_predict = 0
@@ -127,8 +132,8 @@ def benchmark_dataset(args):
label = np.asarray(data['label'])
total_images += label.shape[0]
label = label.flatten()
- image = data['data']
- predict_label = predict(predictor, [image])[0]
+ images = data['data']
+ predict_label = predict(predictor, [images])[0]
correct_predict += (label == predict_label).sum()
batch_end_time_step = time.perf_counter()
batch_latency = batch_end_time_step - last_time_step
@@ -140,7 +145,7 @@ def benchmark_dataset(args):
quantile = np.quantile(latency, [0.9, 0.95, 0.99])
statistics = {
- 'precision': args.trt_precision,
+ 'precision': args.precision,
'batch_size': batch_size,
'throughput': total_images / (end - start),
'accuracy': correct_predict / total_images,
@@ -152,29 +157,33 @@ def benchmark_dataset(args):
return statistics
-def benchmark_synthat(args):
+def benchmark_synthetic(args):
"""
- Benchmark on the synthatic data and bypass all pre-processing.
+ Benchmark on the synthetic data and bypass all pre-processing.
The host to device copy is still included.
This used to find the upper throughput bound when tunning the full input pipeline.
"""
predictor = init_predictor(args)
+ dali_iter = dali_synthetic_dataloader(args, 'gpu:' + str(args.device))
+
batch_size = args.batch_size
image_shape = args.image_shape
- image = np.random.random((batch_size, *image_shape)).astype(np.single)
+ images = np.random.random((batch_size, *image_shape)).astype(np.float32)
latency = []
# warmup
for _ in range(args.benchmark_warmup_steps):
- predict(predictor, [image])[0]
+ predict(predictor, [images])[0]
# benchmark
start = time.perf_counter()
last_time_step = time.perf_counter()
- for _ in range(args.benchmark_steps):
- predict(predictor, [image])[0]
+ for dali_data in dali_iter:
+ for data in dali_data:
+ images = data['data']
+ predict(predictor, [images])[0]
batch_end_time_step = time.perf_counter()
batch_latency = batch_end_time_step - last_time_step
latency.append(batch_latency)
@@ -185,7 +194,7 @@ def benchmark_synthat(args):
quantile = np.quantile(latency, [0.9, 0.95, 0.99])
statistics = {
- 'precision': args.trt_precision,
+ 'precision': args.precision,
'batch_size': batch_size,
'throughput': args.benchmark_steps * batch_size / (end - start),
'eval_latency_avg': np.mean(latency),
@@ -195,14 +204,13 @@ def benchmark_synthat(args):
}
return statistics
-
def main(args):
- setup_dllogger(args.trt_log_path)
+ setup_dllogger(args.report_file)
if args.show_config:
print_args(args)
- if args.trt_use_synthat:
- statistics = benchmark_synthat(args)
+ if args.use_synthetic:
+ statistics = benchmark_synthetic(args)
else:
statistics = benchmark_dataset(args)
@@ -210,4 +218,4 @@ def main(args):
if __name__ == '__main__':
- main(parse_args(including_trt=True))
+ main(parse_args(script='inference'))
diff --git a/PaddlePaddle/Classification/RN50v1.5/program.py b/PaddlePaddle/Classification/RN50v1.5/program.py
index 6dcba59a2..ec16c727d 100644
--- a/PaddlePaddle/Classification/RN50v1.5/program.py
+++ b/PaddlePaddle/Classification/RN50v1.5/program.py
@@ -12,26 +12,25 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-import time
import logging
-
+import time
from profile import Profiler
+
+import dllogger
+import models
import numpy as np
-from optimizer import build_optimizer
from lr_scheduler import build_lr_scheduler
+from optimizer import build_optimizer
from utils.misc import AverageMeter
from utils.mode import Mode, RunScope
from utils.utility import get_num_trainers
-import models
-
-import dllogger
import paddle
import paddle.nn.functional as F
from paddle.distributed import fleet
from paddle.distributed.fleet import DistributedStrategy
-from paddle.static import sparsity
from paddle.distributed.fleet.meta_optimizers.common import CollectiveHelper
+from paddle.incubate import asp as sparsity
def create_feeds(image_shape):
@@ -45,11 +44,13 @@ def create_feeds(image_shape):
key (string): Name of variable to feed.
Value (tuple): paddle.static.data.
"""
- feeds = dict()
+ feeds = {}
feeds['data'] = paddle.static.data(
- name="data", shape=[None] + image_shape, dtype="float32")
+ name="data", shape=[None] + image_shape, dtype="float32"
+ )
feeds['label'] = paddle.static.data(
- name="label", shape=[None, 1], dtype="int64")
+ name="label", shape=[None, 1], dtype="int64"
+ )
return feeds
@@ -70,7 +71,7 @@ def create_fetchs(out, feeds, class_num, label_smoothing=0, mode=Mode.TRAIN):
key (string): Name of variable to fetch.
Value (tuple): (variable, AverageMeter).
"""
- fetchs = dict()
+ fetchs = {}
target = paddle.reshape(feeds['label'], [-1, 1])
if mode == Mode.TRAIN:
@@ -78,8 +79,7 @@ def create_fetchs(out, feeds, class_num, label_smoothing=0, mode=Mode.TRAIN):
loss = F.cross_entropy(out, target)
else:
label_one_hot = F.one_hot(target, class_num)
- soft_target = F.label_smooth(
- label_one_hot, epsilon=label_smoothing)
+ soft_target = F.label_smooth(label_one_hot, epsilon=label_smoothing)
soft_target = paddle.reshape(soft_target, shape=[-1, class_num])
log_softmax = -F.log_softmax(out, axis=-1)
loss = paddle.sum(log_softmax * soft_target, axis=-1)
@@ -94,19 +94,23 @@ def create_fetchs(out, feeds, class_num, label_smoothing=0, mode=Mode.TRAIN):
acc_top1 = paddle.metric.accuracy(input=out, label=target, k=1)
acc_top5 = paddle.metric.accuracy(input=out, label=target, k=5)
- metric_dict = dict()
+ metric_dict = {}
metric_dict["top1"] = acc_top1
metric_dict["top5"] = acc_top5
for key in metric_dict:
if mode != Mode.TRAIN and paddle.distributed.get_world_size() > 1:
paddle.distributed.all_reduce(
- metric_dict[key], op=paddle.distributed.ReduceOp.SUM)
- metric_dict[key] = metric_dict[
- key] / paddle.distributed.get_world_size()
+ metric_dict[key], op=paddle.distributed.ReduceOp.SUM
+ )
+ metric_dict[key] = (
+ metric_dict[key] / paddle.distributed.get_world_size()
+ )
- fetchs[key] = (metric_dict[key], AverageMeter(
- key, '7.4f', need_avg=True))
+ fetchs[key] = (
+ metric_dict[key],
+ AverageMeter(key, '7.4f', need_avg=True),
+ )
return fetchs
@@ -127,13 +131,16 @@ def create_strategy(args, is_train=True):
exec_strategy = paddle.static.ExecutionStrategy()
exec_strategy.num_threads = 1
- exec_strategy.num_iteration_per_drop_scope = (10000 if args.amp and
- args.use_pure_fp16 else 10)
-
- paddle.set_flags({
- 'FLAGS_cudnn_exhaustive_search': True,
- 'FLAGS_conv_workspace_size_limit': 4096
- })
+ exec_strategy.num_iteration_per_drop_scope = (
+ 10000 if args.amp and args.use_pure_fp16 else 10
+ )
+
+ paddle.set_flags(
+ {
+ 'FLAGS_cudnn_exhaustive_search': True,
+ 'FLAGS_conv_workspace_size_limit': 4096,
+ }
+ )
if not is_train:
build_strategy.fix_op_run_order = True
@@ -143,6 +150,8 @@ def create_strategy(args, is_train=True):
build_strategy.fuse_elewise_add_act_ops = True
build_strategy.fuse_bn_add_act_ops = True
build_strategy.enable_addto = True
+ if args.fuse_resunit and is_train:
+ build_strategy.fuse_resunit = True
return build_strategy, exec_strategy
@@ -175,10 +184,11 @@ def dist_optimizer(args, optimizer):
dist_strategy.amp_configs = {
"init_loss_scaling": args.scale_loss,
"use_dynamic_loss_scaling": args.use_dynamic_loss_scaling,
- "use_pure_fp16": args.use_pure_fp16
+ "use_pure_fp16": args.use_pure_fp16,
}
dist_strategy.asp = args.asp
+ dist_strategy.qat = args.qat
optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
@@ -221,14 +231,16 @@ def build(args, main_prog, startup_prog, step_each_epoch, is_train=True):
input_image_channel=input_image_channel,
data_format=data_format,
use_pure_fp16=use_pure_fp16,
- bn_weight_decay=bn_weight_decay)
+ bn_weight_decay=bn_weight_decay,
+ )
out = model(feeds["data"])
fetchs = create_fetchs(
- out, feeds, class_num, args.label_smoothing, mode=mode)
+ out, feeds, class_num, args.label_smoothing, mode=mode
+ )
if args.asp:
- sparsity.set_excluded_layers(main_prog, [model.fc.weight.name])
+ sparsity.set_excluded_layers(main_program=main_prog, param_names=[model.fc.weight.name])
lr_scheduler = None
optimizer = None
@@ -242,10 +254,13 @@ def build(args, main_prog, startup_prog, step_each_epoch, is_train=True):
# This is a workaround to "Communicator of ring id 0 has not been initialized.".
# Since Paddle's design, the initialization would be done inside train program,
# eval_only need to manually call initialization.
- if args.run_scope == RunScope.EVAL_ONLY and \
- paddle.distributed.get_world_size() > 1:
+ if (
+ args.run_scope == RunScope.EVAL_ONLY
+ and paddle.distributed.get_world_size() > 1
+ ):
collective_helper = CollectiveHelper(
- role_maker=fleet.PaddleCloudRoleMaker(is_collective=True))
+ role_maker=fleet.PaddleCloudRoleMaker(is_collective=True)
+ )
collective_helper.update_startup_program(startup_prog)
return fetchs, lr_scheduler, feeds, optimizer
@@ -268,22 +283,22 @@ def compile_prog(args, program, loss_name=None, is_train=True):
build_strategy, exec_strategy = create_strategy(args, is_train)
compiled_program = paddle.static.CompiledProgram(
- program).with_data_parallel(
- loss_name=loss_name,
- build_strategy=build_strategy,
- exec_strategy=exec_strategy)
+ program, build_strategy=build_strategy
+ )
return compiled_program
-def run(args,
- dataloader,
- exe,
- program,
- fetchs,
- epoch,
- mode=Mode.TRAIN,
- lr_scheduler=None):
+def run(
+ args,
+ dataloader,
+ exe,
+ program,
+ fetchs,
+ epoch,
+ mode=Mode.TRAIN,
+ lr_scheduler=None,
+):
"""
Execute program.
@@ -310,11 +325,11 @@ def run(args,
if fetchs[k][1] is not None:
metric_dict[k] = fetchs[k][1]
- metric_dict["batch_time"] = AverageMeter(
- 'batch_time', '.5f', postfix=" s,")
+ metric_dict["batch_time"] = AverageMeter('batch_time', '.5f', postfix=" s,")
metric_dict["data_time"] = AverageMeter('data_time', '.5f', postfix=" s,")
metric_dict["compute_time"] = AverageMeter(
- 'compute_time', '.5f', postfix=" s,")
+ 'compute_time', '.5f', postfix=" s,"
+ )
for m in metric_dict.values():
m.reset()
@@ -326,8 +341,7 @@ def run(args,
batch_size = None
latency = []
- total_benchmark_steps = \
- args.benchmark_steps + args.benchmark_warmup_steps
+ total_benchmark_steps = args.benchmark_steps + args.benchmark_warmup_steps
dataloader.reset()
while True:
@@ -359,11 +373,12 @@ def run(args,
batch_size = batch[0]["data"].shape()[0]
feed_dict = batch[0]
- with profiler.profile_tag(idx, "Training"
- if mode == Mode.TRAIN else "Evaluation"):
- results = exe.run(program=program,
- feed=feed_dict,
- fetch_list=fetch_list)
+ with profiler.profile_tag(
+ idx, "Training" if mode == Mode.TRAIN else "Evaluation"
+ ):
+ results = exe.run(
+ program=program, feed=feed_dict, fetch_list=fetch_list
+ )
for name, m in zip(fetchs.keys(), results):
if name in metric_dict:
@@ -380,15 +395,16 @@ def run(args,
tic = time.perf_counter()
if idx % args.print_interval == 0:
- log_msg = dict()
+ log_msg = {}
log_msg['loss'] = metric_dict['loss'].val.item()
log_msg['top1'] = metric_dict['top1'].val.item()
log_msg['top5'] = metric_dict['top5'].val.item()
log_msg['data_time'] = metric_dict['data_time'].val
log_msg['compute_time'] = metric_dict['compute_time'].val
log_msg['batch_time'] = metric_dict['batch_time'].val
- log_msg['ips'] = \
+ log_msg['ips'] = (
batch_size * num_trainers / metric_dict['batch_time'].val
+ )
if mode == Mode.TRAIN:
log_msg['lr'] = metric_dict['lr'].val
log_info((epoch, idx), log_msg, mode)
@@ -404,10 +420,10 @@ def run(args,
logging.info("Begin benchmark at step %d", idx + 1)
if idx == total_benchmark_steps:
- benchmark_data = dict()
- benchmark_data[
- 'ips'] = batch_size * num_trainers / metric_dict[
- 'batch_time'].avg
+ benchmark_data = {}
+ benchmark_data['ips'] = (
+ batch_size * num_trainers / metric_dict['batch_time'].avg
+ )
if mode == mode.EVAL:
latency = np.array(latency) * 1000
quantile = np.quantile(latency, [0.9, 0.95, 0.99])
@@ -420,15 +436,19 @@ def run(args,
logging.info("End benchmark at epoch step %d", idx)
return benchmark_data
- epoch_data = dict()
+ epoch_data = {}
epoch_data['loss'] = metric_dict['loss'].avg.item()
epoch_data['epoch_time'] = metric_dict['batch_time'].total
- epoch_data['ips'] = batch_size * num_trainers * \
- metric_dict["batch_time"].count / metric_dict["batch_time"].sum
+ epoch_data['ips'] = (
+ batch_size
+ * num_trainers
+ * metric_dict["batch_time"].count
+ / metric_dict["batch_time"].sum
+ )
if mode == Mode.EVAL:
epoch_data['top1'] = metric_dict['top1'].avg.item()
epoch_data['top5'] = metric_dict['top5'].avg.item()
- log_info((epoch, ), epoch_data, mode)
+ log_info((epoch,), epoch_data, mode)
return epoch_data
@@ -443,7 +463,7 @@ def log_info(step, metrics, mode):
mode(utils.Mode): Train or eval mode.
"""
prefix = 'train' if mode == Mode.TRAIN else 'val'
- dllogger_iter_data = dict()
+ dllogger_iter_data = {}
for key in metrics:
dllogger_iter_data[f"{prefix}.{key}"] = metrics[key]
dllogger.log(step=step, data=dllogger_iter_data)
diff --git a/PaddlePaddle/Classification/RN50v1.5/requirements.txt b/PaddlePaddle/Classification/RN50v1.5/requirements.txt
index 3a6cbc400..66e17d3a0 100644
--- a/PaddlePaddle/Classification/RN50v1.5/requirements.txt
+++ b/PaddlePaddle/Classification/RN50v1.5/requirements.txt
@@ -1 +1,2 @@
git+https://github.com/NVIDIA/dllogger@v1.0.0#egg=dllogger
+cuda-python==12.0.0
diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh
index 2d59953ff..7dd68dc40 100644
--- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh
+++ b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh
@@ -14,8 +14,9 @@
python inference.py \
--data-layout NHWC \
- --trt-inference-dir ./inference_amp \
- --trt-precision FP16 \
+ --inference-dir ./inference_amp \
+ --precision FP16 \
--batch-size 256 \
--benchmark-steps 1024 \
- --benchmark-warmup-steps 16
+ --benchmark-warmup-steps 16 \
+ --use-synthetic True
diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_QAT.sh
similarity index 71%
rename from PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh
rename to PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_QAT.sh
index 107b1f4f9..bb2858eb7 100644
--- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh
+++ b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_QAT.sh
@@ -12,10 +12,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-CKPT=${1:-"./output/ResNet50/89"}
-MODEL_PREFIX=${2:-"resnet_50_paddle"}
-
-python -m paddle.distributed.launch --gpus=0 export_model.py \
- --trt-inference-dir ./inference_tf32 \
- --from-checkpoint $CKPT \
- --model-prefix ${MODEL_PREFIX}
+python inference.py \
+ --data-layout NHWC \
+ --inference-dir ./inference_qat \
+ --precision INT8 \
+ --batch-size 256 \
+ --benchmark-steps 1024 \
+ --benchmark-warmup-steps 16 \
+ --use-synthetic True
diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh
index 559677133..6e55fd0be 100644
--- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh
+++ b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh
@@ -13,9 +13,10 @@
# limitations under the License.
python inference.py \
- --trt-inference-dir ./inference_tf32 \
- --trt-precision FP32 \
+ --inference-dir ./inference_tf32 \
+ --precision FP32 \
--dali-num-threads 8 \
--batch-size 256 \
--benchmark-steps 1024 \
- --benchmark-warmup-steps 16
+ --benchmark-warmup-steps 16 \
+ --use-synthetic True
diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh
index a9badfefe..23c4a4991 100644
--- a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh
+++ b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh
@@ -17,4 +17,6 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
--amp \
--scale-loss 128.0 \
--use-dynamic-loss-scaling \
- --data-layout NHWC
+ --data-layout NHWC \
+ --fuse-resunit \
+ --inference-dir ./inference_amp
diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh
similarity index 69%
rename from PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh
rename to PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh
index b1c5676b9..0e7c8f104 100644
--- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh
+++ b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh
@@ -15,9 +15,14 @@
CKPT=${1:-"./output/ResNet50/89"}
MODEL_PREFIX=${2:-"resnet_50_paddle"}
-python -m paddle.distributed.launch --gpus=0 export_model.py \
- --amp \
- --data-layout NHWC \
- --trt-inference-dir ./inference_amp \
- --from-checkpoint ${CKPT} \
- --model-prefix ${MODEL_PREFIX}
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+ --from-pretrained-params ${CKPT} \
+ --model-prefix ${MODEL_PREFIX} \
+ --epochs 10 \
+ --amp \
+ --scale-loss 128.0 \
+ --use-dynamic-loss-scaling \
+ --data-layout NHWC \
+ --qat \
+ --lr 0.00005 \
+ --inference-dir ./inference_qat
diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh
index 65c87b752..0c5ea7988 100644
--- a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh
+++ b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh
@@ -12,4 +12,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py --epochs 90
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py --epochs 90 --inference-dir ./inference_tf32
diff --git a/PaddlePaddle/Classification/RN50v1.5/train.py b/PaddlePaddle/Classification/RN50v1.5/train.py
index d469534de..28e985135 100644
--- a/PaddlePaddle/Classification/RN50v1.5/train.py
+++ b/PaddlePaddle/Classification/RN50v1.5/train.py
@@ -12,20 +12,23 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-import os
import logging
-import paddle
-from paddle.distributed import fleet
-from paddle.static import sparsity
-from paddle.fluid.contrib.mixed_precision.fp16_utils import rewrite_program
-from paddle.fluid.contrib.mixed_precision.fp16_lists import AutoMixedPrecisionLists
+import os
+
from dali import build_dataloader
+from utils.affinity import set_cpu_affinity
from utils.config import parse_args, print_args
from utils.logger import setup_dllogger
-from utils.save_load import init_program, save_model
-from utils.affinity import set_cpu_affinity
from utils.mode import Mode, RunScope
+from utils.save_load import init_program, save_model
+
+import paddle
import program
+from paddle.distributed import fleet
+from paddle.static.amp.fp16_lists import AutoMixedPrecisionLists
+from paddle.static.amp.fp16_utils import cast_model_to_fp16
+from paddle.incubate import asp as sparsity
+from paddle.static.quantization.quanter import quant_aware
class MetricSummary:
@@ -35,18 +38,24 @@ def __init__(self):
def update(self, new_metrics):
if not self.is_updated:
- self.metric_dict = dict()
+ self.metric_dict = {}
for key in new_metrics:
if key in self.metric_dict:
# top1, top5 and ips are "larger is better"
if key in ['top1', 'top5', 'ips']:
- self.metric_dict[key] = new_metrics[key] if new_metrics[
- key] > self.metric_dict[key] else self.metric_dict[key]
+ self.metric_dict[key] = (
+ new_metrics[key]
+ if new_metrics[key] > self.metric_dict[key]
+ else self.metric_dict[key]
+ )
# Others are "Smaller is better"
else:
- self.metric_dict[key] = new_metrics[key] if new_metrics[
- key] < self.metric_dict[key] else self.metric_dict[key]
+ self.metric_dict[key] = (
+ new_metrics[key]
+ if new_metrics[key] < self.metric_dict[key]
+ else self.metric_dict[key]
+ )
else:
self.metric_dict[key] = new_metrics[key]
@@ -89,7 +98,8 @@ def main(args):
train_prog,
startup_prog,
step_each_epoch=train_step_each_epoch,
- is_train=True)
+ is_train=True,
+ )
eval_dataloader = None
eval_prog = None
@@ -98,12 +108,13 @@ def main(args):
eval_step_each_epoch = len(eval_dataloader)
eval_prog = paddle.static.Program()
- eval_fetchs, _, _, _ = program.build(
+ eval_fetchs, _, eval_feeds, _ = program.build(
args,
eval_prog,
startup_prog,
step_each_epoch=eval_step_each_epoch,
- is_train=False)
+ is_train=False,
+ )
# clone to prune some content which is irrelevant in eval_prog
eval_prog = eval_prog.clone(for_test=True)
@@ -113,23 +124,38 @@ def main(args):
init_program(
args,
exe=exe,
- program=train_prog if train_prog is not None else eval_prog)
+ program=train_prog if train_prog is not None else eval_prog,
+ )
if args.amp:
if args.run_scope == RunScope.EVAL_ONLY:
- rewrite_program(eval_prog, amp_lists=AutoMixedPrecisionLists())
+ cast_model_to_fp16(
+ eval_prog,
+ AutoMixedPrecisionLists(),
+ use_fp16_guard=False,
+ level='O1',
+ )
else:
optimizer.amp_init(
device,
scope=paddle.static.global_scope(),
test_program=eval_prog,
- use_fp16_test=True)
+ use_fp16_test=True,
+ )
if args.asp and args.prune_model:
logging.info("Pruning model to 2:4 sparse pattern...")
sparsity.prune_model(train_prog, mask_algo=args.mask_algo)
logging.info("Pruning model done.")
+ if args.qat:
+ if args.run_scope == RunScope.EVAL_ONLY:
+ eval_prog = quant_aware(eval_prog, device, for_test=True, return_program=True)
+ else:
+ optimizer.qat_init(
+ device,
+ test_program=eval_prog)
+
if eval_prog is not None:
eval_prog = program.compile_prog(args, eval_prog, is_train=False)
@@ -138,28 +164,44 @@ def main(args):
for epoch_id in range(args.start_epoch, args.epochs):
# Training
if train_prog is not None:
- metric_summary = program.run(args, train_dataloader, exe,
- train_prog, train_fetchs, epoch_id,
- Mode.TRAIN, lr_scheduler)
+ metric_summary = program.run(
+ args,
+ train_dataloader,
+ exe,
+ train_prog,
+ train_fetchs,
+ epoch_id,
+ Mode.TRAIN,
+ lr_scheduler,
+ )
train_summary.update(metric_summary)
# Save a checkpoint
if epoch_id % args.save_interval == 0:
- model_path = os.path.join(args.output_dir,
- args.model_arch_name)
+ model_path = os.path.join(args.checkpoint_dir, args.model_arch_name)
save_model(train_prog, model_path, epoch_id, args.model_prefix)
# Evaluation
- if (eval_prog is not None) and \
- (epoch_id % args.eval_interval == 0):
- metric_summary = program.run(args, eval_dataloader, exe, eval_prog,
- eval_fetchs, epoch_id, Mode.EVAL)
+ if (eval_prog is not None) and (epoch_id % args.eval_interval == 0):
+ metric_summary = program.run(
+ args,
+ eval_dataloader,
+ exe,
+ eval_prog,
+ eval_fetchs,
+ epoch_id,
+ Mode.EVAL,
+ )
eval_summary.update(metric_summary)
if train_summary.is_updated:
- program.log_info(tuple(), train_summary.metric_dict, Mode.TRAIN)
+ program.log_info((), train_summary.metric_dict, Mode.TRAIN)
if eval_summary.is_updated:
- program.log_info(tuple(), eval_summary.metric_dict, Mode.EVAL)
+ program.log_info((), eval_summary.metric_dict, Mode.EVAL)
+
+ if eval_prog is not None:
+ model_path = os.path.join(args.inference_dir, args.model_arch_name)
+ paddle.static.save_inference_model(model_path, [eval_feeds['data']], [eval_fetchs['label'][0]], exe, program=eval_prog)
if __name__ == '__main__':
diff --git a/PaddlePaddle/Classification/RN50v1.5/utils/config.py b/PaddlePaddle/Classification/RN50v1.5/utils/config.py
index 3987084e6..3b4b46494 100644
--- a/PaddlePaddle/Classification/RN50v1.5/utils/config.py
+++ b/PaddlePaddle/Classification/RN50v1.5/utils/config.py
@@ -100,7 +100,8 @@ def print_args(args):
args_for_log = copy.deepcopy(args)
# Due to dllogger cannot serialize Enum into JSON.
- args_for_log.run_scope = args_for_log.run_scope.value
+ if hasattr(args_for_log, 'run_scope'):
+ args_for_log.run_scope = args_for_log.run_scope.value
dllogger.log(step='PARAMETER', data=vars(args_for_log))
@@ -150,13 +151,19 @@ def check_and_process_args(args):
args.eval_interval = 1
-def add_global_args(parser):
- group = parser.add_argument_group('Global')
+def add_general_args(parser):
+ group = parser.add_argument_group('General')
group.add_argument(
- '--output-dir',
+ '--checkpoint-dir',
type=str,
- default='./output/',
+ default='./checkpoint/',
help='A path to store trained models.')
+ group.add_argument(
+ '--inference-dir',
+ type=str,
+ default='./inference/',
+ help='A path to store inference model once the training is finished.'
+ )
group.add_argument(
'--run-scope',
default='train_eval',
@@ -188,13 +195,8 @@ def add_global_args(parser):
group.add_argument(
'--report-file',
type=str,
- default='./report.json',
+ default='./train.json',
help='A file in which to store JSON experiment report.')
- group.add_argument(
- '--data-layout',
- default='NCHW',
- choices=('NCHW', 'NHWC'),
- help='Data format. It should be one of {NCHW, NHWC}.')
group.add_argument(
'--benchmark', action='http://github.com/store_true', help='To enable benchmark mode.')
group.add_argument(
@@ -276,7 +278,10 @@ def add_advance_args(parser):
'--use-pure-fp16',
action='http://github.com/store_true',
help='Enable pure FP16 training, only be applied when --amp is set.')
-
+ group.add_argument(
+ '--fuse-resunit',
+ action='http://github.com/store_true',
+ help='Enable CUDNNv8 ResUnit fusion, only be applied when --amp is set.')
# ASP
group.add_argument(
'--asp',
@@ -295,6 +300,11 @@ def add_advance_args(parser):
'{mask_1d, mask_2d_greedy, mask_2d_best}. This only be applied ' \
'when --asp and --prune-model is set.'
)
+ # QAT
+ group.add_argument(
+ '--qat',
+ action='http://github.com/store_true',
+ help='Enable quantization aware training (QAT).')
return parser
@@ -392,6 +402,11 @@ def add_model_args(parser):
type=int,
default=1000,
help='The number classes of images.')
+ group.add_argument(
+ '--data-layout',
+ default='NCHW',
+ choices=('NCHW', 'NHWC'),
+ help='Data format. It should be one of {NCHW, NHWC}.')
group.add_argument(
'--bn-weight-decay',
action='http://github.com/store_true',
@@ -445,72 +460,105 @@ def add_training_args(parser):
def add_trt_args(parser):
+ def int_list(x):
+ return list(map(int, x.split(',')))
+
group = parser.add_argument_group('Paddle-TRT')
group.add_argument(
- '--trt-inference-dir',
+ '--device',
+ type=int,
+ default='0',
+ help='The GPU device id for Paddle-TRT inference.'
+ )
+ group.add_argument(
+ '--inference-dir',
type=str,
default='./inference',
- help='A path to store/load inference models. ' \
- 'export_model.py would export models to this folder, ' \
- 'then inference.py would load from here.'
+ help='A path to load inference models.'
)
group.add_argument(
- '--trt-precision',
+ '--data-layout',
+ default='NCHW',
+ choices=('NCHW', 'NHWC'),
+ help='Data format. It should be one of {NCHW, NHWC}.')
+ group.add_argument(
+ '--precision',
default='FP32',
choices=('FP32', 'FP16', 'INT8'),
help='The precision of TensorRT. It should be one of {FP32, FP16, INT8}.'
)
group.add_argument(
- '--trt-workspace-size',
+ '--workspace-size',
type=int,
default=(1 << 30),
help='The memory workspace of TensorRT in MB.')
group.add_argument(
- '--trt-min-subgraph-size',
+ '--min-subgraph-size',
type=int,
default=3,
help='The minimal subgraph size to enable PaddleTRT.')
group.add_argument(
- '--trt-use-static',
+ '--use-static',
type=distutils.util.strtobool,
default=False,
help='Fix TensorRT engine at first running.')
group.add_argument(
- '--trt-use-calib-mode',
+ '--use-calib-mode',
type=distutils.util.strtobool,
default=False,
help='Use the PTQ calibration of PaddleTRT int8.')
group.add_argument(
- '--trt-export-log-path',
- type=str,
- default='./export.json',
- help='A file in which to store JSON model exporting report.')
- group.add_argument(
- '--trt-log-path',
+ '--report-file',
type=str,
default='./inference.json',
help='A file in which to store JSON inference report.')
group.add_argument(
- '--trt-use-synthat',
+ '--use-synthetic',
type=distutils.util.strtobool,
default=False,
help='Apply synthetic data for benchmark.')
+ group.add_argument(
+ '--benchmark-steps',
+ type=int,
+ default=100,
+ help='Steps for benchmark run, only be applied when --benchmark is set.'
+ )
+ group.add_argument(
+ '--benchmark-warmup-steps',
+ type=int,
+ default=100,
+ help='Warmup steps for benchmark run, only be applied when --benchmark is set.'
+ )
+ group.add_argument(
+ '--show-config',
+ type=distutils.util.strtobool,
+ default=True,
+ help='To show arguments.')
return parser
-def parse_args(including_trt=False):
+def parse_args(script='train'):
+ assert script in ['train', 'inference']
parser = argparse.ArgumentParser(
- description="PaddlePaddle RN50v1.5 training script",
+ description=f'PaddlePaddle RN50v1.5 {script} script',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
- parser = add_global_args(parser)
- parser = add_dataset_args(parser)
- parser = add_model_args(parser)
- parser = add_training_args(parser)
- parser = add_advance_args(parser)
-
- if including_trt:
+ if script == 'train':
+ parser = add_general_args(parser)
+ parser = add_dataset_args(parser)
+ parser = add_model_args(parser)
+ parser = add_training_args(parser)
+ parser = add_advance_args(parser)
+ args = parser.parse_args()
+ check_and_process_args(args)
+ else:
parser = add_trt_args(parser)
+ parser = add_dataset_args(parser)
+ args = parser.parse_args()
+ # Precess image layout and channel
+ args.image_channel = args.image_shape[0]
+ if args.data_layout == "NHWC":
+ args.image_shape = [
+ args.image_shape[1], args.image_shape[2], args.image_shape[0]
+ ]
- args = parser.parse_args()
- check_and_process_args(args)
return args
diff --git a/PaddlePaddle/LanguageModeling/BERT/Dockerfile b/PaddlePaddle/LanguageModeling/BERT/Dockerfile
index de3f7feb1..d7f0d43f5 100644
--- a/PaddlePaddle/LanguageModeling/BERT/Dockerfile
+++ b/PaddlePaddle/LanguageModeling/BERT/Dockerfile
@@ -1,15 +1,20 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:22.08-py3
-
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:23.06-py3
FROM ${FROM_IMAGE_NAME}
-
RUN apt-get update && apt-get install -y pbzip2 pv bzip2 cabextract
ENV BERT_PREP_WORKING_DIR /workspace/bert/data
-ADD requirements.txt /workspace/
+
WORKDIR /workspace/
-RUN pip install --no-cache-dir -r requirements.txt
-RUN git clone https://github.com/attardi/wikiextractor.git && cd wikiextractor && git checkout 6408a430fc504a38b04d37ce5e7fc740191dee16 && cd ..
-RUN git clone https://github.com/soskek/bookcorpus.git
-ADD . /workspace/bert
WORKDIR /workspace/bert
+RUN pip install --no-cache-dir \
+ tqdm boto3 requests six ipdb h5py nltk progressbar tokenizers>=0.7\
+ git+https://github.com/NVIDIA/dllogger wget
+
+RUN apt-get install -y iputils-ping
+
+COPY . .
+
+RUN apt-get install -y libjemalloc-dev
+RUN pip install git+https://github.com/NVIDIA/lddl.git
+RUN python -m nltk.downloader punkt
diff --git a/PaddlePaddle/LanguageModeling/BERT/README.md b/PaddlePaddle/LanguageModeling/BERT/README.md
index 69ff2cc86..b7b059c76 100644
--- a/PaddlePaddle/LanguageModeling/BERT/README.md
+++ b/PaddlePaddle/LanguageModeling/BERT/README.md
@@ -20,7 +20,8 @@ This repository provides a script and recipe to train the BERT model for PaddleP
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Pre-training parameters](#pre-training-parameters)
- * [Fine tuning parameters](#fine-tuning-parameters)
+ * [Fine tuning parameters](#fine-tuning-parameters)
+ * [Multi-node](#multi-node)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
@@ -43,6 +44,7 @@ This repository provides a script and recipe to train the BERT model for PaddleP
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
* [Pre-training NVIDIA DGX A100 (8x A100 80GB)](#pre-training-nvidia-dgx-a100-8x-a100-80gb)
+ * [Pre-training NVIDIA DGX A100 (8x A100 80GB) Multi-node Scaling](#pre-training-nvidia-dgx-a100-8x-a100-80gb-multi-node-scaling)
* [Fine-tuning NVIDIA DGX A100 (8x A100 80GB)](#fine-tuning-nvidia-dgx-a100-8x-a100-80gb)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-1x-a100-80gb)
@@ -105,13 +107,17 @@ The following features are supported by this model.
| [Paddle AMP](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/performance_improving/amp_en.html) | Yes |
| [Paddle Fleet](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/distributed/fleet/Fleet_en.html#fleet) | Yes |
| [LAMB](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/optimizer/Lamb_en.html) | Yes |
+| [LDDL](https://github.com/NVIDIA/LDDL) | Yes |
+| Multi-node | Yes |
#### Features
[Fleet](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/distributed/fleet/Fleet_en.html#fleet) is a unified API for distributed training of PaddlePaddle.
[LAMB](https://arxiv.org/pdf/1904.00962.pdf) stands for Layerwise Adaptive Moments based optimizer, which is a large batch optimization technique that helps accelerate the training of deep neural networks using large minibatches. It allows using a global batch size of 65536 and 32768 on sequence lengths 128 and 512, respectively, compared to a batch size of 256 for [Adam](https://arxiv.org/pdf/1412.6980.pdf). The optimized implementation accumulates 1024 gradient batches in phase 1 and 4096 steps in phase 2 before updating weights once. This results in a 15% training speedup. On multi-node systems, LAMB allows scaling up to 1024 GPUs resulting in training speedups of up to 72x in comparison to Adam. Adam has limitations on the learning rate that can be used since it is applied globally on all parameters, whereas LAMB follows a layerwise learning rate strategy.
-
+
+[LDDL](https://github.com/NVIDIA/LDDL) is a library that enables scalable data preprocessing and loading. LDDL is used by this PaddlePaddle BERT example.
+
### Mixed precision training
@@ -193,7 +199,7 @@ The following section lists the requirements you need to meet to start training
This repository contains a Dockerfile that extends the CUDA NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PaddlePaddle 22.08-py3 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle) or newer
+* [PaddlePaddle 22.12-py3 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle) or newer
* Supported GPUs:
* [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
@@ -204,7 +210,11 @@ DGX Documentation:
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
For those unable to use the PaddlePaddle NGC container, to set up the required environment or create your own container, refer to the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html).
+
+For multi-node, the sample provided in this repository requires [Enroot](https://github.com/NVIDIA/enroot) and [Pyxis](https://github.com/NVIDIA/pyxis) set up on a [SLURM](https://slurm.schedmd.com) cluster.
+More information on how to set up and launch can be found in the [Multi-node Documentation](https://docs.nvidia.com/ngc/multi-node-bert-user-guide).
+
## Quick Start Guide
@@ -218,7 +228,10 @@ cd DeepLearningExamples/PaddlePaddle/LanguageModeling/BERT
```
2. Download the NVIDIA pre-trained checkpoint.
-Pre-trained checkpoints link is coming soon.
+If you want to use a pre-trained checkpoint, visit [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_large_paddle_ckpt_mode-pretrain/files). This pre-trained checkpoint is used to fine-tune on SQuAD. Ensure you unzip the downloaded file and place the checkpoint in the `checkpoints/` folder. For a checkpoint already fine-tuned for QA on SQuAD v1.1 visit [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dle/models/bert_large_paddle_ckpt_mode-qa_ds-squad11/files).
+
+
+
3. Build BERT on top of the NGC container.
```
@@ -235,36 +248,23 @@ By default:
- Paddle native logs are stored in the `log/` folder.
- DLLogger's outputs are stored in the `results/` folder.
-5. Download and preprocess the dataset.
+5. Download the dataset.
This repository provides scripts to download, verify, and extract the following datasets:
-- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (fine-tuning for question answering)
-- Wikipedia (pre-training)
-- BookCorpus (pre-training)
+- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (fine-tuning for question answering)
+- Wikipedia (pre-training)
+
-To download, verify, extract the datasets, and create the shards in `.hdf5` format, run:
+To download, verify, extract the datasets, run:
```shell
bash data/create_datasets_from_start.sh
```
-Note: For fine tuning only, Wikipedia and Bookscorpus dataset download and preprocessing can be skipped by commenting it out.
+Note: For fine-tuning only, downloading the Wikipedia dataset can be skipped by commenting it out.
-- Download Wikipedia only for pretraining
-
-The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server, most of the time, gets overloaded and contains broken links resulting in HTTP 403 and 503 errors. Hence, it is recommended to skip downloading BookCorpus data by running:
-```shell
-bash data/create_datasets_from_start.sh wiki_only
-```
-
-- Download Wikipedia and BookCorpus
-
-Users are welcome to download BookCorpus from other sources to match our accuracy or repeatedly try our script until the required number of files are downloaded by running the following:
-```shell
-bash data/create_datasets_from_start.sh wiki_books
-```
-
-Note: Ensure a complete Wikipedia download. If, in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes a successful download, which causes the extraction to fail. Not using BookCorpus can potentially change the final accuracy on a few downstream tasks.
+Note: Ensure a complete Wikipedia download. But if the download failed in LDDL,
+remove the output directory `data/wikipedia/` and start over again.
6. Start pre-training.
@@ -276,16 +276,18 @@ bash scripts/run_pretraining.sh
The default hyperparameters are set to run on 8x A100 80G cards.
+To run on multiple nodes, refer to the [Multi-node](#multi-node) section.
+
7. Start fine-tuning with the SQuAD dataset.
The above pre-trained BERT representations can be fine-tuned with just one additional output layer for a state-of-the-art question answering system. Running the following script launches fine-tuning for question answering with the SQuAD dataset.
```
-bash scripts/run_squad.sh
+bash scripts/run_squad.sh /workspace/bert/checkpoints/
```
8. Start validation/evaluation.
-For SQuAD, validation can be performed with the `bash scripts/run_squad.sh `, setting `mode` to `eval` in `scripts/run_squad.sh` as follows:
+For SQuAD, validation can be performed with the `bash scripts/run_squad.sh /workspace/bert/checkpoints/`, setting `mode` to `eval` in `scripts/run_squad.sh` as follows:
```
mode=${12:-"eval"}
@@ -293,7 +295,7 @@ mode=${12:-"eval"}
9. Start inference/predictions.
-Inference can be performed with the `bash scripts/run_squad.sh `, setting `mode` to `prediction` in `scripts/run_squad.sh` as follows:
+Inference can be performed with the `bash scripts/run_squad.sh /workspace/bert/checkpoints/`, setting `mode` to `prediction` in `scripts/run_squad.sh` as follows:
```
mode=${12:-"prediction"}
@@ -366,6 +368,8 @@ The complete list of the available parameters for the `run_pretraining.py` scrip
Global:
--input-dir INPUT_DIR
The input data directory. Should be specified by users and contain .hdf5 files for the task. (default: None)
+ --vocab-file VOCAB_FILE
+ Vocabulary mapping/file BERT was pretrainined on. (default: None)
--output-dir OUTPUT_DIR
The output directory where the model checkpoints will be written. Should be specified by users. (default: None)
--bert-model {bert-base-uncased,bert-base-cased,bert-large-uncased,bert-large-cased,custom}
@@ -433,6 +437,7 @@ Advanced Training:
--use-dynamic-loss-scaling
Enable dynamic loss scaling in AMP training, only applied when --amp is set. (default: False)
--use-pure-fp16 Enable pure FP16 training, only applied when --amp is set. (default: False)
+ --fuse-mha Enable multihead attention fusion. Require cudnn version >= 8.9.1.
```
@@ -459,6 +464,7 @@ Default arguments are listed below in the order `scripts/run_squad.sh` expects:
- Enable benchmark - The default is `false`.
- Benchmark steps - The default is `100`.
- Benchmark warmup steps - The default is `100`.
+- Fuse MHA fusion - The default is `true`
The script saves the final checkpoint to the `/results/bert-large-uncased/squad` folder.
@@ -466,6 +472,24 @@ Note:
- For SQuAD fine-tuning, `<--max-steps>` is not required since it's usually trained for two or three epochs. If `<--max-steps>` is not set or set to -1, it will be trained for `<--epochs>` epochs. If `<--max-steps>` is set to a positive number, the total training steps is calculated by: `total_steps = min(max_steps, epochs * steps_per_epoch)`.
- For pre-training, `<--max-steps>` is required and `<--epochs>` is deprecated. Because We typically train for a specified number of steps rather than epochs.
+#### Multi-node
+Multi-node runs can be launched on a pyxis/enroot Slurm cluster (refer to [Requirements](#requirements)) with the `run.sub` script with the following command for a 4-node DGX-A100 example for both phase 1 and phase 2:
+
+```
+TRAIN_BATCH_SIZE=256 GRADIENT_ACCUMULATION_STEPS=8 PHASE=1 sbatch -N4 run.sub
+TRAIN_BATCH_SIZE=32 GRADIENT_ACCUMULATION_STEPS=32 PHASE=2 sbatch -N4 run.sub
+```
+
+Checkpoints after phase 1 will be saved in `checkpointdir` specified in `run.sub`. The checkpoint will be automatically picked up to resume training on phase 2. Note that phase 2 should be run after phase 1.
+
+
+The batch variables `BATCHSIZE`, `GRADIENT_STEPS`,`PHASE` refer to the Python arguments `--batch-size`, `--gradient-merge-steps`, `--phase1/--phase2` respectively.
+
+Note that the `run.sub` script is a starting point that has to be adapted depending on the environment. In particular, variables such as `datadir` handle the location of the files for each phase.
+
+Refer to the file’s contents to find the full list of variables to adjust for your system.
+
+
### Command-line options
To view the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
@@ -477,27 +501,25 @@ To view the full list of available options and their descriptions, use the `-h`
Detailed descriptions of command-line options can be found in the [Parameters](#parameters) section.
### Getting the data
-For pre-training BERT, we use the concatenation of Wikipedia (2500M words) and BookCorpus (800M words). For Wikipedia, we extract only the text passages and ignore headers, lists, and tables. BERT requires that datasets are structured as a document-level corpus rather than a shuffled sentence-level corpus because it is critical to extract long contiguous sentences.
-
-The preparation of the pre-training dataset is described in the `bertPrep.py` script found in the `data/` folder. The component steps in the automated scripts to prepare the datasets are as follows:
-
-1. Data download and extract - the dataset is downloaded and extracted.
-
-2. Clean and format - document tags, and so on. are removed from the dataset.
-
-3. Sentence segmentation - the corpus text file is processed into separate sentences.
-
-4. Sharding - the sentence segmented corpus file is split into a number of uniformly distributed smaller text documents.
-
-5. `hdf5` file creation - each text file shard is processed by the `create_pretraining_data.py` script to produce a corresponding `hdf5` file. The script generates input data and labels for masked language modeling and sentence prediction tasks for the input text shard.
-
-The tools used for preparing the BookCorpus and Wikipedia datasets can be applied to prepare an arbitrary corpus. The `create_datasets_from_start.sh` script in the `data/` directory applies sentence segmentation, sharding, and `hdf5` file creation given an arbitrary text file containing a document-separated text corpus.
-
-For fine-tuning a pre-trained BERT model for specific tasks, by default this repository prepares the following dataset:
+
+For pre-training BERT, we use the Wikipedia (2500M words) dataset. We extract
+only the text passages and ignore headers, lists, and tables. BERT requires that
+datasets are structured as a document level corpus rather than a shuffled
+sentence-level corpus because it is critical to extract long contiguous
+sentences. `data/create_datasets_from_start.sh` uses the LDDL downloader to
+download the Wikipedia dataset, and `scripts/run_pretraining.sh` uses the LDDL
+preprocessor and load balancer to preprocess the Wikipedia dataset into Parquet
+shards which are then streamed during the pre-training by the LDDL data loader.
+Refer to [LDDL's README](https://github.com/NVIDIA/LDDL/blob/main/README.md) for more
+information on how to use LDDL. Depending on the speed of your internet
+connection, downloading and extracting the Wikipedia dataset takes a few hours,
+and running the LDDL preprocessor and load balancer takes half an hour on a
+single DGXA100 node.
+
+For fine-tuning a pre-trained BERT model for specific tasks, by default, this repository prepares the following dataset:
- [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/): for question answering
-
-Depending on the speed of your internet connection, this process takes about a day to complete. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time.
+
#### Dataset guidelines
@@ -511,8 +533,6 @@ BERT pre-training optimizes for two unsupervised classification tasks. The first
The second task is next sentence prediction. One training instance of BERT pre-training is two sentences (a sentence pair). A sentence pair may be constructed by simply taking two adjacent sentences from a single document or by pairing up two random sentences with equal probability. The goal of this task is to predict whether or not the second sentence followed the first in the original document.
-The `create_pretraining_data.py` script takes in raw text and creates training instances for both pre-training tasks.
-
### Training process
@@ -522,7 +542,7 @@ The training process consists of two steps: pre-training and fine-tuning.
Pre-training is performed using the `run_pretraining.py` script along with parameters defined in the `scripts/run_pretraining.sh`.
-The `run_pretraining.sh` script runs a job on a single node that trains the BERT-large model from scratch using Wikipedia and BookCorpus datasets as training data using the LAMB optimizer. By default, the training script runs two phases of training with a hyperparameter recipe specific to 8x A100 80G cards:
+The `run_pretraining.sh` script runs a job on a single node that trains the BERT-large model from scratch using Wikipedia datasets as training data using the LAMB optimizer. By default, the training script runs two phases of training with a hyperparameter recipe specific to 8x A100 80G cards:
Phase 1: (Maximum sequence length of 128)
- Runs on 8 GPUs with a training batch size of 256 per GPU.
@@ -565,10 +585,18 @@ bash run_pretraining.sh \
\
\
\
+ \
+ \
+ \
+ \
+ \
+ \
+ \
\
\
\
-
+ \
+
```
Where:
@@ -593,8 +621,16 @@ Where:
- `` is the root path to bert code.
- `` is the path to the checkpoint to start the pretraining routine on (Usually a BERT pre-trained checkpoint).
+- `wikipedia_source` is the path to the 'source' subdirectory for the Wikipedia corpus.
+- `num_dask_workers` is the number of dask workers to preprocess the bert dataset.
+- `num_shards_per_workers` is the number of the output parquet/txt shards per worker.
+- `num_workers` is the number of workers for dataloading.
+- `sample_ratio` is the ratio of how many articles/documents are sampled from each corpus.
+- `phase2_bin_size` is the stride of the sequence length for each binbin size for phase2.
+- `masking` LDDL supports both static and dynamic masking. Refer to [LDDL's README](https://github.com/NVIDIA/LDDL/blob/main/README.md) for more information.
- `` is the path to the bert config file.
- `` a flag to enable benchmark. The train process will warmup for `` and then measure the throughput of the following ``.
+- `` a flag to enable cuDNN MHA fusion.
Note that:
- If users follow [Quick Start Guide](#quick-start-guide) to set up container and dataset, there is no need to set any parameters. For example:
@@ -609,7 +645,10 @@ bash scripts/run_pretraining.sh \
/path/to/dataset/phase1 \
/path/to/dataset/phase2 \
/workspace/bert \
- None None false
+ None \
+ /path/to/wikipedia/source \
+ 32 128 4 0.9 64 static \
+ None false
```
To run the pre-training routine on an initial checkpoint, point the `from-checkpoint` variable to the location of the checkpoint folder in `scripts/run_pretraining.sh`.
@@ -622,6 +661,7 @@ python3 -m paddle.distributed.launch \
--gpus="0,1,2,3,4,5,6,7" \
./run_pretraining.py \
--input-dir=/path/to/dataset/phase1 \
+ --vocab-file=vocab/bert-large-uncased-vocab.txt \
--output-dir=./results \
--bert-model=bert-large-uncased \
--from-checkpoint=./results/bert-large-uncased/phase1 \
@@ -634,6 +674,7 @@ python3 -m paddle.distributed.launch \
--max-predictions-per-seq=20 \
--gradient-merge-steps=32 \
--amp \
+ --fuse-mha \
--use-dynamic-loss-scaling \
--optimizer=Lamb \
--phase1 \
@@ -733,7 +774,8 @@ bash scripts/run_squad.sh \
\
\
\
-
+ \
+
```
By default, the `mode` argument is set to `train eval`. Refer to the [Quick Start Guide](#quick-start-guide) for explanations of each positional argument.
@@ -773,7 +815,10 @@ bash scripts/run_pretraining.sh \
/path/to/dataset/phase1 \
/path/to/dataset/phase2 \
/workspace/bert \
- None None true 10 10
+ None \
+ /path/to/wikipedia/source \
+ 32 128 4 0.9 64 static \
+ None true 10 10 true
```
To benchmark the training performance on a specific batch size for SQuAD, refer to [Fine-tuning](#fine-tuning) and turn on the `` flags. An example call to run training for 200 steps (100 steps for warmup and 100 steps to measure), and generate throughput numbers:
@@ -786,7 +831,7 @@ bash scripts/run_squad.sh \
results/checkpoints \
train \
bert_configs/bert-large-uncased.json \
- -1 true 100 100
+ -1 true 100 100 true
```
#### Inference performance benchmark
@@ -802,7 +847,8 @@ bash scripts/run_squad.sh \
\
eval \
\
-
+ \
+
```
An example call to run inference and generate throughput numbers:
@@ -815,7 +861,7 @@ bash scripts/run_squad.sh \
results/checkpoints \
eval \
bert_configs/bert-large-uncased.json \
- -1 true 100 100
+ -1 true 100 100 true
```
@@ -831,8 +877,8 @@ Our results were obtained by running the `scripts/run_squad.sh` and `scripts/run
| DGX System | GPUs / Node | Precision | Accumulated Batch size / GPU (Phase 1 and Phase 2) | Accumulation steps (Phase 1 and Phase 2) | Final Loss | Time to train(hours) | Time to train speedup (TF32 to mixed precision) |
|--------------------|-------------|-----------|----------------------------------------------------|------------------------------------------|-------------------|----------------------|-------------------------------------------------|
-| 1 x DGX A100 80GB | 8 | AMP | 256 and 32 | 32 and 128 | 1.409 | ~ 50 hours | 1.72 |
-| 1 x DGX A100 80GB | 8 | TF32 | 128 and 16 | 64 and 256 | 1.421 | ~ 86 hours | 1 |
+| 32 x DGX A100 80GB | 8 | AMP | 256 and 128 | 1 and 4 | 1.409 | ~ 1.1 hours | 2.27 |
+| 32 x DGX A100 80GB | 8 | TF32 | 128 and 16b | 2 and 8 | 1.421 | ~ 2.5 hours | 1 |
##### Pre-training loss curves
@@ -869,16 +915,34 @@ Training stability with 8 GPUs, FP16 computations, batch size of 32:
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
-Our results were obtained by running the script `run_pretraining.sh` in the PaddlePaddle:22.08-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in sequences per second) were averaged over a few training iterations.
+Our results were obtained by running the script `run_pretraining.sh` in the PaddlePaddle:22.12-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in sequences per second) were averaged over a few training iterations.
###### Pre-training NVIDIA DGX A100 (8x A100 80GB)
| GPUs | Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Throughput - TF32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
|------|----------------------------------|------------------------------------|-----------------|----------------------------------|---------------------------------------------|---------------------------------------------|---------------------|--------------------------------|
-| 1 | 8192 and 8192 | 64 and 32 | 128 | 304 | 529 | 1.74 | 1.00 | 1.00 |
-| 8 | 8192 and 8192 | 64 and 32 | 128 | 2410 | 4200 | 1.74 | 7.93 | 7.94 |
-| 1 | 4096 and 4096 | 256 and 128 | 512 | 59 | 103 | 1.75 | 1.00 | 1.00 |
-| 8 | 4096 and 4096 | 256 and 128 | 512 | 469 | 823 | 1.75 | 7.95 | 7.99 |
+| 1 | 8192 and 8192 | 64 and 32 | 128 | 307 | 694 | 2.26 | 1.00 | 1.00 |
+| 8 | 8192 and 8192 | 64 and 32 | 128 | 2428 | 5541 | 2.28 | 7.91 | 7.98 |
+| 1 | 4096 and 4096 | 256 and 128 | 512 | 107 | 264 | 2.47 | 1.00 | 1.00 |
+| 8 | 4096 and 4096 | 256 and 128 | 512 | 851 | 2109 | 2.48 | 7.95 | 7.99 |
+
+
+###### Pre-training NVIDIA DGX A100 (8x A100 80GB) Multi-node Scaling
+
+| Nodes | GPUs / node | Batch size / GPU (TF32 and FP16) | Accumulated Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Mixed Precision Throughput | Mixed Precision Strong Scaling | TF32 Throughput | TF32 Strong Scaling | Speedup (Mixed Precision to TF32) |
+|-------|-------------|----------------------------------|------------------------------------|-----------------|----------------------------|--------------------------------|-----------------|---------------------|-----------------------------------|-----|
+| 1 | 8 | 126 and 256 | 8192 and 8192 | 64 and 32 | 128 | 5541 | 1 | 2428 | 1 | 2.28 |
+| 2 | 8 | 126 and 256 | 4096 and 4096 | 32 and 16 | 128 | 10646 | 1.92 | 4638 | 1.91 | 2.29 |
+| 4 | 8 | 126 and 256 | 2048 and 2048 | 16 and 8 | 128 | 21389 | 3.86 | 9445 | 3.89 | 2.26 |
+| 8 | 8 | 126 and 256 | 1024 and 1024 | 8 and 4 | 128 | 41681 | 7.52 | 18335 | 7.55 | 2.27 |
+| 16 | 8 | 126 and 256 | 512 and 512 | 4 and 2 | 128 | 79023 | 14.26 | 35526 | 14.63 | 2.22 |
+| 32 | 8 | 126 and 256 | 256 and 256 | 2 and 1 | 128 | 157952 | 28.51 | 69701 | 28.71 | 2.27 |
+| 1 | 8 | 16 and 32 | 4096 and 4096 | 256 and 128 | 512 | 2109 | 1 | 851 | 1 | 2.48 |
+| 2 | 8 | 16 and 32 | 2048 and 2048 | 128 and 64 | 512 | 4051 | 1.92 | 1601 | 1.88 | 2.53 |
+| 4 | 8 | 16 and 32 | 1024 and 1024 | 64 and 32 | 512 | 7972 | 3.78 | 3240 | 3.81 | 2.46 |
+| 8 | 8 | 16 and 32 | 512 and 512 | 32 and 16 | 512 | 15760 | 7.47 | 6329 | 7.44 | 2.49 |
+| 16 | 8 | 16 and 32 | 256 and 256 | 16 and 8 | 512 | 31129 | 14.76 | 12273 | 14.42 | 2.54 |
+| 32 | 8 | 16 and 32 | 128 and 128 | 8 and 4 | 512 | 60206 | 28.55 | 24047 | 28.26 | 2.50 |
###### Fine-tuning NVIDIA DGX A100 (8x A100 80GB)
@@ -887,8 +951,8 @@ Our results were obtained by running the script `run_pretraining.sh` in the Padd
| GPUs | Batch size / GPU (TF32 and FP16) | Throughput - TF32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
|------|----------------------------------|----------------------------------|---------------------------------------------|---------------------------------------------|---------------------|--------------------------------|
-| 1 | 32 and 32 | 83 | 120 | 1.45 | 1.00 | 1.00 |
-| 8 | 32 and 32 | 629 | 876 | 1.39 | 7.59 | 7.30 |
+| 1 | 32 and 32 | 83 | 123 | 1.48 | 1.00 | 1.00 |
+| 8 | 32 and 32 | 629 | 929 | 1.48 | 7.59 | 7.55 |
#### Inference performance results
@@ -912,7 +976,11 @@ The inference performance metrics used were items/second.
## Release notes
### Changelog
-
+
+January 2023
+- [Pre-training using Language Datasets and Data Loaders (LDDL)](https://github.com/NVIDIA/LDDL)
+- Binned pretraining for phase2 with LDDL using a bin size of 64
+
August 2022
- Pre-training support with LAMB optimizer.
- Updated Data download and Preprocessing.
@@ -922,6 +990,13 @@ August 2022
- SQuAD finetune support with AdamW optimizer.
- Updated accuracy and performance tables tested on A100.
- Initial release.
+
+March 2023
+- Pre-training using [Language Datasets and Data Loaders (LDDL)](https://github.com/NVIDIA/LDDL)
+- Binned pretraining for phase2 with LDDL using a bin size of 64
+
+July 2023
+- Optimize AMP training with cuDNN fused dot product attention kernel.
### Known issues
diff --git a/PaddlePaddle/LanguageModeling/BERT/data/create_datasets_from_start.sh b/PaddlePaddle/LanguageModeling/BERT/data/create_datasets_from_start.sh
index 72557ed7e..5caff41e2 100644
--- a/PaddlePaddle/LanguageModeling/BERT/data/create_datasets_from_start.sh
+++ b/PaddlePaddle/LanguageModeling/BERT/data/create_datasets_from_start.sh
@@ -13,36 +13,5 @@
# limitations under the License.
#Download
-to_download=${1:-"wiki_only"}
-
-#Download
-if [ "$to_download" = "wiki_books" ] ; then
- python3 /workspace/bert/data/bertPrep.py --action download --dataset bookscorpus
-fi
-
-python3 /workspace/bert/data/bertPrep.py --action download --dataset wikicorpus_en
+download_wikipedia --outdir ${BERT_PREP_WORKING_DIR}/wikipedia/
python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
-
-# Properly format the text files
-if [ "$to_download" = "wiki_books" ] ; then
- python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset bookscorpus
-fi
-python3 /workspace/bert/data/bertPrep.py --action text_formatting --dataset wikicorpus_en
-
-if [ "$to_download" = "wiki_books" ] ; then
- DATASET="books_wiki_en_corpus"
-else
- DATASET="wikicorpus_en"
- # Shard the text files
-fi
-
-# Shard the text files
-python3 /workspace/bert/data/bertPrep.py --action sharding --dataset $DATASET
-
-# Create HDF5 files Phase 1
-python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 128 \
---max_predictions_per_seq 20 --vocab_file /workspace/bert/vocab/bert-large-uncased-vocab.txt --do_lower_case 1
-
-# Create HDF5 files Phase 2
-python3 /workspace/bert/data/bertPrep.py --action create_hdf5_files --dataset $DATASET --max_seq_length 512 \
---max_predictions_per_seq 80 --vocab_file /workspace/bert/vocab/bert-large-uncased-vocab.txt --do_lower_case 1
diff --git a/PaddlePaddle/LanguageModeling/BERT/loss.py b/PaddlePaddle/LanguageModeling/BERT/loss.py
index 6d8a6c529..73594ccdf 100644
--- a/PaddlePaddle/LanguageModeling/BERT/loss.py
+++ b/PaddlePaddle/LanguageModeling/BERT/loss.py
@@ -13,7 +13,6 @@
# limitations under the License.
import paddle
-import paddle.nn.functional as F
class CrossEntropyLossForSQuAD(paddle.nn.Layer):
@@ -53,7 +52,7 @@ def __init__(self, vocab_size):
self.vocab_size = vocab_size
def forward(self, prediction_scores, seq_relationship_score,
- masked_lm_labels, next_sentence_labels, masked_lm_scale):
+ masked_lm_labels, next_sentence_labels):
"""
Args:
prediction_scores(Tensor):
@@ -80,12 +79,11 @@ def forward(self, prediction_scores, seq_relationship_score,
Its data type should be float32 and its shape is [1].
"""
with paddle.static.amp.fp16_guard():
- masked_lm_loss = F.cross_entropy(
- prediction_scores,
- masked_lm_labels,
- reduction='none',
- ignore_index=-1)
- masked_lm_loss = masked_lm_loss / masked_lm_scale
- next_sentence_loss = F.cross_entropy(
- seq_relationship_score, next_sentence_labels, reduction='none')
- return paddle.sum(masked_lm_loss) + paddle.mean(next_sentence_loss)
+ masked_lm_labels_flat = masked_lm_labels.reshape([-1])
+ mlm_labels = masked_lm_labels_flat[masked_lm_labels_flat != -1]
+ masked_lm_loss = self.loss_fn(prediction_scores, mlm_labels)
+ if next_sentence_labels.ndim == 1:
+ next_sentence_labels = next_sentence_labels.unsqueeze(axis=-1)
+ next_sentence_loss = self.loss_fn(seq_relationship_score,
+ next_sentence_labels)
+ return masked_lm_loss + next_sentence_loss
diff --git a/PaddlePaddle/LanguageModeling/BERT/modeling.py b/PaddlePaddle/LanguageModeling/BERT/modeling.py
index a8650694e..423542d63 100644
--- a/PaddlePaddle/LanguageModeling/BERT/modeling.py
+++ b/PaddlePaddle/LanguageModeling/BERT/modeling.py
@@ -89,17 +89,15 @@ def __init__(self, bert_config):
self.layer_norm = nn.LayerNorm(bert_config.hidden_size, epsilon=1e-12)
self.dropout = nn.Dropout(bert_config.hidden_dropout_prob)
- def forward(self, input_ids, token_type_ids=None, position_ids=None):
+ def forward(self, input_ids, token_type_ids=None):
"""
Args:
See class `BertModel`.
"""
- if position_ids is None:
- ones = paddle.ones_like(input_ids, dtype="int64")
- seq_length = paddle.cumsum(ones, axis=-1)
-
- position_ids = seq_length - ones
- position_ids.stop_gradient = True
+ ones = paddle.ones_like(input_ids, dtype="int64")
+ seq_length = paddle.cumsum(ones, axis=-1)
+ position_ids = seq_length - ones
+ position_ids.stop_gradient = True
if token_type_ids is None:
token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
@@ -175,17 +173,13 @@ def __init__(self, bert_config):
activation=bert_config.hidden_act,
attn_dropout=bert_config.attention_probs_dropout_prob,
act_dropout=0,
- enable_cudnn=False)
+ fuse_qkv=bert_config.fuse_mha)
self.encoder = nn.TransformerEncoder(encoder_layer,
bert_config.num_hidden_layers)
self.pooler = BertPooler(bert_config.hidden_size)
- def forward(self,
- input_ids,
- token_type_ids=None,
- position_ids=None,
- attention_mask=None):
+ def forward(self, input_ids, token_type_ids=None, attention_mask=None):
"""
Args:
input_ids(Tensor):
@@ -198,11 +192,6 @@ def forward(self,
to a `sentence A` and type 1 corresponds to a `sentence B` token.
(see BERT paper for more details). Its data type should be `int64`
Defaults: None, which means we don't add segment embeddings.
- position_ids(Tensor, optional):
- An optional Tensor of shape [batch_size, num_tokens] with the position
- indices of each input sequence tokens in the position embeddings.
- Selected in the range [0, max_position_embeddings - 1].
- Its data type should be `int64`. Defaults: None.
attention_mask(Tensor, optional):
An optional Tensor of shape [batch_size, sequence_length] with indices of
mask used in multi-head attention to avoid performing attention on to some
@@ -234,9 +223,7 @@ def forward(self,
attention_mask = attention_mask.unsqueeze(axis=[1, 2])
embedding_output = self.embeddings(
- input_ids=input_ids,
- position_ids=position_ids,
- token_type_ids=token_type_ids)
+ input_ids=input_ids, token_type_ids=token_type_ids)
if self.fuse:
encoder_output = embedding_output
@@ -263,11 +250,7 @@ def __init__(self, bert_config):
self.bert = BertModel(bert_config)
self.classifier = nn.Linear(bert_config.hidden_size, 2)
- def forward(self,
- input_ids,
- token_type_ids=None,
- position_ids=None,
- attention_mask=None):
+ def forward(self, input_ids, token_type_ids=None, attention_mask=None):
"""
Args:
See class `BertModel`.
@@ -282,7 +265,6 @@ def forward(self,
encoder_output, _ = self.bert(
input_ids,
token_type_ids=token_type_ids,
- position_ids=position_ids,
attention_mask=attention_mask)
logits = self.classifier(encoder_output)
@@ -322,13 +304,7 @@ def __init__(self,
self.decoder_bias = self.create_parameter(
shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
- def forward(self, hidden_states, masked_positions=None):
- if masked_positions is not None:
- hidden_states = paddle.reshape(hidden_states,
- [-1, hidden_states.shape[-1]])
- hidden_states = paddle.tensor.gather(hidden_states,
- masked_positions)
- # gather masked tokens might be more quick
+ def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.activation(hidden_states)
hidden_states = self.layer_norm(hidden_states)
@@ -362,7 +338,7 @@ def __init__(self,
activation, embedding_weights)
self.seq_relationship = nn.Linear(hidden_size, 2)
- def forward(self, encoder_output, pooled_output, masked_positions=None):
+ def forward(self, encoder_output, pooled_output, masked_lm_labels):
"""
Args:
sequence_output(Tensor):
@@ -384,7 +360,12 @@ def forward(self, encoder_output, pooled_output, masked_positions=None):
A Tensor of shape [batch_size, 2] with the scores of next sentence prediction.
Its data type should be float32.
"""
- prediction_scores = self.predictions(encoder_output, masked_positions)
+
+ sequence_flattened = paddle.index_select(
+ encoder_output.reshape([-1, encoder_output.shape[-1]]),
+ paddle.nonzero(masked_lm_labels.reshape([-1]) != -1).squeeze(),
+ axis=0)
+ prediction_scores = self.predictions(sequence_flattened)
seq_relationship_score = self.seq_relationship(pooled_output)
return prediction_scores, seq_relationship_score
@@ -406,18 +387,13 @@ def __init__(self, bert_config):
bert_config.hidden_act,
embedding_weights=self.bert.embeddings.word_embeddings.weight)
- def forward(self,
- input_ids,
- token_type_ids=None,
- position_ids=None,
- attention_mask=None,
- masked_positions=None):
+ def forward(self, input_ids, token_type_ids, attention_mask,
+ masked_lm_labels):
"""
Args:
input_ids(Tensor): See class `BertModel`.
token_type_ids(Tensor, optional): See class `BertModel`.
- position_ids(Tensor, optional): See class `BertModel`.
attention_mask(Tensor, optional): See class `BertModel`.
masked_positions(Tensor, optional): See class `BertPretrainingHeads`.
@@ -434,9 +410,8 @@ def forward(self,
outputs = self.bert(
input_ids,
token_type_ids=token_type_ids,
- position_ids=position_ids,
attention_mask=attention_mask)
sequence_output, pooled_output = outputs[:2]
prediction_scores, seq_relationship_score = self.cls(
- sequence_output, pooled_output, masked_positions)
+ sequence_output, pooled_output, masked_lm_labels)
return prediction_scores, seq_relationship_score
diff --git a/PaddlePaddle/LanguageModeling/BERT/pretraining_dataset.py b/PaddlePaddle/LanguageModeling/BERT/pretraining_dataset.py
deleted file mode 100644
index 66e69b66a..000000000
--- a/PaddlePaddle/LanguageModeling/BERT/pretraining_dataset.py
+++ /dev/null
@@ -1,169 +0,0 @@
-# Copyright (c) 2022 NVIDIA Corporation. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import random
-import h5py
-import numpy as np
-import paddle
-from paddle.io import DataLoader, Dataset
-from utils.collate import Stack
-
-
-def create_pretraining_dataset(args,
- input_file,
- data_holders,
- worker_init=None,
- places=None):
- train_data = PretrainingDataset(
- input_file=input_file, max_pred_length=args.max_predictions_per_seq)
- train_batch_sampler = paddle.io.BatchSampler(
- train_data, batch_size=args.batch_size, shuffle=True)
-
- def _collate_data(data, stack_fn=Stack()):
- num_fields = len(data[0])
- out = [None] * num_fields
- [
- input_ids, segment_ids, input_mask, masked_lm_positions,
- masked_lm_labels, next_sentence_labels, masked_lm_scale
- ] = [0, 1, 2, 3, 4, 5, 6]
- for i in (input_ids, segment_ids, input_mask, next_sentence_labels):
- out[i] = stack_fn([x[i] for x in data])
- _, seq_length = out[input_ids].shape
- size = sum(len(x[masked_lm_positions]) for x in data)
- if size % 8 != 0:
- size += 8 - (size % 8)
- out[masked_lm_positions] = np.full(size, 0, dtype=np.int32)
- out[masked_lm_labels] = np.full([size, 1], -1, dtype=np.int64)
- mask_token_num = 0
- for i, x in enumerate(data):
- for j, pos in enumerate(x[masked_lm_positions]):
- out[masked_lm_positions][mask_token_num] = i * seq_length + pos
- out[masked_lm_labels][mask_token_num] = x[masked_lm_labels][j]
- mask_token_num += 1
- # The value of masked_lm_scale is equal to mask_token_num,
- # which would be used to compute average masked_lm_loss.
- out.append(np.asarray([mask_token_num], dtype=np.float32))
- if args.amp and args.use_pure_fp16:
- #out[input_mask] = out[input_mask].astype(np.float16)
- out[masked_lm_scale] = out[masked_lm_scale].astype(np.float16)
- return out
-
- train_data_loader = DataLoader(
- dataset=train_data,
- places=places,
- feed_list=data_holders,
- batch_sampler=train_batch_sampler,
- collate_fn=_collate_data,
- num_workers=0,
- worker_init_fn=worker_init,
- return_list=False)
-
- return train_data_loader
-
-
-def create_pretraining_data_holder():
- input_ids = paddle.static.data(
- name="input_ids", shape=[-1, -1], dtype="int64")
- segment_ids = paddle.static.data(
- name="segment_ids", shape=[-1, -1], dtype="int64")
- input_mask = paddle.static.data(
- name="input_mask", shape=[-1, 1, 1, -1], dtype="int64")
- masked_lm_positions = paddle.static.data(
- name="masked_lm_positions", shape=[-1], dtype="int32")
- masked_lm_labels = paddle.static.data(
- name="masked_lm_labels", shape=[-1, 1], dtype="int64")
- next_sentence_labels = paddle.static.data(
- name="next_sentence_labels", shape=[-1, 1], dtype="int64")
- masked_lm_scale = paddle.static.data(
- name="masked_lm_scale", shape=[-1, 1], dtype="float32")
- return [
- input_ids, segment_ids, input_mask, masked_lm_positions,
- masked_lm_labels, next_sentence_labels, masked_lm_scale
- ]
-
-
-def select_dataset_file_for_each_worker(files, f_start_id, num_trainers,
- trainer_id):
- """
- Spliting the train file according to the worker index.
- """
- num_files = len(files)
- if num_trainers > num_files:
- remainder = num_trainers % num_files
- data_file = files[(
- f_start_id * num_trainers + trainer_id + remainder * f_start_id) %
- num_files]
- else:
- data_file = files[(f_start_id * num_trainers + trainer_id) % num_files]
- return data_file
-
-
-class WorkerInitObj:
- "Construct the object with different seed, and the Dataloader will generate the data "
- "with different seed in each worker."
-
- def __init__(self, seed):
- self.seed = seed
-
- def __call__(self, pid):
- np.random.seed(seed=self.seed + pid)
- random.seed(self.seed + pid)
-
-
-class PretrainingDataset(Dataset):
- def __init__(self, input_file, max_pred_length):
- self.input_file = input_file
- self.max_pred_length = max_pred_length
- f = h5py.File(input_file, "r")
- keys = [
- 'input_ids', 'input_mask', 'segment_ids', 'masked_lm_positions',
- 'masked_lm_ids', 'next_sentence_labels'
- ]
- self.inputs = [np.asarray(f[key][:]) for key in keys]
- f.close()
-
- def __len__(self):
- 'Denotes the total number of samples'
- return len(self.inputs[0])
-
- def __getitem__(self, index):
- # convert next_sentence_labels (index=5) to np.ndarray type
- [
- input_ids, input_mask, segment_ids, masked_lm_positions,
- masked_lm_ids, next_sentence_labels
- ] = [
- input[index].astype(np.int64)
- if indice < 5 else np.asarray(input[index].astype(np.int64))
- for indice, input in enumerate(self.inputs)
- ]
- # input_mask = (1 - np.reshape(
- # input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e4
- input_mask = np.reshape(input_mask, [1, 1, input_mask.shape[0]])
-
- index = self.max_pred_length
- padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
- if len(padded_mask_indices) != 0:
- index = padded_mask_indices[0].item()
- else:
- index = self.max_pred_length
- masked_lm_labels = masked_lm_ids[:index]
- masked_lm_positions = masked_lm_positions[:index]
- # softmax_with_cross_entropy enforce last dim size equal 1
- masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
- next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
-
- return [
- input_ids, segment_ids, input_mask, masked_lm_positions,
- masked_lm_labels, next_sentence_labels
- ]
diff --git a/PaddlePaddle/LanguageModeling/BERT/program.py b/PaddlePaddle/LanguageModeling/BERT/program.py
index 4c1d17afa..2e46d01e3 100644
--- a/PaddlePaddle/LanguageModeling/BERT/program.py
+++ b/PaddlePaddle/LanguageModeling/BERT/program.py
@@ -12,29 +12,44 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-from concurrent.futures import ThreadPoolExecutor
import os
import time
import logging
import shutil
-import numpy as np
import paddle
import paddle.distributed.fleet as fleet
from modeling import BertForPretraining, BertConfig
from loss import BertPretrainingCriterion
from utils.save_load import save_model
-from utils.utility import get_num_trainers, get_trainer_id
+from utils.utility import get_trainer_id
from lr_scheduler import build_lr_scheduler
from optimizer import build_optimizer
-from pretraining_dataset import create_pretraining_dataset, select_dataset_file_for_each_worker, WorkerInitObj
import dllogger
-def create_strategy(use_distributed_fused_lamb=False):
+def create_pretraining_data_holder():
+ input_ids = paddle.static.data(
+ name="input_ids", shape=[-1, -1], dtype="int64")
+ token_type_ids = paddle.static.data(
+ name="token_type_ids", shape=[-1, -1], dtype="int64")
+ attention_mask = paddle.static.data(
+ name="attention_mask", shape=[-1, 1, 1, -1], dtype="int64")
+ next_sentence_labels = paddle.static.data(
+ name="next_sentence_labels", shape=[-1, 1], dtype="int64")
+ masked_lm_labels = paddle.static.data(
+ name="masked_lm_labels", shape=[-1, -1], dtype="int64")
+ return [
+ input_ids, token_type_ids, attention_mask, next_sentence_labels,
+ masked_lm_labels
+ ]
+
+
+def create_strategy(args, use_distributed_fused_lamb=False):
"""
Create paddle.static.BuildStrategy and paddle.static.ExecutionStrategy with arguments.
Args:
+ args(Namespace): Arguments obtained from ArgumentParser.
use_distributed_fused_lamb(bool, optional): Whether to use distributed fused lamb.
Returns:
build_strategy(paddle.static.BuildStrategy): A instance of BuildStrategy.
@@ -44,6 +59,9 @@ def create_strategy(use_distributed_fused_lamb=False):
exec_strategy = paddle.static.ExecutionStrategy()
build_strategy.enable_addto = True
+ if args.amp:
+ build_strategy.fuse_gemm_epilogue = True
+ build_strategy.fuse_dot_product_attention = args.fuse_mha
if use_distributed_fused_lamb:
build_strategy.fuse_all_reduce_ops = False
@@ -69,7 +87,8 @@ def dist_optimizer(args, optimizer):
optimizer(fleet.distributed_optimizer): A distributed optimizer.
"""
use_distributed_fused_lamb = True if args.optimizer == 'DistributedFusedLamb' else False
- build_strategy, exec_strategy = create_strategy(use_distributed_fused_lamb)
+ build_strategy, exec_strategy = create_strategy(args,
+ use_distributed_fused_lamb)
dist_strategy = fleet.DistributedStrategy()
if use_distributed_fused_lamb:
@@ -111,45 +130,47 @@ def dist_optimizer(args, optimizer):
return optimizer
-def build(args, main_prog, startup_prog, feeds, is_train=True):
+def build(args, main_prog, startup_prog, is_train=True):
"""
Build a executable paddle.static.Program via following 3 steps:
- 1. Create model.
- 2. Create loss.
- 3. Create optimizer if is_train==True.
+ 1. Create feeds.
+ 2. Create model.
+ 3. Create loss.
+ 4. Create optimizer if is_train==True.
Args:
args(Namespace): Arguments obtained from ArgumentParser.
main_prog(paddle.static.Program):The main program.
startup_prog(paddle.static.Program):The startup program.
- feeds(dict): A dict of mapping variables' names to their values
is_train(bool, optional): Whether the main programe created is for training. Default: True.
Returns:
model(paddle.nn.Layer): An instance of BERT Model defined in modeling.py.
lr_scheduler(paddle.optimizer.lr.LRScheduler): A learning rate scheduler.
optimizer(Optimizer): An optimizer with distributed/AMP strategy.
loss(variable): The output variable of loss function.
+ feeds(dict): A dict of mapping variables' names to their values
"""
with paddle.static.program_guard(main_prog, startup_prog):
with paddle.utils.unique_name.guard():
+ feeds = create_pretraining_data_holder()
[
- input_ids, segment_ids, input_mask, masked_lm_positions,
- masked_lm_labels, next_sentence_labels, masked_lm_scale
+ input_ids, token_type_ids, attention_mask,
+ next_sentence_labels, masked_lm_labels
] = feeds
bert_config = BertConfig.from_json_file(args.config_file)
if bert_config.vocab_size % 8 != 0:
bert_config.vocab_size += 8 - (bert_config.vocab_size % 8)
+ bert_config.fuse_mha = args.fuse_mha
model = BertForPretraining(bert_config)
criterion = BertPretrainingCriterion(bert_config.vocab_size)
prediction_scores, seq_relationship_score = model(
input_ids=input_ids,
- token_type_ids=segment_ids,
- attention_mask=input_mask,
- masked_positions=masked_lm_positions)
+ token_type_ids=token_type_ids,
+ attention_mask=attention_mask,
+ masked_lm_labels=masked_lm_labels)
loss = criterion(prediction_scores, seq_relationship_score,
- masked_lm_labels, next_sentence_labels,
- masked_lm_scale)
+ masked_lm_labels, next_sentence_labels)
lr_scheduler = None
optimizer = None
@@ -158,10 +179,16 @@ def build(args, main_prog, startup_prog, feeds, is_train=True):
optimizer = build_optimizer(args, lr_scheduler)
optimizer = dist_optimizer(args, optimizer)
optimizer.minimize(loss)
- return model, lr_scheduler, optimizer, loss
+ return model, lr_scheduler, optimizer, loss, feeds
-def run(exe, program, args, lr_scheduler, loss, feeds, progress=None):
+def run(exe,
+ program,
+ args,
+ lr_scheduler,
+ loss,
+ train_dataloader,
+ progress=None):
"""
Execute program.
@@ -172,20 +199,14 @@ def run(exe, program, args, lr_scheduler, loss, feeds, progress=None):
lr_scheduler(paddle.optimizer.lr.LRScheduler): A learning rate scheduler.
Default: None.
loss(variable): The output variable of loss function.
- feeds(dict): A dict of mapping variables' names to their values
progress(dict, optional): A dict to record the training progress of checkpoint.
Returns:
global_step(int): Final step id of this run.
loss_return(float): Final loss of this run.
train_time_raw(float): Time to train of this run.
"""
- pool = ThreadPoolExecutor(1)
-
- num_trainers = get_num_trainers()
trainer_id = get_trainer_id()
- worker_init = WorkerInitObj(args.seed + trainer_id)
-
batch_size_per_gpu = args.batch_size
log_steps = args.log_freq
save_steps = args.num_steps_per_checkpoint
@@ -195,115 +216,88 @@ def run(exe, program, args, lr_scheduler, loss, feeds, progress=None):
last_step = args.last_step_of_checkpoint
train_iter = 0
epoch = 0
- resume_from_ckpt = False
+ train_time_raw = 0
if progress is None:
progress = dict()
else:
- resume_from_ckpt = True
- last_step = progress.get('global_step', 0)
epoch = progress.get('epoch', 0)
global_step = 0 + last_step
logging.info(f"Training will start at the {last_step+1}th step")
max_steps = args.max_steps
+ steps_this_run = max_steps
if args.steps_this_run is not None:
if args.steps_this_run + last_step > max_steps:
logging.info(
f"Only {max_steps - last_step} steps will be performed in this run due to the limit of --max-steps."
)
else:
- max_steps = args.steps_this_run + last_step
+ steps_this_run = args.steps_this_run
+ max_steps = steps_this_run + last_step
logging.warning(
- f"{args.steps_this_run} steps will be performed in this run.")
+ f"{steps_this_run} steps will be performed in this run.")
+
+ if args.benchmark:
+ max_steps = args.benchmark_warmup_steps + args.benchmark_steps + last_step
+
+
total_samples = 0
+ raw_train_start = time.time()
step_start = time.time()
- raw_train_start = None
+ avg_loss = 0
while True:
- input_dir = args.input_dir
- if not resume_from_ckpt or progress.get('files', None) is None:
- files = [
- os.path.join(input_dir, f) for f in os.listdir(input_dir)
- if os.path.isfile(os.path.join(input_dir, f)) and "training" in
- f
- ]
- files.sort()
- np.random.shuffle(files)
- f_start_id = 0
- else:
- f_start_id = progress['f_id']
- files = progress['files']
- resume_from_ckpt = False
-
- # Select one file for each worker and create the DataLoader for the file
- data_file = select_dataset_file_for_each_worker(
- files, f_start_id, num_trainers, trainer_id)
- train_data_loader = create_pretraining_dataset(
- args, data_file, feeds, worker_init, paddle.static.cuda_places())
-
- for f_id in range(f_start_id + 1, len(files)):
- data_file = select_dataset_file_for_each_worker(
- files, f_id, num_trainers, trainer_id)
- dataset_future = pool.submit(create_pretraining_dataset, args,
- data_file, feeds, worker_init,
- paddle.static.cuda_places())
-
- if raw_train_start is None:
+ for batch in train_dataloader:
+
+ train_iter += 1
+ loss_return = exe.run(program, feed=batch, fetch_list=[loss])
+ total_samples += batch_size_per_gpu
+ avg_loss += loss_return[0].item()
+
+ lr = lr_scheduler.get_lr()
+
+ if train_iter % (log_steps * gradient_merge_steps) == 0:
+ step_cost = time.time() - step_start
+ dllogger_it_data = {
+ 'loss': avg_loss / gradient_merge_steps,
+ 'learning_rate': lr,
+ 'step_cost': step_cost,
+ 'step_samples': total_samples,
+ 'seqs_per_sec': total_samples / step_cost,
+ }
+ dllogger.log((epoch, global_step + 1), data=dllogger_it_data)
+ total_samples = 0
+ step_start = time.time()
+
+ if train_iter % gradient_merge_steps == 0:
+ global_step += 1
+ lr_scheduler.step()
+ avg_loss = 0
+
+ if args.benchmark and train_iter == (args.benchmark_warmup_steps *
+ gradient_merge_steps):
raw_train_start = time.time()
- for batch in train_data_loader:
- train_iter += 1
- loss_return = exe.run(program, feed=batch, fetch_list=[loss])
- total_samples += batch_size_per_gpu
-
- lr = lr_scheduler.get_lr()
- if train_iter % gradient_merge_steps == 0:
- global_step += 1
- lr_scheduler.step()
-
- if train_iter % (log_steps * gradient_merge_steps) == 0:
- step_cost = time.time() - step_start
- dllogger_it_data = {
- 'loss': loss_return[0].item(),
- 'learning_rate': lr,
- 'step_cost': step_cost,
- 'step_samples': total_samples,
- 'seqs_per_sec': total_samples / step_cost,
+ if train_iter % (save_steps * gradient_merge_steps
+ ) == 0 or global_step >= max_steps:
+ train_time_raw = time.time() - raw_train_start
+ if trainer_id == 0:
+ model_path = os.path.join(
+ args.output_dir, args.bert_model, "phase1"
+ if args.phase1 else "phase2", f"{global_step}")
+ progress = {
+ 'epoch': epoch,
+ 'global_step': global_step,
+ 'phase': 1 if args.phase1 else 2,
}
- dllogger.log((epoch, global_step), data=dllogger_it_data)
- total_samples = 0
- step_start = time.time()
-
- if args.benchmark and train_iter == (
- args.benchmark_warmup_steps * gradient_merge_steps):
- raw_train_start = time.time()
-
- if train_iter % (save_steps * gradient_merge_steps
- ) == 0 or global_step >= max_steps:
- if trainer_id == 0:
- model_path = os.path.join(
- args.output_dir, args.bert_model, "phase1"
- if args.phase1 else "phase2", f"{global_step}")
- progress = {
- 'files': files,
- 'epoch': epoch,
- 'global_step': global_step,
- 'f_id': f_id,
- 'phase': 1 if args.phase1 else 2,
- }
- save_model(program, model_path, args.model_prefix,
- progress)
- most_recent_ckpts_paths.append(model_path)
- if len(most_recent_ckpts_paths) > 3:
- ckpt_to_be_removed = most_recent_ckpts_paths.pop(0)
- shutil.rmtree(ckpt_to_be_removed)
- if (global_step >= max_steps) or (
- args.benchmark and global_step >=
- args.benchmark_steps + args.benchmark_warmup_steps):
- train_time_raw = time.time() - raw_train_start
- del train_data_loader
- return global_step, loss_return[0].item(), train_time_raw
- del train_data_loader
- train_data_loader = dataset_future.result(timeout=None)
+ save_model(program, model_path, args.model_prefix,
+ progress)
+ most_recent_ckpts_paths.append(model_path)
+ if len(most_recent_ckpts_paths) > 3:
+ ckpt_to_be_removed = most_recent_ckpts_paths.pop(0)
+ shutil.rmtree(ckpt_to_be_removed)
+ if global_step >= max_steps:
+ actual_steps_this_run = global_step - last_step
+ return global_step, actual_steps_this_run, loss_return[0].item(), train_time_raw
epoch += 1
diff --git a/PaddlePaddle/LanguageModeling/BERT/requirements.txt b/PaddlePaddle/LanguageModeling/BERT/requirements.txt
deleted file mode 100644
index 3b7de667a..000000000
--- a/PaddlePaddle/LanguageModeling/BERT/requirements.txt
+++ /dev/null
@@ -1,4 +0,0 @@
-nltk
-h5py
-tqdm
-git+https://github.com/NVIDIA/dllogger#egg=dllogger
diff --git a/PaddlePaddle/LanguageModeling/BERT/run.sub b/PaddlePaddle/LanguageModeling/BERT/run.sub
new file mode 100644
index 000000000..dd520a5a4
--- /dev/null
+++ b/PaddlePaddle/LanguageModeling/BERT/run.sub
@@ -0,0 +1,268 @@
+#!/bin/bash
+#SBATCH --exclusive
+#SBATCH --mem=0
+#SBATCH --overcommit
+#SBATCH --parsable
+
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -eux
+
+#
+# Job Configurations
+#
+# Tag to the built image.
+IMAGE_VERSION=${IMAGE_VERSION:-"22.12-py3"}
+# Number of processes per node used for the LDDL preprocessor.
+DASK_TASKS_PER_NODE=${DASK_TASKS_PER_NODE:-128}
+# 1 or 2 .
+PHASE=${PHASE:-1}
+# An integer that specifies the pretraining seed.
+SEED=${SEED:-42}
+# The percentage of the articles from the Wikipedia dataset to sample and used
+# for pretraining. 0 < ${SAMPLE_RATIO} < 1.0
+SAMPLE_RATIO=${SAMPLE_RATIO:-0.9}
+# Number of GPUs per node. 0 < ${GPUS} <= 8.
+GPUS=${GPUS:-"8"}
+# The bin size for binned LDDL data loading. 'none' or an integer that divides
+# 128 (for Phase1) or 512 (for Phase2).
+BIN_SIZE=${BIN_SIZE:-"none"}
+# Number of parquet shards per each LDDL data loader worker process. 'none' or
+# an integer.
+NUM_SHARDS_PER_WORKER=${NUM_SHARDS_PER_WORKER:-"none"}
+# Number of LDDL data loader worker processes per rank.
+NUM_WORKERS=${NUM_WORKERS:-4}
+# Should we rerun the LDDL preprocessor every time? 'true' or 'false' .
+RERUN_DASK=${RERUN_DASK:-"true"}
+# 'static' or 'dynamic' .
+MASKING=${MASKING:-"static"}
+# Should we use jemalloc for the LDDL preprocessor? 'true' or 'false' .
+USE_JEMALLOC=${USE_JEMALLOC:-"true"}
+# 'fp16' or 'tf32' .
+PRECISION=${PRECISION:-"fp16"}
+# The path to the initial checkpoint (from Phase1) used to start Phase2. 'none'
+# or an absolute path.
+INIT_CHECKPOINT=${INIT_CHECKPOINT:-"none"}
+# The per-rank batch size before being divided by the gradient accumulation
+# steps.
+TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-"256"}
+# The gradient accumulation steps.
+GRADIENT_ACCUMULATION_STEPS=${GRADIENT_ACCUMULATION_STEPS:-"32"}
+
+#
+# Static Configurations
+#
+# Container URL.
+# Replace this with the URL of the docker image that you build
+# with scripts/docker/build.sh .
+readonly docker_image="bert:${IMAGE_VERSION}"
+# Where the datasets are stored on the system.
+readonly host_datadir="/home/${USER}/datasets"
+readonly container_datadir="/datasets"
+# Replace these with the path to the 'source' subdirectory of the LDDL Wikipedia
+# dataset.
+readonly host_wikipedia_source="${host_datadir}/wikipedia/source"
+readonly container_wikipedia_source="${container_datadir}/wikipedia/source"
+readonly wikipedia_mount="${host_wikipedia_source}:${container_wikipedia_source}"
+# Replace these with where you want to store the Parquet shards in case
+# ${RERUN_DASK} is 'false'.
+readonly host_pretrain="${host_datadir}/pretrain"
+readonly container_pretrain="${container_datadir}/pretrain"
+readonly pretrain_mount="${host_pretrain}:${container_pretrain}"
+# Replace these with where you want to store the pretrained checkpoints on
+# the system.
+readonly host_output="$PWD/results/${SLURM_JOB_ID}"
+mkdir -p "${host_output}"
+readonly container_output="/results"
+readonly output_mount="${host_output}:${container_output}"
+# If INIT_CHECKPOINT is 'none', infer INIT_CHECKPOINT based on job dependency.
+if [ "${INIT_CHECKPOINT}" == "none" ] && [ "${PHASE}" == "2" ] ; then
+ INIT_CHECKPOINT="$PWD/results/${SLURM_JOB_DEPENDENCY}/bert-large-uncased/phase1/7038"
+fi
+# Define mounts.
+mounts="${PWD}:/workspace/bert,${wikipedia_mount},${pretrain_mount},${output_mount}"
+# Add the mount path of the initial checkpoint for Phase2.
+if [ "${PHASE}" == "1" ]; then
+ echo "No init. mounted for Phase1!"
+ readonly container_init_checkpoint=""
+elif [ "${PHASE}" == "2" ]; then
+ if [ ! -f "${INIT_CHECKPOINT}" ]; then
+ echo "No init. checkpoint found for Phase2!"
+ exit 1
+ else
+ mounts="${mounts},$(dirname "${INIT_CHECKPOINT}"):/checkpoints"
+ readonly container_init_checkpoint="/checkpoints"
+ fi
+else
+ echo "\${PHASE} = ${PHASE} unknown!"
+ exit 1
+fi
+# Determine where the parquet shards should be stored.
+if [ "${RERUN_DASK}" == "true" ]; then
+ # Always rerun the dask pipeline. Therefore, use the output directory to store
+ # the parquets.
+ readonly host_pretrain_parquet="${host_output}/parquet"
+ readonly container_pretrain_parquet="${container_output}/parquet"
+elif [ "${RERUN_DASK}" == "false" ]; then
+ echo "Use existing parquets if they exists."
+ if [ "${BIN_SIZE}" == "none" ]; then
+ readonly host_pretrain_parquet="${host_pretrain}/phase${PHASE}/unbinned/parquet"
+ readonly container_pretrain_parquet="${container_pretrain}/phase${PHASE}/unbinned/parquet"
+ else
+ readonly host_pretrain_parquet="${host_pretrain}/phase${PHASE}/bin_size_${BIN_SIZE}/parquet"
+ readonly container_pretrain_parquet="${container_pretrain}/phase${PHASE}/bin_size_${BIN_SIZE}/parquet"
+ fi
+else
+ echo "\${RERUN_DASK} = ${RERUN_DASK} unknown!"
+ exit 1
+fi
+
+readonly PHASE1="\
+ --learning-rate=6e-3 \
+ --warmup-proportion=0.2843 \
+ --phase1 \
+ --max-seq-length=128 \
+ --max-predictions-per-seq=20 \
+ --max-steps=7038 \
+ --num-steps-per-checkpoint=2500 \
+ "
+
+readonly PHASE2="\
+ --learning-rate=4e-3 \
+ --warmup-proportion=0.128 \
+ --phase2 \
+ --max-seq-length=512 \
+ --max-predictions-per-seq=80 \
+ --max-steps=1563 \
+ --num-steps-per-checkpoint=1000 \
+ --from-pretrained-params=${container_init_checkpoint} \
+ "
+
+# Arguments for fp16.
+if [ "${PRECISION}" == "fp16" ]; then
+ readonly fp16_flags="--amp --use-dynamic-loss-scaling --scale-loss=1048576"
+elif [ "${PRECISION}" == "tf32" ]; then
+ readonly fp16_flags=""
+else
+ echo "\${PRECISION} = ${PRECISION} unknown!"
+ exit 1
+fi
+
+# Get the ip address of all nodes.
+IP_CMD="hostname -i"
+IP_STR=$(srun -pmix --ntasks-per-node=1 bash -c "${IP_CMD}")
+IP_STR=$(echo $IP_STR | sed 's/ /,/g')
+echo "\${IP_STR} = ${IP_STR}"
+
+# Get the actual pretraining command.
+readonly PHASES=( "$PHASE1" "$PHASE2" )
+readonly BERT_CMD="\
+ python -m paddle.distributed.launch \
+ --gpus=0,1,2,3,4,5,6,7 \
+ --ips="${IP_STR}" \
+ /workspace/bert/run_pretraining.py \
+ ${PHASES[$((PHASE - 1))]} \
+ --batch-size=${TRAIN_BATCH_SIZE} \
+ --input-dir=${container_pretrain_parquet} \
+ --output-dir=${container_output} \
+ --vocab-file=/workspace/bert/vocab/bert-large-uncased-vocab.txt \
+ --bert-model=bert-large-uncased \
+ --config-file=/workspace/bert/bert_configs/bert-large-uncased.json \
+ --gradient-merge-steps=${GRADIENT_ACCUMULATION_STEPS} \
+ --log-freq=1 \
+ --seed=12345 \
+ --optimizer=Lamb \
+ ${fp16_flags} "
+
+echo "nodes: ${SLURM_JOB_NUM_NODES}, TRAIN_BATCH_SIZE: ${TRAIN_BATCH_SIZE}, GRADIENT_ACCUMULATION_STEPS: ${GRADIENT_ACCUMULATION_STEPS}"
+
+#
+# Running the LDDL preprocessor and load balancer.
+#
+# Determine the number of parquet shards in total.
+if [ "${NUM_SHARDS_PER_WORKER}" == "none" ]; then
+ readonly num_blocks=4096
+else
+ readonly num_blocks=$((NUM_SHARDS_PER_WORKER * $(( NUM_WORKERS > 0 ? NUM_WORKERS : 1 )) * SLURM_JOB_NUM_NODES * GPUS))
+fi
+echo "num_blocks: ${num_blocks}"
+# Run the LDDL preprocessor and load balancer only when there is no file in
+# where the parquets are supposed to be stored.
+if [ ! -d "${host_pretrain_parquet}" ] || [ -z "$(ls -A "${host_pretrain_parquet}")" ]; then
+ # The sequence length is 128 for Phase1, but 512 for Phase2.
+ if [ "${PHASE}" == "1" ]; then
+ readonly target_seq_len_flag=""
+ elif [ "${PHASE}" == "2" ]; then
+ readonly target_seq_len_flag="--target-seq-length 512"
+ else
+ echo "\${PHASE} = ${PHASE} unknown!"
+ exit 1
+ fi
+ # Should we use sequence binning?
+ if [ "${BIN_SIZE}" == "none" ]; then
+ readonly bin_size_flag=""
+ else
+ readonly bin_size_flag="--bin-size ${BIN_SIZE}"
+ fi
+ # Static masking or dynamic masking?
+ if [ "${MASKING}" == "dynamic" ]; then
+ readonly masking_flag=""
+ elif [ "${MASKING}" == "static" ]; then
+ readonly masking_flag="--masking"
+ else
+ echo "\${MASKING} = ${MASKING} unknown!"
+ exit 1
+ fi
+ # Should we use jemalloc for the LDDL preprocessor?
+ if [ "${USE_JEMALLOC}" == "true" ]; then
+ readonly use_jemalloc_flag="--export=ALL,LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so"
+ elif [ "${USE_JEMALLOC}" == "false" ]; then
+ readonly use_jemalloc_flag=""
+ else
+ echo "\${USE_JEMALLOC} = ${USE_JEMALLOC} unknown!"
+ exit 1
+ fi
+ # Run the LDDL preprocessor.
+ srun -l \
+ --mpi=pmix \
+ --container-image="${docker_image}" \
+ --container-mounts="${mounts}" \
+ --ntasks-per-node="${DASK_TASKS_PER_NODE}" \
+ ${use_jemalloc_flag} \
+ preprocess_bert_pretrain \
+ --schedule mpi \
+ ${target_seq_len_flag} \
+ --wikipedia ${container_wikipedia_source} \
+ --sink "${container_pretrain_parquet}" \
+ --vocab-file /workspace/bert/vocab/bert-large-uncased-vocab.txt \
+ --num-blocks "${num_blocks}" \
+ --sample-ratio "${SAMPLE_RATIO}" \
+ ${bin_size_flag} \
+ ${masking_flag} \
+ --seed "${SEED}"
+ # Run the LDDL load balancer.
+ srun -l \
+ --mpi=pmix \
+ --container-image="${docker_image}" \
+ --container-mounts="${mounts}" \
+ --ntasks-per-node="${DASK_TASKS_PER_NODE}" \
+ balance_dask_output \
+ --indir "${container_pretrain_parquet}" \
+ --num-shards "${num_blocks}"
+fi
+
+#
+# Run pretraining.
+#
+srun -l -pmix --container-image="${docker_image}" --container-mounts="${mounts}" --ntasks-per-node=1 bash -c "${BERT_CMD}"
\ No newline at end of file
diff --git a/PaddlePaddle/LanguageModeling/BERT/run_pretraining.py b/PaddlePaddle/LanguageModeling/BERT/run_pretraining.py
index b66f89a54..d6f9f4adc 100644
--- a/PaddlePaddle/LanguageModeling/BERT/run_pretraining.py
+++ b/PaddlePaddle/LanguageModeling/BERT/run_pretraining.py
@@ -12,11 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+import os
import time
+import logging
import paddle
import paddle.distributed.fleet as fleet
-from pretraining_dataset import create_pretraining_data_holder
from utils.config import parse_args, print_args
from utils.save_load import init_program
from utils.logger import setup_loggers
@@ -24,6 +25,7 @@
from utils.utility import set_seed, get_trainer_id, get_num_trainers
import program
import dllogger
+from lddl.paddle import get_bert_pretrain_data_loader
def main():
@@ -42,12 +44,11 @@ def main():
if args.show_config:
print_args(args)
+ device = paddle.set_device('gpu')
fleet.init(is_collective=True)
if args.enable_cpu_affinity:
set_cpu_affinity()
- device = paddle.set_device('gpu')
-
# Create the random seed for the worker
set_seed(args.seed + get_trainer_id())
@@ -60,30 +61,44 @@ def main():
main_program = paddle.static.default_main_program()
startup_program = paddle.static.default_startup_program()
- feeds = create_pretraining_data_holder()
-
- model, lr_scheduler, optimizer, loss = program.build(
- args, main_program, startup_program, feeds)
+ model, lr_scheduler, optimizer, loss, feeds = program.build(
+ args, main_program, startup_program)
exe = paddle.static.Executor(device)
exe.run(startup_program)
progress = init_program(args, program=main_program, exe=exe, model=model)
+ train_dataloader = get_bert_pretrain_data_loader(
+ args.input_dir,
+ vocab_file=args.vocab_file,
+ data_loader_kwargs={
+ 'batch_size': args.batch_size,
+ 'num_workers': args.num_workers,
+ 'persistent_workers': True,
+ 'feed_list': feeds
+ },
+ base_seed=args.seed,
+ log_dir=None if args.output_dir is None else
+ os.path.join(args.output_dir, 'lddl_log'),
+ log_level=logging.WARNING,
+ start_epoch=0 if progress is None else progress.get("epoch", 0),
+ sequence_length_alignment=64)
if args.amp:
optimizer.amp_init(device)
- global_step, final_loss, train_time_raw = program.run(
- exe, main_program, args, lr_scheduler, loss, feeds, progress)
+ global_step, actual_steps_this_run, final_loss, train_time_raw = program.run(
+ exe, main_program, args, lr_scheduler, loss, train_dataloader,
+ progress)
if get_trainer_id() == 0:
e2e_time = time.time() - now
if args.benchmark:
training_perf = args.batch_size * args.gradient_merge_steps * (
- global_step - args.benchmark_warmup_steps
+ actual_steps_this_run - args.benchmark_warmup_steps
) * get_num_trainers() / train_time_raw
else:
- training_perf = args.batch_size * args.gradient_merge_steps * global_step * get_num_trainers(
+ training_perf = args.batch_size * args.gradient_merge_steps * actual_steps_this_run * get_num_trainers(
) / train_time_raw
dllogger.log(step=tuple(),
data={
diff --git a/PaddlePaddle/LanguageModeling/BERT/run_squad.py b/PaddlePaddle/LanguageModeling/BERT/run_squad.py
index 601676c68..48d2694a5 100644
--- a/PaddlePaddle/LanguageModeling/BERT/run_squad.py
+++ b/PaddlePaddle/LanguageModeling/BERT/run_squad.py
@@ -186,9 +186,11 @@ def main(args):
with paddle.static.program_guard(main_program, startup_program):
bert_config = BertConfig.from_json_file(args.config_file)
+ bert_config.fuse_mha = args.fuse_mha
if bert_config.vocab_size % 8 != 0:
bert_config.vocab_size += 8 - (bert_config.vocab_size % 8)
+
model = BertForQuestionAnswering(bert_config)
criterion = CrossEntropyLossForSQuAD()
logits = model(input_ids=input_ids, token_type_ids=segment_ids)
diff --git a/PaddlePaddle/LanguageModeling/BERT/scripts/configs/pretrain_config.sh b/PaddlePaddle/LanguageModeling/BERT/scripts/configs/pretrain_config.sh
index 6300e6412..e2c76de35 100644
--- a/PaddlePaddle/LanguageModeling/BERT/scripts/configs/pretrain_config.sh
+++ b/PaddlePaddle/LanguageModeling/BERT/scripts/configs/pretrain_config.sh
@@ -30,14 +30,22 @@ dgxa100-80g_8gpu_amp ()
warmup_proportion_phase2="0.128"
train_steps_phase2=1563
gradient_accumulation_steps_phase2=128
- DATASET=hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en # change this for other datasets
+ DATASET=pretrain/phase1/unbinned/parquet # change this for other datasets
DATA_DIR_PHASE1="$BERT_PREP_WORKING_DIR/${DATASET}/"
- DATASET2=hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en # change this for other datasets
+ DATASET2=pretrain/phase2/bin_size_64/parquet # change this for other datasets
DATA_DIR_PHASE2="$BERT_PREP_WORKING_DIR/${DATASET2}/"
CODEDIR=/workspace/bert
init_checkpoint="None"
+ VOCAB_FILE=vocab/bert-large-uncased-vocab.txt
RESULTS_DIR=$CODEDIR/results
CHECKPOINTS_DIR=$RESULTS_DIR
+ wikipedia_source=$BERT_PREP_WORKING_DIR/wikipedia/source/
+ num_dask_workers=128
+ num_shards_per_worker=128
+ num_workers=4
+ sample_ratio="0.9"
+ phase2_bin_size=64
+ masking=static
BERT_CONFIG=bert_configs/bert-large-uncased.json
enable_benchmark="false"
benchmark_steps=10 # It takes effect only after the enable_benchmark is set to true
@@ -45,9 +53,11 @@ dgxa100-80g_8gpu_amp ()
echo $train_batch_size $learning_rate $precision $num_gpus \
$warmup_proportion $train_steps $save_checkpoint_steps \
$create_logfile $gradient_accumulation_steps $seed $job_name \
- $train_batch_size_phase2 $learning_rate_phase2 \
+ $train_batch_size_phase2 $learning_rate_phase2 \
$warmup_proportion_phase2 $train_steps_phase2 $gradient_accumulation_steps_phase2 \
$DATA_DIR_PHASE1 $DATA_DIR_PHASE2 $CODEDIR $init_checkpoint \
+ $wikipedia_source $num_dask_workers $num_shards_per_worker $num_workers \
+ $sample_ratio $phase2_bin_size $masking \
$BERT_CONFIG $enable_benchmark $benchmark_steps $benchmark_warmup_steps
}
@@ -69,14 +79,22 @@ dgxa100-80g_8gpu_tf32 ()
warmup_proportion_phase2="0.128"
train_steps_phase2=1563
gradient_accumulation_steps_phase2=256
- DATASET=hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en # change this for other datasets
+ DATASET=pretrain/phase1/unbinned/parquet # change this for other datasets
DATA_DIR_PHASE1="$BERT_PREP_WORKING_DIR/${DATASET}/"
- DATASET2=hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en # change this for other datasets
+ DATASET2=pretrain/phase2/bin_size_64/parquet # change this for other datasets
DATA_DIR_PHASE2="$BERT_PREP_WORKING_DIR/${DATASET2}/"
CODEDIR=/workspace/bert
init_checkpoint="None"
+ VOCAB_FILE=vocab/bert-large-uncased-vocab.txt
RESULTS_DIR=$CODEDIR/results
CHECKPOINTS_DIR=$RESULTS_DIR
+ wikipedia_source=$BERT_PREP_WORKING_DIR/wikipedia/source/
+ num_dask_workers=128
+ num_shards_per_worker=128
+ num_workers=4
+ sample_ratio="0.9"
+ phase2_bin_size=64
+ masking=static
BERT_CONFIG=bert_configs/bert-large-uncased.json
enable_benchmark="false"
benchmark_steps=10 # It takes effect only after the enable_benchmark is set to true
@@ -84,8 +102,10 @@ dgxa100-80g_8gpu_tf32 ()
echo $train_batch_size $learning_rate $precision $num_gpus \
$warmup_proportion $train_steps $save_checkpoint_steps \
$create_logfile $gradient_accumulation_steps $seed $job_name \
- $train_batch_size_phase2 $learning_rate_phase2 \
+ $train_batch_size_phase2 $learning_rate_phase2 \
$warmup_proportion_phase2 $train_steps_phase2 $gradient_accumulation_steps_phase2 \
$DATA_DIR_PHASE1 $DATA_DIR_PHASE2 $CODEDIR $init_checkpoint \
+ $wikipedia_source $num_dask_workers $num_shards_per_worker $num_workers \
+ $sample_ratio $phase2_bin_size $masking \
$BERT_CONFIG $enable_benchmark $benchmark_steps $benchmark_warmup_steps
}
diff --git a/PaddlePaddle/LanguageModeling/BERT/scripts/docker/build.sh b/PaddlePaddle/LanguageModeling/BERT/scripts/docker/build.sh
index cda6c187c..dd5ef4374 100644
--- a/PaddlePaddle/LanguageModeling/BERT/scripts/docker/build.sh
+++ b/PaddlePaddle/LanguageModeling/BERT/scripts/docker/build.sh
@@ -1,6 +1,6 @@
#!/bin/bash
-# Copyright (c) 2022 NVIDIA Corporation. All rights reserved.
+# Copyright (c) 2023 NVIDIA Corporation. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -14,4 +14,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-docker build --network=host . --rm --pull --no-cache -t bert
+URL=${1:-"bert"}
+PUSH=${2:-"none"} # 'push' or 'none'
+
+set -e
+
+docker build \
+ --network=host \
+ --rm \
+ --pull \
+ --no-cache \
+ -t ${URL} \
+ .
+
+if [ "${PUSH}" == "push" ]; then
+ docker push ${URL}
+elif [ "${PUSH}" == "none" ]; then
+ echo "Keep the built image locally."
+else
+ echo "Invalid \${PUSH} option: ${PUSH} !"
+ exit 1
+fi
diff --git a/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining.sh b/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining.sh
index 6ab426d07..bd6da1240 100644
--- a/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining.sh
+++ b/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining.sh
@@ -32,25 +32,88 @@ warmup_proportion_phase2=${14:-"0.128"}
train_steps_phase2=${15:-1563}
gradient_accumulation_steps_phase2=${16:-128}
#change this for other datasets
-DATASET=hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en
+DATASET=pretrain/phase1/unbinned/parquet
DATA_DIR_PHASE1=${17:-$BERT_PREP_WORKING_DIR/${DATASET}/}
#change this for other datasets
-DATASET2=hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en
+DATASET2=pretrain/phase2/bin_size_64/parquet
DATA_DIR_PHASE2=${18:-$BERT_PREP_WORKING_DIR/${DATASET2}/}
CODEDIR=${19:-"/workspace/bert"}
init_checkpoint=${20:-"None"}
+VOCAB_FILE=vocab/bert-large-uncased-vocab.txt
RESULTS_DIR=$CODEDIR/results
CHECKPOINTS_DIR=$RESULTS_DIR
-BERT_CONFIG=${21:-"None"}
-enable_benchmark=${22:-"false"}
-benchmark_steps=${23:-"10"}
-benchmark_warmup_steps=${24:-"10"}
+wikipedia_source=${21:-$BERT_PREP_WORKING_DIR/wikipedia/source/}
+num_dask_workers=${22:-$(nproc)}
+num_shards_per_worker=${23:-128}
+num_workers=${24:-4}
+num_nodes=1
+sample_ratio=${25:-0.9}
+phase2_bin_size=${26:-64}
+masking=${27:-static}
+BERT_CONFIG=${28:-"None"}
+enable_benchmark=${29:-"false"}
+benchmark_steps=${30:-"10"}
+benchmark_warmup_steps=${31:-"10"}
+fuse_mha=${32:-"true"}
+
+# Calculate the total number of shards.
+readonly num_blocks=$((num_shards_per_worker * $(( num_workers > 0 ? num_workers : 1 )) * num_nodes * num_gpus))
+
+if [ "${phase2_bin_size}" == "none" ]; then
+ readonly phase2_bin_size_flag=""
+elif [[ "${phase2_bin_size}" =~ ^(32|64|128|256|512)$ ]]; then
+ readonly phase2_bin_size_flag="--bin-size ${phase2_bin_size}"
+else
+ echo "Error! phase2_bin_size=${phase2_bin_size} not supported!"
+ return -1
+fi
+
+if [ "${masking}" == "static" ]; then
+ readonly masking_flag="--masking"
+elif [ "${masking}" == "dynamic" ]; then
+ readonly masking_flag=""
+else
+ echo "Error! masking=${masking} not supported!"
+ return -1
+fi
mkdir -p $CHECKPOINTS_DIR
-if [ ! -d "$DATA_DIR_PHASE1" ] ; then
- echo "Warning! $DATA_DIR_PHASE1 directory missing. Training cannot start"
+if [ ! -d "${DATA_DIR_PHASE1}" ] || [ -z "$(ls -A ${DATA_DIR_PHASE1})" ]; then
+ echo "Warning! ${DATA_DIR_PHASE1} directory missing."
+ if [ ! -d "${wikipedia_source}" ] || [ -z "$(ls -A ${wikipedia_source})" ]; then
+ echo "Error! ${wikipedia_source} directory missing. Training cannot start!"
+ return -1
+ fi
+ preprocess_cmd=" \
+ mpirun \
+ --oversubscribe \
+ --allow-run-as-root \
+ -np ${num_dask_workers} \
+ -x LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so \
+ preprocess_bert_pretrain \
+ --schedule mpi \
+ --vocab-file ${VOCAB_FILE} \
+ --wikipedia ${wikipedia_source} \
+ --sink ${DATA_DIR_PHASE1} \
+ --num-blocks ${num_blocks} \
+ --sample-ratio ${sample_ratio} \
+ ${masking_flag} \
+ --seed ${seed}"
+ echo "Running ${preprocess_cmd} ..."
+ ${preprocess_cmd}
+
+ balance_load_cmd=" \
+ mpirun \
+ --oversubscribe \
+ --allow-run-as-root \
+ -np ${num_dask_workers} \
+ balance_dask_output \
+ --indir ${DATA_DIR_PHASE1} \
+ --num-shards ${num_blocks}"
+ echo "Running ${balance_load_cmd} ..."
+ ${balance_load_cmd}
fi
if [ ! -d "$RESULTS_DIR" ] ; then
echo "Error! $RESULTS_DIR directory missing."
@@ -68,8 +131,12 @@ if [ "$BERT_CONFIG" != "None" ] ; then
fi
PREC=""
+FUSE_MHA=""
if [ "$precision" = "amp" ] ; then
PREC="--amp --use-dynamic-loss-scaling --scale-loss=1048576"
+ if [ "$fuse_mha" = "true" ] ; then
+ FUSE_MHA="--fuse-mha"
+ fi
elif [ "$precision" = "fp32" ] ; then
PREC=""
elif [ "$precision" = "tf32" ] ; then
@@ -119,6 +186,7 @@ echo $DATA_DIR_PHASE1
INPUT_DIR=$DATA_DIR_PHASE1
CMD=" $CODEDIR/run_pretraining.py"
CMD+=" --input-dir=$DATA_DIR_PHASE1"
+CMD+=" --vocab-file=$VOCAB_FILE"
CMD+=" --output-dir=$CHECKPOINTS_DIR"
CMD+=" $CONFIG "
CMD+=" --bert-model=bert-large-uncased"
@@ -134,6 +202,7 @@ CMD+=" --log-freq=1"
CMD+=" --optimizer=Lamb"
CMD+=" --phase1"
CMD+=" $PREC"
+CMD+=" $FUSE_MHA"
CMD+=" $ACCUMULATE_GRADIENTS"
CMD+=" $INIT_CHECKPOINT"
CMD+=" $BENCH"
@@ -180,11 +249,49 @@ fi
ACCUMULATE_GRADIENTS="--gradient-merge-steps=$gradient_accumulation_steps_phase2"
+if [ ! -d "${DATA_DIR_PHASE2}" ] || [ -z "$(ls -A ${DATA_DIR_PHASE2})" ]; then
+ echo "Warning! ${DATA_DIR_PHASE2} directory missing."
+ if [ ! -d "${wikipedia_source}" ] || [ -z "$(ls -A ${wikipedia_source})" ]; then
+ echo "Error! ${wikipedia_source} directory missing. Training cannot start!"
+ return -1
+ fi
+ preprocess_cmd=" \
+ mpirun \
+ --oversubscribe \
+ --allow-run-as-root \
+ -np ${num_dask_workers} \
+ -x LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so \
+ preprocess_bert_pretrain \
+ --schedule mpi \
+ --vocab-file ${VOCAB_FILE} \
+ --wikipedia ${wikipedia_source} \
+ --sink ${DATA_DIR_PHASE2} \
+ --target-seq-length 512 \
+ --num-blocks ${num_blocks} \
+ --sample-ratio ${sample_ratio} \
+ ${phase2_bin_size_flag} \
+ ${masking_flag} \
+ --seed ${seed}"
+ echo "Running ${preprocess_cmd} ..."
+ ${preprocess_cmd}
+
+ balance_load_cmd=" \
+ mpirun \
+ --oversubscribe \
+ --allow-run-as-root \
+ -np ${num_dask_workers} \
+ balance_dask_output \
+ --indir ${DATA_DIR_PHASE2} \
+ --num-shards ${num_blocks}"
+ echo "Running ${balance_load_cmd} ..."
+ ${balance_load_cmd}
+fi
echo $DATA_DIR_PHASE2
INPUT_DIR=$DATA_DIR_PHASE2
PHASE1_END_CKPT_DIR="${CHECKPOINTS_DIR}/bert-large-uncased/phase1/${train_steps}"
CMD=" $CODEDIR/run_pretraining.py"
CMD+=" --input-dir=$DATA_DIR_PHASE2"
+CMD+=" --vocab-file=$VOCAB_FILE"
CMD+=" --output-dir=$CHECKPOINTS_DIR"
CMD+=" $CONFIG "
CMD+=" --bert-model=bert-large-uncased"
diff --git a/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p1.sh b/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p1.sh
index 18e237ec6..efc77ba06 100644
--- a/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p1.sh
+++ b/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p1.sh
@@ -15,7 +15,8 @@
python3 -m paddle.distributed.launch \
--gpus="0,1,2,3,4,5,6,7" \
./run_pretraining.py \
---input-dir=./data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en \
+--input-dir=pretrain/phase1/unbinned/parquet \
+--vocab-file=vocab/bert-large-uncased-vocab.txt \
--output-dir=./results/checkpoints \
--bert-model=bert-large-uncased \
--from-checkpoint=./results/checkpoints/bert-large-uncased/phase1 \
@@ -30,6 +31,7 @@ python3 -m paddle.distributed.launch \
--amp \
--use-dynamic-loss-scaling \
--optimizer=Lamb \
+--fuse-mha \
--phase1 \
--scale-loss=1048576 \
--learning-rate=6e-3 \
diff --git a/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p2.sh b/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p2.sh
index f0e788cf2..76a9398b7 100644
--- a/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p2.sh
+++ b/PaddlePaddle/LanguageModeling/BERT/scripts/run_pretraining_p2.sh
@@ -15,7 +15,8 @@
python3 -m paddle.distributed.launch \
--gpus="0,1,2,3,4,5,6,7" \
./run_pretraining.py \
---input-dir=./data/hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en \
+--input-dir=pretrain/phase2/bin_size_64/parquet \
+--vocab-file=vocab/bert-large-uncased-vocab.txt \
--output-dir=./results/checkpoints \
--bert-model=bert-large-uncased \
--from-checkpoint=./results/checkpoints/bert-large-uncased/phase2 \
@@ -31,6 +32,7 @@ python3 -m paddle.distributed.launch \
--amp \
--use-dynamic-loss-scaling \
--optimizer=Lamb \
+--fuse-mha \
--phase2 \
--scale-loss=1048576 \
--learning-rate=4e-3 \
diff --git a/PaddlePaddle/LanguageModeling/BERT/scripts/run_squad.sh b/PaddlePaddle/LanguageModeling/BERT/scripts/run_squad.sh
index 4d0d46da0..5234f16c1 100644
--- a/PaddlePaddle/LanguageModeling/BERT/scripts/run_squad.sh
+++ b/PaddlePaddle/LanguageModeling/BERT/scripts/run_squad.sh
@@ -31,6 +31,7 @@ max_steps=${14:-"-1"}
enable_benchmark=${15:-"false"}
benchmark_steps=${16:-"100"}
benchmark_warmup_steps=${17:-"100"}
+fuse_mha=${18:-"true"}
echo "out dir is $OUT_DIR"
@@ -41,9 +42,13 @@ if [ ! -d "$OUT_DIR" ]; then
fi
amp=""
+FUSE_MHA=""
if [ "$precision" = "amp" ] ; then
echo "amp activated!"
amp=" --amp --use-dynamic-loss-scaling --scale-loss=128.0"
+ if [ "$fuse_mha" = "true" ] ; then
+ FUSE_MHA="--fuse-mha"
+ fi
fi
CONFIG=""
@@ -119,6 +124,7 @@ CMD+=" --max-steps=$max_steps "
CMD+=" --optimizer=AdamW "
CMD+=" --log-freq=100 "
CMD+=" $amp "
+CMD+=" $FUSE_MHA "
CMD+=" $BENCH "
CMD+=" --report-file $OUT_DIR/dllogger_${num_gpus}_${precision}.json "
diff --git a/PaddlePaddle/LanguageModeling/BERT/utils/config.py b/PaddlePaddle/LanguageModeling/BERT/utils/config.py
index 2b1564fb0..8a402a291 100644
--- a/PaddlePaddle/LanguageModeling/BERT/utils/config.py
+++ b/PaddlePaddle/LanguageModeling/BERT/utils/config.py
@@ -18,6 +18,7 @@
import distutils.util
import logging
import dllogger
+import paddle
from utils.task import Task
from utils.save_load import _PDOPT_SUFFIX, _PDPARAMS_SUFFIX, _PROGRESS_SUFFIX
@@ -27,7 +28,7 @@
'bert-large-uncased': './bert_configs/bert-large-uncased.json',
'bert-large-cased': './bert_configs/bert-large-cased.json',
'bert-base-uncased': './bert_configs/bert-base-uncased.json',
- 'bert-base-cased': './bert_configs/bert-base-cased.json'
+ 'bert-base-cased': './bert_configs/bert-base-cased.json',
}
@@ -41,28 +42,34 @@ def _check_file_exist(path_with_prefix):
pdparams_path = path_with_prefix + _PDPARAMS_SUFFIX
progress_path = path_with_prefix + _PROGRESS_SUFFIX
found = False
- if os.path.exists(pdopt_path) and os.path.exists(
- pdparams_path) and os.path.exists(progress_path):
+ if (
+ os.path.exists(pdopt_path)
+ and os.path.exists(pdparams_path)
+ and os.path.exists(progress_path)
+ ):
found = True
return found, pdopt_path, pdparams_path, progress_path
if not os.path.exists(args.from_checkpoint):
logging.warning(
- f"Start training from scratch since no checkpoint is found.")
+ f"Start training from scratch since no checkpoint is found."
+ )
args.from_checkpoint = None
args.last_step_of_checkpoint = 0
return
- target_from_checkpoint = os.path.join(args.from_checkpoint,
- args.model_prefix)
+ target_from_checkpoint = os.path.join(
+ args.from_checkpoint, args.model_prefix
+ )
if args.last_step_of_checkpoint is None:
args.last_step_of_checkpoint = 0
elif args.last_step_of_checkpoint == _AUTO_LAST_EPOCH:
folders = os.listdir(args.from_checkpoint)
args.last_step_of_checkpoint = 0
for folder in folders:
- tmp_ckpt_path = os.path.join(args.from_checkpoint, folder,
- args.model_prefix)
+ tmp_ckpt_path = os.path.join(
+ args.from_checkpoint, folder, args.model_prefix
+ )
try:
folder = int(folder)
@@ -72,23 +79,32 @@ def _check_file_exist(path_with_prefix):
)
continue
- if folder > args.last_step_of_checkpoint and \
- _check_file_exist(tmp_ckpt_path)[0]:
+ if (
+ folder > args.last_step_of_checkpoint
+ and _check_file_exist(tmp_ckpt_path)[0]
+ ):
args.last_step_of_checkpoint = folder
- step_with_prefix = os.path.join(str(args.last_step_of_checkpoint), args.model_prefix) \
- if args.last_step_of_checkpoint > 0 else args.model_prefix
- target_from_checkpoint = os.path.join(args.from_checkpoint,
- step_with_prefix)
+ step_with_prefix = (
+ os.path.join(str(args.last_step_of_checkpoint), args.model_prefix)
+ if args.last_step_of_checkpoint > 0
+ else args.model_prefix
+ )
+ target_from_checkpoint = os.path.join(
+ args.from_checkpoint, step_with_prefix
+ )
else:
try:
args.last_step_of_checkpoint = int(args.last_step_of_checkpoint)
except ValueError:
- raise ValueError(f"The value of --last-step-of-checkpoint should be None, {_AUTO_LAST_EPOCH}" \
- f" or integer >= 0, but receive {args.last_step_of_checkpoint}")
+ raise ValueError(
+ f"The value of --last-step-of-checkpoint should be None, {_AUTO_LAST_EPOCH}"
+ f" or integer >= 0, but receive {args.last_step_of_checkpoint}"
+ )
args.from_checkpoint = target_from_checkpoint
found, pdopt_path, pdparams_path, progress_path = _check_file_exist(
- args.from_checkpoint)
+ args.from_checkpoint
+ )
if not found:
args.from_checkpoint = None
args.last_step_of_checkpoint = 0
@@ -98,19 +114,28 @@ def _check_file_exist(path_with_prefix):
def _get_full_path_of_pretrained_params(args, task=Task.pretrain):
- if args.from_pretrained_params is None and args.from_phase1_final_params is None:
+ if (
+ args.from_pretrained_params is None
+ and args.from_phase1_final_params is None
+ ):
args.last_step_of_checkpoint = 0
return
- if task == Task.pretrain and args.from_phase1_final_params is not None and args.last_step_of_checkpoint == 0:
+ if (
+ task == Task.pretrain
+ and args.from_phase1_final_params is not None
+ and args.last_step_of_checkpoint == 0
+ ):
args.from_pretrained_params = args.from_phase1_final_params
- args.from_pretrained_params = os.path.join(args.from_pretrained_params,
- args.model_prefix)
+ args.from_pretrained_params = os.path.join(
+ args.from_pretrained_params, args.model_prefix
+ )
pdparams_path = args.from_pretrained_params + _PDPARAMS_SUFFIX
if not os.path.exists(pdparams_path):
args.from_pretrained_params = None
logging.warning(
- f"Cannot find {pdparams_path}, disable --from-pretrained-params.")
+ f"Cannot find {pdparams_path}, disable --from-pretrained-params."
+ )
args.last_step_of_checkpoint = 0
@@ -121,20 +146,28 @@ def print_args(args):
def check_and_process_args(args, task=Task.pretrain):
if task == Task.pretrain:
- assert not (args.from_checkpoint is not None and \
- args.from_pretrained_params is not None), \
- "--from-pretrained-params and --from-checkpoint should " \
- "not be set simultaneously."
- assert not (args.phase1 and args.phase2), \
- "--phase1 and --phase2 should not be set simultaneously in bert pretraining."
+ assert not (
+ args.from_checkpoint is not None
+ and args.from_pretrained_params is not None
+ ), (
+ "--from-pretrained-params and --from-checkpoint should "
+ "not be set simultaneously."
+ )
+ assert not (
+ args.phase1 and args.phase2
+ ), "--phase1 and --phase2 should not be set simultaneously in bert pretraining."
if args.from_phase1_final_params is not None:
- assert args.phase2, "--from-phase1-final-params should only be used in phase2"
+ assert (
+ args.phase2
+ ), "--from-phase1-final-params should only be used in phase2"
# SQuAD finetuning does not support suspend-resume yet.(TODO)
_get_full_path_of_ckpt(args)
if args.bert_model == 'custom':
- assert args.config_file is not None, "--config-file must be specified if --bert-model=custom"
+ assert (
+ args.config_file is not None
+ ), "--config-file must be specified if --bert-model=custom"
elif args.config_file is None:
args.config_file = _DEFAULT_BERT_CONFIG[args.bert_model]
logging.info(
@@ -144,7 +177,19 @@ def check_and_process_args(args, task=Task.pretrain):
_get_full_path_of_pretrained_params(args, task)
assert os.path.isfile(
- args.config_file), f"Cannot find config file in {args.config_file}"
+ args.config_file
+ ), f"Cannot find config file in {args.config_file}"
+
+ # cudnn mha fusion is only supported after v8.9.1 on Ampere and Hopper GPU
+ device_capability = paddle.device.cuda.get_device_capability()
+ cudnn_mha_supported = paddle.get_cudnn_version() >= 8901 and (
+ device_capability == (8, 0) or device_capability == (9, 0)
+ )
+ if (not cudnn_mha_supported or args.amp is False) and args.fuse_mha is True:
+ logging.info(
+ f"cudnn mha fusion is not supported, fall back to unfused mha"
+ )
+ args.fuse_mha = False
def add_global_args(parser, task=Task.pretrain):
@@ -155,144 +200,165 @@ def add_global_args(parser, task=Task.pretrain):
type=str,
default=None,
required=True,
- help='The input data directory. Should be specified by users and contain .hdf5 files for the task.'
+ help='The input data directory. Should be specified by users and contain .hdf5 files for the task.',
)
+ group.add_argument('--num-workers', default=1, type=int)
if task == Task.squad:
group.add_argument(
'--train-file',
type=str,
default=None,
- help='SQuAD json for training. E.g., train-v1.1.json')
+ help='SQuAD json for training. E.g., train-v1.1.json',
+ )
group.add_argument(
'--predict-file',
type=str,
default=None,
- help='SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json'
+ help='SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json',
)
- group.add_argument(
- '--vocab-file',
- type=str,
- default=None,
- required=True,
- help="Vocabulary mapping/file BERT was pretrainined on")
group.add_argument(
"--eval-script",
help="Script to evaluate squad predictions",
default="evaluate.py",
- type=str)
+ type=str,
+ )
group.add_argument(
'--epochs',
type=int,
default=3,
- help='The number of epochs for training.')
+ help='The number of epochs for training.',
+ )
+ group.add_argument(
+ '--vocab-file',
+ type=str,
+ default=None,
+ required=True,
+ help="Vocabulary mapping/file BERT was pretrainined on",
+ )
group.add_argument(
'--output-dir',
type=str,
default=None,
required=True,
- help='The output directory where the model checkpoints will be written. Should be specified by users.'
+ help='The output directory where the model checkpoints will be written. Should be specified by users.',
)
group.add_argument(
'--bert-model',
type=str,
default='bert-large-uncased',
- choices=('bert-base-uncased', 'bert-base-cased', 'bert-large-uncased',
- 'bert-large-cased', 'custom'),
+ choices=(
+ 'bert-base-uncased',
+ 'bert-base-cased',
+ 'bert-large-uncased',
+ 'bert-large-cased',
+ 'custom',
+ ),
help='Specifies the type of BERT model to use. If it is set as custom, '
- 'the path to the config file must be given by specifying --config-file')
+ 'the path to the config file must be given by specifying --config-file',
+ )
group.add_argument(
'--config-file',
type=str,
default=None,
- help='The BERT model config. If set to None, `<--bert-model>.json` in folder `bert_configs` will be used.'
+ help='The BERT model config. If set to None, `<--bert-model>.json` in folder `bert_configs` will be used.',
)
group.add_argument(
'--max-steps',
type=int,
default=None,
required=True if task == Task.pretrain else False,
- help='Total number of training steps to perform.')
+ help='Total number of training steps to perform.',
+ )
group.add_argument(
- '--log-freq', type=int, default=10, help='Frequency of logging loss.')
+ '--log-freq', type=int, default=10, help='Frequency of logging loss.'
+ )
group.add_argument(
'--num-steps-per-checkpoint',
type=int,
default=100,
- help='Number of update steps until a model checkpoint is saved to disk.'
+ help='Number of update steps until a model checkpoint is saved to disk.',
)
# Init model
group.add_argument(
'--from-pretrained-params',
type=str,
default=None,
- help='Path to pretrained parameters. If set to None, no pretrained params will be used.'
+ help='Path to pretrained parameters. If set to None, no pretrained params will be used.',
)
group.add_argument(
'--from-checkpoint',
type=str,
default=None,
- help='A checkpoint path to resume training. If set to None, no checkpoint will be used. ' \
- 'If not None, --from-pretrained-params will be ignored.')
+ help='A checkpoint path to resume training. If set to None, no checkpoint will be used. '
+ 'If not None, --from-pretrained-params will be ignored.',
+ )
group.add_argument(
'--last-step-of-checkpoint',
type=str,
default=None,
- help='The step id of the checkpoint given by --from-checkpoint. ' \
- 'It should be None, auto, or integer > 0. If it is set as ' \
- 'None, then training will start from the 1-th epoch. If it is set as ' \
- 'auto, then it will search largest integer-convertable folder ' \
- ' --from-checkpoint, which contains required checkpoint. '
+ help='The step id of the checkpoint given by --from-checkpoint. '
+ 'It should be None, auto, or integer > 0. If it is set as '
+ 'None, then training will start from the 1-th epoch. If it is set as '
+ 'auto, then it will search largest integer-convertable folder '
+ ' --from-checkpoint, which contains required checkpoint. ',
)
if task == Task.pretrain:
group.add_argument(
'--from-phase1-final-params',
type=str,
default=None,
- help='Path to final checkpoint of phase1, which will be used to ' \
- 'initialize the parameter in the first step of phase2, and ' \
- 'ignored in the rest steps of phase2.'
+ help='Path to final checkpoint of phase1, which will be used to '
+ 'initialize the parameter in the first step of phase2, and '
+ 'ignored in the rest steps of phase2.',
)
group.add_argument(
'--steps-this-run',
type=int,
default=None,
- help='If provided, only run this many steps before exiting.' \
+ help='If provided, only run this many steps before exiting.',
)
group.add_argument(
- '--seed', type=int, default=42, help="random seed for initialization")
+ '--seed', type=int, default=42, help="random seed for initialization"
+ )
group.add_argument(
'--report-file',
type=str,
default='./report.json',
- help='A file in which to store JSON experiment report.')
+ help='A file in which to store JSON experiment report.',
+ )
group.add_argument(
'--model-prefix',
type=str,
default='bert_paddle',
- help='The prefix name of model files to save/load.')
+ help='The prefix name of model files to save/load.',
+ )
group.add_argument(
'--show-config',
type=distutils.util.strtobool,
default=True,
- help='To show arguments.')
+ help='To show arguments.',
+ )
group.add_argument(
'--enable-cpu-affinity',
type=distutils.util.strtobool,
default=True,
- help='To enable in-built GPU-CPU affinity.')
+ help='To enable in-built GPU-CPU affinity.',
+ )
group.add_argument(
- '--benchmark', action='http://github.com/store_true', help='To enable benchmark mode.')
+ '--benchmark', action='http://github.com/store_true', help='To enable benchmark mode.'
+ )
group.add_argument(
'--benchmark-steps',
type=int,
default=20,
- help='Steps for a benchmark run, only applied when --benchmark is set.')
+ help='Steps for a benchmark run, only applied when --benchmark is set.',
+ )
group.add_argument(
'--benchmark-warmup-steps',
type=int,
default=20,
- help='Warmup steps for a benchmark run, only applied when --benchmark is set.'
+ help='Warmup steps for a benchmark run, only applied when --benchmark is set.',
)
return parser
@@ -304,145 +370,166 @@ def add_training_args(parser, task=Task.pretrain):
default='Lamb',
metavar="OPTIMIZER",
choices=('Lamb', 'AdamW'),
- help='The name of optimizer. It should be one of {Lamb, AdamW}.')
+ help='The name of optimizer. It should be one of {Lamb, AdamW}.',
+ )
group.add_argument(
'--gradient-merge-steps',
type=int,
default=1,
- help="Number of update steps to accumualte before performing a backward/update pass."
+ help="Number of update steps to accumualte before performing a backward/update pass.",
)
group.add_argument(
'--learning-rate',
type=float,
default=1e-4,
- help='The initial learning rate.')
+ help='The initial learning rate.',
+ )
group.add_argument(
'--warmup-start-lr',
type=float,
default=0.0,
- help='The initial learning rate for warmup.')
+ help='The initial learning rate for warmup.',
+ )
group.add_argument(
'--warmup-proportion',
type=float,
default=0.01,
help='Proportion of training to perform linear learning rate warmup for. '
- 'For example, 0.1 = 10%% of training.')
+ 'For example, 0.1 = 10%% of training.',
+ )
group.add_argument(
'--beta1',
type=float,
default=0.9,
- help='The exponential decay rate for the 1st moment estimates.')
+ help='The exponential decay rate for the 1st moment estimates.',
+ )
group.add_argument(
'--beta2',
type=float,
default=0.999,
- help='The exponential decay rate for the 2st moment estimates.')
+ help='The exponential decay rate for the 2st moment estimates.',
+ )
group.add_argument(
'--epsilon',
type=float,
default=1e-6,
- help='A small float value for numerical stability.')
+ help='A small float value for numerical stability.',
+ )
group.add_argument(
'--weight-decay',
type=float,
default=0.01,
- help='The weight decay coefficient.')
+ help='The weight decay coefficient.',
+ )
group.add_argument(
'--max-seq-length',
default=512,
type=int,
help='The maximum total input sequence length after WordPiece tokenization. \n'
'Sequences longer than this will be truncated, and sequences shorter \n'
- 'than this will be padded.')
+ 'than this will be padded.',
+ )
if task == Task.pretrain:
group.add_argument(
'--batch-size',
type=int,
default=32,
- help='The batch size for training')
+ help='The batch size for training',
+ )
group.add_argument(
'--phase1',
action='http://github.com/store_true',
- help='The phase of BERT pretraining. It should not be set ' \
- 'with --phase2 at the same time.'
+ help='The phase of BERT pretraining. It should not be set '
+ 'with --phase2 at the same time.',
)
group.add_argument(
'--phase2',
action='http://github.com/store_true',
- help='The phase of BERT pretraining. It should not be set ' \
- 'with --phase1 at the same time.'
+ help='The phase of BERT pretraining. It should not be set '
+ 'with --phase1 at the same time.',
)
group.add_argument(
'--max-predictions-per-seq',
default=80,
type=int,
- help='The maximum total of masked tokens in the input sequence')
+ help='The maximum total of masked tokens in the input sequence',
+ )
if task == Task.squad:
group.add_argument(
- "--do-train", action='http://github.com/store_true', help="Whether to run training.")
+ "--do-train", action='http://github.com/store_true', help="Whether to run training."
+ )
group.add_argument(
"--do-predict",
action='http://github.com/store_true',
- help="Whether to run eval on the dev set.")
+ help="Whether to run eval on the dev set.",
+ )
group.add_argument(
"--do-eval",
action='http://github.com/store_true',
- help="Whether to use evaluate accuracy of predictions")
+ help="Whether to use evaluate accuracy of predictions",
+ )
group.add_argument(
"--train-batch-size",
default=32,
type=int,
- help="Total batch size for training.")
+ help="Total batch size for training.",
+ )
group.add_argument(
"--predict-batch-size",
default=8,
type=int,
- help="Total batch size for predictions.")
+ help="Total batch size for predictions.",
+ )
group.add_argument(
"--verbose-logging",
action='http://github.com/store_true',
help="If true, all of the warnings related to data processing will be printed. "
- "A number of warnings are expected for a normal SQuAD evaluation.")
+ "A number of warnings are expected for a normal SQuAD evaluation.",
+ )
group.add_argument(
"--doc-stride",
default=128,
type=int,
help="When splitting up a long document into chunks, how much stride to take "
- "between chunks.")
+ "between chunks.",
+ )
group.add_argument(
"--max-query-length",
default=64,
type=int,
help="The maximum number of tokens for the question. Questions longer than this "
- "will be truncated to this length.")
+ "will be truncated to this length.",
+ )
group.add_argument(
"--n-best-size",
default=20,
type=int,
help="The total number of n-best predictions to generate in the nbest_predictions.json "
- "output file.")
+ "output file.",
+ )
group.add_argument(
"--max-answer-length",
default=30,
type=int,
help="The maximum length of an answer that can be generated. This is needed because the start "
- "and end predictions are not conditioned on one another.")
+ "and end predictions are not conditioned on one another.",
+ )
group.add_argument(
"--do-lower-case",
action='http://github.com/store_true',
- help="Whether to lower case the input text. True for uncased models, False for cased models."
+ help="Whether to lower case the input text. True for uncased models, False for cased models.",
)
group.add_argument(
'--version-2-with-negative',
action='http://github.com/store_true',
- help='If true, the SQuAD examples contain some that do not have an answer.'
+ help='If true, the SQuAD examples contain some that do not have an answer.',
)
group.add_argument(
'--null-score-diff-threshold',
type=float,
default=0.0,
- help="If null_score - best_non_null is greater than the threshold predict null."
+ help="If null_score - best_non_null is greater than the threshold predict null.",
)
return parser
@@ -452,22 +539,29 @@ def add_advance_args(parser):
group.add_argument(
'--amp',
action='http://github.com/store_true',
- help='Enable automatic mixed precision training (AMP).')
+ help='Enable automatic mixed precision training (AMP).',
+ )
group.add_argument(
'--scale-loss',
type=float,
default=1.0,
- help='The loss scalar for AMP training, only applied when --amp is set.'
+ help='The loss scalar for AMP training, only applied when --amp is set.',
)
group.add_argument(
'--use-dynamic-loss-scaling',
action='http://github.com/store_true',
- help='Enable dynamic loss scaling in AMP training, only applied when --amp is set.'
+ help='Enable dynamic loss scaling in AMP training, only applied when --amp is set.',
)
group.add_argument(
'--use-pure-fp16',
action='http://github.com/store_true',
- help='Enable pure FP16 training, only applied when --amp is set.')
+ help='Enable pure FP16 training, only applied when --amp is set.',
+ )
+ group.add_argument(
+ '--fuse-mha',
+ action='http://github.com/store_true',
+ help='Enable multihead attention fusion. Require cudnn version >= 8.9.1',
+ )
return parser
@@ -475,8 +569,10 @@ def add_advance_args(parser):
def parse_args(task=Task.pretrain):
parser = argparse.ArgumentParser(
description="PaddlePaddle BERT pretraining script"
- if task == Task.pretrain else "PaddlePaddle SQuAD finetuning script",
- formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+ if task == Task.pretrain
+ else "PaddlePaddle SQuAD finetuning script",
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+ )
parser = add_global_args(parser, task)
parser = add_training_args(parser, task)
diff --git a/PyTorch/Classification/ConvNets/image_classification/dataloaders.py b/PyTorch/Classification/ConvNets/image_classification/dataloaders.py
index 47a25862a..7f3249b4d 100644
--- a/PyTorch/Classification/ConvNets/image_classification/dataloaders.py
+++ b/PyTorch/Classification/ConvNets/image_classification/dataloaders.py
@@ -34,6 +34,7 @@
import torchvision.transforms as transforms
from PIL import Image
from functools import partial
+from torchvision.transforms.functional import InterpolationMode
from image_classification.autoaugment import AutoaugmentImageNetPolicy
@@ -422,9 +423,10 @@ def get_pytorch_train_loader(
prefetch_factor=2,
memory_format=torch.contiguous_format,
):
- interpolation = {"bicubic": Image.BICUBIC, "bilinear": Image.BILINEAR}[
- interpolation
- ]
+ interpolation = {
+ "bicubic": InterpolationMode.BICUBIC,
+ "bilinear": InterpolationMode.BILINEAR,
+ }[interpolation]
traindir = os.path.join(data_path, "train")
transforms_list = [
transforms.RandomResizedCrop(image_size, interpolation=interpolation),
@@ -474,9 +476,10 @@ def get_pytorch_val_loader(
memory_format=torch.contiguous_format,
prefetch_factor=2,
):
- interpolation = {"bicubic": Image.BICUBIC, "bilinear": Image.BILINEAR}[
- interpolation
- ]
+ interpolation = {
+ "bicubic": InterpolationMode.BICUBIC,
+ "bilinear": InterpolationMode.BILINEAR,
+ }[interpolation]
valdir = os.path.join(data_path, "val")
val_dataset = datasets.ImageFolder(
valdir,
diff --git a/PyTorch/Classification/ConvNets/image_classification/models/resnet.py b/PyTorch/Classification/ConvNets/image_classification/models/resnet.py
index 47e58022f..fbfd13c71 100644
--- a/PyTorch/Classification/ConvNets/image_classification/models/resnet.py
+++ b/PyTorch/Classification/ConvNets/image_classification/models/resnet.py
@@ -63,14 +63,16 @@ def __init__(
stride=1,
cardinality=1,
downsample=None,
+ fused_se=True,
last_bn_0_init=False,
+ trt=False,
):
super(BasicBlock, self).__init__()
- self.conv1 = builder.conv3x3(inplanes, planes, stride, cardinality=cardinality)
+ self.conv1 = builder.conv3x3(inplanes, planes, stride, groups=cardinality)
self.bn1 = builder.batchnorm(planes)
self.relu = builder.activation()
self.conv2 = builder.conv3x3(
- planes, planes * expansion, cardinality=cardinality
+ planes, planes * expansion, groups=cardinality
)
self.bn2 = builder.batchnorm(planes * expansion, zero_init=last_bn_0_init)
self.downsample = downsample
diff --git a/PyTorch/Classification/GPUNet/README.md b/PyTorch/Classification/GPUNet/README.md
index 02d272741..6a8a4ba20 100644
--- a/PyTorch/Classification/GPUNet/README.md
+++ b/PyTorch/Classification/GPUNet/README.md
@@ -413,7 +413,7 @@ We benchmark the training results following the steps in [Training](#training).
##### NVIDIA DGX V100 (8x V100 32GB)
| **Model**|**Batch**| **Epochs** | **GPUs** | **FP32 Top1** | **AMP Top1** | **FP32 (hours)
Train Time** | **AMP (hours)
Train Time** | **Training speedup
(FP32 / AMP)** |
|:--------:|:------:|:----------:|:--------:|:--------------:|:--------------:|:-------------------:|:-----------------------:|:--------------------------------:|
-| GPUNet-0 |192 | 450 | 8 | 77.90+/-0.03 | 77.96+/-0.05 |71.63|46.56| 1.54 x |
+| GPUNet-0 |192 | 450 | 8 | 78.90+/-0.03 | 78.96+/-0.05 |71.63|46.56| 1.54 x |
| GPUNet-1 |192 | 450 | 8 | 80.4-+/-0.03 | 80.5+/-0.03 |67.5 |43.5 | 1.55 x |
| GPUNet-2 |192 | 450 | 8 | 82.1-+/-0.04 | 82.2+/-0.04 |171 |84.25| 2.03 x |
diff --git a/PyTorch/Detection/Efficientdet/data/dataset.py b/PyTorch/Detection/Efficientdet/data/dataset.py
index de3ee474f..b01a01264 100644
--- a/PyTorch/Detection/Efficientdet/data/dataset.py
+++ b/PyTorch/Detection/Efficientdet/data/dataset.py
@@ -43,7 +43,7 @@ class CocoDetection(data.Dataset):
def __init__(self, root, ann_file, config, transform=None):
super(CocoDetection, self).__init__()
- if isinstance(root, torch._six.string_classes):
+ if isinstance(root, (str, bytes)):
root = os.path.expanduser(root)
self.root = root
self.transform = transform
diff --git a/PyTorch/Detection/Efficientdet/train.py b/PyTorch/Detection/Efficientdet/train.py
index 7ca278b57..b59472cc3 100755
--- a/PyTorch/Detection/Efficientdet/train.py
+++ b/PyTorch/Detection/Efficientdet/train.py
@@ -521,12 +521,14 @@ def train_epoch(
model.train()
+ torch.cuda.synchronize()
end = time.time()
last_idx = steps_per_epoch - 1
num_updates = epoch * steps_per_epoch
for batch_idx in range(steps_per_epoch):
input, target = next(loader_iter)
last_batch = batch_idx == last_idx
+ torch.cuda.synchronize()
data_time_m.update(time.time() - end)
with torch.cuda.amp.autocast(enabled=use_amp):
@@ -575,6 +577,7 @@ def train_epoch(
if lr_scheduler is not None:
lr_scheduler.step_update(num_updates=num_updates, metric=losses_m.avg)
+ torch.cuda.synchronize()
end = time.time()
if args.benchmark:
if batch_idx >= args.benchmark_steps:
@@ -597,6 +600,7 @@ def validate(model, loader, args, evaluator=None, epoch=0, log_suffix=''):
model.eval()
+ torch.cuda.synchronize()
end = time.time()
last_idx = len(loader) - 1
with torch.no_grad():
diff --git a/PyTorch/Detection/Efficientdet/validate.py b/PyTorch/Detection/Efficientdet/validate.py
index 6145596c2..06eaa69db 100644
--- a/PyTorch/Detection/Efficientdet/validate.py
+++ b/PyTorch/Detection/Efficientdet/validate.py
@@ -208,12 +208,14 @@ def validate(args):
bench.eval()
batch_time = AverageMeter()
throughput = AverageMeter()
+ torch.cuda.synchronize()
end = time.time()
total_time_start = time.time()
with torch.no_grad():
for i, (input, target) in enumerate(loader):
with torch.cuda.amp.autocast(enabled=args.amp):
output = bench(input, target['img_scale'], target['img_size'])
+ torch.cuda.synchronize()
batch_time.update(time.time() - end)
throughput.update(input.size(0) / batch_time.val)
evaluator.add_predictions(output, target)
@@ -235,6 +237,7 @@ def validate(args):
)
end = time.time()
+ torch.cuda.synchronize()
dllogger_metric['total_inference_time'] = time.time() - total_time_start
dllogger_metric['inference_throughput'] = throughput.avg
dllogger_metric['inference_time'] = 1000 / throughput.avg
@@ -245,6 +248,7 @@ def validate(args):
mean_ap = evaluator.evaluate()
else:
evaluator.save_predictions(args.results)
+ torch.cuda.synchronize()
dllogger_metric['map'] = mean_ap
dllogger_metric['total_eval_time'] = time.time() - total_time_start
else:
diff --git a/PyTorch/Detection/SSD/Dockerfile b/PyTorch/Detection/SSD/Dockerfile
index baa382bd8..822683b70 100755
--- a/PyTorch/Detection/SSD/Dockerfile
+++ b/PyTorch/Detection/SSD/Dockerfile
@@ -1,20 +1,14 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.07-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.10-py3
FROM ${FROM_IMAGE_NAME}
# Set working directory
WORKDIR /workspace/ssd
-# Install nv-cocoapi
-ENV COCOAPI_VERSION=2.0+nv0.6.0
-RUN export COCOAPI_TAG=$(echo ${COCOAPI_VERSION} | sed 's/^.*+n//') \
- && pip install --no-cache-dir pybind11 \
- && pip install --no-cache-dir git+https://github.com/NVIDIA/cocoapi.git@${COCOAPI_TAG}#subdirectory=PythonAPI
-# Install dllogger
-RUN pip install --no-cache-dir git+https://github.com/NVIDIA/dllogger.git#egg=dllogger
+# Copy the model files
+COPY . .
-# Install requirements
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-RUN python3 -m pip install pycocotools==2.0.0
+# Install python requirements
+RUN pip install --no-cache-dir -r requirements.txt
-COPY . .
+ENV CUDNN_V8_API_ENABLED=1
+ENV TORCH_CUDNN_V8_API_ENABLED=1
diff --git a/PyTorch/Detection/SSD/README.md b/PyTorch/Detection/SSD/README.md
index 402616e5d..b3ad035e7 100644
--- a/PyTorch/Detection/SSD/README.md
+++ b/PyTorch/Detection/SSD/README.md
@@ -218,11 +218,11 @@ The following section lists the requirements in order to start training the SSD3
### Requirements
-This repository contains `Dockerfile` which extends the PyTorch 21.05 NGC container
+This repository contains `Dockerfile` which extends the PyTorch 22.10 NGC container
and encapsulates some dependencies. Aside from these dependencies,
ensure you have the following software:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 21.05 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+* [PyTorch 22.10 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
* GPU-based architecture:
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
@@ -235,7 +235,7 @@ Documentation:
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
-For those unable to use the [PyTorch 21.05 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
+For those unable to use the [PyTorch 22.10 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
to set up the required environment or create your own container,
see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
@@ -475,18 +475,18 @@ to evaluate models on the COCO dataset. We are using these scripts
during validation to measure a models performance in AP metric.
Metrics below are evaluated using pycocotools’ methodology, in the following format:
```
- Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250
- Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.423
- Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.257
- Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
- Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.269
- Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.237
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.342
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.358
- Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.118
- Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.394
- Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.548
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.27205
+ Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.45869
+ Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.27884
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.08275
+ Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.29840
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.42722
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.25092
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.36528
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.38262
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.13577
+ Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.42287
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.57277
```
The metric reported in our results is present in the first row.
@@ -542,7 +542,7 @@ The training benchmark was run in various scenarios on A100 80GB and V100 16G GP
To benchmark training, run:
```
-python -m torch.distributed.launch --nproc_per_node={NGPU} \
+torchrun --nproc_per_node={NGPU} \
main.py --batch-size {bs} \
--mode benchmark-training \
--benchmark-warmup 100 \
@@ -583,37 +583,34 @@ The following sections provide details on how we achieved our performance and ac
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the `./examples/SSD300_A100_{FP16,TF32}_{1,4,8}GPU.sh`
-script in the `pytorch-21.05-py3` NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
+script in the `pytorch-22.10-py3` NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
|GPUs |Batch size / GPU|Accuracy - TF32|Accuracy - mixed precision|Time to train - TF32|Time to train - mixed precision|Time to train speedup (TF32 to mixed precision)|
|-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------|
-|1 |64 |0.26 |0.26 |07:45:00 |05:09:00 |150.49% |
-|4 |64 |0.26 |0.26 |01:59:00 |01:19:00 |149.52% |
-|8 |64 |0.25 |0.26 |01:02:00 |00:40:00 |155.64% |
-|1 |128 |0.26 |0.26 |07:36:00 |04:57:00 |153.50% |
-|4 |128 |0.26 |0.26 |01:55:00 |01:15:00 |152.92% |
-|8 |128 |0.26 |0.25 |00:58:00 |00:38:00 |151.89% |
-|1 |256 |0.26 |0.26 |07:34:00 |04:53:00 |154.80% |
-|4 |256 |0.25 |0.26 |01:54:00 |01:14:00 |152.98% |
-|8 |256 |0.248 |0.25 |00:57:00 |00:37:00 |151.46% |
+|1 |64 |0.271 |0.272 |03:19:59 |03:18:35 |100% |
+|4 |64 |0.270 |0.270 |00:51:22 |00:51:31 | 99% |
+|8 |64 |0.270 |0.269 |00:26:10 |00:26:10 | 99% |
+|1 |128 |0.274 |0.271 |03:03:56 |03:03:50 |100% |
+|4 |128 |0.272 |0.270 |00:46:51 |00:47:01 | 99% |
+|8 |128 |0.267 |0.267 |00:23:44 |00:23:46 | 99% |
+|1 |256 |0.272 |0.272 |02:56:37 |02:56:44 | 99% |
+|4 |256 |0.271 |0.267 |00:45:05 |00:45:07 | 99% |
+|8 |256 |0.260 |0.258 |00:22:49 |00:22:56 |100% |
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the `./examples/SSD300_FP{16,32}_{1,4,8}GPU.sh`
-script in the `pytorch-21.05-py3` NGC container on NVIDIA DGX-1 with 8x
+script in the `pytorch-22.10-py3` NGC container on NVIDIA DGX-1 with 8x
V100 16GB GPUs.
|GPUs |Batch size / GPU|Accuracy - FP32|Accuracy - mixed precision|Time to train - FP32|Time to train - mixed precision|Time to train speedup (FP32 to mixed precision)|
|-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------|
-|1 |32 |0.26 |0.26 |20:14:00 |10:09:00 |199.30% |
-|4 |32 |0.25 |0.25 |05:10:00 |02:40:00 |193.88% |
-|8 |32 |0.26 |0.25 |02:35:00 |01:20:00 |192.24% |
-|1 |64 | |0.26 |09:34:00 | | |
-|4 |64 | |0.26 |02:27:00 | | |
-|8 |64 | |0.26 |01:14:00 | | |
-
-
-
+|1 |32 |0.269 |0.271 |20:04:48 |07:25:27 |270% |
+|4 |32 |0.270 |0.269 |05:08:56 |01:58:41 |260% |
+|8 |32 |0.271 |0.269 |02:35:00 |01:00:27 |256% |
+|1 |64 | |0.272 | |06:47:58 | |
+|4 |64 | |0.270 | |01:46:34 | |
+|8 |64 | |0.269 | |00:53:52 | |
Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision
@@ -626,52 +623,51 @@ Here are example graphs of FP32, TF32 and AMP training on 8 GPU configuration:
##### Training stability test
The SSD300 v1.1 model was trained for 65 epochs, starting
-from 15 different initial random seeds. The training was performed in the `pytorch-21.05-py3` NGC container on
+from 15 different initial random seeds. The training was performed in the `pytorch-22.10-py3` NGC container on
NVIDIA DGX A100 8x A100 80GB GPUs with batch size per GPU = 128.
After training, the models were evaluated on the test dataset. The following
table summarizes the final mAP on the test set.
|**Precision**|**Average mAP**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|------------:|--------------:|---------------------:|----------:|----------:|---------:|
-| AMP | 0.2514314286 | 0.001498316675 | 0.24456 | 0.25182 | 0.24907 |
-| TF32 | 0.2489106667 | 0.001749463047 | 0.24487 | 0.25148 | 0.24848 |
-
+| AMP | 0.2679503039 | 0.001360494012 | 0.26201 | 0.27013 | 0.26529 |
+| TF32 | 0.2670691823 | 0.001639394102 | 0.26181 | 0.27274 | 0.26492 |
#### Training performance results
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the `main.py` script with the `--mode
-benchmark-training` flag in the `pytorch-21.05-py3` NGC container on NVIDIA
+benchmark-training` flag in the `pytorch-22.10-py3` NGC container on NVIDIA
DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second)
were averaged over an entire training epoch.
|GPUs |Batch size / GPU|Throughput - TF32|Throughput - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32 |Weak scaling - mixed precision |
|-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------|
-|1 |64 |279.85 |428.30 |153.04% |100% |100% |
-|4 |64 |1095.17 |1660.59 |151.62% |391% |387% |
-|8 |64 |2181.21 |3301.58 |151.36% |779% |770% |
-|1 |128 |286.17 |440.74 |154.01% |100% |100% |
-|4 |128 |1135.02 |1755.94 |154.70% |396% |398% |
-|8 |128 |2264.92 |3510.29 |154.98% |791% |796% |
+|1 |64 | 364.27 | 662.91 |181% |100% |100% |
+|4 |64 |1432.73 |2581.24 |180% |393% |389% |
+|8 |64 |2838.76 |5252.84 |185% |779% |792% |
+|1 |128 | 377.18 | 724.41 |192% |100% |100% |
+|4 |128 |1493.13 |2885.55 |193% |395% |398% |
+|8 |128 |2967.23 |5733.98 |193% |786% |791% |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the `main.py` script with the `--mode
-benchmark-training` flag in the `pytorch-21.05-py3` NGC container on NVIDIA
+benchmark-training` flag in the `pytorch-22.10-py3` NGC container on NVIDIA
DGX-1 with 8x V100 16GB GPUs. Performance numbers (in items/images per second)
were averaged over an entire training epoch.
|GPUs |Batch size / GPU|Throughput - FP32|Throughput - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32 |Weak scaling - mixed precision |
|-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------|
-|1 |32 |108.27 |212.95 |196.68% |100% |100% |
-|4 |32 |425.07 |826.38 |194.41% |392% |388% |
-|8 |32 |846.58 |1610.82 |190.27% |781% |756% |
-|1 |64 | |227.69 | | |100% |
-|4 |64 | |891.27 | | |391% |
-|8 |64 | |1770.09 | | |777% |
+|1 |32 |107.22 | 296.80 |276% |100% |100% |
+|4 |32 |419.54 |1115.59 |265% |391% |375% |
+|8 |32 |840.35 |2153.96 |256% |783% |725% |
+|1 |64 | | 322.81 | | |100% |
+|4 |64 | |1238.27 | | |383% |
+|8 |64 | |2520.50 | | |780% |
Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision
@@ -682,35 +678,35 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
Our results were obtained by running the `main.py` script with `--mode
-benchmark-inference` flag in the pytorch-21.05-py3 NGC container on NVIDIA
+benchmark-inference` flag in the pytorch-22.10-py3 NGC container on NVIDIA
DGX A100 (1x A100 80GB) GPU.
|Batch size |Throughput - TF32|Throughput - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32 |Weak scaling - mixed precision |
|-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------|
-|1 |105.53 | 90.62 | 85% |100% | 100% |
-|2 |197.77 | 168.41 | 85% |187% | 185% |
-|4 |332.10 | 323.68 | 97% |314% | 357% |
-|8 |526.12 | 523.96 | 99% |498% | 578% |
-|16 |634.50 | 816.91 |128% |601% | 901% |
-|32 |715.35 | 956.91 |133% |677% |1055% |
-|64 |752.57 |1053.39 |139% |713% |1162% |
+|1 |158.83 | 142.67 | 89% |100% |100% |
+|2 |308.31 | 261.21 | 84% |194% |183% |
+|4 |481.69 | 454.95 | 94% |303% |318% |
+|8 |597.72 | 742.05 |124% |376% |520% |
+|16 |590.44 | 887.01 |150% |371% |621% |
+|32 |708.97 | 970.27 |136% |446% |680% |
+|64 |798.16 |1057.51 |132% |502% |741% |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
Our results were obtained by running the `main.py` script with `--mode
-benchmark-inference` flag in the pytorch-21.05-py3 NGC container on NVIDIA
+benchmark-inference` flag in the pytorch-22.10-py3 NGC container on NVIDIA
DGX-1 with (1x V100 16GB) GPU.
|Batch size |Throughput - FP32|Throughput - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32 |Weak scaling - mixed precision |
|-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------|
-|1 | 75.05 | 57.03 | 75% |100% |100% |
-|2 |138.39 |117.12 | 84% |184% |205% |
-|4 |190.74 |185.38 | 97% |254% |325% |
-|8 |237.34 |368.48 |155% |316% |646% |
-|16 |285.32 |504.77 |176% |380% |885% |
-|32 |306.22 |548.87 |179% |408% |962% |
+|1 | 93.21 | 84.59 | 90% |100% |100% |
+|2 |148.61 |165.30 |111% |159% |195% |
+|4 |206.82 |304.77 |147% |221% |360% |
+|8 |242.55 |447.25 |184% |260% |528% |
+|16 |292.44 |541.05 |185% |313% |639% |
+|32 |311.61 |605.30 |194% |334% |715% |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
@@ -718,6 +714,32 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
### Changelog
+October 2022
+ * upgrade the PyTorch container to 22.10
+ * switched to using torchvision IMAGENET1K_V2 backbone weights
+ * added a flag to control for torchvision weight enums
+ * added a flag to control TF32 computations
+ * fixed various depreciation warnings
+ * set `TORCH_CUDNN_V8_API_ENABLED` environment variable which replaces `CUDNN_V8_API_ENABLED` from older containers
+ * updated [nv-cocoapi](https://github.com/NVIDIA/cocoapi/) from 0.6.0 to 0.7.3
+ * updated python dependencies
+
+June 2022
+ * upgrade the PyTorch container to 22.05
+ * fixed DALI depreciation warnings
+
+January 2022
+ * upgrade the PyTorch container to 22.01
+ * made AMP the default data precision
+ * added --data-layout option (channels_first is the recommended layout with --no-amp)
+ * updated README with new performance numbers
+
+November 2021
+ * upgrade the PyTorch container to 21.11
+ * switched data layout from NCHW (channels first) to NHWC (channels last)
+ * replaced `torch.distributed.launch` with `torchrun`
+ * updated README with new performance numbers
+
May 2021
* upgrade the PyTorch container to 21.05
* replaced APEX AMP with native PyTorch AMP
diff --git a/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_1GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_1GPU.sh
index 1754a4aa5..3b880fc3f 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_1GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_1GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 1 GPUs using 256 batch size
# Usage bash SSD300_FP16_1GPU.sh
-python $1/main.py --backbone resnet50 --warmup 300 --bs 256 --amp --data $2 ${@:3}
+python $1/main.py --backbone resnet50 --warmup 300 --bs 256 --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_4GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_4GPU.sh
index 1aa66e10b..23580ed3d 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_4GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_4GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 4 GPUs using 1024 batch size (256 per GPU)
# Usage ./SSD300_FP16_4GPU.sh
-python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 256 --amp --data $2 ${@:3}
+torchrun --nproc_per_node=4 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 256 --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_8GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_8GPU.sh
index 2857d0943..95007f6a9 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_8GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_A100_FP16_8GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 8 GPUs using 1024 batch size (128 per GPU)
# Usage ./SSD300_FP16_8GPU.sh
-python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 128 --amp --data $2 ${@:3}
+torchrun --nproc_per_node=8 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 128 --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_A100_FP32_8GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_A100_FP32_8GPU.sh
index 72c8e438f..eb455cab5 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_A100_FP32_8GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_A100_FP32_8GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP32 on 8 GPUs using 1024 batch size (128 per GPU)
# Usage ./SSD300_FP32_8GPU.sh
-python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 128 --data $2 ${@:3}
+torchrun --nproc_per_node=8 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 128 --no-amp --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP16_1GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_FP16_1GPU.sh
index b2b4b9859..64037b569 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_1GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_1GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 1 GPUs using 64 batch size
# Usage bash SSD300_FP16_1GPU.sh
-python $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}
+python $1/main.py --backbone resnet50 --warmup 300 --bs 64 --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP16_4GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_FP16_4GPU.sh
index f015bf3c2..dc1b40070 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_4GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_4GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 4 GPUs using 256 batch size (64 per GPU)
# Usage ./SSD300_FP16_4GPU.sh
-python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}
+torchrun --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP16_8GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_FP16_8GPU.sh
index 4434e8e3c..d62e60012 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_8GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_8GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 8 GPUs using 512 batch size (64 per GPU)
# Usage ./SSD300_FP16_8GPU.sh
-python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}
+torchrun --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP16_EVAL.sh b/PyTorch/Detection/SSD/examples/SSD300_FP16_EVAL.sh
index 96adfbf50..1b233942c 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_EVAL.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_EVAL.sh
@@ -1,4 +1,4 @@
# This script evaluates SSD300 model in FP16 using 32 batch size on 1 GPU
# Usage: ./SSD300_FP16_EVAL.sh
-python $1/main.py --backbone resnet50 --amp --ebs 32 --data $2 --mode evaluation --checkpoint $3 ${@:4}
+python $1/main.py --backbone resnet50 --ebs 32 --data $2 --mode evaluation --checkpoint $3 ${@:4}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP16_INFERENCE_BENCHMARK.sh b/PyTorch/Detection/SSD/examples/SSD300_FP16_INFERENCE_BENCHMARK.sh
index c26b80072..75fe322a8 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_INFERENCE_BENCHMARK.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_INFERENCE_BENCHMARK.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 inference benchmark in FP16 on 1 GPU with 64 batch size
# Usage bash SSD300_FP16_INFERENCE_BENCHMARK.sh
-python $1/main.py --backbone resnet50 --mode benchmark-inference --bs 64 --amp --data $2 ${@:3}
+python $1/main.py --backbone resnet50 --mode benchmark-inference --bs 64 --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP32_1GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_FP32_1GPU.sh
index ea5240a76..d7e148dfd 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP32_1GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP32_1GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP32 on 1 GPUs using 32 batch size
# Usage ./SSD300_FP32_1GPU.sh
-python $1/main.py --backbone resnet50 --bs 32 --warmup 300 --data $2 ${@:3}
+python $1/main.py --backbone resnet50 --bs 32 --warmup 300 --no-amp --data-layout channels_first --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP32_4GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_FP32_4GPU.sh
index 29159557a..96b6a92bd 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP32_4GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP32_4GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP32 on 4 GPUs using 128 batch size (32 per GPU)
# Usage ./SSD300_FP32_4GPU.sh
-python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 32 --data $2 ${@:3}
+torchrun --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 32 --no-amp --data-layout channels_first --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP32_8GPU.sh b/PyTorch/Detection/SSD/examples/SSD300_FP32_8GPU.sh
index 441efd9d0..b880359c9 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP32_8GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP32_8GPU.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 training in FP32 on 8 GPUs using 256 batch size (32 per GPU)
# Usage ./SSD300_FP32_8GPU.sh
-python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 32 --data $2 ${@:3}
+torchrun --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 32 --no-amp --data-layout channels_first --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP32_EVAL.sh b/PyTorch/Detection/SSD/examples/SSD300_FP32_EVAL.sh
index b3179c1ed..cd387f777 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP32_EVAL.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP32_EVAL.sh
@@ -1,4 +1,4 @@
# This script evaluates SSD300 model in FP32 using 32 batch size on 1 GPU
# Usage: ./SSD300_FP32_EVAL.sh
-python $1/main.py --backbone resnet50 --ebs 32 --data $2 --mode evaluation --checkpoint $3 ${@:4}
+python $1/main.py --backbone resnet50 --ebs 32 --data $2 --mode evaluation --no-amp --data-layout channels_first --checkpoint $3 ${@:4}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_FP32_INFERENCE_BENCHMARK.sh b/PyTorch/Detection/SSD/examples/SSD300_FP32_INFERENCE_BENCHMARK.sh
index e7c0fa864..8f46338b4 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_FP32_INFERENCE_BENCHMARK.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP32_INFERENCE_BENCHMARK.sh
@@ -1,4 +1,4 @@
# This script launches SSD300 inference benchmark in FP32 on 1 GPU with 64 batch size
# Usage bash SSD300_FP32_INFERENCE_BENCHMARK.sh
-python $1/main.py --backbone resnet50 --warmup 300 --mode benchmark-inference --bs 32 --data $2 ${@:3}
+python $1/main.py --backbone resnet50 --warmup 300 --mode benchmark-inference --bs 32 --no-amp --data-layout channels_first --data $2 ${@:3}
diff --git a/PyTorch/Detection/SSD/examples/SSD300_inference.py b/PyTorch/Detection/SSD/examples/SSD300_inference.py
index bc5b20d9c..8681b423f 100644
--- a/PyTorch/Detection/SSD/examples/SSD300_inference.py
+++ b/PyTorch/Detection/SSD/examples/SSD300_inference.py
@@ -28,7 +28,7 @@ def load_checkpoint(model, model_file):
def build_predictor(model_file, backbone='resnet50'):
- ssd300 = SSD300(backbone=ResNet(backbone))
+ ssd300 = SSD300(backbone=ResNet(backbone=backbone))
load_checkpoint(ssd300, model_file)
return ssd300
diff --git a/PyTorch/Detection/SSD/main.py b/PyTorch/Detection/SSD/main.py
index c0c4db41b..4c3fc3e69 100644
--- a/PyTorch/Detection/SSD/main.py
+++ b/PyTorch/Detection/SSD/main.py
@@ -67,6 +67,9 @@ def make_parser():
help='manually set random seed for torch')
parser.add_argument('--checkpoint', type=str, default=None,
help='path to model checkpoint file')
+ parser.add_argument('--torchvision-weights-version', type=str, default="IMAGENET1K_V2",
+ choices=['IMAGENET1K_V1', 'IMAGENET1K_V2', 'DEFAULT'],
+ help='The torchvision weights version to use when --checkpoint is not specified')
parser.add_argument('--save', type=str, default=None,
help='save model checkpoints in the specified directory')
parser.add_argument('--mode', type=str, default='training',
@@ -97,9 +100,19 @@ def make_parser():
' backbone model declared with the --backbone argument.'
' When it is not provided, pretrained model from torchvision'
' will be downloaded.')
- parser.add_argument('--num-workers', type=int, default=4)
- parser.add_argument('--amp', action='http://github.com/store_true',
- help='Whether to enable AMP ops. When false, uses TF32 on A100 and FP32 on V100 GPUS.')
+ parser.add_argument('--num-workers', type=int, default=8)
+ parser.add_argument("--amp", dest='amp', action="http://github.com/store_true",
+ help="Enable Automatic Mixed Precision (AMP).")
+ parser.add_argument("--no-amp", dest='amp', action="http://github.com/store_false",
+ help="Disable Automatic Mixed Precision (AMP).")
+ parser.set_defaults(amp=True)
+ parser.add_argument("--allow-tf32", dest='allow_tf32', action="http://github.com/store_true",
+ help="Allow TF32 computations on supported GPUs.")
+ parser.add_argument("--no-allow-tf32", dest='allow_tf32', action="http://github.com/store_false",
+ help="Disable TF32 computations.")
+ parser.set_defaults(allow_tf32=True)
+ parser.add_argument('--data-layout', default="channels_last", choices=['channels_first', 'channels_last'],
+ help="Model data layout. It's recommended to use channels_first with --no-amp")
parser.add_argument('--log-interval', type=int, default=20,
help='Logging interval.')
parser.add_argument('--json-summary', type=str, default=None,
@@ -150,7 +163,9 @@ def train(train_loop_func, logger, args):
val_dataset = get_val_dataset(args)
val_dataloader = get_val_dataloader(val_dataset, args)
- ssd300 = SSD300(backbone=ResNet(args.backbone, args.backbone_path))
+ ssd300 = SSD300(backbone=ResNet(backbone=args.backbone,
+ backbone_path=args.backbone_path,
+ weights=args.torchvision_weights_version))
args.learning_rate = args.learning_rate * args.N_gpu * (args.batch_size / 32)
start_epoch = 0
iteration = 0
@@ -223,6 +238,7 @@ def train(train_loop_func, logger, args):
obj['model'] = ssd300.module.state_dict()
else:
obj['model'] = ssd300.state_dict()
+ os.makedirs(args.save, exist_ok=True)
save_path = os.path.join(args.save, f'epoch_{epoch}.pt')
torch.save(obj, save_path)
logger.log('model path', save_path)
@@ -261,6 +277,8 @@ def log_params(logger, args):
if args.local_rank == 0:
os.makedirs('./models', exist_ok=True)
+ torch.backends.cuda.matmul.allow_tf32 = args.allow_tf32
+ torch.backends.cudnn.allow_tf32 = args.allow_tf32
torch.backends.cudnn.benchmark = True
# write json only on the main thread
diff --git a/PyTorch/Detection/SSD/requirements.txt b/PyTorch/Detection/SSD/requirements.txt
index db0e31dff..636a76589 100644
--- a/PyTorch/Detection/SSD/requirements.txt
+++ b/PyTorch/Detection/SSD/requirements.txt
@@ -1,3 +1,6 @@
-Cython>=0.28.4
-scikit-image>=0.15.0
-ujson>=4.0.2
+Cython>=0.29.32
+scikit-image>=0.19.3
+ujson>=5.5.0
+pybind11>=2.10.0
+git+https://github.com/NVIDIA/cocoapi.git@v0.7.3#subdirectory=PythonAPI
+git+https://github.com/NVIDIA/dllogger.git#egg=dllogger
diff --git a/PyTorch/Detection/SSD/ssd/coco_pipeline.py b/PyTorch/Detection/SSD/ssd/coco_pipeline.py
index 3e2865b44..88a844422 100644
--- a/PyTorch/Detection/SSD/ssd/coco_pipeline.py
+++ b/PyTorch/Detection/SSD/ssd/coco_pipeline.py
@@ -21,6 +21,7 @@
# DALI imports
import nvidia.dali as dali
from nvidia.dali.pipeline import Pipeline
+from nvidia.dali.types import to_numpy_type
class COCOPipeline(Pipeline):
@@ -124,14 +125,14 @@ def define_graph(self):
return (images, bboxes.gpu(), labels.gpu())
to_torch_type = {
- np.dtype(np.float32) : torch.float32,
- np.dtype(np.float64) : torch.float64,
- np.dtype(np.float16) : torch.float16,
- np.dtype(np.uint8) : torch.uint8,
- np.dtype(np.int8) : torch.int8,
- np.dtype(np.int16) : torch.int16,
- np.dtype(np.int32) : torch.int32,
- np.dtype(np.int64) : torch.int64
+ np.float32 : torch.float32,
+ np.float64 : torch.float64,
+ np.float16 : torch.float16,
+ np.uint8 : torch.uint8,
+ np.int8 : torch.int8,
+ np.int16 : torch.int16,
+ np.int32 : torch.int32,
+ np.int64 : torch.int64
}
def feed_ndarray(dali_tensor, arr):
@@ -242,9 +243,9 @@ def __next__(self):
labels_shape[j].append(lshape)
# We always need to alocate new memory as bboxes and labels varies in shape
- images_torch_type = to_torch_type[np.dtype(images[0].dtype())]
- bboxes_torch_type = to_torch_type[np.dtype(bboxes[0][0].dtype())]
- labels_torch_type = to_torch_type[np.dtype(labels[0][0].dtype())]
+ images_torch_type = to_torch_type[to_numpy_type(images[0].dtype)]
+ bboxes_torch_type = to_torch_type[to_numpy_type(bboxes[0][0].dtype)]
+ labels_torch_type = to_torch_type[to_numpy_type(labels[0][0].dtype)]
torch_gpu_device = torch.device('cuda', dev_id)
torch_cpu_device = torch.device('cpu')
diff --git a/PyTorch/Detection/SSD/ssd/evaluate.py b/PyTorch/Detection/SSD/ssd/evaluate.py
index 20ede8842..e96df0aaf 100644
--- a/PyTorch/Detection/SSD/ssd/evaluate.py
+++ b/PyTorch/Detection/SSD/ssd/evaluate.py
@@ -52,10 +52,8 @@ def evaluate(model, coco, cocoGt, encoder, inv_map, args):
try:
result = encoder.decode_batch(ploc_i, plabel_i, 0.50, 200)[0]
- except:
- # raise
- print("")
- print("No object detected in idx: {}".format(idx))
+ except Exception as e:
+ print("Skipping idx {}, failed to decode with message {}, Skipping.".format(idx, e))
continue
htot, wtot = img_size[0][idx].item(), img_size[1][idx].item()
diff --git a/PyTorch/Detection/SSD/ssd/model.py b/PyTorch/Detection/SSD/ssd/model.py
index 3da96f486..18a269d83 100644
--- a/PyTorch/Detection/SSD/ssd/model.py
+++ b/PyTorch/Detection/SSD/ssd/model.py
@@ -18,22 +18,22 @@
class ResNet(nn.Module):
- def __init__(self, backbone='resnet50', backbone_path=None):
+ def __init__(self, backbone='resnet50', backbone_path=None, weights="IMAGENET1K_V1"):
super().__init__()
if backbone == 'resnet18':
- backbone = resnet18(pretrained=not backbone_path)
+ backbone = resnet18(weights=None if backbone_path else weights)
self.out_channels = [256, 512, 512, 256, 256, 128]
elif backbone == 'resnet34':
- backbone = resnet34(pretrained=not backbone_path)
+ backbone = resnet34(weights=None if backbone_path else weights)
self.out_channels = [256, 512, 512, 256, 256, 256]
elif backbone == 'resnet50':
- backbone = resnet50(pretrained=not backbone_path)
+ backbone = resnet50(weights=None if backbone_path else weights)
self.out_channels = [1024, 512, 512, 256, 256, 256]
elif backbone == 'resnet101':
- backbone = resnet101(pretrained=not backbone_path)
+ backbone = resnet101(weights=None if backbone_path else weights)
self.out_channels = [1024, 512, 512, 256, 256, 256]
else: # backbone == 'resnet152':
- backbone = resnet152(pretrained=not backbone_path)
+ backbone = resnet152(weights=None if backbone_path else weights)
self.out_channels = [1024, 512, 512, 256, 256, 256]
if backbone_path:
backbone.load_state_dict(torch.load(backbone_path))
@@ -108,7 +108,7 @@ def _init_weights(self):
def bbox_view(self, src, loc, conf):
ret = []
for s, l, c in zip(src, loc, conf):
- ret.append((l(s).view(s.size(0), 4, -1), c(s).view(s.size(0), self.label_num, -1)))
+ ret.append((l(s).reshape(s.size(0), 4, -1), c(s).reshape(s.size(0), self.label_num, -1)))
locs, confs = list(zip(*ret))
locs, confs = torch.cat(locs, 2).contiguous(), torch.cat(confs, 2).contiguous()
diff --git a/PyTorch/Detection/SSD/ssd/train.py b/PyTorch/Detection/SSD/ssd/train.py
index 011f8210c..fa258f5a5 100644
--- a/PyTorch/Detection/SSD/ssd/train.py
+++ b/PyTorch/Detection/SSD/ssd/train.py
@@ -44,6 +44,8 @@ def train_loop(model, loss_func, scaler, epoch, optim, train_dataloader, val_dat
label = label.view(N, M)
with torch.cuda.amp.autocast(enabled=args.amp):
+ if args.data_layout == 'channels_last':
+ img = img.to(memory_format=torch.channels_last)
ploc, plabel = model(img)
ploc, plabel = ploc.float(), plabel.float()
@@ -101,6 +103,8 @@ def benchmark_train_loop(model, loss_func, scaler, epoch, optim, train_dataloade
label = label.view(N, M)
with torch.cuda.amp.autocast(enabled=args.amp):
+ if args.data_layout == 'channels_last':
+ img = img.to(memory_format=torch.channels_last)
ploc, plabel = model(img)
ploc, plabel = ploc.float(), plabel.float()
diff --git a/PyTorch/Detection/SSD/ssd/utils.py b/PyTorch/Detection/SSD/ssd/utils.py
index ab88bff88..27c2dd1c2 100644
--- a/PyTorch/Detection/SSD/ssd/utils.py
+++ b/PyTorch/Detection/SSD/ssd/utils.py
@@ -217,7 +217,7 @@ def decode_single(self, bboxes_in, scores_in, criteria, max_output, max_num=200)
_, max_ids = scores_out.sort(dim=0)
- max_ids = max_ids[-max_output:]
+ max_ids = max_ids[-max_output:].to("cpu")
return bboxes_out[max_ids, :], labels_out[max_ids], scores_out[max_ids]
diff --git a/PyTorch/DrugDiscovery/MoFlow/Dockerfile b/PyTorch/DrugDiscovery/MoFlow/Dockerfile
new file mode 100644
index 000000000..a95eef054
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/Dockerfile
@@ -0,0 +1,29 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.11-py3
+FROM ${FROM_IMAGE_NAME}
+
+WORKDIR /workspace/
+
+RUN python3 -m pip install --upgrade pip
+RUN python3 -m pip install git+https://github.com/NVIDIA/dllogger@v1.0.0#egg=dllogger
+
+RUN python3 -m pip install rdkit-pypi
+
+ARG WORKSPACE=/workspace/moflow_pyt
+WORKDIR ${WORKSPACE}
+ADD . ${WORKSPACE}
+RUN python3 -m pip install .
diff --git a/PyTorch/DrugDiscovery/MoFlow/LICENSE b/PyTorch/DrugDiscovery/MoFlow/LICENSE
new file mode 100644
index 000000000..86538fa63
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/LICENSE
@@ -0,0 +1,202 @@
+Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright 2022 NVIDIA Corporation
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
\ No newline at end of file
diff --git a/PyTorch/DrugDiscovery/MoFlow/NOTICE b/PyTorch/DrugDiscovery/MoFlow/NOTICE
new file mode 100644
index 000000000..f4561f45c
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/NOTICE
@@ -0,0 +1,3 @@
+MoFlow PyTorch
+
+This repository includes software from https://github.com/calvin-zcx/moflow licensed under the MIT License.
diff --git a/PyTorch/DrugDiscovery/MoFlow/README.md b/PyTorch/DrugDiscovery/MoFlow/README.md
new file mode 100644
index 000000000..94e5072f8
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/README.md
@@ -0,0 +1,580 @@
+# MoFlow For PyTorch
+
+This repository provides a script and recipe to train the MoFlow model to achieve state-of-the-art accuracy. The content of this repository is tested and maintained by NVIDIA.
+
+## Table Of Contents
+
+- [Model overview](#model-overview)
+ * [Model architecture](#model-architecture)
+ * [Default configuration](#default-configuration)
+ * [Feature support matrix](#feature-support-matrix)
+ * [Features](#features)
+ * [Mixed precision training](#mixed-precision-training)
+ * [Enabling mixed precision](#enabling-mixed-precision)
+ * [Enabling TF32](#enabling-tf32)
+ * [Glossary](#glossary)
+- [Setup](#setup)
+ * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+ * [Scripts and sample code](#scripts-and-sample-code)
+ * [Parameters](#parameters)
+ * [Command-line options](#command-line-options)
+ * [Getting the data](#getting-the-data)
+ * [Dataset guidelines](#dataset-guidelines)
+ * [Multi-dataset](#multi-dataset)
+ * [Training process](#training-process)
+ * [Inference process](#inference-process)
+- [Performance](#performance)
+ * [Benchmarking](#benchmarking)
+ * [Training performance benchmark](#training-performance-benchmark)
+ * [Inference performance benchmark](#inference-performance-benchmark)
+ * [Results](#results)
+ * [Training accuracy results](#training-accuracy-results)
+ * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
+ * [Training stability test](#training-stability-test)
+ * [Training performance results](#training-performance-results)
+ * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
+ * [Inference performance results](#inference-performance-results)
+ * [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-1x-a100-80gb)
+- [Release notes](#release-notes)
+ * [Changelog](#changelog)
+ * [Known issues](#known-issues)
+
+
+
+## Model overview
+
+MoFlow is a model for molecule generation that leverages Normalizing Flows.
+Normalizing Flows is a class of generative neural networks that directly models the probability density of the data. They consist of a sequence of invertible transformations that convert the input data that follow some hard-to-model distribution into a latent code that follows a normal distribution which can then be easily used for sampling.
+
+MoFlow was first introduced by Chengxi Zang et al. in their paper titled "MoFlow: An Invertible Flow Model for Generating Molecular Graphs" ([link](https://arxiv.org/pdf/2006.10137.pdf)).
+
+The model enables you to generate novel molecules that have similar properties to your training data.
+In the case of [ZINC dataset](https://zinc.docking.org/), which is used in this example, it allows you to navigate the chemical space of drug-like molecules and facilitate de-novo drug design.
+The differences between this version and the [original implementation](https://github.com/calvin-zcx/moflow) accompanying the paper are as follows:
+* Loss calculation was separated from the neural network
+* ActNorm layers were refactored and their initialization was moved outside of the forward pass
+* Numerical stability of the training was improved by introducing gradient clipping
+* Numerically-stable formulas for 1/sigmoid(x) and log(sigmoid(x)) were used in AffineCoupling and GraphAffineCoupling layers
+* Network and data configurations were untangled to allow for more flexibility
+* Linear transformations for node features were implemented using native Linear layers instead of custom GraphLinear layers
+* Rescaled adjacency matrix was removed as it did not provide any benefit for the training
+* Data pre-processing and loading were refactored
+* Support for data-parallel multi-GPU training was added
+* Option to capture CUDA graphs was added
+* Execution of bond and atom models in was put in two parallel CUDA streams
+* Option to compile model to TorchScript format was added
+* Support for Automatic Mixed Precision training and inference was added
+* FusedAdam optimizer from [Apex](https://github.com/NVIDIA/apex) was used instead of Adam
+* Training parameters were tuned to achieve better generation quality
+
+This model is trained with mixed precision using Tensor Cores on the NVIDIA Ampere GPU architectures. Therefore, researchers can get results up to 1.43x faster than training with full precision while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+### Model architecture
+
+
+[Chengxi Zang and Fei Wang. 2020. MoFlow: An Invertible Flow Model for Generating Molecular Graphs. In Proceedings of the 26th ACM SIGKDD](https://arxiv.org/pdf/2006.10137.pdf)
+
+
+The MoFlow model consists of two parts.
+The first part, Glow, processes edges to convert an adjacency matrix into a latent vector Z_B.
+The second part, Graph Conditional Flow, processes nodes in the context of edges to produce conditional latent vector Z_{A|B}.
+Each part is a normalizing flow—a chain of invertible transformations with learnable parameters, which provide the ability to learn the distribution of the data.
+
+### Default configuration
+The MoFlow model is built out of Normalizing Flows. It consists of two parts: Glow for processing edges and Graph Conditional Flow for processing nodes in the context of edges.
+
+
+The following features were implemented in this model:
+* Data-parallel multi-GPU training (DDP)
+* Mixed precision training (autocast, gradient scaling)
+* Just-in-time compilation
+* Resumable training
+* CUDA graphs capture
+
+The following performance optimizations were implemented in this model:
+- A series of matrix manipulations in the GraphConv layer was replaced with a single torch.einsum
+- Tensors are created on the device with the desired dtype whenever possible
+- Channels-last memory format was used for Glow
+- Stream concurrency was introduced to allow for executing Glow and Graph Conditional Flow at the same time. The concurrency happens in both forward and backward passes, and it hides the runtime of the smaller sub-model. Performance improvement is the most prominent for small batch sizes.
+- Number of nodes in the graph is now independent of the maximum number of atoms in the dataset. This provides more flexibility and allows the use of shapes divisible by eight for better Tensor Cores usage.
+- FusedAdam optimizer is used instead of native Adam.
+- Normalization of the adjacency matrix was removed, as it did not benefit the training and required additional computation.
+
+
+### Feature support matrix
+
+This model supports the following features::
+
+| Feature | MoFlow
+|-----------------------|--------------------------
+|Automatic mixed precision (AMP) | Yes
+|Distributed data parallel (DDP) | Yes
+|CUDA Graphs | Yes
+
+
+
+
+
+#### Features
+**Distributed data parallel (DDP)**
+
+[DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implements data parallelism at the module level that can run across multiple GPUs or machines.
+
+**Automatic Mixed Precision (AMP)**
+
+This implementation uses the native PyTorch AMP implementation of mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code. A detailed explanation of mixed precision can be found in the next section.
+
+**CUDA Graphs**
+
+This feature allows launching multiple GPU operations through a single CPU operation. The result is a vast reduction in CPU overhead. The benefits are particularly pronounced when training with relatively small batch sizes. The CUDA Graphs feature has been available through a [native PyTorch API](https://pytorch.org/docs/master/notes/cuda.html#cuda-graphs) starting from PyTorch v1.10.
+
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in NVIDIA Volta, and following with both the NVIDIA Turing and NVIDIA Ampere Architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
+1. Porting the model to use the FP16 data type where appropriate.
+2. Adding loss scaling to preserve small gradient values.
+
+AMP enables mixed precision training on NVIDIA Volta, NVIDIA Turing, and NVIDIA Ampere GPU architectures automatically. The PyTorch framework code makes all necessary model changes internally.
+
+For information about:
+- How to train using mixed precision, refer to the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
+- Techniques used for mixed precision training, refer to the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+- APEX tools for mixed precision training, refer to the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+
+#### Enabling mixed precision
+
+Mixed precision is enabled in PyTorch by using the native [Automatic Mixed Precision package](https://pytorch.org/docs/stable/amp.html), which casts variables to half-precision upon retrieval while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In PyTorch, loss scaling can be applied automatically using a `GradScaler`.
+Automatic Mixed Precision makes all the adjustments internally in PyTorch, providing two benefits over manual operations. First, programmers do not need to modify network model code, reducing development and maintenance efforts. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running PyTorch models.
+
+To enable mixed precision, you can simply use the `--amp` flag when running the training or inference scripts.
+
+
+
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on NVIDIA Volta GPUs.
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require a high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
+
+
+### Glossary
+**Normalizing flow** - a class of generative neural networks that directly models the probability density of the data.
+
+**Molecular graph** - representation of a molecule, in which nodes correspond to atoms and edges correspond to chemical bonds
+
+**SMILES format** - a format that allows representing a molecule with a string of characters
+## Setup
+
+The following section lists the requirements that you need to meet to start training the MoFlow model.
+
+### Requirements
+
+This repository contains a Dockerfile that extends the PyTorch 22.11 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+- PyTorch 22.11+ NGC container
+- Supported GPUs:
+ - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+ - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
+ - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+
+For more information about how to get started with NGC containers, refer to the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
+- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+- [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
+- Running [framework name - link to topic]
+
+For those unable to use the [framework name] NGC container, to set up the required environment or create your own container, refer to the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
+
+## Quick Start Guide
+
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the MoFlow model on the ZINC 250k dataset. For the specifics concerning training and inference, refer to the [Advanced](#advanced) section.
+
+1. Clone the repository.
+```
+git clone [https://github.com/NVIDIA/DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples)
+cd [DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples)/PyTorch/DrugDiscovery/MoFlow
+```
+
+2. Build the MoFlow PyTorch NGC container.
+```
+docker build . -t moflow_pyt
+```
+
+3. Start an interactive session in the NGC container to run training/inference.
+Run the following command to launch the Docker container.
+
+```
+docker run --rm -it --shm-size=8gb --gpus all -v :/results moflow_pyt
+```
+
+If you want to reuse the dataset between runs, (recommended), use -v :/data to mount your directory inside the container:
+```
+docker run --rm -it --shm-size=8gb --gpus all -v :/results -v :/data moflow_pyt
+```
+The contents of /data will be downloaded in the following step.
+
+
+
+4. Download and preprocess the dataset.
+```
+bash scripts/prepare_datasets.sh
+```
+
+5. Start training and evaluation.
+```
+bash scripts/train.sh
+```
+
+6. Start inference.
+
+You can train the model yourself (see the prevoius step) or download the pretrained weights from NGC:
+```
+wget 'https://api.ngc.nvidia.com/v2/models/nvidia/dle/moflow__pyt_ckpt/versions/22.11.0_amp/files/model_snapshot_epoch_300' -O /results/model_snapshot_epoch_300
+```
+Then you can run the inference:
+
+```
+bash scripts/predict.sh
+```
+
+Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark your performance to [Training performance benchmark](#training-performance-results), or [Inference performance benchmark](#inference-performance-results). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
+## Advanced
+
+The following sections provide greater details of the dataset, running training and inference, and the training results.
+
+### Scripts and sample code
+In the root directory, the most important files are:
+- Dockerfile - definition of the Docker image with all dependencies needed to run MoFlow
+- setup.py - script that allows installing MoFlow with pip. Note that it does not include dependencies.
+
+The `moflow` directory contains the definition of the network and tools needed for using it
+- `config.py` - configuration of the dataset and network
+- `data` - directory with tools needed to process and load the data
+- `model` - directory with the definition of the MoFlow’s building blocks and helper functions
+- `runtime` - directory that contains code for running experiments, multi-GPU training, and logging. The most important files in this directory are `train.py` and `generate.py`, which allow running training or inference, respectively.
+- `utils.py`- various helper functions
+
+The `scripts` directory contains scripts for running the most typical workflows inside the docker container:
+- `benchmark_inference.sh` and `benchmark_training.sh` for measuring the performance of inference or training, respectively
+- `data_preprocess.py` for dataset preparation
+- `prepare_datasets.sh` for downloading and preprocessing the data (note, that it launches `data_preprocess.py`)
+- `train.sh` for launching training
+- `predict.sh` for sampling random molecules from the trained model
+### Parameters
+
+The complete list of parameters accepted by the runtime scripts (`moflow/runtime/train.py` and `moflow/runtime/generate.py`) consists of:
+* --data_dir - Location for the dataset.
+* --config_name - The config to choose. This parameter allows one to switch between different datasets and their dedicated configurations of the neural network. By default, a pre-defined “zinc250k” config is used.
+* --results_dir - Directory where checkpoints are stored.
+* --predictions_path - Path to store generated molecules. If an empty string is provided, predictions will not be saved (useful for benchmarking and debugging).
+* --log_path - Path for DLLogger log. This file will contain information about the speed and accuracy of the model during training and inference. Note that if the file already exists, new logs will be added at the end.
+* --log_interval - Frequency for writing logs, expressed in steps.
+* --warmup_steps - Number of warmup steps. This value is used for benchmarking and for CUDA graph capture.
+* --steps - Number of steps used for training/inference. This parameter allows finishing training earlier than the specified number of epochs. If used with inference, it allows generating more molecules (by default only a single batch of molecules is generated).
+* --save_epochs - Frequency for saving checkpoints, expressed in epochs. If -1 is provided, checkpoints will not be saved.
+* --eval_epochs - Evaluation frequency, expressed in epochs. If -1 is provided, an evaluation will not be performed.
+* --learning_rate - Base learning rate.
+* --beta1 - beta1 parameter for the Adam optimizer.
+* --beta2 - beta2 parameter for the Adam optimizer.
+* --clip - Gradient clipping norm.
+* --epochs - Number of training epochs. Note that you can finish training mid-epoch by using “--steps” flag.
+* --batch_size - Batch size per GPU.
+* --num_workers - Number of workers in the data loader.
+* --seed - Random seed used to initialize the distributed loaders.
+* --local_rank - rank of the GPU, used to launch distributed training. This argument is specified automatically by `torchrun` and does not have to be provided by the user.
+* --temperature - Temperature used for sampling.
+* --val_batch_size - Number of molecules to generate during the validation step.
+* --allow_untrained - Allow sampling molecules from an untrained network. Useful for performance benchmarking or debugging purposes.
+* --correct_validity - Apply validity correction after the generation of the molecules.
+* --amp - Use Automatic Mixed Precision
+* --cuda_graph - Capture GPU kernels with CUDA graphs. This option allows to speed up training.
+* --jit - Compile the model with `torch.jit.script`. Can be used to speed up training or inference.
+* --verbosity - Verbosity level. Specify the following values: 0, 1, 2, 3, where 0 means minimal verbosity (errors only) and 3 - maximal (debugging).
+
+
+### Command-line options
+
+To view the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
+`python moflow/runtime/train.py --help`
+
+The following example output is printed when running the model:
+```
+usage: train.py [-h] [--data_dir DATA_DIR] [--config_name {zinc250k}] [--results_dir RESULTS_DIR] [--predictions_path PREDICTIONS_PATH] [--log_path LOG_PATH] [--log_interval LOG_INTERVAL]
+ [--warmup_steps WARMUP_STEPS] [--steps STEPS] [--save_epochs SAVE_EPOCHS] [--eval_epochs EVAL_EPOCHS] [--learning_rate LEARNING_RATE] [--beta1 BETA1] [--beta2 BETA2] [--clip CLIP]
+ [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--num_workers NUM_WORKERS] [--seed SEED] [--local_rank LOCAL_RANK] [--temperature TEMPERATURE] [--val_batch_size VAL_BATCH_SIZE]
+ [--allow_untrained] [--correct_validity] [--amp] [--cuda_graph] [--jit] [--verbosity {0,1,2,3}]
+
+optional arguments:
+ -h, --help show this help message and exit
+ --data_dir DATA_DIR Location for the dataset.
+ --config_name {zinc250k}
+ The config to choose. This parameter allows one to switch between different datasets and their dedicated configurations of the neural network. By default, a pre-defined
+ "zinc250k" config is used.
+ --results_dir RESULTS_DIR
+ Directory where checkpoints are stored.
+ --predictions_path PREDICTIONS_PATH
+ Path to store generated molecules. If an empty string is provided, predictions will not be saved (useful for benchmarking and debugging).
+ --log_path LOG_PATH Path for DLLogger log. This file will contain information about the speed and accuracy of the model during training and inference. Note that if the file already exists, new logs
+ will be added at the end.
+ --log_interval LOG_INTERVAL
+ Frequency for writing logs, expressed in steps.
+ --warmup_steps WARMUP_STEPS
+ Number of warmup steps. This value is used for benchmarking and for CUDA graph capture.
+ --steps STEPS Number of steps used for training/inference. This parameter allows finishing training earlier than the specified number of epochs. If used with inference, it allows generating
+ more molecules (by default only a single batch of molecules is generated).
+ --save_epochs SAVE_EPOCHS
+ Frequency for saving checkpoints, expressed in epochs. If -1 is provided, checkpoints will not be saved.
+ --eval_epochs EVAL_EPOCHS
+ Evaluation frequency, expressed in epochs. If -1 is provided, an evaluation will not be performed.
+ --learning_rate LEARNING_RATE
+ Base learning rate.
+ --beta1 BETA1 beta1 parameter for the optimizer.
+ --beta2 BETA2 beta2 parameter for the optimizer.
+ --clip CLIP Gradient clipping norm.
+ --epochs EPOCHS Number of training epochs. Note that you can finish training mid-epoch by using "--steps" flag.
+ --batch_size BATCH_SIZE
+ Batch size per GPU.
+ --num_workers NUM_WORKERS
+ Number of workers in the data loader.
+ --seed SEED Random seed used to initialize the distributed loaders.
+ --local_rank LOCAL_RANK
+ rank of the GPU, used to launch distributed training. This argument is specified automatically by `torchrun` and does not have to be provided by the user.
+ --temperature TEMPERATURE
+ Temperature used for sampling.
+ --val_batch_size VAL_BATCH_SIZE
+ Number of molecules to generate during validation step.
+ --allow_untrained Allow sampling molecules from an untrained network. Useful for performance benchmarking or debugging purposes.
+ --correct_validity Apply validity correction after the generation of the molecules.
+ --amp Use Automatic Mixed Precision.
+ --cuda_graph Capture GPU kernels with CUDA graphs. This option allows to speed up training.
+ --jit Compile the model with `torch.jit.script`. Can be used to speed up training or inference.
+ --verbosity {0,1,2,3}
+ Verbosity level. Specify the following values: 0, 1, 2, 3, where 0 means minimal verbosity (errors only) and 3 - maximal (debugging).
+
+```
+### Getting the data
+
+The MoFlow model was trained on the ZINC 250k dataset. The original data split was used, with 224569 molecules in the training set and 24887 molecules in the test set.
+
+This repository contains the `prepare_datasets.sh` script that will automatically download and process the dataset. By default, data will be downloaded to the `/data/` directory.
+
+#### Dataset guidelines
+The dataset preparation is implemented in the `scripts/data_preprocess.py` script, and the parameters for the dataset are defined in the `moflow/config.py` file. The config includes information about data location, the structure of the CSV file, types and numbers of atoms in the molecules, and the number of nodes in the output graphs.
+
+Initially, the data is stored in a CSV file that contains the molecules in SMILES format, together with their properties (optional). The data is loaded using the `pandas` library, and the SMILES strings are converted to molecules with RDKit.
+
+Then, the molecules are converted into graphs with features assigned to nodes and edges. The first step is the standardization of molecular structures - each molecule is converted into canonical SMILES and loaded back, and kekulized. Then, two numpy arrays are constructed. The first array is a vector corresponding to graph nodes and contains atomic numbers for all atoms in the molecule. The second array is a 2D square matrix corresponding to graph edges and contains codes for atomic bond orders - 0 if two atoms are not connected, 1 for a single bond, 2 for a double bond, and 3 for a triple bond.
+
+Both arrays are padded to some predefined size larger than the maximum number of atoms in the molecules in the dataset. For ZINC 250k, the maximum number of atoms is 38, and the output size of the numpy arrays is set to 40 for the nodes array and 40x40 for the edges array.
+
+This representation of the data is dumped on the disk using the numpy `savez` function.
+
+During training, the numpy arrays are loaded, and one-hot-encoding is used to represent atomic numbers (node features) and bond orders (edge features). This representation is then used for training the neural network.
+
+### Training process
+
+The training script is located in `moflow/runtime/train.py` and it accepts the parameters listed above.
+
+To make the usage of the model easier, there is also `scripts/train.sh` script that runs training with the default configuration and the evaluation using the trained checkpoint at the end. This script can be run without any arguments - then it launches training on a single GPU and performance optimizations enabled - automatic mixed precision (AMP) and CUDA graph capture.
+
+```
+./scripts/train.sh
+```
+
+It is also possible to pass the number of GPUs and precision (“amp” or “full”) that should be used for training. For example, to launch training with eight GPUs and AMP, run:
+```
+./scripts/train.sh 8
+```
+and to launch four GPU training with full precision, run:
+```
+./scripts/train.sh 4 full
+```
+These two arguments can also be followed by extra flags that will be passed to training and evaluation commands. For example, to train on eight GPUs with AMP, batch size of 2048 per GPU and save logs in `/results/dll.json`, run:
+```
+./scripts/train.sh 8 amp --batch_size 2048 --log_path /results/dll.json
+```
+
+Alternatively, you can launch training with `moflow/runtime/train.py`. To run the model with multiple GPUs, run:
+
+```
+torchrun --nproc_per_node=<# GPUs> moflow/runtime/train.py
+```
+To enable mixed precision training, add `--amp`. You can also optimize the performance further by adding `--cuda_graph` or `--jit` flags to enable CUDA graph capture or just-in-time compilation, respectively.
+
+#### Logs
+By default, logs are printed to the screen and not saved on disk. If you want to store the logs, pass `--log_path` flag to `scripts/train.sh` or `moflow/runtime/train.py`.
+
+#### Checkpoints
+By default, the training script saves checkpoints inside `/results` every five epochs. The location of the checkpoints directory can be modified with `--results_dir` flag and saving interval with `--save_epochs` flag (pass -1 if you do not want to save checkpoints). Up to five most recent checkpoints are kept while the older ones are removed.
+
+#### Evaluation
+The following metrics are used to evaluate the model:
+
+- Validity - the percentage of predictions corresponding to the correct molecular graph.
+- Uniqueness - the percentage of valid molecules that is unique.
+- Novelty - the percentage of valid and unique molecules not present in the training set.
+- N.U.V - the percentage of valid, unique, and novel molecules.
+
+During training, a single batch of molecules is generated every couple of epochs to assess two metrics: validity and uniqueness, as they are quick to calculate and track the training progress.
+
+By default, the validation batch size is set to 100 molecules per GPU, and evaluation happens every five epochs. This can be changed with `--val_batch_size` and `--eval_epochs` flags, respectively. To disable evaluation, pass `--eval_epochs -1`.
+
+If you use `scripts/train.sh`, there is also a final evaluation of the model done on 100 batches of molecules. This larger sample is evaluated with all metrics described above, and we use N.U.V as the main metric.
+
+Alternatively, you can trigger evaluation manually by running `moflow/runtime/evaluate.py` script. Make sure that you pass the same value for `--results_dir` for both training and evaluation scripts.
+
+### Inference process
+
+Inference can be run by launching the `moflow/runtime/generate.py` or `scripts/predict.sh` script. The first one provides more flexibility and accepts the arguments listed above. The second script allows you to easily run the default configuration with performance optimization (`--jit` flag) and molecule validity correction (`--correct_validity`). To generate a single batch of molecules with AMP and batch size of 512, run:
+```
+./scripts/predict.sh
+```
+You can also provide batch size and precision to use for predictions. For example, to generate 1000 molecules with full precision, run:
+
+```
+./scripts/predict.sh 1000 full
+```
+
+The script also allows you to pass extra flags to the generation. For example, to generate 10 batches of 1000 each and save predictions inside /results/predictions.smi, run:
+```
+./scripts/predict.sh 1000 amp --steps 10 --predictions_path /results/predictions.smi
+```
+
+
+
+## Performance
+The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
+
+### Benchmarking
+
+The following section shows how to run benchmarks measuring the model performance in training and inference modes.
+
+#### Training performance benchmark
+
+To benchmark the training performance on a specific number of GPUs, batch size and precision, run:
+
+```
+bash scripts/benchmark_training.sh <# GPUs>
+```
+Eg. running
+```
+./scripts/benchmark_training.sh 8 2048 amp
+```
+will measure performance for eight GPUs, batch size of 2048 per GPU and mixed precision and running:
+```
+./scripts/benchmark_training.sh 1 1024 full
+```
+will measure performance for single GPU, batch size of 1024 and full precision.
+#### Inference performance benchmark
+To benchmark the inference performance on a specific batch size and precision, run:
+
+```
+bash scripts/benchmark_inference.sh
+```
+
+Eg. running
+```
+./scripts/benchmark_inference.sh 2048 amp
+```
+will measure performance for a batch size of 2048 and mixed precision and running:
+```
+./scripts/benchmark_inference.sh 1024 full
+```
+will measure performance for a batch size of 1024 and full precision.
+
+### Results
+
+The following sections provide details on how we achieved our performance and accuracy in training and inference.
+
+#### Training accuracy results
+
+
+##### Training accuracy: NVIDIA A100 (8x A100 80GB)
+
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. The values presented below were averaged over 20 experiments.
+
+| GPUs | Batch size / GPU | NUV - TF32 | NUV - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision)
+|---------|------------------|-----------------|----------------------------|-------------------------|----------------------------------|--------------
+| 1 | 512 | 89.63 % | 87.83 % | 5h8min | 4h0min | 1.28x
+| 8 | 512 | 87.03 % | 87.90 % | 48min | 40min | 1.20x
+
+
+##### Training stability test
+
+
+The MoFlow model was trained for 300 epochs starting from 20 different initial random seeds. Every five training epochs, the model was evaluated by generating a small sample of molecules (100 molecules per GPU), and validity and uniqueness were calculated. The training was performed in the PyTorch 22.11 Docker container on NVIDIA DGX A100 with 8x A100 80GB GPUs with AMP and CUDA graph capture enabled. The following table summarizes the results of the stability test.
+
+The following table displays the validity and uniqueness scores after every 50 epochs for different initial random seeds.
+
+|epoch|validity mean|validity std|validity min|validity max|validity median|uniqueness mean|uniqueness std|uniqueness min|uniqueness max|uniqueness median|
+|-----|-------------|------------|------------|------------|---------------|---------------|--------------|--------------|--------------|-----------------|
+|50 |68.22 |5.25 |57.38 |74.75 |69.50 |93.64 |8.22 |62.56 |99.82 |95.30 |
+|100 |76.91 |4.23 |69.50 |84.38 |77.50 |99.39 |0.92 |96.31 |100.00 |99.83 |
+|150 |80.48 |3.80 |73.88 |88.25 |81.75 |99.58 |0.78 |96.64 |100.00 |99.85 |
+|200 |83.87 |3.98 |77.00 |90.62 |84.44 |99.76 |0.38 |98.81 |100.00 |100.00 |
+|250 |86.08 |4.46 |77.12 |93.12 |86.56 |99.87 |0.21 |99.27 |100.00 |100.00 |
+|300 |87.29 |3.70 |77.75 |93.38 |87.69 |99.82 |0.30 |98.70 |100.00 |99.93 |
+
+
+
+#### Training performance results
+
+
+##### Training performance: NVIDIA A100 (8x A100 80GB)
+
+Our results were obtained by running the `scripts/benchmark_training.sh` training script in the PyTorch 22.11 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in molecules per second) were averaged over 190 iterations after 10 warm-up steps.
+
+|GPUs|Batch size / GPU|Throughput - TF32|Throughput - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32|Weak scaling - mixed precision|
+|----|----------------|-----------------|----------------------------|-------------------------------------------|-------------------|------------------------------|
+|1 |512 |3499.35 |4524.15 |1.29 | | |
+|1 |1024 |3883.49 |5392.78 |1.39 | | |
+|1 |2048 |4291.29 |6118.46 |1.43 | | |
+|8 |512 |24108.04 |29293.41 |1.22 |6.89 |6.47 |
+|8 |1024 |28104.62 |37365.05 |1.33 |7.24 |6.93 |
+|8 |2048 |30927.04 |42078.31 |1.36 |7.21 |6.88 |
+
+
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+
+#### Inference performance results
+
+##### Inference performance: NVIDIA A100 (1x A100 80GB)
+
+Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 22.11 NGC container on the NVIDIA A100 (1x A100 80GB) GPU.
+
+FP16
+|Batch size|Throughput Avg|Latency Avg|Latency 90%|Latency 95%|Latency 99%|
+|----------|--------------|-----------|-----------|-----------|-----------|
+|512 |12524.49 |41 |41 |41 |41 |
+|1024 |13871.60 |74 |74 |74 |74 |
+|2048 |14386.44 |142 |144 |144 |144 |
+
+TF32
+|Batch size|Throughput Avg|Latency Avg|Latency 90%|Latency 95%|Latency 99%|
+|----------|--------------|-----------|-----------|-----------|-----------|
+|512 |9696.35 |53 |53 |53 |53 |
+|1024 |10242.98 |100 |100 |100 |100 |
+|2048 |11174.75 |183 |187 |187 |187 |
+
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+## Release notes
+
+### Changelog
+January 2023
+- Initial release
+
+### Known issues
+
+There is a known issue with the selection of sampling temperature. For some runs, the default value (0.3) might be sub-optimal, and better prediction quality can be achieved when lowering or increasing the value of this parameter. To tune the value of this parameter, run `moflow/runtime/evaluate.py` script passing different values for the `--temperature` flag.
diff --git a/PyTorch/DrugDiscovery/MoFlow/img/moflow.png b/PyTorch/DrugDiscovery/MoFlow/img/moflow.png
new file mode 100644
index 000000000..e806e1451
Binary files /dev/null and b/PyTorch/DrugDiscovery/MoFlow/img/moflow.png differ
diff --git a/TensorFlow2/Recommendation/DLRM/tensorflow-dot-based-interact/tensorflow_dot_based_interact/python/__init__.py b/PyTorch/DrugDiscovery/MoFlow/moflow/__init__.py
similarity index 100%
rename from TensorFlow2/Recommendation/DLRM/tensorflow-dot-based-interact/tensorflow_dot_based_interact/python/__init__.py
rename to PyTorch/DrugDiscovery/MoFlow/moflow/__init__.py
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/config.py b/PyTorch/DrugDiscovery/MoFlow/moflow/config.py
new file mode 100644
index 000000000..8bf4d07c4
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/config.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from dataclasses import asdict, dataclass, field
+import json
+from typing import Dict, List, Optional
+
+from rdkit import Chem
+
+
+_VALID_IDX_FILE = 'valid_idx_{}.json'
+_CSV_FILE = '{}.csv'
+_DATASET_FILE = '{}_relgcn_kekulized_ggnp.npz'
+
+DUMMY_CODE = 0
+CODE_TO_BOND = dict(enumerate([
+ 'DUMMY',
+ Chem.rdchem.BondType.SINGLE,
+ Chem.rdchem.BondType.DOUBLE,
+ Chem.rdchem.BondType.TRIPLE,
+]))
+BOND_TO_CODE = {v: k for k, v in CODE_TO_BOND.items()}
+ATOM_VALENCY = {6: 4, 7: 3, 8: 2, 9: 1, 15: 3, 16: 2, 17: 1, 35: 1, 53: 1}
+
+
+@dataclass
+class DatasetConfig:
+ dataset_name: str
+ atomic_num_list: List[int]
+ max_num_atoms: int
+ labels: List[str]
+ smiles_col: str
+ code_to_atomic: Dict[int, int] = field(init=False)
+ atomic_to_code: Dict[int, int] = field(init=False)
+ valid_idx_file: str = field(init=False)
+ csv_file: str = field(init=False)
+ dataset_file: str = field(init=False)
+
+ def __post_init__(self):
+ self.valid_idx_file = _VALID_IDX_FILE.format(self.dataset_name)
+ self.csv_file = _CSV_FILE.format(self.dataset_name)
+ self.dataset_file = _DATASET_FILE.format(self.dataset_name)
+
+ self.code_to_atomic = dict(enumerate(sorted([DUMMY_CODE] + self.atomic_num_list)))
+ self.atomic_to_code = {v: k for k, v in self.code_to_atomic.items()}
+
+
+@dataclass
+class AtomFlowConfig:
+ n_flow: int
+ hidden_gnn: List[int]
+ hidden_lin: List[int]
+ n_block: int = 1
+ mask_row_size_list: List[int] = field(default_factory=lambda: [1])
+ mask_row_stride_list: List[int] = field(default_factory=lambda: [1])
+
+@dataclass
+class BondFlowConfig:
+ hidden_ch: List[int]
+ conv_lu: int
+ n_squeeze: int
+ n_block: int = 1
+ n_flow: int = 10
+
+
+@dataclass
+class ModelConfig:
+ atom_config: AtomFlowConfig
+ bond_config: BondFlowConfig
+ noise_scale: float = 0.6
+ learn_dist: bool = True
+
+@dataclass
+class Config:
+ dataset_config: DatasetConfig
+ model_config: ModelConfig
+ max_num_nodes: Optional[int] = None
+ num_node_features: Optional[int] = None
+ num_edge_features: int = len(CODE_TO_BOND)
+ z_dim: int = field(init=False)
+
+ def __post_init__(self):
+ if self.max_num_nodes is None:
+ self.max_num_nodes = self.dataset_config.max_num_atoms
+ if self.num_node_features is None:
+ self.num_node_features = len(self.dataset_config.code_to_atomic)
+ bonds_dim = self.max_num_nodes * self.max_num_nodes * self.num_edge_features
+ atoms_dim = self.max_num_nodes * self.num_node_features
+ self.z_dim = bonds_dim + atoms_dim
+
+
+ def save(self, path):
+ self.path = path
+ with open(path, 'w') as f:
+ json.dump(asdict(self), f, indent=4, sort_keys=True)
+
+ @classmethod
+ def load(cls, path):
+ with open(path, 'r') as f:
+ data = json.load(f)
+ return cls(**data)
+
+ def __repr__(self) -> str:
+ return json.dumps(asdict(self), indent=4, separators=(',', ': '))
+
+
+ZINC250K_CONFIG = Config(
+ max_num_nodes=40,
+ dataset_config=DatasetConfig(
+ dataset_name='zinc250k',
+ atomic_num_list=[6, 7, 8, 9, 15, 16, 17, 35, 53],
+ max_num_atoms=38,
+ labels=['logP', 'qed', 'SAS'],
+ smiles_col='smiles',
+ ),
+ model_config=ModelConfig(
+ AtomFlowConfig(
+ n_flow=38,
+ hidden_gnn=[256],
+ hidden_lin=[512, 64],
+ ),
+ BondFlowConfig(
+ n_squeeze=20,
+ hidden_ch=[512, 512],
+ conv_lu=2
+ ),
+ )
+)
+
+CONFIGS = {'zinc250k': ZINC250K_CONFIG}
diff --git a/TensorFlow2/Recommendation/DLRM/tensorflow-dot-based-interact/tensorflow_dot_based_interact/python/ops/__init__.py b/PyTorch/DrugDiscovery/MoFlow/moflow/data/__init__.py
similarity index 100%
rename from TensorFlow2/Recommendation/DLRM/tensorflow-dot-based-interact/tensorflow_dot_based_interact/python/ops/__init__.py
rename to PyTorch/DrugDiscovery/MoFlow/moflow/data/__init__.py
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/data/data_frame_parser.py b/PyTorch/DrugDiscovery/MoFlow/moflow/data/data_frame_parser.py
new file mode 100644
index 000000000..ba76fc439
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/data/data_frame_parser.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+from logging import getLogger
+import traceback
+from typing import List
+
+import numpy as np
+import pandas as pd
+from rdkit import Chem
+from tqdm import tqdm
+
+from moflow.data.encoding import MolEncoder, EncodingError
+from moflow.data.data_loader import NumpyTupleDataset
+
+
+class DataFrameParser:
+ """
+ This DataFrameParser parses pandas dataframe containing SMILES and, optionally, some additional features.
+
+ Args:
+ encoder (MolEncoder): encoder instance
+ labels (list): labels column that should be loaded
+ smiles_col (str): smiles column
+ """
+
+ def __init__(self, encoder: MolEncoder,
+ labels: List[str],
+ smiles_col: str = 'smiles'):
+ super(DataFrameParser, self).__init__()
+ self.labels = labels
+ self.smiles_col = smiles_col
+ self.logger = getLogger(__name__)
+ self.encoder = encoder
+
+ def parse(self, df: pd.DataFrame) -> NumpyTupleDataset:
+ """Parse DataFrame using `encoder` and prepare a dataset instance
+
+ Labels are extracted from `labels` columns and input features are
+ extracted from smiles information in `smiles` column.
+ """
+ all_nodes = []
+ all_edges = []
+
+ total_count = df.shape[0]
+ fail_count = 0
+ success_count = 0
+ for smiles in tqdm(df[self.smiles_col], total=df.shape[0]):
+ try:
+ mol = Chem.MolFromSmiles(smiles)
+ if mol is None:
+ fail_count += 1
+ continue
+ # Note that smiles expression is not unique.
+ # we obtain canonical smiles
+ nodes, edges = self.encoder.encode_mol(mol)
+
+ except EncodingError as e:
+ fail_count += 1
+ continue
+ except Exception as e:
+ self.logger.warning('parse(), type: {}, {}'
+ .format(type(e).__name__, e.args))
+ self.logger.info(traceback.format_exc())
+ fail_count += 1
+ continue
+ all_nodes.append(nodes)
+ all_edges.append(edges)
+ success_count += 1
+
+ result = [np.array(all_nodes), np.array(all_edges), *(df[label_col].values for label_col in self.labels)]
+ self.logger.info('Preprocess finished. FAIL {}, SUCCESS {}, TOTAL {}'
+ .format(fail_count, success_count, total_count))
+
+ dataset = NumpyTupleDataset(result)
+ return dataset
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/data/data_loader.py b/PyTorch/DrugDiscovery/MoFlow/moflow/data/data_loader.py
new file mode 100644
index 000000000..28f9378ca
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/data/data_loader.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+import os
+import logging
+from typing import Any, Callable, Iterable, Optional, Tuple
+
+import numpy as np
+from torch.utils.data import Dataset
+
+
+class NumpyTupleDataset(Dataset):
+ """Dataset of a tuple of datasets.
+
+ It combines multiple datasets into one dataset. Each example is represented
+ by a tuple whose ``i``-th item corresponds to the i-th dataset.
+ And each ``i``-th dataset is expected to be an instance of numpy.ndarray.
+
+ Args:
+ datasets: Underlying datasets. The ``i``-th one is used for the
+ ``i``-th item of each example. All datasets must have the same
+ length.
+ transform: An optional function applied to an item bofre returning
+ """
+
+ def __init__(self, datasets: Iterable[np.ndarray], transform: Optional[Callable] = None) -> None:
+ if not datasets:
+ raise ValueError('no datasets are given')
+ length = len(datasets[0])
+ for i, dataset in enumerate(datasets):
+ if len(dataset) != length:
+ raise ValueError(
+ 'dataset of the index {} has a wrong length'.format(i))
+ self._datasets = datasets
+ self._length = length
+ self.transform = transform
+
+ def __len__(self) -> int:
+ return self._length
+
+ def __getitem__(self, index: int) -> Tuple[Any]:
+ item = [dataset[index] for dataset in self._datasets]
+
+ if self.transform:
+ item = self.transform(item)
+ return item
+
+ def get_datasets(self) -> Tuple[np.ndarray]:
+ return self._datasets
+
+
+ def save(self, filepath: str) -> None:
+ """save the dataset to filepath in npz format
+
+ Args:
+ filepath (str): filepath to save dataset. It is recommended to end
+ with '.npz' extension.
+ """
+ np.savez(filepath, *self._datasets)
+ logging.info('Save {} done.'.format(filepath))
+
+ @classmethod
+ def load(cls, filepath: str, transform: Optional[Callable] = None):
+ logging.info('Loading file {}'.format(filepath))
+ if not os.path.exists(filepath):
+ raise ValueError('Invalid filepath {} for dataset'.format(filepath))
+ load_data = np.load(filepath)
+ result = []
+ i = 0
+ while True:
+ key = 'arr_{}'.format(i)
+ if key in load_data.keys():
+ result.append(load_data[key])
+ i += 1
+ else:
+ break
+ return cls(result, transform)
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/data/encoding.py b/PyTorch/DrugDiscovery/MoFlow/moflow/data/encoding.py
new file mode 100644
index 000000000..d3d71fde9
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/data/encoding.py
@@ -0,0 +1,139 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+from typing import Tuple
+import numpy as np
+from rdkit import Chem
+
+from moflow.config import BOND_TO_CODE, DUMMY_CODE
+
+
+class MolEncoder:
+ """Encodes atoms and adjecency matrix.
+
+ Args:
+ out_size (int): It specifies the size of array returned by
+ `get_input_features`.
+ If the number of atoms in the molecule is less than this value,
+ the returned arrays is padded to have fixed size.
+ """
+
+ def __init__(self, out_size: int):
+ super(MolEncoder, self).__init__()
+ self.out_size = out_size
+
+ def encode_mol(self, mol: Chem.Mol) -> Tuple[np.ndarray, np.ndarray]:
+ """get input features
+
+ Args:
+ mol (Mol):
+
+ Returns:
+
+ """
+ mol = self._standardize_mol(mol)
+ self._check_num_atoms(mol)
+ atom_array = self.construct_atomic_number_array(mol)
+ adj_array = self.construct_discrete_edge_matrix(mol)
+ return atom_array, adj_array
+
+ def _standardize_mol(self, mol: Chem.Mol) -> Chem.Mol:
+ canonical_smiles = Chem.MolToSmiles(mol, isomericSmiles=False,
+ canonical=True)
+ mol = Chem.MolFromSmiles(canonical_smiles)
+ Chem.Kekulize(mol)
+ return mol
+
+ def _check_num_atoms(self, mol: Chem.Mol) -> None:
+ """Check number of atoms in `mol` does not exceed `out_size`"""
+ num_atoms = mol.GetNumAtoms()
+ if num_atoms > self.out_size:
+ raise EncodingError(f'Number of atoms in mol {num_atoms} exceeds num_max_atoms {self.out_size}')
+
+
+ def construct_atomic_number_array(self, mol: Chem.Mol) -> np.ndarray:
+ """Returns atomic numbers of atoms consisting a molecule.
+
+ Args:
+ mol (rdkit.Chem.Mol): Input molecule.
+
+ Returns:
+ numpy.ndarray: an array consisting of atomic numbers
+ of atoms in the molecule.
+ """
+
+ atom_list = [a.GetAtomicNum() for a in mol.GetAtoms()]
+ n_atom = len(atom_list)
+ if self.out_size < n_atom:
+ raise EncodingError(f'out_size {self.out_size} is smaller than number of atoms in mol {n_atom}')
+ atom_array = np.full(self.out_size, DUMMY_CODE, dtype=np.uint8)
+ atom_array[:n_atom] = atom_list
+ return atom_array
+
+
+ def construct_discrete_edge_matrix(self, mol: Chem.Mol) -> np.ndarray:
+ """Returns the edge-type dependent adjacency matrix of the given molecule.
+
+ Args:
+ mol (rdkit.Chem.Mol): Input molecule.
+
+ Returns:
+ adj_array (numpy.ndarray): The adjacent matrix of the input molecule.
+ It is symmetrical 2-dimensional array with shape (out_size, out_size),
+ filled with integers representing bond types. It two atoms are not
+ conncted, DUMMY_CODE is used instead.
+ """
+ if mol is None:
+ raise EncodingError('mol is None')
+ n_atom = mol.GetNumAtoms()
+
+ if self.out_size < n_atom:
+ raise EncodingError(f'out_size {self.out_size} is smaller than number of atoms in mol {n_atom}')
+
+ adjs = np.full((self.out_size, self.out_size), DUMMY_CODE, dtype=np.uint8)
+
+ for bond in mol.GetBonds():
+ bond_type = bond.GetBondType()
+ # we need to use code here - bond types are rdkit objects
+ code = BOND_TO_CODE[bond_type]
+ i = bond.GetBeginAtomIdx()
+ j = bond.GetEndAtomIdx()
+ adjs[[i, j], [j, i]] = code
+ return adjs
+
+
+class EncodingError(Exception):
+ pass
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/data/transform.py b/PyTorch/DrugDiscovery/MoFlow/moflow/data/transform.py
new file mode 100644
index 000000000..eaf3e9e43
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/data/transform.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+import json
+import logging
+import numpy as np
+import os
+from typing import Dict, Tuple
+
+from moflow.config import CODE_TO_BOND, DUMMY_CODE, Config
+
+
+def _onehot(data: np.ndarray, codes_dict: Dict[int, int], dtype=np.float32) -> np.ndarray:
+ shape = [len(codes_dict), *data.shape]
+ encoded = np.zeros(shape, dtype=dtype)
+ for obj_key, code in codes_dict.items():
+ encoded[code, data == obj_key] = 1
+ return encoded
+
+
+def encode_nodes(atomic_nums: np.ndarray, config: Config) -> np.ndarray:
+ padded_data = np.full(config.max_num_nodes, DUMMY_CODE, dtype=np.uint8)
+ padded_data[:len(atomic_nums)] = atomic_nums
+ encoded = _onehot(padded_data, config.dataset_config.atomic_to_code).T
+ return encoded
+
+
+def encode_edges(adj: np.ndarray, config: Config) -> np.ndarray:
+ padded_data = np.full((config.max_num_nodes, config.max_num_nodes), DUMMY_CODE, dtype=np.uint8)
+ n, m = adj.shape
+ assert n == m, 'adjecency matrix should be square'
+ padded_data[:n, :n] = adj
+ # we already store codes in the file - bond types are rdkit objects
+ encoded = _onehot(padded_data, {k:k for k in CODE_TO_BOND})
+ return encoded
+
+
+def transform_fn(data: Tuple[np.ndarray], config: Config) -> Tuple[np.ndarray]:
+ node, adj, *labels = data
+ node = encode_nodes(node, config)
+ adj = encode_edges(adj, config)
+ return (node, adj, *labels)
+
+
+def get_val_ids(config: Config, data_dir: str):
+ file_path = os.path.join(data_dir, config.dataset_config.valid_idx_file)
+ logging.info('loading train/valid split information from: {}'.format(file_path))
+ with open(file_path) as json_data:
+ data = json.load(json_data)
+
+ val_ids = [int(idx)-1 for idx in data]
+ return val_ids
diff --git a/Tools/PyTorch/TimeSeriesPredictionPlatform/distributed_launcher/__init__.py b/PyTorch/DrugDiscovery/MoFlow/moflow/model/__init__.py
similarity index 100%
rename from Tools/PyTorch/TimeSeriesPredictionPlatform/distributed_launcher/__init__.py
rename to PyTorch/DrugDiscovery/MoFlow/moflow/model/__init__.py
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/model/basic.py b/PyTorch/DrugDiscovery/MoFlow/moflow/model/basic.py
new file mode 100644
index 000000000..f6f0e32bf
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/model/basic.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+import math
+from typing import Tuple
+import numpy as np
+from scipy import linalg as la
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+from moflow.runtime.distributed_utils import get_world_size, reduce_tensor
+
+
+class ActNorm(nn.Module):
+ def __init__(self, num_channels, num_dims, channels_dim=1):
+ super().__init__()
+ self.num_channels = num_channels
+ self.num_dims = num_dims
+ self.channels_dim = channels_dim
+ self.shape = [1] * num_dims
+ self.shape[channels_dim] = num_channels
+ self.loc = nn.Parameter(torch.zeros(*self.shape))
+ self.scale = nn.Parameter(torch.ones(*self.shape))
+
+ self.register_buffer('initialized', torch.tensor(0, dtype=torch.uint8))
+ self.register_buffer('num_elements', torch.tensor(0, dtype=torch.uint8))
+
+ @torch.jit.ignore
+ def initialize(self, input):
+ if self.initialized.item() == 1:
+ return
+
+ dims = list(input.shape[1:])
+ del dims[self.channels_dim -1]
+
+ num_elems = math.prod(dims)
+ permutation = [self.channels_dim] + [i for i in range(self.num_dims) if i != self.channels_dim]
+ with torch.no_grad():
+
+ flatten = input.permute(*permutation).contiguous().view(self.num_channels, -1)
+ mean = flatten.mean(1).view(self.shape)
+ std = flatten.std(1).view(self.shape)
+
+ num_gpus = get_world_size()
+ mean = reduce_tensor(mean, num_gpus)
+ std = reduce_tensor(std, num_gpus)
+ self.loc.data.copy_(-mean)
+ self.scale.data.copy_(1 / (std + 1e-6))
+ self.initialized.fill_(1)
+ self.num_elements.fill_(num_elems)
+
+ def forward(self, input):
+ log_abs = torch.log(torch.abs(self.scale))
+ logdet = self.num_elements * torch.sum(log_abs)
+ return self.scale * (input + self.loc), logdet
+
+ @torch.jit.export
+ def reverse(self, output):
+ return output / self.scale - self.loc
+
+
+class InvConv2d(nn.Module):
+ def __init__(self, in_channel):
+ super().__init__()
+
+ weight = torch.randn(in_channel, in_channel)
+ q, _ = torch.qr(weight)
+ weight = q.unsqueeze(2).unsqueeze(3)
+ self.weight = nn.Parameter(weight)
+
+ def forward(self, input):
+ _, _, height, width = input.shape
+
+ out = F.conv2d(input, self.weight)
+ logdet = (
+ height * width * torch.slogdet(self.weight.squeeze().double())[1].float()
+ )
+
+ return out, logdet
+
+ def reverse(self, output):
+ return F.conv2d(
+ output, self.weight.squeeze().inverse().unsqueeze(2).unsqueeze(3)
+ )
+
+
+class InvConv2dLU(nn.Module):
+ def __init__(self, in_channel):
+ super().__init__()
+
+ weight = np.random.randn(in_channel, in_channel)
+ q, _ = la.qr(weight)
+ w_p, w_l, w_u = la.lu(q.astype(np.float32))
+ w_s = np.diag(w_u)
+ w_u = np.triu(w_u, 1)
+ u_mask = np.triu(np.ones_like(w_u), 1)
+ l_mask = u_mask.T
+
+ w_p = torch.from_numpy(w_p)
+ w_l = torch.from_numpy(w_l).contiguous()
+ w_s = torch.from_numpy(w_s)
+ w_u = torch.from_numpy(w_u)
+
+ self.register_buffer('w_p', w_p)
+ self.register_buffer('u_mask', torch.from_numpy(u_mask))
+ self.register_buffer('l_mask', torch.from_numpy(l_mask))
+ self.register_buffer('s_sign', torch.sign(w_s))
+ self.register_buffer('l_eye', torch.eye(l_mask.shape[0]))
+ self.w_l = nn.Parameter(w_l)
+ self.w_s = nn.Parameter(torch.log(torch.abs(w_s)))
+ self.w_u = nn.Parameter(w_u)
+
+ def forward(self, input):
+ _, _, height, width = input.shape
+
+ weight = self.calc_weight()
+
+ out = F.conv2d(input, weight)
+ logdet = height * width * torch.sum(self.w_s)
+
+ return out, logdet
+
+ def calc_weight(self):
+ weight = (
+ self.w_p
+ @ (self.w_l * self.l_mask + self.l_eye)
+ @ ((self.w_u * self.u_mask) + torch.diag(self.s_sign * torch.exp(self.w_s)))
+ )
+
+ return weight.unsqueeze(2).unsqueeze(3)
+
+ def reverse(self, output):
+ weight = self.calc_weight()
+ dtype = weight.dtype
+ weight = weight.float()
+ weight_inv = weight.squeeze().inverse().unsqueeze(2).unsqueeze(3)
+ weight_inv = weight_inv.to(dtype=dtype)
+
+ return F.conv2d(output, weight_inv)
+
+
+class GraphConv(nn.Module):
+ def __init__(self, in_channels, out_channels, num_atoms, num_edge_type=4):
+ super(GraphConv, self).__init__()
+
+ self.graph_linear_self = nn.Linear(in_channels, out_channels)
+ self.graph_linear_edge = nn.Linear(in_channels, out_channels * num_edge_type)
+ self.num_edge_type = num_edge_type
+ self.in_ch = in_channels
+ self.out_ch = out_channels
+ self.num_atoms = num_atoms
+
+ def forward(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
+ adj, nodes = graph
+ hs = self.graph_linear_self(nodes)
+ m = self.graph_linear_edge(nodes)
+ m = m.view(-1, self.num_atoms, self.out_ch, self.num_edge_type)
+ hr = torch.einsum('bemn,bnce->bmc', adj, m)
+ hr = hr.unsqueeze(2)
+ return hs + hr
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/model/coupling.py b/PyTorch/DrugDiscovery/MoFlow/moflow/model/coupling.py
new file mode 100644
index 000000000..55a12c633
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/model/coupling.py
@@ -0,0 +1,196 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+from typing import Tuple
+import torch
+import torch.nn as nn
+from torch.nn.functional import logsigmoid
+
+from moflow.model.basic import GraphConv
+
+
+def sigmoid_inverse(x):
+ """Calculates 1/sigmoid(x) in a more numerically stable way"""
+ return 1 + torch.exp(-x)
+
+
+class AffineCoupling(nn.Module): # delete
+ def __init__(self, in_channel, hidden_channels, mask_swap=False): # filter_size=512, --> hidden_channels =(512, 512)
+ super(AffineCoupling, self).__init__()
+
+ self.mask_swap=mask_swap
+ # self.norms_in = nn.ModuleList()
+ last_h = in_channel // 2
+ vh = tuple(hidden_channels)
+ layers = []
+ for h in vh:
+ layers.append(nn.Conv2d(last_h, h, kernel_size=3, padding=1))
+ layers.append(nn.BatchNorm2d(h))
+ layers.append(nn.ReLU(inplace=True))
+ last_h = h
+ layers.append(nn.Conv2d(last_h, in_channel, kernel_size=3, padding=1))
+ self.layers = nn.Sequential(*layers)
+
+ def forward(self, input: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ in_a, in_b = input.chunk(2, 1) # (2,12,32,32) --> (2,6,32,32), (2,6,32,32)
+
+ if self.mask_swap:
+ in_a, in_b = in_b, in_a
+
+ s_logits, t = self._s_t_function(in_a)
+ s = torch.sigmoid(s_logits)
+ out_b = (in_b + t) * s
+ logdet = torch.sum(logsigmoid(s_logits).reshape(input.shape[0], -1), 1)
+
+ if self.mask_swap:
+ result = torch.cat([out_b, in_a], 1)
+ else:
+ result = torch.cat([in_a, out_b], 1)
+
+ return result, logdet
+
+ @torch.jit.export
+ def reverse(self, output: torch.Tensor) -> torch.Tensor:
+ out_a, out_b = output.chunk(2, 1)
+ if self.mask_swap:
+ out_a, out_b = out_b, out_a
+
+ s_logits, t = self._s_t_function(out_a)
+ s_inverse = sigmoid_inverse(s_logits)
+ in_b = out_b * s_inverse - t
+
+ if self.mask_swap:
+ result = torch.cat([in_b, out_a], 1)
+ else:
+ result = torch.cat([out_a, in_b], 1)
+
+ return result
+
+ def _s_t_function(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ h = self.layers(x)
+ s_logits, t = h.chunk(2, 1)
+ return s_logits, t
+
+
+class ConvCouplingBlock(nn.Module):
+ def __init__(self, in_dim: int, out_dim: int, n_node: int) -> None:
+ super().__init__()
+ self.graph_conv = GraphConv(in_dim, out_dim, n_node)
+ self.bn = nn.BatchNorm2d(n_node)
+ self.relu = nn.ReLU(inplace=True)
+
+ def forward(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
+ adj, nodes = graph
+ h = self.graph_conv(graph)
+ h = h.to(memory_format=torch.channels_last)
+ h = self.bn(h)
+ h = self.relu(h)
+ return adj, h
+
+
+class LinCouplingBlock(nn.Module):
+ def __init__(self, in_dim: int, out_dim: int, n_node: int) -> None:
+ super().__init__()
+ self.lin = nn.Linear(in_dim, out_dim)
+ self.bn = nn.BatchNorm2d(n_node)
+ self.relu = nn.ReLU(inplace=True)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ h = self.lin(x)
+ h = h.to(memory_format=torch.channels_last)
+ h = self.bn(h)
+ h = self.relu(h)
+ return h
+
+
+class GraphAffineCoupling(nn.Module):
+ def __init__(self, n_node, in_dim, hidden_dim_dict, masked_row):
+ super(GraphAffineCoupling, self).__init__()
+ self.n_node = n_node
+ self.in_dim = in_dim
+ self.hidden_dim_dict = hidden_dim_dict
+ self.masked_row = masked_row
+
+ self.hidden_dim_gnn = hidden_dim_dict['gnn']
+ self.hidden_dim_linear = hidden_dim_dict['linear']
+
+ conv_layers = []
+ last_dim = in_dim
+ for out_dim in self.hidden_dim_gnn:
+ conv_layers.append(ConvCouplingBlock(last_dim, out_dim, n_node))
+ last_dim = out_dim
+ self.net_conv = nn.ModuleList(conv_layers)
+
+ lin_layers = []
+ for out_dim in self.hidden_dim_linear:
+ lin_layers.append(LinCouplingBlock(last_dim, out_dim, n_node))
+ last_dim = out_dim
+ lin_layers.append(nn.Linear(last_dim, in_dim*2))
+ self.net_lin = nn.Sequential(*lin_layers)
+
+ mask = torch.ones(n_node, in_dim)
+ mask[masked_row, :] = 0 # masked_row are kept same, and used for _s_t for updating the left rows
+ self.register_buffer('mask', mask)
+
+ def forward(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
+ adj, input = graph
+ masked_x = self.mask * input
+ masked_x_sq = masked_x.unsqueeze(2)
+ s_logits, t = self._s_t_function((adj, masked_x_sq))
+ s = torch.sigmoid(s_logits)
+ out = masked_x + (1-self.mask) * (input + t) * s
+ logdet = torch.sum(logsigmoid(s_logits).reshape(input.shape[0], -1), 1)
+ return out, logdet
+
+ @torch.jit.export
+ def reverse(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
+ adj, output = graph
+ masked_y = self.mask * output
+ masked_y_sq = masked_y.unsqueeze(2)
+ s_logits, t = self._s_t_function((adj, masked_y_sq))
+ s_inverse = sigmoid_inverse(s_logits)
+ input = masked_y + (1 - self.mask) * (output * s_inverse - t)
+ return input
+
+ def _s_t_function(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
+ for l in self.net_conv:
+ graph = l(graph)
+ adj, h = graph
+ h = self.net_lin(h)
+ h = h.squeeze(2)
+ s_logits, t = h.chunk(2, dim=-1)
+
+ return s_logits, t
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/model/glow.py b/PyTorch/DrugDiscovery/MoFlow/moflow/model/glow.py
new file mode 100644
index 000000000..eaa69bf84
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/model/glow.py
@@ -0,0 +1,270 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+from typing import Tuple
+import torch
+import torch.nn as nn
+
+from moflow.model.basic import ActNorm, InvConv2dLU, InvConv2d
+from moflow.model.coupling import AffineCoupling, GraphAffineCoupling
+
+
+class Flow(nn.Module):
+ def __init__(self, in_channel, hidden_channels, conv_lu=2, mask_swap=False):
+ super(Flow, self).__init__()
+
+ # More stable to support more flows
+ self.actnorm = ActNorm(num_channels=in_channel, num_dims=4)
+
+ if conv_lu == 0:
+ self.invconv = InvConv2d(in_channel)
+ elif conv_lu == 1:
+ self.invconv = InvConv2dLU(in_channel)
+ elif conv_lu == 2:
+ self.invconv = None
+ else:
+ raise ValueError("conv_lu in {0,1,2}, 0:InvConv2d, 1:InvConv2dLU, 2:none-just swap to update in coupling")
+
+ self.coupling = AffineCoupling(in_channel, hidden_channels, mask_swap=mask_swap)
+
+ def forward(self, input: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ out, logdet = self.actnorm(input)
+ if self.invconv is not None:
+ out, det1 = self.invconv(out)
+ else:
+ det1 = 0
+ out, det2 = self.coupling(out)
+
+ logdet = logdet + det1
+ if det2 is not None:
+ logdet = logdet + det2
+
+ return out, logdet
+
+ @torch.jit.export
+ def reverse(self, output: torch.Tensor) -> torch.Tensor:
+ input = self.coupling.reverse(output)
+ if self.invconv is not None:
+ input = self.invconv.reverse(input)
+ input = self.actnorm.reverse(input)
+
+ return input
+
+
+class FlowOnGraph(nn.Module):
+ def __init__(self, n_node, in_dim, hidden_dim_dict, masked_row):
+ super(FlowOnGraph, self).__init__()
+ self.n_node = n_node
+ self.in_dim = in_dim
+ self.hidden_dim_dict = hidden_dim_dict
+ self.masked_row = masked_row
+ self.actnorm = ActNorm(num_channels=n_node, num_dims=3)
+ self.coupling = GraphAffineCoupling(n_node, in_dim, hidden_dim_dict, masked_row)
+
+ def forward(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
+ adj, input = graph
+ out, logdet = self.actnorm(input)
+ det1 = 0
+ out, det2 = self.coupling((adj, out))
+
+ logdet = logdet + det1
+ if det2 is not None:
+ logdet = logdet + det2
+ return out, logdet
+
+ @torch.jit.export
+ def reverse(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
+ adj, output = graph
+ input = self.coupling.reverse((adj, output))
+ input = self.actnorm.reverse(input)
+ return input
+
+
+class Block(nn.Module):
+ def __init__(self, in_channel, n_flow, squeeze_fold, hidden_channels, conv_lu=2):
+ super(Block, self).__init__()
+ self.squeeze_fold = squeeze_fold
+ squeeze_dim = in_channel * self.squeeze_fold * self.squeeze_fold
+
+ self.flows = nn.ModuleList()
+ for i in range(n_flow):
+ if conv_lu in (0, 1):
+ self.flows.append(Flow(squeeze_dim, hidden_channels,
+ conv_lu=conv_lu, mask_swap=False))
+ else:
+ self.flows.append(Flow(squeeze_dim, hidden_channels,
+ conv_lu=2, mask_swap=bool(i % 2)))
+
+ def forward(self, input: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ out = self._squeeze(input)
+ logdet = 0
+
+ for flow in self.flows:
+ out, det = flow(out)
+ logdet = logdet + det
+
+ out = self._unsqueeze(out)
+ return out, logdet
+
+ @torch.jit.export
+ def reverse(self, output: torch.Tensor) -> torch.Tensor:
+ input = self._squeeze(output)
+
+ for flow in self.flows[::-1]:
+ input = flow.reverse(input)
+
+ unsqueezed = self._unsqueeze(input)
+ return unsqueezed
+
+ def _squeeze(self, x: torch.Tensor) -> torch.Tensor:
+ """Trade spatial extent for channels. In forward direction, convert each
+ 1x4x4 volume of input into a 4x1x1 volume of output.
+
+ Args:
+ x (torch.Tensor): Input to squeeze or unsqueeze.
+ reverse (bool): Reverse the operation, i.e., unsqueeze.
+
+ Returns:
+ x (torch.Tensor): Squeezed or unsqueezed tensor.
+ """
+ assert len(x.shape) == 4
+ b_size, n_channel, height, width = x.shape
+ fold = self.squeeze_fold
+
+ squeezed = x.view(b_size, n_channel, height // fold, fold, width // fold, fold)
+ squeezed = squeezed.permute(0, 1, 3, 5, 2, 4).contiguous()
+ out = squeezed.view(b_size, n_channel * fold * fold, height // fold, width // fold)
+ return out
+
+ def _unsqueeze(self, x: torch.Tensor) -> torch.Tensor:
+ assert len(x.shape) == 4
+ b_size, n_channel, height, width = x.shape
+ fold = self.squeeze_fold
+ unsqueezed = x.view(b_size, n_channel // (fold * fold), fold, fold, height, width)
+ unsqueezed = unsqueezed.permute(0, 1, 4, 2, 5, 3).contiguous()
+ out = unsqueezed.view(b_size, n_channel // (fold * fold), height * fold, width * fold)
+ return out
+
+
+class BlockOnGraph(nn.Module):
+ def __init__(self, n_node, in_dim, hidden_dim_dict, n_flow, mask_row_size=1, mask_row_stride=1):
+ super(BlockOnGraph, self).__init__()
+ assert 0 < mask_row_size < n_node
+ self.flows = nn.ModuleList()
+ for i in range(n_flow):
+ start = i * mask_row_stride
+ masked_row =[r % n_node for r in range(start, start+mask_row_size)]
+ self.flows.append(FlowOnGraph(n_node, in_dim, hidden_dim_dict, masked_row=masked_row))
+
+ def forward(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
+ adj, input = graph
+ out = input
+ logdet = 0
+ for flow in self.flows:
+ out, det = flow((adj, out))
+ logdet = logdet + det
+ return out, logdet
+
+ @torch.jit.export
+ def reverse(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
+ adj, output = graph
+ input = output
+ for flow in self.flows[::-1]:
+ input = flow.reverse((adj, input))
+ return input
+
+
+class Glow(nn.Module):
+ def __init__(self, in_channel, n_flow, n_block, squeeze_fold, hidden_channel, conv_lu=2):
+ super(Glow, self).__init__()
+
+ self.blocks = nn.ModuleList()
+ n_channel = in_channel
+ for i in range(n_block):
+ self.blocks.append(Block(n_channel, n_flow, squeeze_fold, hidden_channel, conv_lu=conv_lu))
+
+ def forward(self, input: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ logdet = 0
+ out = input
+
+ for block in self.blocks:
+ out, det = block(out)
+ logdet = logdet + det
+
+ return out, logdet
+
+ @torch.jit.export
+ def reverse(self, z: torch.Tensor) -> torch.Tensor:
+ h = z
+ for i, block in enumerate(self.blocks[::-1]):
+ h = block.reverse(h)
+
+ return h
+
+
+class GlowOnGraph(nn.Module):
+ def __init__(self, n_node, in_dim, hidden_dim_dict, n_flow, n_block,
+ mask_row_size_list=(2,), mask_row_stride_list=(1,)):
+ super(GlowOnGraph, self).__init__()
+
+ assert len(mask_row_size_list) == n_block or len(mask_row_size_list) == 1
+ assert len(mask_row_stride_list) == n_block or len(mask_row_stride_list) == 1
+ if len(mask_row_size_list) == 1:
+ mask_row_size_list = mask_row_size_list * n_block
+ if len(mask_row_stride_list) == 1:
+ mask_row_stride_list = mask_row_stride_list * n_block
+ self.blocks = nn.ModuleList()
+ for i in range(n_block):
+ mask_row_size = mask_row_size_list[i]
+ mask_row_stride = mask_row_stride_list[i]
+ self.blocks.append(BlockOnGraph(n_node, in_dim, hidden_dim_dict, n_flow, mask_row_size, mask_row_stride))
+
+ def forward(self, adj: torch.Tensor, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ logdet = 0
+ out = x
+ for block in self.blocks:
+ out, det = block((adj, out))
+ logdet = logdet + det
+ return out, logdet
+
+ @torch.jit.export
+ def reverse(self, graph: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
+ adj, z = graph
+ input = z
+ for i, block in enumerate(self.blocks[::-1]):
+ input = block.reverse((adj, input))
+
+ return input
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/model/model.py b/PyTorch/DrugDiscovery/MoFlow/moflow/model/model.py
new file mode 100644
index 000000000..83e39950f
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/model/model.py
@@ -0,0 +1,251 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+import math
+import torch
+import torch.nn as nn
+
+from moflow.config import Config
+from moflow.model.glow import Glow, GlowOnGraph
+
+def gaussian_nll(x, mean, ln_var):
+ """Computes the negative log-likelihood of a Gaussian distribution.
+
+ Given two variable ``mean`` representing :math:`\\mu` and ``ln_var``
+ representing :math:`\\log(\\sigma^2)`, this function computes in
+ elementwise manner the negative log-likelihood of :math:`x` on a
+ Gaussian distribution :math:`N(\\mu, S)`,
+
+ .. math::
+
+ -\\log N(x; \\mu, \\sigma^2) =
+ \\log\\left(\\sqrt{(2\\pi)^D |S|}\\right) +
+ \\frac{1}{2}(x - \\mu)^\\top S^{-1}(x - \\mu),
+
+ where :math:`D` is a dimension of :math:`x` and :math:`S` is a diagonal
+ matrix where :math:`S_{ii} = \\sigma_i^2`.
+
+ Args:
+ x: Input variable.
+ mean: Mean of a Gaussian distribution, :math:`\\mu`.
+ ln_var: Logarithm of variance of a Gaussian distribution,
+ :math:`\\log(\\sigma^2)`.
+
+ Returns:
+ torch.Tensor:
+ Negative log-likelihood.
+ """
+
+ x_prec = torch.exp(-ln_var)
+ x_diff = x - mean
+ x_power = (x_diff * x_diff) * x_prec * -0.5
+ loss = (ln_var + math.log(2 * (math.pi))) / 2 - x_power
+ return loss
+
+
+class MoFlowLoss(nn.Module):
+ def __init__(self, config: Config) -> None:
+ super().__init__()
+ self.b_n_type = config.num_edge_features
+ self.a_n_node = config.max_num_nodes
+ self.a_n_type = config.num_node_features
+ self.b_size = self.a_n_node * self.a_n_node * self.b_n_type
+ self.a_size = self.a_n_node * self.a_n_type
+
+ if config.model_config.learn_dist:
+ self.ln_var = nn.Parameter(torch.zeros(1))
+ else:
+ self.register_buffer('ln_var', torch.zeros(1))
+
+ def forward(self, h, adj_h, sum_log_det_jacs_x, sum_log_det_jacs_adj):
+ z = [h, adj_h]
+ logdet = [sum_log_det_jacs_x, sum_log_det_jacs_adj]
+
+ device = z[0].device
+ dtype = z[0].dtype
+ z[0] = z[0].reshape(z[0].shape[0],-1)
+ z[1] = z[1].reshape(z[1].shape[0], -1)
+
+ logdet[0] = logdet[0] - self.a_size * math.log(2.)
+ logdet[1] = logdet[1] - self.b_size * math.log(2.)
+ ln_var_adj = self.ln_var * torch.ones([self.b_size], device=device, dtype=dtype)
+ ln_var_x = self.ln_var * torch.ones([self.a_size], device=device, dtype=dtype)
+ nll_adj = torch.mean(
+ torch.sum(gaussian_nll(z[1], torch.zeros(self.b_size, device=device, dtype=dtype), ln_var_adj), dim=1)
+ - logdet[1])
+ nll_adj = nll_adj / (self.b_size * math.log(2.)) # the negative log likelihood per dim with log base 2
+
+ nll_x = torch.mean(torch.sum(
+ gaussian_nll(z[0], torch.zeros(self.a_size, device=device, dtype=dtype), ln_var_x),
+ dim=1) - logdet[0])
+ nll_x = nll_x / (self.a_size * math.log(2.)) # the negative log likelihood per dim with log base 2
+
+ return nll_x, nll_adj
+
+
+class MoFlow(nn.Module):
+ def __init__(self, config: Config):
+ super(MoFlow, self).__init__()
+ self.config = config
+ self.b_n_type = config.num_edge_features
+ self.a_n_node = config.max_num_nodes
+ self.a_n_type = config.num_node_features
+ self.b_size = self.a_n_node * self.a_n_node * self.b_n_type
+ self.a_size = self.a_n_node * self.a_n_type
+ self.noise_scale = config.model_config.noise_scale
+
+ self.bond_model = Glow(
+ in_channel=self.b_n_type,
+ n_flow=config.model_config.bond_config.n_flow,
+ n_block=config.model_config.bond_config.n_block,
+ squeeze_fold=config.model_config.bond_config.n_squeeze,
+ hidden_channel=config.model_config.bond_config.hidden_ch,
+ conv_lu=config.model_config.bond_config.conv_lu
+ )
+
+ self.atom_model = GlowOnGraph(
+ n_node=self.a_n_node,
+ in_dim=self.a_n_type,
+ hidden_dim_dict={
+ 'gnn': config.model_config.atom_config.hidden_gnn,
+ 'linear': config.model_config.atom_config.hidden_lin
+ },
+ n_flow=config.model_config.atom_config.n_flow,
+ n_block=config.model_config.atom_config.n_block,
+ mask_row_size_list=config.model_config.atom_config.mask_row_size_list,
+ mask_row_stride_list=config.model_config.atom_config.mask_row_stride_list,
+ )
+
+ self._cuda_graphs = dict()
+ self.atom_stream = None
+ self.bond_stream = None
+
+ @torch.jit.ignore
+ def forward(self, adj: torch.Tensor, x: torch.Tensor, with_cuda_graph: bool = False):
+ """
+ :param adj: (256,4,9,9)
+ :param x: (256,9,5)
+ :return:
+ """
+ if with_cuda_graph and self.atom_stream is None:
+ self.atom_stream = torch.cuda.Stream()
+ self.bond_stream = torch.cuda.Stream()
+ h = x
+ # add uniform noise to node feature matrices
+ if self.training:
+ if self.noise_scale == 0:
+ h = h/2.0 - 0.5 + torch.rand_like(x) * 0.4
+ else:
+ h = h + torch.rand_like(x) * self.noise_scale
+ if with_cuda_graph:
+ if self.atom_model not in self._cuda_graphs:
+ h, sum_log_det_jacs_x = self._forward_graph(self.atom_model, adj, h)
+ else:
+ self.atom_stream.wait_stream(torch.cuda.current_stream())
+ with torch.cuda.stream(self.atom_stream):
+ h, sum_log_det_jacs_x = self._forward_graph(self.atom_model, adj, h)
+ else:
+ h, sum_log_det_jacs_x = self.atom_model(adj, h)
+
+ # add uniform noise to adjacency tensors
+ if self.training:
+ if self.noise_scale == 0:
+ adj_bond = adj/2.0 - 0.5 + torch.rand_like(adj) * 0.4
+ else:
+ adj_bond = adj + torch.rand_like(adj) * self.noise_scale
+ else:
+ adj_bond = adj
+ if with_cuda_graph:
+ if self.bond_model not in self._cuda_graphs:
+ adj_h, sum_log_det_jacs_adj = self._forward_graph(self.bond_model, adj_bond)
+ else:
+ self.bond_stream.wait_stream(torch.cuda.current_stream())
+ with torch.cuda.stream(self.bond_stream):
+ adj_h, sum_log_det_jacs_adj = self._forward_graph(self.bond_model, adj_bond)
+ else:
+ adj_h, sum_log_det_jacs_adj = self.bond_model(adj_bond)
+ if with_cuda_graph:
+ torch.cuda.current_stream().wait_stream(self.atom_stream)
+ torch.cuda.current_stream().wait_stream(self.bond_stream)
+ return h, adj_h, sum_log_det_jacs_x, sum_log_det_jacs_adj
+
+ @torch.jit.export
+ def reverse(self, z):
+ """
+ Returns a molecule, given its latent vector.
+ :param z: latent vector. Shape: [B, N*N*M + N*T]
+ B = Batch size, N = number of atoms, M = number of bond types,
+ T = number of atom types (Carbon, Oxygen etc.)
+ :return: adjacency matrix and feature matrix of a molecule
+ """
+ batch_size = z.shape[0]
+ z_x = z[:, :self.a_size]
+ z_adj = z[:, self.a_size:]
+
+ h_adj = z_adj.reshape(batch_size, self.b_n_type, self.a_n_node, self.a_n_node)
+ h_adj = h_adj.to(memory_format=torch.channels_last)
+ h_adj = self.bond_model.reverse(h_adj)
+
+ if self.noise_scale == 0:
+ h_adj = (h_adj + 0.5) * 2
+ adj = h_adj
+ adj = adj + adj.permute(0, 1, 3, 2)
+ adj = adj / 2
+ adj = adj.softmax(dim=1)
+ max_bond = adj.max(dim=1).values.reshape(batch_size, -1, self.a_n_node, self.a_n_node)
+ adj = torch.floor(adj / max_bond)
+
+ adj = adj.to(memory_format=torch.channels_last)
+ h_x = z_x.reshape(batch_size, self.a_n_node, self.a_n_type)
+ h_x = self.atom_model.reverse((adj, h_x))
+ if self.noise_scale == 0:
+ h_x = (h_x + 0.5) * 2
+ return adj, h_x
+
+ @torch.jit.ignore
+ def _forward_graph(self, model, *args):
+ if model not in self._cuda_graphs:
+ if torch.distributed.is_initialized():
+ torch.distributed.barrier()
+ torch.cuda.synchronize()
+ self._cuda_graphs[model] = torch.cuda.make_graphed_callables(
+ model,
+ args,
+ )
+ torch.cuda.synchronize()
+ if torch.distributed.is_initialized():
+ torch.distributed.barrier()
+ return self._cuda_graphs[model](*args)
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/model/utils.py b/PyTorch/DrugDiscovery/MoFlow/moflow/model/utils.py
new file mode 100644
index 000000000..6f9233040
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/model/utils.py
@@ -0,0 +1,42 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import logging
+from typing import Iterable
+import torch
+
+def initialize_module(module: torch.nn.Module, inputs: Iterable[torch.Tensor]) -> None:
+ """Use given sample input to initialize the module.
+ Module must implement method called `initialize` which takes list of input tensors
+ """
+ assert hasattr(module, 'initialize')
+ assert len(inputs) == 1, f'{len(inputs)} inputs'
+ assert module.initialized.item() == 0, 'initialized'
+ module.initialize(*inputs)
+ assert module.initialized.item() == 1, 'not initialized'
+
+
+def initialize(model: torch.nn.Module, single_batch: Iterable[torch.Tensor]) -> None:
+ """Initialize all sub-modules in the model given the sample input batch."""
+ hooks = []
+ for name, module in model.named_modules():
+ if hasattr(module, 'initialize'):
+ logging.info(f'marking {name} for initialization')
+ hook = module.register_forward_pre_hook(initialize_module)
+ hooks.append(hook)
+ _ = model(*single_batch)
+ logging.info('all modules initialized, removing hooks')
+ for hook in hooks:
+ hook.remove()
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/__init__.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/arguments.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/arguments.py
new file mode 100644
index 000000000..9aa610cbc
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/arguments.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import os
+
+from moflow.config import CONFIGS
+from moflow.runtime.logger import LOGGING_LEVELS
+
+
+PARSER = argparse.ArgumentParser()
+PARSER.add_argument('--data_dir', type=str, default='/data', help='Location for the dataset.')
+PARSER.add_argument('--config_name', type=str, default='zinc250k', choices=list(CONFIGS),
+ help='The config to choose. This parameter allows one to switch between different datasets '
+ 'and their dedicated configurations of the neural network. By default, a pre-defined "zinc250k" config is used.')
+PARSER.add_argument('--results_dir', type=str, default='/results', help='Directory where checkpoints are stored.')
+PARSER.add_argument('--predictions_path', type=str, default='/results/predictions.smi',
+ help='Path to store generated molecules. If an empty string is provided, predictions will not be '
+ 'saved (useful for benchmarking and debugging).')
+PARSER.add_argument('--log_path', type=str, default=None,
+ help='Path for DLLogger log. This file will contain information about the speed and '
+ 'accuracy of the model during training and inference. Note that if the file '
+ 'already exists, new logs will be added at the end.')
+PARSER.add_argument('--log_interval', type=int, default=20, help='Frequency for writing logs, expressed in steps.')
+PARSER.add_argument('--warmup_steps', type=int, default=20,
+ help='Number of warmup steps. This value is used for benchmarking and for CUDA graph capture.')
+PARSER.add_argument('--steps', type=int, default=-1,
+ help='Number of steps used for training/inference. This parameter allows finishing '
+ 'training earlier than the specified number of epochs. If used with inference, '
+ 'it allows generating more molecules (by default only a single batch of molecules is generated).')
+PARSER.add_argument('--save_epochs', type=int, default=5,
+ help='Frequency for saving checkpoints, expressed in epochs. If -1 is provided, checkpoints will not be saved.')
+PARSER.add_argument('--eval_epochs', type=int, default=5,
+ help='Evaluation frequency, expressed in epochs. If -1 is provided, an evaluation will not be performed.')
+PARSER.add_argument('--learning_rate', type=float, default=0.0005, help='Base learning rate.')
+PARSER.add_argument('--beta1', type=float, default=0.9, help='beta1 parameter for the optimizer.')
+PARSER.add_argument('--beta2', type=float, default=0.99, help='beta2 parameter for the optimizer.')
+PARSER.add_argument('--clip', type=float, default=1, help='Gradient clipping norm.')
+PARSER.add_argument('--epochs', type=int, default=300,
+ help='Number of training epochs. Note that you can finish training mid-epoch by using "--steps" flag.')
+PARSER.add_argument('--batch_size', type=int, default=512, help='Batch size per GPU.')
+PARSER.add_argument('--num_workers', type=int, default=4, help='Number of workers in the data loader.')
+PARSER.add_argument('--seed', type=int, default=1, help='Random seed used to initialize the distributed loaders.')
+PARSER.add_argument('--local_rank', default=os.environ.get('LOCAL_RANK', 0), type=int,
+ help='rank of the GPU, used to launch distributed training. This argument is specified '
+ 'automatically by `torchrun` and does not have to be provided by the user.')
+PARSER.add_argument('--temperature', type=float, default=0.3, help='Temperature used for sampling.')
+PARSER.add_argument('--val_batch_size', type=int, default=100, help='Number of molecules to generate during validation step.')
+PARSER.add_argument('--allow_untrained', action='http://github.com/store_true',
+ help='Allow sampling molecules from an untrained network. Useful for performance benchmarking or debugging purposes.')
+PARSER.add_argument('--correct_validity', action='http://github.com/store_true', help='Apply validity correction after the generation of the molecules.')
+PARSER.add_argument('--amp', action='http://github.com/store_true', help='Use Automatic Mixed Precision.')
+PARSER.add_argument('--cuda_graph', action='http://github.com/store_true', help='Capture GPU kernels with CUDA graphs. This option allows to speed up training.')
+PARSER.add_argument('--jit', action='http://github.com/store_true', help='Compile the model with `torch.jit.script`. Can be used to speed up training or inference.')
+PARSER.add_argument('--verbosity', type=int, default=1, choices=list(LOGGING_LEVELS),
+ help='Verbosity level. Specify the following values: 0, 1, 2, 3, where 0 means minimal '
+ 'verbosity (errors only) and 3 - maximal (debugging).')
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/common.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/common.py
new file mode 100644
index 000000000..1a31c4d76
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/common.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from glob import glob
+import logging
+import os
+from typing import List, Optional, Tuple
+import torch
+
+from moflow.model.model import MoFlow
+
+
+CHECKPOINT_PATTERN = 'model_snapshot_epoch_%s'
+
+
+def _sort_checkpoints(paths: List[str]) -> List[str]:
+ return sorted(paths, key=lambda x: int(x.split('_')[-1]))
+
+
+def save_state(dir: str, model: MoFlow, optimizer: torch.optim.Optimizer, ln_var: float, epoch: int, keep: int = 1) -> None:
+ """Save training state in a given dir. This checkpoint can be used to resume training or run inference
+ with the trained model. This function will keep up to newest checkpoints and remove the oldest ones.
+ """
+ save_path = os.path.join(dir, CHECKPOINT_PATTERN % (epoch + 1))
+ state = {
+ 'model': model.state_dict(),
+ 'optimizer': optimizer.state_dict(),
+ 'ln_var': ln_var,
+ 'epoch': epoch,
+ }
+ torch.save(state, save_path)
+
+ if keep > 0:
+ filenames = glob(os.path.join(dir, CHECKPOINT_PATTERN % '*'))
+ if len(filenames) <= keep:
+ return
+
+ to_del = _sort_checkpoints(filenames)[:-keep]
+ for path in to_del:
+ os.remove(path)
+
+
+def load_state(path: str, model: MoFlow, device: torch.device, optimizer: Optional[torch.optim.Optimizer] = None) -> Tuple[int, float]:
+ """Load model's and optimizer's state from a given file.
+ This function returns the number of epochs the model was trained for and natural logarithm of variance
+ the for the distribution of the latent space.
+ """
+ state = torch.load(path, map_location=device)
+ model.load_state_dict(state['model'])
+ if optimizer is not None:
+ optimizer.load_state_dict(state['optimizer'])
+ return state['epoch'], state['ln_var']
+
+
+def get_newest_checkpoint(model_dir: str, validate: bool = True) -> str:
+ """Find newest checkpoint in a given directory.
+ If validate is set to True, this function will also verify that the file can be loaded and
+ select older checkpoint if neccessary.
+ """
+ filenames = glob(os.path.join(model_dir, CHECKPOINT_PATTERN % '*'))
+ if len(filenames) == 0:
+ logging.info(f'No checkpoints available')
+ return None
+
+ paths = _sort_checkpoints(filenames)
+ if validate:
+ for latest_path in paths[::-1]:
+ try:
+ torch.load(latest_path, map_location='cpu')
+ break
+ except:
+ logging.info(f'Checkpoint {latest_path} is corrupted')
+ else:
+ logging.info(f'All available checkpoints were corrupted')
+ return None
+
+ else:
+ latest_path = paths[-1]
+
+ logging.info(f'Found checkpoint {latest_path}')
+ return latest_path
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/distributed_utils.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/distributed_utils.py
new file mode 100644
index 000000000..67ca67e16
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/distributed_utils.py
@@ -0,0 +1,71 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import logging
+import os
+
+import torch
+import torch.distributed as dist
+
+
+def get_device(local_rank: int) -> torch.device:
+ if torch.cuda.is_available():
+ torch.cuda.set_device(local_rank % torch.cuda.device_count())
+ device = torch.device("cuda")
+ else:
+ device = torch.device("cpu")
+ logging.warning("not using a(ny) GPU(s)!")
+ return device
+
+
+def get_world_size() -> int:
+ return int(os.environ.get("WORLD_SIZE", 1))
+
+
+def reduce_tensor(tensor: torch.Tensor, num_gpus: int) -> torch.Tensor:
+ if num_gpus > 1:
+ rt = tensor.clone()
+ dist.all_reduce(rt, op=dist.ReduceOp.SUM)
+ if rt.is_floating_point():
+ rt = rt / num_gpus
+ else:
+ rt = rt // num_gpus
+ return rt
+ return tensor
+
+
+def init_distributed() -> bool:
+ world_size = int(os.environ.get("WORLD_SIZE", 1))
+ distributed = world_size > 1
+ if distributed:
+ backend = "nccl" if torch.cuda.is_available() else "gloo"
+ os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0" # Needed for CUDA graphs
+ dist.init_process_group(backend=backend, init_method="env://")
+ assert dist.is_initialized()
+
+ if get_rank() == 0:
+ logging.info(f"Distributed initialized. World size: {world_size}")
+ return distributed
+
+
+def get_rank() -> int:
+ """
+ Gets distributed rank or returns zero if distributed is not initialized.
+ """
+ if torch.distributed.is_available() and torch.distributed.is_initialized():
+ rank = torch.distributed.get_rank()
+ else:
+ rank = 0
+ return rank
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/evaluate.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/evaluate.py
new file mode 100644
index 000000000..080b2444b
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/evaluate.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from functools import partial
+import os
+
+import numpy as np
+import torch
+from torch.cuda.amp import autocast
+
+from moflow.config import CONFIGS
+from moflow.data import transform
+from moflow.data.data_loader import NumpyTupleDataset
+
+from moflow.model.model import MoFlow
+from moflow.utils import check_validity, convert_predictions_to_mols, predictions_to_smiles, check_novelty
+from moflow.runtime.arguments import PARSER
+from moflow.runtime.common import get_newest_checkpoint, load_state
+from moflow.runtime.distributed_utils import get_device
+from moflow.runtime.generate import infer
+from moflow.runtime.logger import MetricsLogger, setup_logging
+
+
+if __name__ == '__main__':
+ from rdkit import RDLogger
+ RDLogger.DisableLog('rdApp.*')
+
+ args = PARSER.parse_args()
+ logger = setup_logging(args)
+
+ snapshot_path = get_newest_checkpoint(args.results_dir)
+ config = CONFIGS[args.config_name]
+ model = MoFlow(config)
+
+ device = get_device(args.local_rank)
+ if snapshot_path is not None:
+ epoch, ln_var = load_state(snapshot_path, model, device=device)
+ elif args.allow_untrained:
+ epoch, ln_var = 0, 0
+ else:
+ raise RuntimeError('Generating molecules from an untrained network! '
+ 'If this was intentional, pass --allow_untrained flag.')
+ model.to(device)
+ model.eval()
+
+ if args.steps == -1:
+ args.steps = 1
+
+ acc_logger = MetricsLogger(logger)
+ valid_idx = transform.get_val_ids(config, args.data_dir)
+ dataset = NumpyTupleDataset.load(
+ os.path.join(args.data_dir, config.dataset_config.dataset_file),
+ transform=partial(transform.transform_fn, config=config),
+ )
+ train_idx = [t for t in range(len(dataset)) if t not in valid_idx]
+ n_train = len(train_idx)
+ train_dataset = torch.utils.data.Subset(dataset, train_idx)
+ train_x = torch.Tensor(np.array([a[0] for a in train_dataset]))
+ train_adj = torch.Tensor(np.array([a[1] for a in train_dataset]))
+
+ train_smiles = set(predictions_to_smiles(train_adj, train_x, config))
+
+
+ with autocast(enabled=args.amp):
+ for i in range(args.steps):
+ results = infer(
+ model, config, ln_var=ln_var, temp=args.temperature, batch_size=args.batch_size,
+ device=device)
+
+ mols_batch = convert_predictions_to_mols(*results, correct_validity=args.correct_validity)
+ validity_info = check_validity(mols_batch)
+ novel_r, abs_novel_r = check_novelty(validity_info['valid_smiles'], train_smiles, len(mols_batch))
+ _, nuv = check_novelty(list(set(validity_info['valid_smiles'])), train_smiles, len(mols_batch))
+ metrics = {
+ 'validity': validity_info['valid_ratio'],
+ 'novelty': novel_r,
+ 'uniqueness': validity_info['unique_ratio'],
+ 'abs_novelty': abs_novel_r,
+ 'abs_uniqueness': validity_info['abs_unique_ratio'],
+ 'nuv': nuv,
+ }
+
+ acc_logger.update(metrics)
+
+ acc_logger.summarize(step=tuple())
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/generate.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/generate.py
new file mode 100644
index 000000000..91b5446e2
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/generate.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import Optional, Tuple
+
+import numpy as np
+from torch.cuda.amp import autocast
+import torch
+
+from moflow.config import CONFIGS, Config
+
+from moflow.model.model import MoFlow
+from moflow.utils import convert_predictions_to_mols, postprocess_predictions
+from moflow.runtime.arguments import PARSER
+from moflow.runtime.common import get_newest_checkpoint, load_state
+from moflow.runtime.distributed_utils import get_device
+from moflow.runtime.logger import PerformanceLogger, setup_logging
+
+
+def infer(model: MoFlow, config: Config, device: torch.device, *,
+ ln_var: float = 0, temp: float = 0.6, mu: Optional[torch.Tensor] = None,
+ batch_size: int = 20) -> Tuple[np.ndarray, np.ndarray]:
+
+ if mu is None:
+ mu = torch.zeros(config.z_dim, dtype=torch.float32, device=device)
+
+ sigma = temp * np.sqrt(np.exp(ln_var))
+ with torch.no_grad():
+ z = torch.normal(mu.reshape(-1, config.z_dim).repeat((batch_size, 1)), sigma)
+ adj, x = model.reverse(z)
+ x, adj = postprocess_predictions(x, adj, config=config)
+
+ return adj, x
+
+
+if __name__ == '__main__':
+ from rdkit import RDLogger
+ RDLogger.DisableLog('rdApp.*')
+
+ args = PARSER.parse_args()
+ logger = setup_logging(args)
+ perf_logger = PerformanceLogger(logger, args.batch_size, args.warmup_steps, mode='generate')
+ if args.predictions_path:
+ from rdkit.Chem import SmilesWriter
+ smiles_writer = SmilesWriter(args.predictions_path)
+
+ snapshot_path = get_newest_checkpoint(args.results_dir)
+ config = CONFIGS[args.config_name]
+ model = MoFlow(config)
+
+ device = get_device(args.local_rank)
+ if snapshot_path is not None:
+ epoch, ln_var = load_state(snapshot_path, model, device=device)
+ elif args.allow_untrained:
+ epoch, ln_var = 0, 0
+ else:
+ raise RuntimeError('Generating molecules from an untrained network! '
+ 'If this was intentional, pass --allow_untrained flag.')
+ model.to(device=device, memory_format=torch.channels_last)
+ model.eval()
+ if args.jit:
+ model.atom_model = torch.jit.script(model.atom_model)
+ model.bond_model = torch.jit.script(model.bond_model)
+
+
+ if args.steps == -1:
+ args.steps = 1
+
+ with autocast(enabled=args.amp):
+ for i in range(args.steps):
+ perf_logger.update()
+ results = infer(
+ model, config, ln_var=ln_var, temp=args.temperature, batch_size=args.batch_size,
+ device=device)
+
+ if (i + 1) % args.log_interval == 0:
+ perf_logger.summarize(step=(0, i, i))
+ if args.predictions_path:
+ mols_batch = convert_predictions_to_mols(*results, correct_validity=args.correct_validity)
+ for mol in mols_batch:
+ smiles_writer.write(mol)
+
+ perf_logger.summarize(step=tuple())
+ if args.predictions_path:
+ smiles_writer.close()
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/logger.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/logger.py
new file mode 100644
index 000000000..0918b1036
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/logger.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from abc import ABC, abstractmethod
+import logging
+import time
+
+import dllogger
+from dllogger import JSONStreamBackend, StdOutBackend, Verbosity
+import numpy as np
+
+
+LOGGING_LEVELS = dict(enumerate([logging.ERROR, logging.WARNING, logging.INFO, logging.DEBUG]))
+
+
+def get_dllogger(args):
+ backends = []
+ if args.local_rank == 0:
+ backends.append(StdOutBackend(Verbosity.VERBOSE))
+ if args.log_path is not None:
+ backends.append(JSONStreamBackend(Verbosity.VERBOSE, args.log_path, append=True))
+ dllogger.init(backends=backends)
+ return dllogger
+
+
+def setup_logging(args):
+ logging.basicConfig(
+ format='%(asctime)s %(levelname)s:\t%(message)s', datefmt='%H:%M:%S', level=LOGGING_LEVELS[args.verbosity], force=True
+ )
+ return get_dllogger(args)
+
+
+class BaseLogger(ABC):
+ @abstractmethod
+ def update(self, **kwargs) -> None:
+ pass
+
+ @abstractmethod
+ def process_stats(self) -> dict:
+ return {}
+
+ @abstractmethod
+ def reset(self) -> None:
+ pass
+
+ def summarize(self, step: tuple) -> None:
+ stats = self.process_stats()
+ if len(stats) == 0:
+ logging.warn('Empty stats for logging, skipping')
+ return
+ self.logger.log(step=step, data=stats)
+ self.logger.flush()
+
+
+class PerformanceLogger(BaseLogger):
+ def __init__(self, logger, batch_size: int, warmup_steps: int = 100, mode: str = 'train'):
+ self.logger = logger
+ self.batch_size = batch_size
+ self.warmup_steps = warmup_steps
+ self._step = 0
+ self._timestamps = []
+ self.mode = mode
+
+ def update(self, **kwargs) -> None:
+ self._step += 1
+ if self._step >= self.warmup_steps:
+ self._timestamps.append(time.time())
+
+ def reset(self) -> None:
+ self._step = 0
+ self._timestamps = []
+
+ def process_stats(self) -> dict:
+ if len(self._timestamps) < 2:
+ logging.warn('Cannot process performance stats - less than 2 measurements collected')
+ return {}
+
+ timestamps = np.asarray(self._timestamps)
+ deltas = np.diff(timestamps)
+ throughput = (self.batch_size / deltas).mean()
+ stats = {
+ f'throughput_{self.mode}': throughput,
+ f'latency_{self.mode}_mean': deltas.mean(),
+ f'total_time_{self.mode}': timestamps[-1] - timestamps[0],
+ }
+ for level in [90, 95, 99]:
+ stats.update({f'latency_{self.mode}_{level}': np.percentile(deltas, level)})
+
+ return stats
+
+
+class MetricsLogger(BaseLogger):
+ def __init__(self, logger, mode: str = 'train'):
+ self.logger = logger
+ self.mode = mode
+ self._metrics_dict = {}
+
+ def update(self, metrics: dict, **kwargs) -> None:
+ for metrics_name, metric_val in metrics.items():
+ if metrics_name not in self._metrics_dict:
+ self._metrics_dict[metrics_name] = []
+ self._metrics_dict[metrics_name].append(float(metric_val))
+
+ def reset(self) -> None:
+ self._metrics_dict = {}
+
+ def process_stats(self) -> dict:
+ stats = {}
+ for metric_name, metric_val in self._metrics_dict.items():
+ stats[metric_name] = np.mean(metric_val)
+ return stats
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/train.py b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/train.py
new file mode 100644
index 000000000..cd19c06bc
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/runtime/train.py
@@ -0,0 +1,298 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+import argparse
+import functools
+import json
+import logging
+import os
+import signal
+from typing import Dict
+
+from apex.contrib.clip_grad import clip_grad_norm_
+from apex.optimizers import FusedAdam as Adam
+import torch
+from torch.cuda.amp import autocast, GradScaler
+from torch.utils.data.distributed import DistributedSampler
+
+from moflow.config import CONFIGS, Config
+from moflow.data.data_loader import NumpyTupleDataset
+from moflow.data import transform
+from moflow.model.model import MoFlow, MoFlowLoss
+from moflow.model.utils import initialize
+from moflow.runtime.logger import MetricsLogger, PerformanceLogger, setup_logging
+from moflow.runtime.arguments import PARSER
+from moflow.runtime.common import get_newest_checkpoint, load_state, save_state
+from moflow.runtime.distributed_utils import (
+ get_device, get_rank, get_world_size, init_distributed, reduce_tensor
+)
+from moflow.runtime.generate import infer
+from moflow.utils import check_validity, convert_predictions_to_mols
+
+
+torch._C._jit_set_autocast_mode(True)
+
+
+def run_validation(model: MoFlow, config: Config, ln_var: float, args: argparse.Namespace,
+ is_distributed: bool, world_size: int, device: torch.device) -> Dict[str, float]:
+ model.eval()
+ if is_distributed:
+ model_callable = model.module
+ else:
+ model_callable = model
+ result = infer(model_callable, config, device=device, ln_var=ln_var, batch_size=args.val_batch_size,
+ temp=args.temperature)
+ mols = convert_predictions_to_mols(*result, correct_validity=args.correct_validity)
+ validity_info = check_validity(mols)
+ valid_ratio = torch.tensor(validity_info['valid_ratio'], dtype=torch.float32, device=device)
+ unique_ratio = torch.tensor(validity_info['unique_ratio'], dtype=torch.float32, device=device)
+ valid_value = reduce_tensor(valid_ratio, world_size).detach().cpu().numpy()
+ unique_value = reduce_tensor(unique_ratio, world_size).detach().cpu().numpy()
+ model.train()
+ return {'valid': valid_value, 'unique': unique_value}
+
+
+def train(args: argparse.Namespace) -> None:
+ os.makedirs(args.results_dir, exist_ok=True)
+
+ # Device configuration
+ device = get_device(args.local_rank)
+ torch.cuda.set_stream(torch.cuda.Stream())
+ is_distributed = init_distributed()
+ world_size = get_world_size()
+ local_rank = get_rank()
+
+ logger = setup_logging(args)
+ if local_rank == 0:
+ perf_logger = PerformanceLogger(logger, args.batch_size * world_size, args.warmup_steps)
+ acc_logger = MetricsLogger(logger)
+
+ if local_rank == 0:
+ logging.info('Input args:')
+ logging.info(json.dumps(vars(args), indent=4, separators=(',', ':')))
+
+ # Model configuration
+ assert args.config_name in CONFIGS
+ config = CONFIGS[args.config_name]
+ data_file = config.dataset_config.dataset_file
+ transform_fn = functools.partial(transform.transform_fn, config=config)
+ valid_idx = transform.get_val_ids(config, args.data_dir)
+
+ if local_rank == 0:
+ logging.info('Config:')
+ logging.info(str(config))
+ model = MoFlow(config)
+ model.to(device)
+ loss_module = MoFlowLoss(config)
+ loss_module.to(device)
+
+ # Datasets:
+ dataset = NumpyTupleDataset.load(
+ os.path.join(args.data_dir, data_file),
+ transform=transform_fn,
+ )
+ if len(valid_idx) == 0:
+ raise ValueError('Empty validation set!')
+ train_idx = [t for t in range(len(dataset)) if t not in valid_idx]
+ train = torch.utils.data.Subset(dataset, train_idx)
+ test = torch.utils.data.Subset(dataset, valid_idx)
+
+ if world_size > 1:
+ sampler = DistributedSampler(train, seed=args.seed, drop_last=False)
+ else:
+ sampler = None
+ train_dataloader = torch.utils.data.DataLoader(
+ train,
+ batch_size=args.batch_size,
+ shuffle=sampler is None,
+ sampler=sampler,
+ num_workers=args.num_workers,
+ drop_last=True,
+ )
+
+ if local_rank == 0:
+ logging.info(f'Using {world_size} GPUs')
+ logging.info(f'Num training samples: {len(train)}')
+ logging.info(f'Minibatch-size: {args.batch_size}')
+ logging.info(f'Num Iter/Epoch: {len(train_dataloader)}')
+ logging.info(f'Num epoch: {args.epochs}')
+
+ if is_distributed:
+ train_dataloader.sampler.set_epoch(-1)
+ x, adj, *_ = next(iter(train_dataloader))
+ x = x.to(device)
+ adj = adj.to(device)
+ with autocast(enabled=args.amp):
+ initialize(model, (adj, x))
+
+ model.to(memory_format=torch.channels_last)
+ adj.to(memory_format=torch.channels_last)
+
+ if args.jit:
+ model.bond_model = torch.jit.script(model.bond_model)
+ model.atom_model = torch.jit.script(model.atom_model)
+
+ # make one pass in both directions to make sure that model works
+ with torch.no_grad():
+ _ = model(adj, x)
+ _ = model.reverse(torch.randn(args.batch_size, config.z_dim, device=device))
+
+ if is_distributed:
+ model = torch.nn.parallel.DistributedDataParallel(
+ model,
+ device_ids=[local_rank],
+ output_device=local_rank,
+ )
+ loss_module = torch.nn.parallel.DistributedDataParallel(
+ loss_module,
+ device_ids=[local_rank],
+ output_device=local_rank,
+ )
+ model_callable = model.module
+ loss_callable = loss_module.module
+ else:
+ model_callable = model
+ loss_callable = loss_module
+
+ # Loss and optimizer
+ optimizer = Adam((*model.parameters(), *loss_module.parameters()), lr=args.learning_rate, betas=(args.beta1, args.beta2))
+ scaler = GradScaler()
+
+ if args.save_epochs == -1:
+ args.save_epochs = args.epochs
+ if args.eval_epochs == -1:
+ args.eval_epochs = args.epochs
+ if args.steps == -1:
+ args.steps = args.epochs * len(train_dataloader)
+
+ snapshot_path = get_newest_checkpoint(args.results_dir)
+ if snapshot_path is not None:
+ snapshot_epoch, ln_var = load_state(snapshot_path, model_callable, optimizer=optimizer, device=device)
+ loss_callable.ln_var = torch.nn.Parameter(torch.tensor(ln_var))
+ first_epoch = snapshot_epoch + 1
+ step = first_epoch * len(train_dataloader)
+ else:
+ first_epoch = 0
+ step = 0
+
+ if first_epoch >= args.epochs:
+ logging.info(f'Model was already trained for {first_epoch} epochs')
+ exit(0)
+
+ for epoch in range(first_epoch, args.epochs):
+ if local_rank == 0:
+ acc_logger.reset()
+ if is_distributed:
+ train_dataloader.sampler.set_epoch(epoch)
+ for i, batch in enumerate(train_dataloader):
+ if local_rank == 0:
+ perf_logger.update()
+ step += 1
+ optimizer.zero_grad()
+ x = batch[0].to(device)
+ adj = batch[1].to(device=device,memory_format=torch.channels_last)
+
+ # Forward, backward and optimize
+ with_cuda_graph = (
+ args.cuda_graph
+ and step >= args.warmup_steps
+ and x.size(0) == args.batch_size
+ )
+ with autocast(enabled=args.amp, cache_enabled=not with_cuda_graph):
+ output = model(adj, x, with_cuda_graph=with_cuda_graph)
+ nll_x, nll_adj = loss_module(*output)
+ loss = nll_x + nll_adj
+
+ if args.amp:
+ scaler.scale(loss).backward()
+ scaler.unscale_(optimizer)
+ clip_grad_norm_(model.parameters(), args.clip)
+ scaler.step(optimizer)
+ scaler.update()
+ else:
+ loss.backward()
+ clip_grad_norm_(model.parameters(), args.clip)
+ optimizer.step()
+
+ # Print log info
+ if (i + 1) % args.log_interval == 0:
+ nll_x_value = reduce_tensor(nll_x, world_size).item()
+ nll_adj_value = reduce_tensor(nll_adj, world_size).item()
+ loss_value = nll_x_value + nll_adj_value
+
+ if local_rank == 0:
+ acc_logger.update({
+ 'loglik': loss_value,
+ 'nll_x': nll_x_value,
+ 'nll_adj': nll_adj_value
+ })
+
+ acc_logger.summarize(step=(epoch, i, i))
+ perf_logger.summarize(step=(epoch, i, i))
+
+ if step >= args.steps:
+ break
+
+ if (epoch + 1) % args.eval_epochs == 0:
+ with autocast(enabled=args.amp):
+ metrics = run_validation(model, config, loss_callable.ln_var.item(), args, is_distributed, world_size, device)
+ if local_rank == 0:
+ acc_logger.update(metrics)
+
+ # The same report for each epoch
+ if local_rank == 0:
+ acc_logger.summarize(step=(epoch,))
+ perf_logger.summarize(step=(epoch,))
+
+ # Save the model checkpoints
+ if (epoch + 1) % args.save_epochs == 0:
+ if local_rank == 0 or not is_distributed:
+ save_state(args.results_dir, model_callable, optimizer, loss_callable.ln_var.item(), epoch, keep=5)
+
+ if step >= args.steps:
+ break
+
+ if local_rank == 0:
+ acc_logger.summarize(step=tuple())
+ perf_logger.summarize(step=tuple())
+
+
+if __name__ == '__main__':
+ from rdkit import RDLogger
+ RDLogger.DisableLog('rdApp.*')
+
+ args = PARSER.parse_args()
+ train(args)
diff --git a/PyTorch/DrugDiscovery/MoFlow/moflow/utils.py b/PyTorch/DrugDiscovery/MoFlow/moflow/utils.py
new file mode 100644
index 000000000..d197934f4
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/moflow/utils.py
@@ -0,0 +1,211 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+import re
+from typing import Dict, List, Optional, Tuple, Union
+
+import numpy as np
+from rdkit import Chem
+import torch
+
+from moflow.config import Config, ATOM_VALENCY, CODE_TO_BOND, DUMMY_CODE
+
+
+def postprocess_predictions(x: Union[torch.Tensor, np.ndarray], adj: Union[torch.Tensor, np.ndarray], config: Config) -> Tuple[np.ndarray, np.ndarray]:
+ assert x.ndim == 3 and adj.ndim == 4, 'expected batched predictions'
+ n = config.dataset_config.max_num_atoms
+ adj = adj[:, :, :n, :n]
+ x = x[:, :n]
+
+ atoms = torch.argmax(x, dim=2)
+ atoms = _to_numpy_array(atoms)
+
+ adj = torch.argmax(adj, dim=1)
+ adj = _to_numpy_array(adj)
+
+ decoded = np.zeros_like(atoms)
+ for code, atomic_num in config.dataset_config.code_to_atomic.items():
+ decoded[atoms == code] = atomic_num
+
+ return decoded, adj
+
+
+def convert_predictions_to_mols(adj: np.ndarray, x: np.ndarray, correct_validity: bool = False) -> List[Chem.Mol]:
+ molecules = [construct_mol(x_elem, adj_elem) for x_elem, adj_elem in zip(x, adj)]
+
+ if correct_validity:
+ molecules = [correct_mol(mol) for mol in molecules]
+ return molecules
+
+
+def construct_mol(atoms: np.ndarray, adj: np.ndarray) -> Chem.Mol:
+ from rdkit import RDLogger
+ RDLogger.DisableLog('rdApp.*')
+ atoms_exist = (atoms != 0)
+ atoms = atoms[atoms_exist]
+ adj = adj[atoms_exist][:, atoms_exist]
+
+ mol = Chem.RWMol()
+
+ for atom in atoms:
+ mol.AddAtom(Chem.Atom(int(atom)))
+
+ for start, end in zip(*np.where(adj != DUMMY_CODE)):
+ if start > end:
+ mol.AddBond(int(start), int(end), CODE_TO_BOND[int(adj[start, end])])
+ # add formal charge to atom: e.g. [O+], [N+] [S+]
+ # not support [O-], [N-] [S-] [NH+] etc.
+ flag, atomid_valence = check_valency(mol)
+ if flag:
+ continue
+ else:
+ assert len(atomid_valence) == 2
+ idx = atomid_valence[0]
+ v = atomid_valence[1]
+ an = mol.GetAtomWithIdx(idx).GetAtomicNum()
+ if an in (7, 8, 16) and (v - ATOM_VALENCY[an]) == 1:
+ mol.GetAtomWithIdx(idx).SetFormalCharge(1)
+ return mol
+
+
+def valid_mol(x: Optional[Chem.Mol]) -> Optional[Chem.Mol]:
+ if x is None:
+ # RDKit wasn't able to create the mol
+ return None
+ smi = Chem.MolToSmiles(x, isomericSmiles=True)
+ if len(smi) == 0 or '.' in smi:
+ # Mol is empty or fragmented
+ return None
+ reloaded = Chem.MolFromSmiles(smi)
+ # if smiles is invalid - it will be None, otherwise mol is valid
+ return reloaded
+
+
+def check_valency(mol: Chem.Mol) -> Tuple[bool, List[int]]:
+ """Checks that no atoms in the mol have exceeded their possible
+ valency. Returns True if no valency issues, False otherwise
+ plus information about problematic atom.
+ """
+ try:
+ Chem.SanitizeMol(mol, sanitizeOps=Chem.SanitizeFlags.SANITIZE_PROPERTIES)
+ return True, None
+ except ValueError as e:
+ e = str(e)
+ p = e.find('#')
+ e_sub = e[p:]
+ atomid_valence = list(map(int, re.findall(r'\d+', e_sub)))
+ return False, atomid_valence
+
+
+def correct_mol(mol: Chem.Mol) -> Chem.Mol:
+ flag, atomid_valence = check_valency(mol)
+ while not flag:
+ assert len(atomid_valence) == 2
+ idx = atomid_valence[0]
+ v = atomid_valence[1]
+ queue = []
+ for b in mol.GetAtomWithIdx(idx).GetBonds():
+ queue.append(
+ (b.GetIdx(), int(b.GetBondType()), b.GetBeginAtomIdx(), b.GetEndAtomIdx())
+ )
+ queue.sort(key=lambda tup: tup[1], reverse=True)
+ if len(queue) > 0:
+ start = queue[0][2]
+ end = queue[0][3]
+ t = queue[0][1] - 1
+ mol.RemoveBond(start, end)
+ if t >= 1:
+ mol.AddBond(start, end, CODE_TO_BOND[t])
+ flag, atomid_valence = check_valency(mol)
+
+ # if mol is fragmented, select the largest fragment
+ mols = Chem.GetMolFrags(mol, asMols=True)
+ mol = max(mols, key=lambda m: m.GetNumAtoms())
+
+ return mol
+
+
+def predictions_to_smiles(adj: torch.Tensor, x: torch.Tensor, config: Config) -> List[str]:
+ x, adj = postprocess_predictions(x, adj, config=config)
+ valid = [Chem.MolToSmiles(construct_mol(x_elem, adj_elem), isomericSmiles=True)
+ for x_elem, adj_elem in zip(x, adj)]
+ return valid
+
+
+def check_validity(molecules: List[Chem.Mol]) -> dict:
+ valid = [valid_mol(mol) for mol in molecules]
+ valid = [mol for mol in valid if mol is not None]
+
+ n_mols = len(molecules)
+ valid_ratio = len(valid) / n_mols
+ valid_smiles = [Chem.MolToSmiles(mol, isomericSmiles=False) for mol in valid]
+ unique_smiles = list(set(valid_smiles))
+ unique_ratio = 0.
+ if len(valid) > 0:
+ unique_ratio = len(unique_smiles) / len(valid)
+ valid_mols = [Chem.MolFromSmiles(s) for s in valid_smiles]
+ abs_unique_ratio = len(unique_smiles) / n_mols
+
+ results = dict()
+ results['valid_mols'] = valid_mols
+ results['valid_smiles'] = valid_smiles
+ results['valid_ratio'] = valid_ratio * 100
+ results['unique_ratio'] = unique_ratio * 100
+ results['abs_unique_ratio'] = abs_unique_ratio * 100
+
+ return results
+
+
+def check_novelty(gen_smiles: List[str], train_smiles: List[str], n_generated_mols: int):
+ if len(gen_smiles) == 0:
+ novel_ratio = 0.
+ abs_novel_ratio = 0.
+ else:
+ duplicates = [1 for mol in gen_smiles if mol in train_smiles]
+ novel = len(gen_smiles) - sum(duplicates)
+ novel_ratio = novel * 100. / len(gen_smiles)
+ abs_novel_ratio = novel * 100. / n_generated_mols
+ return novel_ratio, abs_novel_ratio
+
+
+def _to_numpy_array(a):
+ if isinstance(a, torch.Tensor):
+ a = a.cpu().detach().numpy()
+ elif isinstance(a, np.ndarray):
+ pass
+ else:
+ raise TypeError("a ({}) is not a torch.Tensor".format(type(a)))
+ return a
diff --git a/PyTorch/DrugDiscovery/MoFlow/scripts/benchmark_inference.sh b/PyTorch/DrugDiscovery/MoFlow/scripts/benchmark_inference.sh
new file mode 100755
index 000000000..f53d039fc
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/scripts/benchmark_inference.sh
@@ -0,0 +1,39 @@
+#!/bin/bash
+
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+bs=${1:-512}
+prec=${2:-amp}
+flags="${@:3}"
+
+
+cmd="python \
+ /workspace/moflow_pyt/moflow/runtime/generate.py \
+ --batch_size ${bs} \
+ --steps 200 \
+ --warmup_steps 10 \
+ --allow_untrained \
+ --predictions_path '' \
+ --jit \
+ ${flags} \
+ "
+
+if [ $prec == "amp" ]; then
+ cmd="${cmd} --amp"
+fi
+
+set -x
+bash -c "${cmd}"
diff --git a/PyTorch/DrugDiscovery/MoFlow/scripts/benchmark_training.sh b/PyTorch/DrugDiscovery/MoFlow/scripts/benchmark_training.sh
new file mode 100755
index 000000000..e65e7099f
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/scripts/benchmark_training.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+gpus=${1:-1}
+bs=${2:-512}
+prec=${3:-amp}
+flags="${@:4}"
+
+
+if [[ "${gpus}" == "1" ]]; then
+ cmd="python"
+else
+ cmd="torchrun --nproc_per_node=${gpus}"
+fi
+
+cmd="${cmd} \
+ /workspace/moflow_pyt/moflow/runtime/train.py \
+ --batch_size ${bs} \
+ --steps 200 \
+ --eval_epochs -1 \
+ --save_epochs -1 \
+ --cuda_graph \
+ ${flags} \
+ "
+
+if [ $prec == "amp" ]; then
+ cmd="${cmd} --amp"
+fi
+
+set -x
+bash -c "${cmd}"
diff --git a/PyTorch/DrugDiscovery/MoFlow/scripts/data_preprocess.py b/PyTorch/DrugDiscovery/MoFlow/scripts/data_preprocess.py
new file mode 100644
index 000000000..1cbf0a3e6
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/scripts/data_preprocess.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+import os
+import pandas as pd
+import argparse
+import time
+
+from moflow.config import CONFIGS
+from moflow.data.data_frame_parser import DataFrameParser
+from moflow.data.encoding import MolEncoder
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='')
+ parser.add_argument('--data_name', type=str,
+ choices=list(CONFIGS),
+ help='dataset to be downloaded')
+ parser.add_argument('--data_dir', type=str, default='/data')
+ args = parser.parse_args()
+ return args
+
+def main(args):
+ start_time = time.time()
+ args = parse_args()
+ print('args', vars(args))
+
+ assert args.data_name in CONFIGS
+ dataset_config = CONFIGS[args.data_name].dataset_config
+
+ preprocessor = MolEncoder(out_size=dataset_config.max_num_atoms)
+
+ input_path = os.path.join(args.data_dir, dataset_config.csv_file)
+ output_path = os.path.join(args.data_dir, dataset_config.dataset_file)
+
+ print(f'Preprocessing {args.data_name} data:')
+ df = pd.read_csv(input_path, index_col=0)
+ parser = DataFrameParser(preprocessor, labels=dataset_config.labels, smiles_col=dataset_config.smiles_col)
+ dataset = parser.parse(df)
+
+ dataset.save(output_path)
+ print('Total time:', time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))
+
+
+if __name__ == '__main__':
+ args = parse_args()
+ main(args)
diff --git a/Tools/PyTorch/TimeSeriesPredictionPlatform/conf/model_dataset/auto_arima_electricity.yaml b/PyTorch/DrugDiscovery/MoFlow/scripts/predict.sh
old mode 100644
new mode 100755
similarity index 62%
rename from Tools/PyTorch/TimeSeriesPredictionPlatform/conf/model_dataset/auto_arima_electricity.yaml
rename to PyTorch/DrugDiscovery/MoFlow/scripts/predict.sh
index d82464068..d8cfda06b
--- a/Tools/PyTorch/TimeSeriesPredictionPlatform/conf/model_dataset/auto_arima_electricity.yaml
+++ b/PyTorch/DrugDiscovery/MoFlow/scripts/predict.sh
@@ -1,10 +1,12 @@
+#!/bin/bash
+
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
-# http://www.apache.org/licenses/LICENSE-2.0
+# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@@ -12,6 +14,23 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-dataset:
- config:
- stride: 400
+
+bs=${1:-512}
+prec=${2:-amp}
+flags="${@:3}"
+
+
+cmd="python \
+ /workspace/moflow_pyt/moflow/runtime/generate.py \
+ --batch_size ${bs} \
+ --jit \
+ --correct_validity \
+ ${flags} \
+ "
+
+if [ $prec == "amp" ]; then
+ cmd="${cmd} --amp"
+fi
+
+set -x
+bash -c "${cmd}"
diff --git a/PyTorch/DrugDiscovery/MoFlow/scripts/prepare_datasets.sh b/PyTorch/DrugDiscovery/MoFlow/scripts/prepare_datasets.sh
new file mode 100755
index 000000000..33c095a0d
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/scripts/prepare_datasets.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+REPO_URL='https://raw.githubusercontent.com/calvin-zcx/moflow'
+GIT_HASH='3026b2e9bb8de027f3887deb96ccdd876ba51664'
+DATA_DIR="/data"
+
+wget -O "${DATA_DIR}/zinc250k.csv" "${REPO_URL}/${GIT_HASH}/data/zinc250k.csv"
+wget -O "${DATA_DIR}/valid_idx_zinc250k.json" "${REPO_URL}/${GIT_HASH}/data/valid_idx_zinc.json"
+
+python ${PWD}/scripts/data_preprocess.py --data_name "zinc250k" --data_dir ${DATA_DIR}
diff --git a/PyTorch/DrugDiscovery/MoFlow/scripts/train.sh b/PyTorch/DrugDiscovery/MoFlow/scripts/train.sh
new file mode 100755
index 000000000..06ede7c96
--- /dev/null
+++ b/PyTorch/DrugDiscovery/MoFlow/scripts/train.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Copyright 2020 Chengxi Zang
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+
+
+gpus=${1:-1}
+prec=${2:-amp}
+flags="${@:3}"
+
+
+if [[ "${gpus}" == "1" ]]; then
+ cmd="python"
+else
+ cmd="torchrun --nproc_per_node=${gpus}"
+fi
+
+cmd="${cmd} \
+ /workspace/moflow_pyt/moflow/runtime/train.py \
+ --cuda_graph \
+ ${flags} \
+ "
+
+eval_cmd="python \
+ /workspace/moflow_pyt/moflow/runtime/evaluate.py \
+ --steps 1000 \
+ --jit \
+ ${flags} \
+ "
+
+if [ $prec == "amp" ]; then
+ cmd="${cmd} --amp"
+ eval_cmd="${eval_cmd} --amp"
+fi
+
+if [[ $gpus == 1 ]]; then
+ cmd="${cmd} --learning_rate 0.0001"
+fi
+
+set -x
+bash -c "${cmd} && ${eval_cmd}"
diff --git a/Tools/PyTorch/TimeSeriesPredictionPlatform/conf/model/cuml_auto_arima.yaml b/PyTorch/DrugDiscovery/MoFlow/setup.py
similarity index 63%
rename from Tools/PyTorch/TimeSeriesPredictionPlatform/conf/model/cuml_auto_arima.yaml
rename to PyTorch/DrugDiscovery/MoFlow/setup.py
index 01cfc79bd..18b690e3d 100644
--- a/Tools/PyTorch/TimeSeriesPredictionPlatform/conf/model/cuml_auto_arima.yaml
+++ b/PyTorch/DrugDiscovery/MoFlow/setup.py
@@ -4,7 +4,7 @@
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
-# http://www.apache.org/licenses/LICENSE-2.0
+# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@@ -12,8 +12,17 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-_target_: models.stat_models.CUMLAutoARIMA
-defaults:
- - _self_
- - /trainer@_global_/trainer: stattrainer
+from setuptools import setup
+
+setup(
+ name='moflow_pyt',
+ packages=[
+ 'moflow',
+ 'moflow.data',
+ 'moflow.model',
+ 'moflow.runtime'
+ ],
+ version='0.0.1',
+ description='MoFlow: an invertible flow model for generating molecular graphs',
+)
diff --git a/PyTorch/Forecasting/TFT/Dockerfile b/PyTorch/Forecasting/TFT/Dockerfile
old mode 100644
new mode 100755
index 6f94e4726..7b057ad95
--- a/PyTorch/Forecasting/TFT/Dockerfile
+++ b/PyTorch/Forecasting/TFT/Dockerfile
@@ -12,7 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.12-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.11-py3
+
FROM ${FROM_IMAGE_NAME}
# Set workdir and python path
diff --git a/PyTorch/Forecasting/TFT/Dockerfile-triton b/PyTorch/Forecasting/TFT/Dockerfile-triton
index 2d338397f..f4bc92fe2 100644
--- a/PyTorch/Forecasting/TFT/Dockerfile-triton
+++ b/PyTorch/Forecasting/TFT/Dockerfile-triton
@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.12-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.11-py3
FROM ${FROM_IMAGE_NAME}
# Ensure apt-get won't prompt for selecting options
diff --git a/PyTorch/Forecasting/TFT/README.md b/PyTorch/Forecasting/TFT/README.md
index 6dda2327c..8284f9804 100644
--- a/PyTorch/Forecasting/TFT/README.md
+++ b/PyTorch/Forecasting/TFT/README.md
@@ -123,9 +123,6 @@ For information about:
Training of Deep Neural
Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
blog.
-* APEX tools for mixed precision training, refer to the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in
- PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/)
- .
#### Enabling mixed precision
@@ -169,7 +166,7 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile, which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-- [PyTorch 21.12 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+- [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
- Supported GPUs:
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
@@ -371,7 +368,7 @@ The [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/
### Benchmarking
-The following section shows how to run benchmarks measuring the model performance in training and inference modes.
+The following section shows how to run benchmarks measuring the model performance in training and inference modes. Note that the first 3 steps of each epoch are not used in the throughput or latency calculation. This is due to the fact that the nvFuser performs the optimizations on the 3rd step of the first epoch causing a multi-second pause.
#### Training performance benchmark
@@ -390,24 +387,24 @@ We conducted an extensive hyperparameter search along with stability tests. The
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
-Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs.
+Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs.
| Dataset | GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision)
|-------------|---|------|-----------------------|-----------------------|-------|-------|-------
-| Electricity | 8 | 1024 | 0.027 / 0.057 / 0.029 | 0.028 / 0.057 / 0.029 | 216s | 176s | 1.227x
-| Traffic | 8 | 1024 | 0.043 / 0.108 / 0.079 | 0.042 / 0.107 / 0.078 | 151s | 126s | 1.198x
+| Electricity | 8 | 1024 | 0.026 / 0.056 / 0.029 | 0.028 / 0.058 / 0.029 | 200s | 176s | 1.136x
+| Traffic | 8 | 1024 | 0.044 / 0.108 / 0.078 | 0.044 / 0.109 / 0.079 | 140s | 129s | 1.085x
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
-Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
| Dataset | GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision)
|-------------|---|------|-----------------------|-----------------------|-------|-------|-----------
-| Electricity | 8 | 1024 | 0.028 / 0.057 / 0.029 | 0.027 / 0.057 / 0.029 | 381s | 261s | 1.460x
-| Traffic | 8 | 1024 | 0.042 / 0.106 / 0.076 | 0.040 / 0.103 / 0.074 | 256s | 176s | 1.455x
+| Electricity | 8 | 1024 | 0.028 / 0.057 / 0.028 | 0.027 / 0.059 / 0.030 | 371s | 269s | 1.379x
+| Traffic | 8 | 1024 | 0.042 / 0.110 / 0.080 | 0.043 / 0.109 / 0.080 | 251s | 191s | 1.314x
@@ -417,22 +414,22 @@ In order to get a greater picture of the model’s accuracy, we performed a hype
| Dataset | #GPU | Hidden size | #Heads | Local BS | LR | Gradient clipping | Dropout | Mean q-risk | Std q-risk | Min q-risk | Max q-risk
|-------------|------|-------------|--------|----------|------|-------------------|---------|-------------|------------| -----------|------
-| Electricity | 8 | 128 | 4 | 1024 | 1e-3 | 0.0 | 0.1 | 0.1131 | 0.0025 | 0.1080 | 0.1200
-| Traffic | 8 | 128 | 4 | 1024 | 1e-3 | 0.0 | 0.3 | 0.2180 | 0.0049 | 0.2069 | 0.2336
+| Electricity | 8 | 128 | 4 | 1024 | 1e-3 | 0.0 | 0.1 | 0.1129 | 0.0025 | 0.1074 | 0.1244
+| Traffic | 8 | 128 | 4 | 1024 | 1e-3 | 0.0 | 0.3 | 0.2262 | 0.0027 | 0.2207 | 0.2331
#### Training performance results
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
-Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
+Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
| Dataset | GPUs | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision
|-------------|---|------|--------|--------|-------|-------|-----
-| Electricity | 1 | 1024 | 10173 | 13703 | 1.35x | 1 | 1
-| Electricity | 8 | 1024 | 80596 | 107761 | 1.34x | 7.92x | 7.86x
-| Traffic | 1 | 1024 | 10197 | 13779 | 1.35x | 1 | 1
-| Traffic | 8 | 1024 | 80692 | 107979 | 1.34x | 7.91x | 7.84x
+| Electricity | 1 | 1024 | 12435 | 17608 | 1.42x | 1 | 1
+| Electricity | 8 | 1024 | 94389 | 130769 | 1.39x | 7.59x | 7.42x
+| Traffic | 1 | 1024 | 12509 | 17591 | 1.40x | 1 | 1
+| Traffic | 8 | 1024 | 94476 | 130992 | 1.39x | 7.55x | 7.45x
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -442,14 +439,14 @@ The performance metrics used were items per second.
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
-Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
+Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
| Dataset | GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision
|-------------|---|------|-------|-------|-------|------|----
-| Electricity | 1 | 1024 | 5580 | 9148 | 1.64x | 1 | 1
-| Electricity | 8 | 1024 | 43351 | 69855 | 1.61x | 7.77x | 7.64x
-| Traffic | 1 | 1024 | 5593 | 9194 | 1.64x | 1 | 1
-| Traffic | 8 | 1024 | 43426 | 69983 | 1.61x | 7.76x | 7.61x
+| Electricity | 1 | 1024 | 5932 | 10163 | 1.71x | 1 | 1
+| Electricity | 8 | 1024 | 45566 | 75660 | 1.66x | 7.68x | 7.44x
+| Traffic | 1 | 1024 | 5971 | 10166 | 1.70x | 1 | 1
+| Traffic | 8 | 1024 | 45925 | 75640 | 1.64x | 7.69x | 7.44x
@@ -463,39 +460,44 @@ The performance metrics used were items per second.
##### Inference Performance: NVIDIA DGX A100
-Our results were obtained by running the `inference.py` script in the [PyTorch 21.12 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX A100. Throughput is measured in items per second and latency is measured in milliseconds.
+Our results were obtained by running the `inference.py` script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX A100. Throughput is measured in items per second and latency is measured in milliseconds.
To benchmark the inference performance on a specific batch size and dataset, run the `inference.py` script.
| Dataset | GPUs | Batch size / GPU | Throughput - mixed precision (item/s) | Average Latency (ms) | Latency p90 (ms) | Latency p95 (ms) | Latency p99 (ms)
|-------------|--------|-----|---------------------------------|-----------------|-------------|-------------|------------
-| Electricity | 1 | 1 | 144.37 | 6.93 | 7.00 | 7.04 | 7.25
-| Electricity | 1 | 2 | 277.53 | 7.21 | 7.25 | 7.27 | 7.48
-| Electricity | 1 | 4 | 564.37 | 7.09 | 7.13 | 7.15 | 7.64
-| Electricity | 1 | 8 | 1399.25 | 5.72 | 5.71 | 5.77 | 7.51
-| Traffic | 1 | 1 | 145.26 | 6.88 | 6.91 | 6.95 | 7.60
-| Traffic | 1 | 2 | 277.97 | 7.19 | 7.28 | 7.30 | 7.46
-| Traffic | 1 | 4 | 563.05 | 7.10 | 7.14 | 7.16 | 7.42
-| Traffic | 1 | 8 | 1411.62 | 5.67 | 5.69 | 5.79 | 6.21
+| Electricity | 1 | 1 | 272.43 | 3.67 | 3.70 | 3.87 | 4.18
+| Electricity | 1 | 2 | 518.13 | 3.86 | 3.88 | 3.93 | 4.19
+| Electricity | 1 | 4 | 1039.31 | 3.85 | 3.89 | 3.97 | 4.15
+| Electricity | 1 | 8 | 2039.54 | 3.92 | 3.93 | 3.95 | 4.32
+| Traffic | 1 | 1 | 269.59 | 3.71 | 3.74 | 3.79 | 4.30
+| Traffic | 1 | 2 | 518.73 | 3.86 | 3.78 | 3.91 | 4.66
+| Traffic | 1 | 4 | 1021.49 | 3.92 | 3.94 | 3.95 | 4.25
+| Traffic | 1 | 8 | 2005.54 | 3.99 | 4.01 | 4.03 | 4.39
##### Inference Performance: NVIDIA DGX-1 V100
-Our results were obtained by running the `inference.py` script in the [PyTorch 21.12 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 V100. Throughput is measured in items per second and latency is measured in milliseconds.
+Our results were obtained by running the `inference.py` script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 V100. Throughput is measured in items per second and latency is measured in milliseconds.
To benchmark the inference performance on a specific batch size and dataset, run the `inference.py` script.
| Dataset | GPUs | Batch size / GPU | Throughput - mixed precision (item/s) | Average Latency (ms) | Latency p90 (ms) | Latency p95 (ms) | Latency p99 (ms)
|-------------|--------|-----|---------------------------------|-----------------|-------------|-------------|------------
-| Electricity | 1 | 1 | 95.65 | 10.45 | 11.30 | 11.95 | 12.13
-| Electricity | 1 | 2 | 193.15 | 10.35 | 10.80 | 11.46 | 12.16
-| Electricity | 1 | 4 | 381.09 | 10.49 | 10.75 | 12.29 | 12.41
-| Electricity | 1 | 8 | 805.49 | 9.93 | 10.41 | 10.48 | 10.91
-| Traffic | 1 | 1 | 96.72 | 10.34 | 10.53 | 11.99 | 12.13
-| Traffic | 1 | 2 | 192.93 | 10.37 | 10.80 | 11.97 | 12.12
-| Traffic | 1 | 4 | 379.00 | 10.55 | 10.88 | 11.09 | 11.96
-| Traffic | 1 | 8 | 859.69 | 9.30 | 10.58 | 10.65 | 11.28
+| Electricity | 1 | 1 | 171.68 | 5.82 | 5.99 | 6.17 | 7.00
+| Electricity | 1 | 2 | 318.92 | 6.27 | 6.43 | 6.60 | 7.51
+| Electricity | 1 | 4 | 684.79 | 5.84 | 6.02 | 6.08 | 6.47
+| Electricity | 1 | 8 | 1275.54 | 6.27 | 7.31 | 7.36 | 7.51
+| Traffic | 1 | 1 | 183.39 | 5.45 | 5.64 | 5.86 | 6.73
+| Traffic | 1 | 2 | 340.73 | 5.87 | 6.07 | 6.77 | 7.25
+| Traffic | 1 | 4 | 647.33 | 6.18 | 6.35 | 7.99 | 8.07
+| Traffic | 1 | 8 | 1364.39 | 5.86 | 6.07 | 6.40 | 7.31
## Release notes
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to https://developer.nvidia.com/deep-learning-performance-training-inference.
### Changelog
+March 2023
+- 23.01 Container Update
+- Switch from NVIDIA Apex AMP and NVIDIA Apex FusedLayerNorm to Native PyTorch AMP and Native PyTorch LayerNorm
+- Acceleration using NvFuser
+
February 2022
- 21.12 Container Update
- Triton Inference Performance Numbers
diff --git a/PyTorch/Forecasting/TFT/configuration.py b/PyTorch/Forecasting/TFT/configuration.py
index b2e3ceb56..09b97f7ef 100644
--- a/PyTorch/Forecasting/TFT/configuration.py
+++ b/PyTorch/Forecasting/TFT/configuration.py
@@ -124,5 +124,5 @@ def __init__(self):
CONFIGS = {'electricity': ElectricityConfig,
- 'traffic': TrafficConfig,
+ 'traffic': TrafficConfig,
}
diff --git a/PyTorch/Forecasting/TFT/criterions.py b/PyTorch/Forecasting/TFT/criterions.py
index 2f469f779..12de5be76 100644
--- a/PyTorch/Forecasting/TFT/criterions.py
+++ b/PyTorch/Forecasting/TFT/criterions.py
@@ -15,6 +15,7 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
+import numpy as np
class QuantileLoss(nn.Module):
def __init__(self, config):
@@ -26,3 +27,11 @@ def forward(self, predictions, targets):
ql = (1-self.q)*F.relu(diff) + self.q*F.relu(-diff)
losses = ql.view(-1, ql.shape[-1]).mean(0)
return losses
+
+def qrisk(pred, tgt, quantiles):
+ diff = pred - tgt
+ ql = (1-quantiles)*np.clip(diff,0, float('inf')) + quantiles*np.clip(-diff,0, float('inf'))
+ losses = ql.reshape(-1, ql.shape[-1])
+ normalizer = np.abs(tgt).mean()
+ risk = 2 * losses / normalizer
+ return risk.mean(0)
diff --git a/PyTorch/Forecasting/TFT/data_utils.py b/PyTorch/Forecasting/TFT/data_utils.py
index b851f1854..ce6c4f6ed 100644
--- a/PyTorch/Forecasting/TFT/data_utils.py
+++ b/PyTorch/Forecasting/TFT/data_utils.py
@@ -41,7 +41,8 @@
from bisect import bisect
import torch
-from torch.utils.data import Dataset,IterableDataset,DataLoader
+from torch.utils.data import Dataset, IterableDataset, DataLoader, DistributedSampler, RandomSampler
+from torch.utils.data.dataloader import default_collate
class DataTypes(enum.IntEnum):
"""Defines numerical types of each column."""
@@ -401,6 +402,51 @@ def sample_data(dataset, num_samples):
else:
return torch.utils.data.Subset(dataset, np.random.choice(np.arange(len(dataset)), size=num_samples, replace=False))
+def load_dataset(args, config, collate_fn=default_collate):
+ from utils import print_once
+ train_split = TFTBinaryDataset(os.path.join(args.data_path, 'train.bin'), config)
+ train_split = sample_data(train_split, args.sample_data[0])
+ if args.distributed_world_size > 1:
+ data_sampler = DistributedSampler(train_split, args.distributed_world_size, args.distributed_rank, seed=args.seed + args.distributed_rank, drop_last=True)
+ else:
+ data_sampler = RandomSampler(train_split)
+ train_loader = DataLoader(train_split,
+ batch_size=args.batch_size,
+ num_workers=4,
+ sampler=data_sampler,
+ collate_fn=collate_fn,
+ pin_memory=True)
+
+ valid_split = TFTBinaryDataset(os.path.join(args.data_path, 'valid.bin'), config)
+ valid_split = sample_data(valid_split, args.sample_data[1])
+ if args.distributed_world_size > 1:
+ data_sampler = DistributedSampler(valid_split, args.distributed_world_size, args.distributed_rank, shuffle=False, drop_last=False)
+ else:
+ data_sampler = None
+ valid_loader = DataLoader(valid_split,
+ batch_size=args.batch_size,
+ sampler=data_sampler,
+ num_workers=4,
+ collate_fn=collate_fn,
+ pin_memory=True)
+
+ test_split = TFTBinaryDataset(os.path.join(args.data_path, 'test.bin'), config)
+ if args.distributed_world_size > 1:
+ data_sampler = DistributedSampler(test_split, args.distributed_world_size, args.distributed_rank, shuffle=False, drop_last=False)
+ else:
+ data_sampler = None
+ test_loader = DataLoader(test_split,
+ batch_size=args.batch_size,
+ sampler=data_sampler,
+ num_workers=4,
+ collate_fn=collate_fn,
+ pin_memory=True)
+
+ print_once(f'Train split length: {len(train_split)}')
+ print_once(f'Valid split length: {len(valid_split)}')
+ print_once(f'Test split length: {len(test_split)}')
+
+ return train_loader, valid_loader, test_loader
def standarize_electricity(path):
"""Code taken from https://github.com/google-research/google-research/blob/master/tft/script_download_data.py"""
@@ -574,4 +620,3 @@ def read_matrix(filename):
flat_df.to_csv(os.path.join(path, 'standarized.csv'))
-
diff --git a/PyTorch/Forecasting/TFT/inference.py b/PyTorch/Forecasting/TFT/inference.py
index f1d3ab97f..7f60f5588 100644
--- a/PyTorch/Forecasting/TFT/inference.py
+++ b/PyTorch/Forecasting/TFT/inference.py
@@ -26,12 +26,12 @@
from configuration import ElectricityConfig
from data_utils import TFTDataset
from utils import PerformanceMeter
-from criterions import QuantileLoss
+from criterions import qrisk
import dllogger
from log_helper import setup_logger
+from torch.cuda import amp
def _unscale_per_id(config, values, ids, scalers):
- values = values.cpu().numpy()
num_horizons = config.example_length - config.encoder_length + 1
flat_values = pd.DataFrame(
values,
@@ -51,11 +51,9 @@ def _unscale_per_id(config, values, ids, scalers):
flat_values = pd.concat(df_list, axis=0)
flat_values = flat_values[[col for col in flat_values if not 'id' in col]]
- flat_tensor = torch.from_numpy(flat_values.values)
- return flat_tensor
+ return flat_values.values
def _unscale(config, values, scaler):
- values = values.cpu().numpy()
num_horizons = config.example_length - config.encoder_length + 1
flat_values = pd.DataFrame(
values,
@@ -68,46 +66,46 @@ def _unscale(config, values, scaler):
flat_values[col] = _t_col
flat_values = flat_values[[col for col in flat_values if not 'id' in col]]
- flat_tensor = torch.from_numpy(flat_values.values)
- return flat_tensor
+ return flat_values.values
def predict(args, config, model, data_loader, scalers, cat_encodings, extend_targets=False):
model.eval()
predictions = []
targets = []
ids = []
- perf_meter = PerformanceMeter()
+ perf_meter = PerformanceMeter(benchmark_mode=not args.disable_benchmark)
n_workers = args.distributed_world_size if hasattr(args, 'distributed_world_size') else 1
-
- for step, batch in enumerate(data_loader):
- perf_meter.reset_current_lap()
- with torch.no_grad():
- batch = {key: tensor.cuda() if tensor.numel() else None for key, tensor in batch.items()}
- ids.append(batch['id'][:,0,:])
- targets.append(batch['target'])
- predictions.append(model(batch).float())
-
- perf_meter.update(args.batch_size * n_workers,
- exclude_from_total=step in [0, len(data_loader)-1])
-
- targets = torch.cat(targets, dim=0)
+
+ with torch.jit.fuser("fuser2"):
+ for step, batch in enumerate(data_loader):
+ perf_meter.reset_current_lap()
+ with torch.no_grad():
+ batch = {key: tensor.cuda() if tensor.numel() else None for key, tensor in batch.items()}
+ ids.append(batch['id'][:,0,:])
+ targets.append(batch['target'])
+ predictions.append(model(batch).float())
+
+ perf_meter.update(args.batch_size * n_workers,
+ exclude_from_total=step in [0, 1, 2, len(data_loader)-1])
+
+ targets = torch.cat(targets, dim=0).cpu().numpy()
if not extend_targets:
targets = targets[:,config.encoder_length:,:]
- predictions = torch.cat(predictions, dim=0)
+ predictions = torch.cat(predictions, dim=0).cpu().numpy()
if config.scale_per_id:
ids = torch.cat(ids, dim=0).cpu().numpy()
- unscaled_predictions = torch.stack(
+ unscaled_predictions = np.stack(
[_unscale_per_id(config, predictions[:,:,i], ids, scalers) for i in range(len(config.quantiles))],
- dim=-1)
- unscaled_targets = _unscale_per_id(config, targets[:,:,0], ids, scalers).unsqueeze(-1)
+ axis=-1)
+ unscaled_targets = np.expand_dims(_unscale_per_id(config, targets[:,:,0], ids, scalers), axis=-1)
else:
ids = None
- unscaled_predictions = torch.stack(
+ unscaled_predictions = np.stack(
[_unscale(config, predictions[:,:,i], scalers['']) for i in range(len(config.quantiles))],
- dim=-1)
- unscaled_targets = _unscale(config, targets[:,:,0], scalers['']).unsqueeze(-1)
+ axis=-1)
+ unscaled_targets = np.expand_dims(_unscale(config, targets[:,:,0], scalers['']), axis=-1)
return unscaled_predictions, unscaled_targets, ids, perf_meter
@@ -173,9 +171,11 @@ def inference(args, config, model, data_loader, scalers, cat_encodings):
os.makedirs(os.path.join(args.results, 'predictions', str(key)), exist_ok=True)
df.to_csv(os.path.join(args.results, 'predictions', str(key), q+'.csv'))
- losses = QuantileLoss(config)(unscaled_predictions, unscaled_targets)
- normalizer = unscaled_targets.abs().mean()
- q_risk = 2 * losses / normalizer
+ #losses = QuantileLoss(config)(torch.from_numpy(unscaled_predictions).contiguous(),
+ # torch.from_numpy(unscaled_targets).contiguous()).numpy()
+ #normalizer = np.mean(np.abs(unscaled_targets))
+ #q_risk = 2 * losses / normalizer
+ risk = qrisk(unscaled_predictions, unscaled_targets, np.array(config.quantiles))
perf_dict = {
'throughput': perf_meter.avg,
@@ -186,7 +186,7 @@ def inference(args, config, model, data_loader, scalers, cat_encodings):
'total_infernece_time': perf_meter.total_time,
}
- return q_risk, perf_dict
+ return risk, perf_dict
def main(args):
@@ -215,7 +215,7 @@ def main(args):
quantiles = {'test_p10': quantiles[0].item(), 'test_p50': quantiles[1].item(), 'test_p90': quantiles[2].item(), 'sum':sum(quantiles).item()}
finish_log = {**quantiles, **perf_dict}
dllogger.log(step=(), data=finish_log, verbosity=1)
- print('Test q-risk: P10 {} | P50 {} | P90 {}'.format(*quantiles))
+ print('Test q-risk: P10 {test_p10} | P50 {test_p50} | P90 {test_p90}'.format(**quantiles))
print('Latency:\n\tAverage {:.3f}s\n\tp90 {:.3f}s\n\tp95 {:.3f}s\n\tp99 {:.3f}s'.format(
perf_dict['latency_avg'], perf_dict['latency_p90'], perf_dict['latency_p95'], perf_dict['latency_p99']))
@@ -235,5 +235,6 @@ def main(args):
parser.add_argument('--save_predictions', action='http://github.com/store_true')
parser.add_argument('--results', type=str, default='/results')
parser.add_argument('--log_file', type=str, default='dllogger.json')
+ parser.add_argument("--disable_benchmark", action='http://github.com/store_true', help='Disable benchmarking mode')
ARGS = parser.parse_args()
main(ARGS)
diff --git a/PyTorch/Forecasting/TFT/modeling.py b/PyTorch/Forecasting/TFT/modeling.py
old mode 100644
new mode 100755
index a0300ea99..d5c214d5c
--- a/PyTorch/Forecasting/TFT/modeling.py
+++ b/PyTorch/Forecasting/TFT/modeling.py
@@ -17,12 +17,11 @@
import torch.nn.functional as F
from torch import Tensor
+from torch.nn.parameter import UninitializedParameter
from typing import Dict, Tuple, Optional, List
-if os.environ.get("TFT_SCRIPTING", False):
- from torch.nn import LayerNorm
-else:
- from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm
+MAKE_CONVERT_COMPATIBLE = os.environ.get("TFT_SCRIPTING", None) is not None
+from torch.nn import LayerNorm
class MaybeLayerNorm(nn.Module):
def __init__(self, output_size, hidden_size, eps):
@@ -46,21 +45,20 @@ def forward(self, x: Tensor) -> Tensor:
x = F.glu(x)
return x
-
class GRN(nn.Module):
def __init__(self,
input_size,
- hidden_size,
+ hidden_size,
output_size=None,
context_hidden_size=None,
- dropout=0):
+ dropout=0.0,):
super().__init__()
-
-
self.layer_norm = MaybeLayerNorm(output_size, hidden_size, eps=1e-3)
self.lin_a = nn.Linear(input_size, hidden_size)
if context_hidden_size is not None:
self.lin_c = nn.Linear(context_hidden_size, hidden_size, bias=False)
+ else:
+ self.lin_c = nn.Identity()
self.lin_i = nn.Linear(hidden_size, hidden_size)
self.glu = GLU(hidden_size, output_size if output_size else hidden_size)
self.dropout = nn.Dropout(dropout)
@@ -74,13 +72,28 @@ def forward(self, a: Tensor, c: Optional[Tensor] = None):
x = self.lin_i(x)
x = self.dropout(x)
x = self.glu(x)
- y = a if not self.out_proj else self.out_proj(a)
+ y = a if self.out_proj is None else self.out_proj(a)
x = x + y
- x = self.layer_norm(x)
- return x
+ return self.layer_norm(x)
+
+
+# @torch.jit.script #Currently broken with autocast
+def fused_pointwise_linear_v1(x, a, b):
+ out = torch.mul(x.unsqueeze(-1), a)
+ out = out + b
+ return out
+
+@torch.jit.script
+def fused_pointwise_linear_v2(x, a, b):
+ out = x.unsqueeze(3) * a
+ out = out + b
+ return out
+
class TFTEmbedding(nn.Module):
- def __init__(self, config):
+ def __init__(self, config, initialize_cont_params=True):
+ # initialize_cont_params=False prevents form initializing parameters inside this class
+ # so they can be lazily initialized in LazyEmbedding module
super().__init__()
self.s_cat_inp_lens = config.static_categorical_inp_lens
self.t_cat_k_inp_lens = config.temporal_known_categorical_inp_lens
@@ -108,23 +121,43 @@ def __init__(self, config):
self.t_cat_o_embed = nn.ModuleList([
nn.Embedding(n, self.hidden_size) for n in self.t_cat_o_inp_lens]) if self.t_cat_o_inp_lens else None
- self.s_cont_embedding_vectors = nn.Parameter(torch.Tensor(self.s_cont_inp_size, self.hidden_size)) if self.s_cont_inp_size else None
- self.t_cont_k_embedding_vectors = nn.Parameter(torch.Tensor(self.t_cont_k_inp_size, self.hidden_size)) if self.t_cont_k_inp_size else None
- self.t_cont_o_embedding_vectors = nn.Parameter(torch.Tensor(self.t_cont_o_inp_size, self.hidden_size)) if self.t_cont_o_inp_size else None
- self.t_tgt_embedding_vectors = nn.Parameter(torch.Tensor(self.t_tgt_size, self.hidden_size))
+ if initialize_cont_params:
+ self.s_cont_embedding_vectors = nn.Parameter(torch.Tensor(self.s_cont_inp_size, self.hidden_size)) if self.s_cont_inp_size else None
+ self.t_cont_k_embedding_vectors = nn.Parameter(torch.Tensor(self.t_cont_k_inp_size, self.hidden_size)) if self.t_cont_k_inp_size else None
+ self.t_cont_o_embedding_vectors = nn.Parameter(torch.Tensor(self.t_cont_o_inp_size, self.hidden_size)) if self.t_cont_o_inp_size else None
+ self.t_tgt_embedding_vectors = nn.Parameter(torch.Tensor(self.t_tgt_size, self.hidden_size))
- self.s_cont_embedding_bias = nn.Parameter(torch.zeros(self.s_cont_inp_size, self.hidden_size)) if self.s_cont_inp_size else None
- self.t_cont_k_embedding_bias = nn.Parameter(torch.zeros(self.t_cont_k_inp_size, self.hidden_size)) if self.t_cont_k_inp_size else None
- self.t_cont_o_embedding_bias = nn.Parameter(torch.zeros(self.t_cont_o_inp_size, self.hidden_size)) if self.t_cont_o_inp_size else None
- self.t_tgt_embedding_bias = nn.Parameter(torch.zeros(self.t_tgt_size, self.hidden_size))
+ self.s_cont_embedding_bias = nn.Parameter(torch.zeros(self.s_cont_inp_size, self.hidden_size)) if self.s_cont_inp_size else None
+ self.t_cont_k_embedding_bias = nn.Parameter(torch.zeros(self.t_cont_k_inp_size, self.hidden_size)) if self.t_cont_k_inp_size else None
+ self.t_cont_o_embedding_bias = nn.Parameter(torch.zeros(self.t_cont_o_inp_size, self.hidden_size)) if self.t_cont_o_inp_size else None
+ self.t_tgt_embedding_bias = nn.Parameter(torch.zeros(self.t_tgt_size, self.hidden_size))
+ self.reset_parameters()
+
+
+ def reset_parameters(self):
if self.s_cont_embedding_vectors is not None:
torch.nn.init.xavier_normal_(self.s_cont_embedding_vectors)
+ torch.nn.init.zeros_(self.s_cont_embedding_bias)
if self.t_cont_k_embedding_vectors is not None:
torch.nn.init.xavier_normal_(self.t_cont_k_embedding_vectors)
+ torch.nn.init.zeros_(self.t_cont_k_embedding_bias)
if self.t_cont_o_embedding_vectors is not None:
torch.nn.init.xavier_normal_(self.t_cont_o_embedding_vectors)
- torch.nn.init.xavier_normal_(self.t_tgt_embedding_vectors)
+ torch.nn.init.zeros_(self.t_cont_o_embedding_bias)
+ if self.t_tgt_embedding_vectors is not None:
+ torch.nn.init.xavier_normal_(self.t_tgt_embedding_vectors)
+ torch.nn.init.zeros_(self.t_tgt_embedding_bias)
+ if self.s_cat_embed is not None:
+ for module in self.s_cat_embed:
+ module.reset_parameters()
+ if self.t_cat_k_embed is not None:
+ for module in self.t_cat_k_embed:
+ module.reset_parameters()
+ if self.t_cat_o_embed is not None:
+ for module in self.t_cat_o_embed:
+ module.reset_parameters()
+
def _apply_embedding(self,
cat: Optional[Tensor],
@@ -138,8 +171,11 @@ def _apply_embedding(self,
#the line below is equivalent to following einsums
#e_cont = torch.einsum('btf,fh->bthf', cont, cont_emb)
#e_cont = torch.einsum('bf,fh->bhf', cont, cont_emb)
- e_cont = torch.mul(cont.unsqueeze(-1), cont_emb)
- e_cont = e_cont + cont_bias
+ if MAKE_CONVERT_COMPATIBLE:
+ e_cont = torch.mul(cont.unsqueeze(-1), cont_emb)
+ e_cont = e_cont + cont_bias
+ else:
+ e_cont = fused_pointwise_linear_v1(cont, cont_emb, cont_bias)
else:
e_cont = None
@@ -185,11 +221,68 @@ def forward(self, x: Dict[str, Tensor]):
# Temporal observed targets
# t_observed_tgt = torch.einsum('btf,fh->btfh', t_tgt_obs, self.t_tgt_embedding_vectors)
- t_observed_tgt = torch.matmul(t_tgt_obs.unsqueeze(3).unsqueeze(4), self.t_tgt_embedding_vectors.unsqueeze(1)).squeeze(3)
- t_observed_tgt = t_observed_tgt + self.t_tgt_embedding_bias
+ if MAKE_CONVERT_COMPATIBLE:
+ t_observed_tgt = torch.matmul(t_tgt_obs.unsqueeze(3).unsqueeze(4), self.t_tgt_embedding_vectors.unsqueeze(1)).squeeze(3)
+ t_observed_tgt = t_observed_tgt + self.t_tgt_embedding_bias
+ else:
+ t_observed_tgt = fused_pointwise_linear_v2(t_tgt_obs, self.t_tgt_embedding_vectors, self.t_tgt_embedding_bias)
return s_inp, t_known_inp, t_observed_inp, t_observed_tgt
+class LazyEmbedding(nn.modules.lazy.LazyModuleMixin, TFTEmbedding):
+ cls_to_become = TFTEmbedding
+
+ def __init__(self, config):
+ super().__init__(config, initialize_cont_params=False)
+
+ if config.static_continuous_inp_size:
+ self.s_cont_embedding_vectors = UninitializedParameter()
+ self.s_cont_embedding_bias = UninitializedParameter()
+ else:
+ self.s_cont_embedding_vectors = None
+ self.s_cont_embedding_bias = None
+
+ if config.temporal_known_continuous_inp_size:
+ self.t_cont_k_embedding_vectors = UninitializedParameter()
+ self.t_cont_k_embedding_bias = UninitializedParameter()
+ else:
+ self.t_cont_k_embedding_vectors = None
+ self.t_cont_k_embedding_bias = None
+
+ if config.temporal_observed_continuous_inp_size:
+ self.t_cont_o_embedding_vectors = UninitializedParameter()
+ self.t_cont_o_embedding_bias = UninitializedParameter()
+ else:
+ self.t_cont_o_embedding_vectors = None
+ self.t_cont_o_embedding_bias = None
+
+ self.t_tgt_embedding_vectors = UninitializedParameter()
+ self.t_tgt_embedding_bias = UninitializedParameter()
+
+ def initialize_parameters(self, x):
+ if self.has_uninitialized_params():
+ s_cont_inp = x.get('s_cont', None)
+ t_cont_k_inp = x.get('k_cont', None)
+ t_cont_o_inp = x.get('o_cont', None)
+ t_tgt_obs = x['target'] # Has to be present
+
+ if s_cont_inp is not None:
+ self.s_cont_embedding_vectors.materialize((s_cont_inp.shape[-1], self.hidden_size))
+ self.s_cont_embedding_bias.materialize((s_cont_inp.shape[-1], self.hidden_size))
+
+ if t_cont_k_inp is not None:
+ self.t_cont_k_embedding_vectors.materialize((t_cont_k_inp.shape[-1], self.hidden_size))
+ self.t_cont_k_embedding_bias.materialize((t_cont_k_inp.shape[-1], self.hidden_size))
+
+ if t_cont_o_inp is not None:
+ self.t_cont_o_embedding_vectors.materialize((t_cont_o_inp.shape[-1], self.hidden_size))
+ self.t_cont_o_embedding_bias.materialize((t_cont_o_inp.shape[-1], self.hidden_size))
+
+ self.t_tgt_embedding_vectors.materialize((t_tgt_obs.shape[-1], self.hidden_size))
+ self.t_tgt_embedding_bias.materialize((t_tgt_obs.shape[-1], self.hidden_size))
+
+ self.reset_parameters()
+
class VariableSelectionNetwork(nn.Module):
def __init__(self, config, num_inputs):
super().__init__()
@@ -197,7 +290,7 @@ def __init__(self, config, num_inputs):
self.var_grns = nn.ModuleList([GRN(config.hidden_size, config.hidden_size, dropout=config.dropout) for _ in range(num_inputs)])
def forward(self, x: Tensor, context: Optional[Tensor] = None):
- Xi = x.reshape(*x.shape[:-2], -1)
+ Xi = torch.flatten(x, start_dim=-2)
grn_outputs = self.joint_grn(Xi, c=context)
sparse_weights = F.softmax(grn_outputs, dim=-1)
transformed_embed_list = [m(x[...,i,:]) for i, m in enumerate(self.var_grns)]
@@ -223,7 +316,7 @@ def forward(self, x: Tensor) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
# enrichment context
# state_c context
# state_h context
- cs, ce, ch, cc = tuple(m(variable_ctx) for m in self.context_grns)
+ cs, ce, ch, cc = [m(variable_ctx) for m in self.context_grns]
return cs, ce, ch, cc
@@ -241,7 +334,7 @@ def __init__(self, config):
self.scale = self.d_head**-0.5
self.register_buffer("_mask", torch.triu(torch.full((config.example_length, config.example_length), float('-inf')), 1).unsqueeze(0))
- def forward(self, x: Tensor, mask_future_timesteps: bool = True) -> Tuple[Tensor, Tensor]:
+ def forward(self, x: Tensor) -> Tuple[Tensor, Tensor]:
bs, t, h_size = x.shape
qkv = self.qkv_linears(x)
q, k, v = qkv.split((self.n_head * self.d_head, self.n_head * self.d_head, self.d_head), dim=-1)
@@ -253,8 +346,7 @@ def forward(self, x: Tensor, mask_future_timesteps: bool = True) -> Tuple[Tensor
attn_score = torch.matmul(q.permute((0, 2, 1, 3)), k.permute((0, 2, 3, 1)))
attn_score.mul_(self.scale)
- if mask_future_timesteps:
- attn_score = attn_score + self._mask
+ attn_score = attn_score + self._mask
attn_prob = F.softmax(attn_score, dim=3)
attn_prob = self.attn_dropout(attn_prob)
@@ -265,26 +357,14 @@ def forward(self, x: Tensor, mask_future_timesteps: bool = True) -> Tuple[Tensor
out = self.out_proj(m_attn_vec)
out = self.out_dropout(out)
- return out, attn_vec
+ return out, attn_prob
-
-
-class TemporalFusionTransformer(nn.Module):
- """
- Implementation of https://arxiv.org/abs/1912.09363
- """
+class TFTBack(nn.Module):
def __init__(self, config):
super().__init__()
- if hasattr(config, 'model'):
- config = config.model
-
- self.encoder_length = config.encoder_length #this determines from how distant past we want to use data from
-
- self.embedding = TFTEmbedding(config)
- self.static_encoder = StaticCovariateEncoder(config)
-
- self.history_vsn = VariableSelectionNetwork(config, config.num_historic_vars)
+ self.encoder_length = config.encoder_length
+ self.history_vsn = VariableSelectionNetwork(config, config.num_historic_vars)
self.history_encoder = nn.LSTM(config.hidden_size, config.hidden_size, batch_first=True)
self.future_vsn = VariableSelectionNetwork(config, config.num_future_vars)
self.future_encoder = nn.LSTM(config.hidden_size, config.hidden_size, batch_first=True)
@@ -309,28 +389,13 @@ def __init__(self, config):
self.decoder_ln = LayerNorm(config.hidden_size, eps=1e-3)
self.quantile_proj = nn.Linear(config.hidden_size, len(config.quantiles))
-
- def forward(self, x: Dict[str, Tensor]) -> Tensor:
- s_inp, t_known_inp, t_observed_inp, t_observed_tgt = self.embedding(x)
-
- # Static context
- cs, ce, ch, cc = self.static_encoder(s_inp)
- ch, cc = ch.unsqueeze(0), cc.unsqueeze(0) #lstm initial states
-
- # Temporal input
- _historical_inputs = [t_known_inp[:,:self.encoder_length,:], t_observed_tgt[:,:self.encoder_length,:]]
- if t_observed_inp is not None:
- _historical_inputs.insert(0,t_observed_inp[:,:self.encoder_length,:])
-
- historical_inputs = torch.cat(_historical_inputs, dim=-2)
- future_inputs = t_known_inp[:, self.encoder_length:]
-
- # Encoders
+
+ def forward(self, historical_inputs, cs, ch, cc, ce, future_inputs):
historical_features, _ = self.history_vsn(historical_inputs, cs)
history, state = self.history_encoder(historical_features, (ch, cc))
future_features, _ = self.future_vsn(future_inputs, cs)
future, _ = self.future_encoder(future_features, state)
- torch.cuda.synchronize() # this call gives perf boost for unknown reasons
+ torch.cuda.synchronize()
# skip connection
input_embedding = torch.cat([historical_features, future_features], dim=1)
@@ -343,7 +408,7 @@ def forward(self, x: Dict[str, Tensor]) -> Tensor:
enriched = self.enrichment_grn(temporal_features, c=ce)
# Temporal self attention
- x, _ = self.attention(enriched, mask_future_timesteps=True)
+ x, _ = self.attention(enriched)
# Don't compute hictorical quantiles
x = x[:, self.encoder_length:, :]
@@ -365,3 +430,39 @@ def forward(self, x: Dict[str, Tensor]) -> Tensor:
out = self.quantile_proj(x)
return out
+
+
+class TemporalFusionTransformer(nn.Module):
+ """
+ Implementation of https://arxiv.org/abs/1912.09363
+ """
+ def __init__(self, config):
+ super().__init__()
+
+ if hasattr(config, 'model'):
+ config = config.model
+
+ self.encoder_length = config.encoder_length #this determines from how distant past we want to use data from
+
+ self.embedding = LazyEmbedding(config)
+ self.static_encoder = StaticCovariateEncoder(config)
+ if MAKE_CONVERT_COMPATIBLE:
+ self.TFTpart2 = TFTBack(config)
+ else:
+ self.TFTpart2 = torch.jit.script(TFTBack(config))
+
+ def forward(self, x: Dict[str, Tensor]) -> Tensor:
+ s_inp, t_known_inp, t_observed_inp, t_observed_tgt = self.embedding(x)
+
+ # Static context
+ cs, ce, ch, cc = self.static_encoder(s_inp)
+ ch, cc = ch.unsqueeze(0), cc.unsqueeze(0) #lstm initial states
+
+ # Temporal input
+ _historical_inputs = [t_known_inp[:,:self.encoder_length,:], t_observed_tgt[:,:self.encoder_length,:]]
+ if t_observed_inp is not None:
+ _historical_inputs.insert(0,t_observed_inp[:,:self.encoder_length,:])
+
+ historical_inputs = torch.cat(_historical_inputs, dim=-2)
+ future_inputs = t_known_inp[:, self.encoder_length:]
+ return self.TFTpart2(historical_inputs, cs, ch, cc, ce, future_inputs)
\ No newline at end of file
diff --git a/PyTorch/Forecasting/TFT/tft_torchhub.py b/PyTorch/Forecasting/TFT/tft_torchhub.py
new file mode 100644
index 000000000..88888ed2d
--- /dev/null
+++ b/PyTorch/Forecasting/TFT/tft_torchhub.py
@@ -0,0 +1,95 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import urllib.request
+from zipfile import ZipFile
+import torch
+from torch.utils.data import DataLoader
+NGC_CHECKPOINT_URLS = {}
+NGC_CHECKPOINT_URLS["electricity"] = "https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-electricity/versions/22.11.0_amp/zip"
+NGC_CHECKPOINT_URLS["traffic"] = "https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-traffic/versions/22.11.0_amp/zip"
+def _download_checkpoint(checkpoint, force_reload):
+ model_dir = os.path.join(torch.hub._get_torch_home(), 'checkpoints')
+ if not os.path.exists(model_dir):
+ os.makedirs(model_dir)
+ ckpt_file = os.path.join(model_dir, os.path.basename(checkpoint))
+ if not os.path.exists(ckpt_file) or force_reload:
+ sys.stderr.write('Downloading checkpoint from {}\n'.format(checkpoint))
+ urllib.request.urlretrieve(checkpoint, ckpt_file)
+ with ZipFile(ckpt_file, "r") as zf:
+ zf.extractall(path=model_dir)
+ return os.path.join(model_dir, "checkpoint.pt")
+
+def nvidia_tft(pretrained=True, **kwargs):
+ from .modeling import TemporalFusionTransformer
+ """Constructs a TFT model.
+ For detailed information on model input and output, training recipies, inference and performance
+ visit: github.com/NVIDIA/DeepLearningExamples and/or ngc.nvidia.com
+ Args (type[, default value]):
+ pretrained (bool, True): If True, returns a pretrained model.
+ dataset (str, 'electricity'): loads selected model type electricity or traffic. Defaults to electricity
+ """
+ ds_type = kwargs.get("dataset", "electricity")
+ ckpt = _download_checkpoint(NGC_CHECKPOINT_URLS[ds_type], True)
+ state_dict = torch.load(ckpt)
+ config = state_dict['config']
+
+ model = TemporalFusionTransformer(config)
+ if pretrained:
+ model.load_state_dict(state_dict['model'])
+ model.eval()
+ return model
+
+def nvidia_tft_data_utils(**kwargs):
+
+ from .data_utils import TFTDataset
+ from .configuration import ElectricityConfig
+ class Processing:
+ @staticmethod
+ def download_data(path):
+ if not os.path.exists(os.path.join(path, "raw")):
+ os.makedirs(os.path.join(path, "raw"), exist_ok=True)
+ dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00321/LD2011_2014.txt.zip"
+ ckpt_file = os.path.join(path, "raw/electricity.zip")
+ if not os.path.exists(ckpt_file):
+ sys.stderr.write('Downloading checkpoint from {}\n'.format(dataset_url))
+ urllib.request.urlretrieve(dataset_url, ckpt_file)
+ with ZipFile(ckpt_file, "r") as zf:
+ zf.extractall(path=os.path.join(path, "raw/electricity/"))
+
+ @staticmethod
+ def preprocess(path):
+ config = ElectricityConfig()
+ if not os.path.exists(os.path.join(path, "processed")):
+ os.makedirs(os.path.join(path, "processed"), exist_ok=True)
+ from data_utils import standarize_electricity as standarize
+ from data_utils import preprocess
+ standarize(os.path.join(path, "raw/electricity"))
+ preprocess(os.path.join(path, "raw/electricity/standarized.csv"), os.path.join(path, "processed/electricity_bin/"), config)
+
+
+ @staticmethod
+ def get_batch(path):
+ config = ElectricityConfig()
+ test_split = TFTDataset(os.path.join(path, "processed/electricity_bin/", "test.csv"), config)
+ data_loader = DataLoader(test_split, batch_size=16, num_workers=0)
+ for i, batch in enumerate(data_loader):
+ if i == 40:
+ break
+ return batch
+
+ return Processing()
+
diff --git a/PyTorch/Forecasting/TFT/train.py b/PyTorch/Forecasting/TFT/train.py
old mode 100644
new mode 100755
index 37396f80b..cfdba7102
--- a/PyTorch/Forecasting/TFT/train.py
+++ b/PyTorch/Forecasting/TFT/train.py
@@ -23,10 +23,9 @@
import torch.nn.functional as F
import torch.distributed as dist
from torch.utils.data import DataLoader, DistributedSampler, RandomSampler
-from apex import amp
from apex.optimizers import FusedAdam
-#from torch.nn.parallel import DistributedDataParallel as DDP
-from apex.parallel import DistributedDataParallel as DDP
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.cuda import amp
import numpy as np
@@ -34,48 +33,14 @@
from modeling import TemporalFusionTransformer
from configuration import CONFIGS
-from data_utils import TFTBinaryDataset, sample_data
+from data_utils import load_dataset
from log_helper import setup_logger
from criterions import QuantileLoss
from inference import predict
-from utils import PerformanceMeter
+from utils import PerformanceMeter, print_once
import gpu_affinity
from ema import ModelEma
-def load_dataset(args, config):
- train_split = TFTBinaryDataset(os.path.join(args.data_path, 'train.bin'), config)
- train_split = sample_data(train_split, args.sample_data[0])
- if args.distributed_world_size > 1:
- data_sampler = DistributedSampler(train_split, args.distributed_world_size, args.distributed_rank, seed=args.seed + args.distributed_rank, drop_last=True)
- else:
- data_sampler = RandomSampler(train_split)
- train_loader = DataLoader(train_split, batch_size=args.batch_size, num_workers=4, sampler=data_sampler, pin_memory=True)
-
- valid_split = TFTBinaryDataset(os.path.join(args.data_path, 'valid.bin'), config)
- valid_split = sample_data(valid_split, args.sample_data[1])
- if args.distributed_world_size > 1:
- data_sampler = DistributedSampler(valid_split, args.distributed_world_size, args.distributed_rank, shuffle=False, drop_last=False)
- else:
- data_sampler = None
- valid_loader = DataLoader(valid_split, batch_size=args.batch_size, sampler=data_sampler, num_workers=4, pin_memory=True)
-
- test_split = TFTBinaryDataset(os.path.join(args.data_path, 'test.bin'), config)
- if args.distributed_world_size > 1:
- data_sampler = DistributedSampler(test_split, args.distributed_world_size, args.distributed_rank, shuffle=False, drop_last=False)
- else:
- data_sampler = None
- test_loader = DataLoader(test_split, batch_size=args.batch_size, sampler=data_sampler, num_workers=4, pin_memory=True)
-
- print_once(f'Train split length: {len(train_split)}')
- print_once(f'Valid split length: {len(valid_split)}')
- print_once(f'Test split length: {len(test_split)}')
-
- return train_loader, valid_loader, test_loader
-
-def print_once(*args, **kwargs):
- if not dist.is_initialized() or dist.get_rank() == 0:
- print(*args, **kwargs)
-
def main(args):
### INIT DISTRIBUTED
@@ -113,23 +78,28 @@ def main(args):
dllogger.log(step='HPARAMS', data={**vars(args), **vars(config)}, verbosity=1)
+ train_loader, valid_loader, test_loader = load_dataset(args, config)
+
model = TemporalFusionTransformer(config).cuda()
if args.ema_decay:
model_ema = ModelEma(model, decay=args.ema_decay)
- print_once('Model params: {}'.format(sum(p.numel() for p in model.parameters())))
+ # Run dummy iteration to initialize lazy modules
+ dummy_batch = next(iter(train_loader))
+ dummy_batch = {key: tensor.cuda() if tensor.numel() else None for key, tensor in dummy_batch.items()}
+ model(dummy_batch)
+
criterion = QuantileLoss(config).cuda()
optimizer = FusedAdam(model.parameters(), lr=args.lr)
- if args.use_amp:
- model, optimizer = amp.initialize(model, optimizer, opt_level="O2", loss_scale="dynamic")
if args.distributed_world_size > 1:
- #model = DDP(model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
- model = DDP(model)
+ model = DDP(model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
- train_loader, valid_loader, test_loader = load_dataset(args, config)
+ print_once('Model params: {}'.format(sum(p.numel() for p in model.parameters())))
global_step = 0
- perf_meter = PerformanceMeter()
+ perf_meter = PerformanceMeter(benchmark_mode=not args.disable_benchmark)
+ if args.use_amp:
+ scaler = amp.GradScaler(init_scale=32768.0)
for epoch in range(args.epochs):
start = time.time()
@@ -139,20 +109,28 @@ def main(args):
for local_step, batch in enumerate(train_loader):
perf_meter.reset_current_lap()
batch = {key: tensor.cuda() if tensor.numel() else None for key, tensor in batch.items()}
- predictions = model(batch)
- targets = batch['target'][:,config.encoder_length:,:]
- p_losses = criterion(predictions, targets)
- loss = p_losses.sum()
-
+ with torch.jit.fuser("fuser2"), amp.autocast(enabled=args.use_amp):
+ predictions = model(batch)
+ targets = batch['target'][:,config.encoder_length:,:]
+ p_losses = criterion(predictions, targets)
+ loss = p_losses.sum()
+ if global_step == 0 and args.ema_decay:
+ model_ema(batch)
if args.use_amp:
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
+ scaler.scale(loss).backward()
+
else:
loss.backward()
if not args.grad_accumulation or (global_step+1) % args.grad_accumulation == 0:
+ if args.use_amp:
+ scaler.unscale_(optimizer)
if args.clip_grad:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip_grad)
- optimizer.step()
+ if args.use_amp:
+ scaler.step(optimizer)
+ scaler.update()
+ else:
+ optimizer.step()
optimizer.zero_grad()
if args.ema_decay:
model_ema.update(model)
@@ -164,7 +142,7 @@ def main(args):
torch.cuda.synchronize()
ips = perf_meter.update(args.batch_size * args.distributed_world_size,
- exclude_from_total=local_step in [0, len(train_loader)-1])
+ exclude_from_total=local_step in [0, 1, 2, len(train_loader)-1])
log_dict = {'P10':p_losses[0].item(), 'P50':p_losses[1].item(), 'P90':p_losses[2].item(), 'loss': loss.item(), 'items/s':ips}
dllogger.log(step=global_step, data=log_dict, verbosity=1)
@@ -188,6 +166,10 @@ def main(args):
cat_encodings = pickle.load(open(os.path.join(args.data_path,'cat_encodings.bin'), 'rb'))
unscaled_predictions, unscaled_targets, _, _ = predict(args, config, model, test_loader, tgt_scalers, cat_encodings)
+
+ unscaled_predictions = torch.from_numpy(unscaled_predictions).contiguous()
+ unscaled_targets = torch.from_numpy(unscaled_targets).contiguous()
+
losses = QuantileLoss(config)(unscaled_predictions, unscaled_targets)
normalizer = unscaled_targets.abs().mean()
quantiles = 2 * losses / normalizer
@@ -209,9 +191,10 @@ def validate(args, config, model, criterion, dataloader, global_step):
model.eval()
losses = []
+ torch.cuda.synchronize()
validation_start = time.time()
for batch in dataloader:
- with torch.no_grad():
+ with torch.jit.fuser("fuser2"), amp.autocast(enabled=args.use_amp), torch.no_grad():
batch = {key: tensor.cuda() if tensor.numel() else None for key, tensor in batch.items()}
predictions = model(batch)
targets = batch['target'][:,config.encoder_length:,:]
@@ -219,6 +202,7 @@ def validate(args, config, model, criterion, dataloader, global_step):
bs = next(t for t in batch.values() if t is not None).shape[0]
losses.append((p_losses, bs))
+ torch.cuda.synchronize()
validation_end = time.time()
p_losses = sum([l[0]*l[1] for l in losses])/sum([l[1] for l in losses]) #takes into accunt that the last batch is not full
@@ -280,6 +264,7 @@ def validate(args, config, model, criterion, dataloader, global_step):
'disabled'],
help='type of CPU affinity')
parser.add_argument("--ema_decay", type=float, default=0.0, help='Use exponential moving average')
+ parser.add_argument("--disable_benchmark", action='http://github.com/store_true', help='Disable benchmarking mode')
ARGS = parser.parse_args()
diff --git a/PyTorch/Forecasting/TFT/triton/README.md b/PyTorch/Forecasting/TFT/triton/README.md
index c548a9401..862c252ef 100644
--- a/PyTorch/Forecasting/TFT/triton/README.md
+++ b/PyTorch/Forecasting/TFT/triton/README.md
@@ -146,6 +146,9 @@ NVIDIA DGX A100 (1x A100 80GB): bash ./triton/runner/start_NVIDIA-DGX-A100-\(1x-
NVIDIA T4: bash ./triton/runner/start_NVIDIA-T4.sh
```
+If one encounters an error like `the provided PTX was compiled with an unsupported toolchain`, follow the steps in
+[Step by step deployment process](#step-by-step-deployment-process).
+
## Performance
The performance measurements in this document were conducted at the time of publication and may not reflect
the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to
@@ -2077,7 +2080,7 @@ Please use the data download from the [Main QSG](https://github.com/NVIDIA/DeepL
#### Prepare Checkpoint
Please place a `checkpoint.pt` from TFT trained on electricity in `runner_workspace/checkpoints/electricity_bin/`. Note that the `electricity_bin`
subdirectory may not be created yet. In addition one can download a zip archive of a trained checkpoint
-[here](https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_eletricity_amp/versions/21.06.0/zip)
+[here](https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-electricity/versions/22.11.0_amp/zip)
#### Setup Container
Build and run a container that extends the NGC PyTorch container with the Triton Inference Server client libraries and dependencies.
@@ -2242,7 +2245,7 @@ mkdir -p ${SHARED_DIR}/input_data
python triton/prepare_input_data.py \
--input-data-dir ${SHARED_DIR}/input_data/ \
--dataset ${DATASETS_DIR}/${DATASET} \
- --checkpoint ${CHECKPOINT_DIR}/ \
+ --checkpoint ${CHECKPOINT_DIR}/
```
diff --git a/PyTorch/Forecasting/TFT/triton/deployment_toolkit/bermuda/pyt.py b/PyTorch/Forecasting/TFT/triton/deployment_toolkit/bermuda/pyt.py
index 2d3e3a67c..0578f3d49 100644
--- a/PyTorch/Forecasting/TFT/triton/deployment_toolkit/bermuda/pyt.py
+++ b/PyTorch/Forecasting/TFT/triton/deployment_toolkit/bermuda/pyt.py
@@ -161,6 +161,8 @@ def load(self, model_path: Union[str, Path], **kwargs) -> Model:
def _trace(self, model: Model, dataloader_fn) -> Model:
device = get_model_device(model.handle)
dummy_input = get_sample_input(dataloader_fn(), device)
+ # Run dummy forward to initialize lazy modules
+ model.handle(*dummy_input)
traced_model = torch.jit.trace_module(model.handle, {"forward": dummy_input})
return Model(traced_model, precision=model.precision, inputs=model.inputs, outputs=model.outputs)
@@ -213,6 +215,7 @@ def save(self, model: Model, model_path: Union[str, Path], dataloader_fn) -> Mod
device = get_model_device(model.handle)
dummy_input = get_sample_input(dataloader_fn(), device)
+ model.handle(*dummy_input)
with torch.no_grad():
torch.onnx.export(
model.handle,
diff --git a/PyTorch/Forecasting/TFT/triton/requirements.txt b/PyTorch/Forecasting/TFT/triton/requirements.txt
index a0af48ed3..30cbed0fa 100644
--- a/PyTorch/Forecasting/TFT/triton/requirements.txt
+++ b/PyTorch/Forecasting/TFT/triton/requirements.txt
@@ -11,7 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-model_navigator[pyt] @ git+https://github.com/triton-inference-server/model_navigator.git@v0.2.5#egg=model_navigator
+model_navigator[pyt] @ git+https://github.com/triton-inference-server/model_navigator.git@v0.2.7#egg=model_navigator
natsort>=7.0.0
networkx==2.5
numpy
@@ -21,3 +21,4 @@ pycuda>=2019.1.2
PyYAML>=5.2
tabulate>=0.8.7
tqdm>=4.44.1
+triton-model-analyzer==1.22.0
diff --git a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-A30.yaml b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-A30.yaml
index b76b17bb8..372b640e5 100644
--- a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-A30.yaml
+++ b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-A30.yaml
@@ -1,8 +1,8 @@
checkpoints:
- name: electricity_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_eletricity_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-electricity/versions/22.11.0_amp/zip
- name: traffic_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_traffic_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-traffic/versions/22.11.0_amp/zip
configurations:
- accelerator: none
batch_size:
@@ -112,7 +112,7 @@ configurations:
triton_gpu_engine_count: 2
triton_max_queue_delay: 1
triton_preferred_batch_sizes: 512 1024
-container_version: '21.12'
+container_version: '22.11'
datasets:
- name: electricity_bin
- name: traffic_bin
diff --git a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-1-(1x-V100-32GB).yaml b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-1-(1x-V100-32GB).yaml
index b76b17bb8..372b640e5 100644
--- a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-1-(1x-V100-32GB).yaml
+++ b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-1-(1x-V100-32GB).yaml
@@ -1,8 +1,8 @@
checkpoints:
- name: electricity_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_eletricity_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-electricity/versions/22.11.0_amp/zip
- name: traffic_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_traffic_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-traffic/versions/22.11.0_amp/zip
configurations:
- accelerator: none
batch_size:
@@ -112,7 +112,7 @@ configurations:
triton_gpu_engine_count: 2
triton_max_queue_delay: 1
triton_preferred_batch_sizes: 512 1024
-container_version: '21.12'
+container_version: '22.11'
datasets:
- name: electricity_bin
- name: traffic_bin
diff --git a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-A100-(1x-A100-80GB).yaml b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-A100-(1x-A100-80GB).yaml
index b76b17bb8..372b640e5 100644
--- a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-A100-(1x-A100-80GB).yaml
+++ b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-DGX-A100-(1x-A100-80GB).yaml
@@ -1,8 +1,8 @@
checkpoints:
- name: electricity_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_eletricity_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-electricity/versions/22.11.0_amp/zip
- name: traffic_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_traffic_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-traffic/versions/22.11.0_amp/zip
configurations:
- accelerator: none
batch_size:
@@ -112,7 +112,7 @@ configurations:
triton_gpu_engine_count: 2
triton_max_queue_delay: 1
triton_preferred_batch_sizes: 512 1024
-container_version: '21.12'
+container_version: '22.11'
datasets:
- name: electricity_bin
- name: traffic_bin
diff --git a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-T4.yaml b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-T4.yaml
index b76b17bb8..372b640e5 100644
--- a/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-T4.yaml
+++ b/PyTorch/Forecasting/TFT/triton/runner/config_NVIDIA-T4.yaml
@@ -1,8 +1,8 @@
checkpoints:
- name: electricity_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_eletricity_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-electricity/versions/22.11.0_amp/zip
- name: traffic_bin
- url: https://api.ngc.nvidia.com/v2/models/nvidia/tft_pyt_ckpt_base_traffic_amp/versions/21.06.0/zip
+ url: https://api.ngc.nvidia.com/v2/models/nvidia/dle/tft_base_pyt_ckpt_ds-traffic/versions/22.11.0_amp/zip
configurations:
- accelerator: none
batch_size:
@@ -112,7 +112,7 @@ configurations:
triton_gpu_engine_count: 2
triton_max_queue_delay: 1
triton_preferred_batch_sizes: 512 1024
-container_version: '21.12'
+container_version: '22.11'
datasets:
- name: electricity_bin
- name: traffic_bin
diff --git a/PyTorch/Forecasting/TFT/triton/scripts/docker/triton_inference_server.sh b/PyTorch/Forecasting/TFT/triton/scripts/docker/triton_inference_server.sh
index 242434d3a..481e6c9c2 100644
--- a/PyTorch/Forecasting/TFT/triton/scripts/docker/triton_inference_server.sh
+++ b/PyTorch/Forecasting/TFT/triton/scripts/docker/triton_inference_server.sh
@@ -41,7 +41,7 @@ docker run --rm -d \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ipc=host \
- nvcr.io/nvidia/tritonserver:21.12-py3 tritonserver \
+ nvcr.io/nvidia/tritonserver:22.11-py3 tritonserver \
--model-store=${MODEL_REPOSITORY_PATH} \
--strict-model-config=false \
--exit-on-error=true \
diff --git a/PyTorch/Forecasting/TFT/utils.py b/PyTorch/Forecasting/TFT/utils.py
index fc993bd63..b85d361c1 100644
--- a/PyTorch/Forecasting/TFT/utils.py
+++ b/PyTorch/Forecasting/TFT/utils.py
@@ -13,12 +13,17 @@
# limitations under the License.
import time
+import torch.distributed as dist
+import torch
class PerformanceMeter():
- def __init__(self):
+ def __init__(self, benchmark_mode=True):
+ self.benchmark_mode = benchmark_mode
self.reset()
def reset(self):
+ if self.benchmark_mode:
+ torch.cuda.synchronize()
self.avg = 0
self.count = 0
self.total_time = 0
@@ -26,6 +31,8 @@ def reset(self):
self.intervals = []
def update(self, n, exclude_from_total=False):
+ if self.benchmark_mode:
+ torch.cuda.synchronize()
delta = time.time() - self.last_update_time
self.intervals.append(delta)
if not exclude_from_total:
@@ -37,6 +44,8 @@ def update(self, n, exclude_from_total=False):
return n/delta
def reset_current_lap(self):
+ if self.benchmark_mode:
+ torch.cuda.synchronize()
self.last_update_time = time.time()
def p(self, i):
@@ -44,3 +53,7 @@ def p(self, i):
idx = int(len(self.intervals) * i / 100)
return sorted(self.intervals)[idx]
+def print_once(*args, **kwargs):
+ if not dist.is_initialized() or dist.get_rank() == 0:
+ print(*args, **kwargs)
+
diff --git a/PyTorch/LanguageModeling/BART/Dockerfile b/PyTorch/LanguageModeling/BART/Dockerfile
index c09b2e2ad..f49237538 100755
--- a/PyTorch/LanguageModeling/BART/Dockerfile
+++ b/PyTorch/LanguageModeling/BART/Dockerfile
@@ -14,55 +14,25 @@
# limitations under the License.
# ==============================================================================
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.02-py3
-
-######
-# Tokenizers is only available pre-built on x86
-#
-FROM ${FROM_IMAGE_NAME} AS tokenizers_amd64
-WORKDIR /wheelhouse
-RUN pip download tokenizers==0.8.0
-
-FROM quay.io/pypa/manylinux2014_aarch64 as tokenizers_arm64
-ARG PYVER=38
-RUN yum install -y openssl-devel
-RUN curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain nightly-2020-05-14 -y
-ENV PATH="/root/.cargo/bin:$PATH"
-ENV PYBIN=/opt/python/cp${PYVER}-cp${PYVER}/bin
-ENV PYTHON_SYS_EXECUTABLE="$PYBIN/python"
-RUN git clone -b python-v0.8.0 https://github.com/huggingface/tokenizers.git /opt/tokenizers
-WORKDIR /opt/tokenizers/bindings/python
-RUN "${PYBIN}/pip" install setuptools-rust \
- && "${PYBIN}/python" setup.py bdist_wheel \
- && rm -rf build/* \
- && for whl in dist/*.whl; do \
- auditwheel repair "$whl" -w dist/; \
- done \
- && rm dist/*-linux_* \
- && mkdir -p /wheelhouse \
- && mv dist/*.whl /wheelhouse
-
-ARG TARGETARCH
-FROM tokenizers_${TARGETARCH} AS tokenizers
-#
-#####
-
-
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.08-py3
FROM ${FROM_IMAGE_NAME}
-RUN apt-get update && apt-get install -y pbzip2
-RUN --mount=from=tokenizers,source=/wheelhouse,target=/tmp/wheelhouse \
- pip install --no-cache-dir /tmp/wheelhouse/tokenizers*.whl
-RUN pip install --no-cache-dir dataclasses gitpython rouge-score pynvml==8.0.4 \
- git+https://github.com/NVIDIA/dllogger pytorch-lightning==1.1.5 gdown sacrebleu
-
-RUN pip install tqdm --upgrade
+RUN apt-get update
+COPY requirements.txt .
+RUN pip install --upgrade --no-cache-dir pip \
+ && pip install --no-cache-dir -r requirements.txt
WORKDIR /workspace
-RUN git clone https://github.com/artmatsak/cnn-dailymail.git
+RUN git clone https://github.com/abisee/cnn-dailymail.git
RUN git clone https://github.com/gcunhase/AMICorpusXML.git
+# Re-build apex
+RUN git clone https://github.com/nv-joseli/apex.git
+RUN cd apex && \
+ git checkout bf16lamb && \
+ NVCC_APPEND_FLAGS='--threads 1' pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
+
WORKDIR /workspace/bart
COPY . .
diff --git a/PyTorch/LanguageModeling/BART/README.md b/PyTorch/LanguageModeling/BART/README.md
index 044f2907a..16e2761c3 100755
--- a/PyTorch/LanguageModeling/BART/README.md
+++ b/PyTorch/LanguageModeling/BART/README.md
@@ -1,4 +1,4 @@
-# BART 1.0 For PyTorch
+# BART For PyTorch
This repository provides a script and recipe to train the BART model to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA.
@@ -30,16 +30,15 @@ This repository provides a script and recipe to train the BART model to achieve
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
- * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
- * [Training accuracy: NVIDIA DGX-1 V100 (8x V100 32GB)](#training-accuracy-nvidia-dgx-1-v100-8x-v100-32gb)
+ * [Pre-training accuracy: NVIDIA DGX A100 (320x A100 80GB)](#pre-training-accuracy-nvidia-dgx-a100-320x-a100-80gb)
+ * [Fine-tuning accuracy: NVIDIA DGX A100 (8x A100 80GB)](#fine-tuning-accuracy-nvidia-dgx-a100-8x-a100-80gb)
* [Training stability test](#training-stability-test)
* [Training performance results](#training-performance-results)
- * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
- * [Training performance: NVIDIA DGX-1 V100 (8x V100 32GB)](#training-performance-nvidia-dgx-1-v100-8x-v100-32gb)
+ * [Pre-training performance: Single-node on NVIDIA DGX A100 (8x A100 80GB)](#pre-training-performance-single-node-on-nvidia-dgx-a100-8x-a100-80gb)
+ * [Pre-training performance: Multi-node on NVIDIA DGX A100 (8x A100 80GB)](#pre-training-performance-multi-node-on-nvidia-dgx-a100-8x-a100-80gb)
+ * [Fine-tuning performance: NVIDIA DGX A100 (8x A100 80GB)](#fine-tuning-performance-nvidia-dgx-a100-8x-a100-80gb)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-1x-a100-80gb)
- * [Inference performance: NVIDIA DGX-1 V100 (1x V100 32GB)](#inference-performance-nvidia-dgx-1-v100-1x-v100-16gb)
- * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
@@ -76,16 +75,33 @@ Inference is done by default with beam search 4 for CNN-DM dataset and 6 for XSu
The following features are supported by this model:
-| **Feature** | **BERT** |
+| **Feature** | **BART** |
|:---------:|:----------:|
-|APEX AMP|Yes|
-|APEX DDP|Yes|
+| PyTorch AMP | Yes |
+| PyTorch DDP | Yes |
+| LAMB | Yes |
+| Multi-node | Yes |
+| LDDL | Yes |
+| Pre-LN | Yes |
#### Features
[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training, whereas [AMP](https://nvidia.github.io/apex/amp.html) is an abbreviation used for automatic mixed precision training.
[DDP](https://nvidia.github.io/apex/parallel.html) stands for DistributedDataParallel and is used for multi-GPU training.
+
+[LAMB](https://arxiv.org/pdf/1904.00962.pdf) stands for Layerwise Adaptive Moments based optimizer, is a large batch optimization technique that helps accelerate training of deep neural networks using large minibatches. It allows using a global batch size of 65536 and 32768 on sequence lengths 128 and 512 respectively, compared to a batch size of 256 for [Adam](https://arxiv.org/pdf/1412.6980.pdf). The optimized implementation accumulates 1024 gradient batches in phase 1 and 4096 steps in phase 2 before updating weights once. This results in a 15% training speedup. On multi-node systems, LAMB allows scaling up to 1024 GPUs resulting in training speedups of up to 72x in comparison to Adam. Adam has limitations on the learning rate that can be used since it is applied globally on all parameters whereas LAMB follows a layerwise learning rate strategy.
+
+NVLAMB adds the necessary tweaks to [LAMB version 1](https://arxiv.org/abs/1904.00962v1), to ensure correct convergence. The algorithm is as follows:
+
+ 
+
+In this PyTorch BART example, we used global batch size of 64000 and 30720 on sequence lengths 128 and 512 respectively, compared to a batch size of 8000 and sequence lengths 512 on [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) which Facebook used for BART. We only trained with 44% total number of tokens compared to Facebook's BART. It can get 2.7x training speedup and achieve similar accuracy.
+
+[LDDL](../lddl) is a library that enables scalable data preprocessing and loading. LDDL is used by this PyTorch BART example.
+
+[Pre-LN](https://arxiv.org/pdf/2002.04745.pdf) is an transformer architecture, which layer normalization is put inside the residual blocks. In our experiments, For Pre-LN transformer, the loss decays faster and it makes training more stable without gradient exploding or vanishing .
+
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
@@ -99,17 +115,15 @@ For information about:
#### Enabling mixed precision
-In this repository, mixed precision training is enabled by PyTorch Lightning with NVIDIA’s APEX library. The APEX library has an automatic mixed precision module that allows mixed precision to be enabled with minimal code changes.
+In this repository, mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP)
+autocast [torch.cuda.amp.autocast](https://pytorch.org/docs/stable/amp.html#autocasting) which casts variables
+to half-precision upon retrieval, while storing variables in single-precision format.
+Furthermore, to preserve small gradient magnitudes in backpropagation,
+a [gradient scaling](https://pytorch.org/docs/stable/amp.html#gradient-scaling)
+step must be included.
-Automatic mixed precision can be enabled with the following code changes:
-
-```
-if args.fp16:
- train_params["precision"] = 16
- train_params["amp_level"] = args.amp_level
-```
-
-Where `` is the optimization level. In the summarization, `O1` is set as the optimization level. Mixed precision training can be turned on by passing the `fp16` argument to the `finetune.py`. All shell scripts have a positional argument available to enable mixed precision training.
+For an in-depth walk through on AMP, check out sample usage
+[here](https://pytorch.org/docs/stable/amp.html).
#### TF32
@@ -146,10 +160,8 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile which extends the PyTorch
NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-- [PyTorch 21.02-py3+](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) NGC container
+- [PyTorch 22.08-py3+](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) NGC container
- Supported GPUs:
-- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
-- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
@@ -193,43 +205,73 @@ Use the following script to download and preprocess CNN DM data as well as XSum
bash scripts/get_data.sh
```
+Use the script to download Wikipedia, Common Crawl, and OpenWebTextCorpus for pre-training dataset
+```bash
+bash scripts/get_pretraining_data.sh
+```
+The pretraining dataset is 200GB+ and takes 24+ hours to download.
+
+For downloading less dataset, you can change the date period of Common Crawl archive to take less time. For example:
+```bash
+download_common_crawl \
+ --outdir $data_folder/common_crawl \
+ --warc-files-start-date 2016-09-01 \
+ --warc-files-end-date 2016-10-31 \
+ --start-date 2016-09-01 \
+ --end-date 2016-10-31
+```
+
+Use the script to preprocess the pre-training dataset into LDDL Parquet shards
+```bash
+bash scripts/preprocess_pretrain_data.sh
+```
+
By default, the path to the data folder is set to /workspace/bart/data for ease of use in all the scripts.
-5. Start summarizing.
+5. Start pre-training
+
+BART is designed to pre-train language representations. The following scripts are to replicate pre-training on Wikipedia, Common Crawl, and OpenWebTextCorpus from the LAMB paper. These scripts are general and can be used for pre-training language representations on any corpus of choice.
+From within the container, you can use the following script to run pre-training using LAMB.
+
+```bash
+bash scripts/run_pretraining.sh
+```
+
+6. Start summarizing.
Pretrained BART representations can be fine tuned for a state-of-the-art summarization system. From within the container, you can use the following script to run summarization on CNN DM dataset.
```bash
-bash scripts/run_summarization.sh
+bash scripts/run_summarization.sh
```
This repository contains a number of predefined configurations to run the CNN+DM fine tuning on NVIDIA DGX-1 V100 or NVIDIA DGX A100 nodes in `scripts/params/cnn_dm_params.sh`. For example, to use the default DGX A100 8 gpu config, run:
```bash
-bash scripts/run_summarization.sh $(source scripts/params/cnn_dm_params.sh && dgxa100_8gpu_fp16)
+bash scripts/run_summarization.sh $(source scripts/params/cnn_dm_params.sh && dgxa100_8gpu_bf16)
```
Similarly, configurations for XSum dataset are available in `scripts/params/xsum_params.sh`.
-6. Start inference/predictions.
+7. Start inference/predictions.
You can run the following script to run inference summarization using a fine-tuned checkpoint:
```bash
-bash scripts/run_eval_summarization.sh
+bash scripts/run_eval_summarization.sh
```
This repository contains multiple predefined configurations in `scripts/params/cnn_dm_params.sh` and `scripts/params/xsum_params.sh`. For example, to run inference on CNN-DM with a checkpoint run:
```bash
-bash scripts/run_eval_summarization.sh $(source scripts/params/cnn_dm_params.sh && dgxa100_8gpu_fp16_eval)
+bash scripts/run_eval_summarization.sh $(source scripts/params/cnn_dm_params.sh && dgxa100_8gpu_bf16_eval)
```
Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark yours performance to [Training performance benchmark](#training-performance-results), or [Inference performance benchmark](#inference-performance-results). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
-7. Run Custom Inference with the fine-tuned checkpoint
+8. Run Custom Inference with the fine-tuned checkpoint
We can write a simple few lines of code to run custom inference with the fine-tuned checkpoint.
```python
@@ -238,7 +280,8 @@ from bart.tokenization.tokenization_bart import BartTokenizer
from bart.configuration.configuration_bart import BartConfig
import json
config = BartConfig(**json.load(open('configs/config.json', "r")))
-config.fp16 = False
+config.dtype = None
+config.pre_ln = True
model_path = 'results/_epoch1_step2000.ckpt' # The fine-tuned checkpoint path
model = BartForConditionalGeneration.from_pretrained(model_path, config=config)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
@@ -262,6 +305,7 @@ The following sections provide greater details of the dataset, running training
### Scripts and sample code
In the root directory, the most important files are:
+* `pretrain.py` - Serves as entry point for pre-training
* `finetune.py` - Serves as entry point for fine-tuning
* `run_eval.py` - Serves as entry point for inference
* `Dockerfile` - Container with the basic set of dependencies to run BART
@@ -270,6 +314,8 @@ The `scripts/` folder encapsulates all the one-click scripts required for runnin
* `run_summarization.sh` - Runs summarization finetuning followed by inference using the `finetune.py` and `run_eval.py` files.
* `run_summarization_eval.sh` - Runs inference on fine tuned checkpoint using the `run_eval.py` file.
* `get_data.sh` - Preprocesses CNN-DM dataset as well as downloads and preprocesses XSum dataset.
+* `get_pretraining_data.sh` - Downloads pre-train dataset.
+* `preprocess_pretrain_data.sh` - Preprocesses pre-train dataset.
Other folders included in the root directory are:
* `data/` - Necessary folder to download datasets required for fine tuning of BART.
@@ -277,27 +323,47 @@ Other folders included in the root directory are:
* `utils/` - Necessary utility files for BART model.
### Parameters
-Aside from the options to set hyperparameters, the relevant options to control the behaviour of the `run_pretraining.py` script are:
+Aside from the options to set hyperparameters, the relevant options to control the behaviour of the `pretrain.py` script are:
+
+```
+--config_path: The configuration file corresponding to BART Model
+--warmup_steps: Number of WARMUP_STEPS
+--max_steps: Number of MAX_STEPS
+--data_dir: Location to DATA_DIR
+--learning_rate: Learning Rate
+--n_val: Number of validation examples to test for early stopping
+--train_batch_size: Train batch size
+--gradient_accumulation_steps: Number of accumulation steps
+--max_source_length: Maximum source length
+--max_target_length: Maximum target length
+--val_max_target_length: Maximum length of validation tokens
+--eval_max_gen_length: Maximum length while generating validation tokens
+--weight_decay: weight decay
+--dropout: drop out
+--lamb: Whether to use LAMB optimizer
+--pre_ln: Whether to use Pre-LN architecture
+--allreduce_post_accumulation_half_precision: Whether to do fp16/bf16 allreduce post accumulation
+```
+
+Aside from the options to set hyperparameters, the relevant options to control the behaviour of the `finetune.py` script are:
```
--config_path: The configuration file corresponding to BART Model
--warmup_steps: Number of WARMUP_STEPS
--max_steps: Number of MAX_STEPS
--data_dir: Location to DATA_DIR
---gpus: Number of GPUs
--learning_rate: Learning Rate
--n_val: Number of validation examples to test for early stopping
--train_batch_size: Train batch size
--gradient_accumulation_steps: Number of accumulation steps
---val_check_interval: Periodicity of checking validation score
--max_source_length: Maximum source length
--max_target_length: Maximum target length
--val_max_target_length: Maximum length of validation tokens
--eval_max_gen_length: Maximum length while generating validation tokens
--weight_decay: weight decay
--dropout: drop out
---early_stopping_patience: number of validation trials of no improvement before which to trigger early stopping
---amp_level: amp mode of optimization level to use if training with mixed precision
+--pre_ln: Whether to use Pre-LN architecture
+--allreduce_post_accumulation_half_precision: Whether to do fp16/bf16 allreduce post accumulation
```
### Command-line options
@@ -305,13 +371,19 @@ Aside from the options to set hyperparameters, the relevant options to control t
To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option with the Python file, for example:
```bash
+python pretrain.py --help
python finetune.py --help
python run_eval.py --help
```
### Getting the data
+For pre-training BART, we use the concatenation of Wikipedia, Common Crawl, and OpenWebTextCorpus.
+
+Common Crawl is an archieve of news articles from small and major publishers world wide, which is provided from commoncrawl.org.
+
+OpenWebTextCorpus is an open source effort to reproduce OpenAI’s WebText dataset. The distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University.
-We have tested fine tuning the BART model on summarization benchmarks such as CNN-DM and XSum.
+For fine-tuning BART, we have tested fine tuning the BART model on summarization benchmarks such as CNN-DM and XSum.
CNN-DM is a concatenation of CNN Stories as well as Daily Mail Stories. CNN consists of approximately 90k documents whereas Daily Mail consists of 197k documents.
@@ -323,7 +395,7 @@ XSum, on the other hand, is also a single-document summarization task dataset bu
#### Dataset guidelines
-The repository contains a script to preprocess and download data. It can be run as:
+The repository contains scripts to preprocess and download data. It can be run as:
```bash
bash scripts/get_data.sh
@@ -333,15 +405,67 @@ The script downloads CNN and DM raw data from [here](https://cs.nyu.edu/~kcho/DM
The script also downloads the XSum dataset from the [HuggingFace storage](https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz).
+```bash
+bash scripts/get_pretraining_data.sh
+```
+The script uses the LDDL downloader to download Wikipedia, Common Crawl, and OpenWebTextCorpus dataset. The Common Crawl is downloaded by [news-please](https://github.com/fhamborg/news-please). And OpenWebTextCorpus is downloaded from [here](https://skylion007.github.io/OpenWebTextCorpus/)
+
+For downloading less dataset, you can change the date period of Common Crawl archive in the script to take less time. For example:
+```bash
+download_common_crawl \
+ --outdir $data_folder/common_crawl \
+ --warc-files-start-date 2016-09-01 \
+ --warc-files-end-date 2016-10-31 \
+ --start-date 2016-09-01 \
+ --end-date 2016-10-31
+```
+
+```bash
+bash scripts/preprocess_pretrain_data.sh
+```
+The script uses the LDDL preprocessor and load balancer to preprocess the pre-training dataset into Parquet shards which are then streamed during the pre-training by the LDDL data loader.
+
The script by default stores the data into the `/workspace/bart/data` folder.
### Training process
+The training process consists of two steps: pre-training and fine-tuning.
+
+#### Pre-training
+Pre-training BART is done using `scripts/run_pretraining.sh` script that, in turn, uses the `pretrain.py` file to perform training.
+
+For example, it can be invoked by calling:
+
+```bash
+bash scripts/run_pretraining.sh
+```
+
+Where:
+* train_batch_size_phase* - per-GPU batch size used for training in the respective phase
+* learning_rate_phase* - Learning rate in the respective phase
+* precision - fp16/bf16/fp32/tf32 precision for training
+* use_preln - Whether to use Pre-LN architecture
+* num_gpus - number of GPUs to run training with
+* warmup_steps_phase* - Number of warmup steps for learning rate scheduler in the respective phase
+* train_steps_phase* - Number of training steps in the respective phase
+* save_checkpoint_steps - Number of steps for saving checkpoint
+* num_accumulation_phase* - Number of accumulation steps for an effective larger training batch size in the respective phase
+* config_path - path to configuration file of BART Model
+
+
+
+By default, the training script stores results to `results/bart_pyt_pretraining` and runs with:
+
+```bash
+bash scripts/run_pretraining.sh 200 32 5e-3 4e-3 bf16 true 8 2166 200 95040 7560 100 40 120 configs/config.json
+```
+
+#### Fine-tuning
Training BART for summarization is done using `scripts/run_summarization.sh` script that, in turn, uses the `finetune.py` file to perform training.
For example, it can be invoked by calling:
```bash
-Bash scripts/run_summarization.sh