Skip to content

Commit f81fca9

Browse files
leo0519nv-kkudrynski
authored andcommitted
[ResNet50/Paddle] Do inference with synthetic input as default
1 parent 1e10352 commit f81fca9

File tree

8 files changed

+169
-82
lines changed

8 files changed

+169
-82
lines changed

PaddlePaddle/Classification/RN50v1.5/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:22.05-py3
1+
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:23.02-py3
22
FROM ${FROM_IMAGE_NAME}
33

44
ADD requirements.txt /workspace/

PaddlePaddle/Classification/RN50v1.5/README.md

Lines changed: 81 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -504,6 +504,8 @@ Advanced Training:
504504
be applied when --asp and --prune-model is set. (default: mask_1d)
505505

506506
Paddle-TRT:
507+
--device DEVICE_ID
508+
The GPU device id for Paddle-TRT inference. (default: 0)
507509
--trt-inference-dir TRT_INFERENCE_DIR
508510
A path to store/load inference models. export_model.py would export models to this folder, then inference.py
509511
would load from here. (default: ./inference)
@@ -521,7 +523,7 @@ Paddle-TRT:
521523
A file in which to store JSON model exporting report. (default: ./export.json)
522524
--trt-log-path TRT_LOG_PATH
523525
A file in which to store JSON inference report. (default: ./inference.json)
524-
--trt-use-synthat TRT_USE_SYNTHAT
526+
--trt-use-synthetic TRT_USE_SYNTHAT
525527
Apply synthetic data for benchmark. (default: False)
526528
```
527529

@@ -672,17 +674,27 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
672674
```
673675

674676
#### Inference with TensorRT
675-
To run inference with TensorRT for the best performance, you can apply the scripts in `scripts/inference`.
677+
For inference with TensorRT, we provide two scopes to do benchmark with or without data preprocessing.
678+
679+
The default scripts in `scripts/inference` use synthetic input to run inference without data preprocessing.
676680

677681
For example,
678682
1. Run `bash scripts/inference/export_resnet50_AMP.sh <your_checkpoint>` to export an inference model.
679-
- The default path of checkpoint is `./output/ResNet50/89`.
683+
- The default path of the checkpoint is `./output/ResNet50/89`.
680684
2. Run `bash scripts/inference/infer_resnet50_AMP.sh` to infer with TensorRT.
681685

682686
Or you could manually run `export_model.py` and `inference.py` with specific arguments, refer to [Command-line options](#command-line-options).
683687

684688
Note that arguments passed to `export_model.py` and `inference.py` should be the same with arguments used in training.
685689

690+
To run inference with data preprocessing, set the option `--trt-use-synthetic` to false and `--image-root` to the path of your own dataset. For example,
691+
692+
```bash
693+
python inference.py --trt-inference-dir <path_to_model> \
694+
--image-root <your_own_data_set> \
695+
--trt-use-synthetic False
696+
```
697+
686698
## Performance
687699

688700
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
@@ -748,7 +760,7 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
748760

749761
##### Benchmark with TensorRT
750762

751-
To benchmark the inference performance with TensorRT on a specific batch size, run:
763+
To benchmark the inference performance with TensorRT on a specific batch size, run inference.py with `--trt-use-synthetic True`. The benchmark uses synthetic input without data preprocessing.
752764

753765
* FP32 / TF32
754766
```bash
@@ -757,7 +769,8 @@ python inference.py \
757769
--trt-precision FP32 \
758770
--batch-size <batch_size> \
759771
--benchmark-steps 1024 \
760-
--benchmark-warmup-steps 16
772+
--benchmark-warmup-steps 16 \
773+
--trt-use-synthetic True
761774
```
762775

763776
* FP16
@@ -767,13 +780,12 @@ python inference.py \
767780
--trt-precision FP16 \
768781
--batch-size <batch_size>
769782
--benchmark-steps 1024 \
770-
--benchmark-warmup-steps 16
783+
--benchmark-warmup-steps 16 \
784+
--trt-use-synthetic True
771785
```
772786

773787
Note that arguments passed to `inference.py` should be the same with arguments used in training.
774788

775-
The benchmark uses the validation dataset by default, which should be put in `--image-root/val`.
776-
For the performance benchmark of the raw model, a synthetic dataset can be used. To use synthetic dataset, add `--trt-use-synthat True` as a command line option.
777789

778790
### Results
779791

@@ -866,96 +878,103 @@ Our results were obtained by running the applicable training script with `--run-
866878
#### Paddle-TRT performance: NVIDIA DGX A100 (1x A100 80GB)
867879
Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA DGX A100 with (1x A100 80G) GPU.
868880

881+
Note that the benchmark does not include data preprocessing. Refer to [Benchmark with TensorRT](#benchmark-with-tensorrt).
882+
869883
**TF32 Inference Latency**
870884

871885
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
872886
|--------------|------------------|---------------|---------------|---------------|---------------|
873-
| 1 | 716.49 img/s | 1.40 ms | 1.96 ms | 2.20 ms | 3.01 ms |
874-
| 2 | 1219.98 img/s | 1.64 ms | 2.26 ms | 2.90 ms | 5.04 ms |
875-
| 4 | 1880.12 img/s | 2.13 ms | 3.39 ms | 4.44 ms | 7.32 ms |
876-
| 8 | 2404.10 img/s | 3.33 ms | 4.51 ms | 5.90 ms | 10.39 ms |
877-
| 16 | 3101.28 img/s | 5.16 ms | 7.06 ms | 9.13 ms | 15.18 ms |
878-
| 32 | 3294.11 img/s | 9.71 ms | 21.42 ms | 26.94 ms | 35.79 ms |
879-
| 64 | 4327.38 img/s | 14.79 ms | 25.59 ms | 30.45 ms | 45.34 ms |
880-
| 128 | 4956.59 img/s | 25.82 ms | 33.74 ms | 40.36 ms | 56.06 ms |
881-
| 256 | 5244.29 img/s | 48.81 ms | 62.11 ms | 67.56 ms | 88.38 ms |
887+
| 1 | 915.48 img/s | 1.09 ms | 1.09 ms | 1.18 ms | 1.19 ms |
888+
| 2 | 1662.70 img/s | 1.20 ms | 1.21 ms | 1.29 ms | 1.30 ms |
889+
| 4 | 2856.25 img/s | 1.40 ms | 1.40 ms | 1.49 ms | 1.55 ms |
890+
| 8 | 3988.80 img/s | 2.01 ms | 2.01 ms | 2.10 ms | 2.18 ms |
891+
| 16 | 5409.55 img/s | 2.96 ms | 2.96 ms | 3.05 ms | 3.07 ms |
892+
| 32 | 6406.13 img/s | 4.99 ms | 5.00 ms | 5.08 ms | 5.12 ms |
893+
| 64 | 7169.75 img/s | 8.93 ms | 8.94 ms | 9.01 ms | 9.04 ms |
894+
| 128 | 7616.79 img/s | 16.80 ms | 16.89 ms | 16.90 ms | 16.99 ms |
895+
| 256 | 7843.26 img/s | 32.64 ms | 32.85 ms | 32.88 ms | 32.93 ms |
882896

883897
**FP16 Inference Latency**
884898

885899
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
886900
|--------------|------------------|---------------|---------------|---------------|---------------|
887-
| 1 | 860.90 img/s | 1.16 ms | 1.81 ms | 2.06 ms | 2.98 ms |
888-
| 2 | 1464.06 img/s | 1.37 ms | 2.13 ms | 2.73 ms | 4.76 ms |
889-
| 4 | 2246.24 img/s | 1.78 ms | 3.17 ms | 4.20 ms | 7.39 ms |
890-
| 8 | 2457.44 img/s | 3.25 ms | 4.35 ms | 5.50 ms | 9.98 ms |
891-
| 16 | 3928.83 img/s | 4.07 ms | 6.26 ms | 8.50 ms | 15.10 ms |
892-
| 32 | 3853.13 img/s | 8.30 ms | 19.87 ms | 25.51 ms | 34.99 ms |
893-
| 64 | 5581.89 img/s | 11.46 ms | 22.32 ms | 30.75 ms | 43.35 ms |
894-
| 128 | 6846.77 img/s | 18.69 ms | 25.43 ms | 35.03 ms | 50.04 ms |
895-
| 256 | 7481.19 img/s | 34.22 ms | 40.92 ms | 51.10 ms | 65.68 ms |
901+
| 1 | 1265.67 img/s | 0.79 ms | 0.79 ms | 0.88 ms | 0.89 ms |
902+
| 2 | 2339.59 img/s | 0.85 ms | 0.86 ms | 0.94 ms | 0.96 ms |
903+
| 4 | 4271.30 img/s | 0.94 ms | 0.94 ms | 1.03 ms | 1.04 ms |
904+
| 8 | 7053.76 img/s | 1.13 ms | 1.14 ms | 1.22 ms | 1.31 ms |
905+
| 16 | 10225.85 img/s | 1.56 ms | 1.57 ms | 1.65 ms | 1.67 ms |
906+
| 32 | 12802.53 img/s | 2.50 ms | 2.50 ms | 2.59 ms | 2.61 ms |
907+
| 64 | 14723.56 img/s | 4.35 ms | 4.35 ms | 4.43 ms | 4.45 ms |
908+
| 128 | 16157.12 img/s | 7.92 ms | 7.96 ms | 8.00 ms | 8.06 ms |
909+
| 256 | 17054.80 img/s | 15.01 ms | 15.06 ms | 15.07 ms | 15.16 ms |
910+
896911

897912
#### Paddle-TRT performance: NVIDIA A30 (1x A30 24GB)
898913
Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A30 with (1x A30 24G) GPU.
899914

915+
Note that the benchmark does not include data preprocessing. Refer to [Benchmark with TensorRT](#benchmark-with-tensorrt).
916+
900917
**TF32 Inference Latency**
901918

902919
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
903920
|--------------|------------------|---------------|---------------|---------------|---------------|
904-
| 1 | 672.79 img/s | 1.49 ms | 2.01 ms | 2.29 ms | 3.04 ms |
905-
| 2 | 1041.47 img/s | 1.92 ms | 2.49 ms | 2.87 ms | 4.13 ms |
906-
| 4 | 1505.64 img/s | 2.66 ms | 3.43 ms | 4.06 ms | 6.85 ms |
907-
| 8 | 2001.13 img/s | 4.00 ms | 4.72 ms | 5.54 ms | 9.51 ms |
908-
| 16 | 2462.80 img/s | 6.50 ms | 7.71 ms | 9.32 ms | 15.54 ms |
909-
| 32 | 2474.34 img/s | 12.93 ms | 21.61 ms | 25.76 ms | 34.69 ms |
910-
| 64 | 2949.38 img/s | 21.70 ms | 29.58 ms | 34.63 ms | 47.11 ms |
911-
| 128 | 3278.67 img/s | 39.04 ms | 43.34 ms | 52.72 ms | 66.78 ms |
912-
| 256 | 3293.10 img/s | 77.74 ms | 90.51 ms | 99.71 ms | 110.80 ms |
921+
| 1 | 781.87 img/s | 1.28 ms | 1.29 ms | 1.38 ms | 1.45 ms |
922+
| 2 | 1290.14 img/s | 1.55 ms | 1.55 ms | 1.65 ms | 1.67 ms |
923+
| 4 | 1876.48 img/s | 2.13 ms | 2.13 ms | 2.23 ms | 2.25 ms |
924+
| 8 | 2451.23 img/s | 3.26 ms | 3.27 ms | 3.37 ms | 3.42 ms |
925+
| 16 | 2974.77 img/s | 5.38 ms | 5.42 ms | 5.47 ms | 5.53 ms |
926+
| 32 | 3359.63 img/s | 9.52 ms | 9.62 ms | 9.66 ms | 9.72 ms |
927+
| 64 | 3585.82 img/s | 17.85 ms | 18.03 ms | 18.09 ms | 18.20 ms |
928+
| 128 | 3718.44 img/s | 34.42 ms | 34.71 ms | 34.75 ms | 34.91 ms |
929+
| 256 | 3806.11 img/s | 67.26 ms | 67.61 ms | 67.71 ms | 67.86 ms |
913930

914931
**FP16 Inference Latency**
915932

916933
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
917934
|--------------|------------------|---------------|---------------|---------------|---------------|
918-
| 1 | 804.56 img/s | 1.24 ms | 1.81 ms | 2.15 ms | 3.07 ms |
919-
| 2 | 1435.74 img/s | 1.39 ms | 2.05 ms | 2.48 ms | 3.86 ms |
920-
| 4 | 2169.87 img/s | 1.84 ms | 2.72 ms | 3.39 ms | 5.94 ms |
921-
| 8 | 2395.13 img/s | 3.34 ms | 4.46 ms | 5.11 ms | 9.49 ms |
922-
| 16 | 3779.82 img/s | 4.23 ms | 5.83 ms | 7.66 ms | 14.44 ms |
923-
| 32 | 3620.18 img/s | 8.84 ms | 17.90 ms | 22.31 ms | 30.91 ms |
924-
| 64 | 4592.08 img/s | 13.94 ms | 24.00 ms | 29.38 ms | 41.41 ms |
925-
| 128 | 5064.06 img/s | 25.28 ms | 31.73 ms | 37.79 ms | 53.01 ms |
926-
| 256 | 4774.61 img/s | 53.62 ms | 59.04 ms | 67.29 ms | 80.51 ms |
935+
| 1 | 1133.80 img/s | 0.88 ms | 0.89 ms | 0.98 ms | 0.99 ms |
936+
| 2 | 2068.18 img/s | 0.97 ms | 0.97 ms | 1.06 ms | 1.08 ms |
937+
| 4 | 3181.06 img/s | 1.26 ms | 1.27 ms | 1.35 ms | 1.38 ms |
938+
| 8 | 5078.30 img/s | 1.57 ms | 1.58 ms | 1.68 ms | 1.74 ms |
939+
| 16 | 6240.02 img/s | 2.56 ms | 2.58 ms | 2.67 ms | 2.86 ms |
940+
| 32 | 7000.86 img/s | 4.57 ms | 4.66 ms | 4.69 ms | 4.76 ms |
941+
| 64 | 7523.45 img/s | 8.51 ms | 8.62 ms | 8.73 ms | 8.86 ms |
942+
| 128 | 7914.47 img/s | 16.17 ms | 16.31 ms | 16.34 ms | 16.46 ms |
943+
| 256 | 8225.56 img/s | 31.12 ms | 31.29 ms | 31.38 ms | 31.50 ms |
927944

928945

929946
#### Paddle-TRT performance: NVIDIA A10 (1x A10 24GB)
930947
Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A10 with (1x A10 24G) GPU.
931948

949+
Note that the benchmark does not include data preprocessing. Refer to [Benchmark with TensorRT](#benchmark-with-tensorrt).
950+
932951
**TF32 Inference Latency**
933952

934953
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
935954
|--------------|------------------|---------------|---------------|---------------|---------------|
936-
| 1 | 372.04 img/s | 2.69 ms | 3.64 ms | 4.20 ms | 5.28 ms |
937-
| 2 | 615.93 img/s | 3.25 ms | 4.08 ms | 4.59 ms | 6.42 ms |
938-
| 4 | 1070.02 img/s | 3.74 ms | 3.90 ms | 4.35 ms | 7.48 ms |
939-
| 8 | 1396.88 img/s | 5.73 ms | 6.87 ms | 7.52 ms | 10.63 ms |
940-
| 16 | 1522.20 img/s | 10.51 ms | 12.73 ms | 13.84 ms | 17.84 ms |
941-
| 32 | 1674.39 img/s | 19.11 ms | 23.23 ms | 24.63 ms | 29.55 ms |
942-
| 64 | 1782.14 img/s | 35.91 ms | 41.84 ms | 44.53 ms | 48.94 ms |
943-
| 128 | 1722.33 img/s | 74.32 ms | 85.37 ms | 89.27 ms | 94.85 ms |
944-
| 256 | 1576.89 img/s | 162.34 ms | 181.01 ms | 185.92 ms | 194.42 ms |
955+
| 1 | 563.63 img/s | 1.77 ms | 1.79 ms | 1.87 ms | 1.89 ms |
956+
| 2 | 777.13 img/s | 2.57 ms | 2.63 ms | 2.68 ms | 2.89 ms |
957+
| 4 | 1171.93 img/s | 3.41 ms | 3.43 ms | 3.51 ms | 3.55 ms |
958+
| 8 | 1627.81 img/s | 4.91 ms | 4.97 ms | 5.02 ms | 5.09 ms |
959+
| 16 | 1986.40 img/s | 8.05 ms | 8.11 ms | 8.19 ms | 8.37 ms |
960+
| 32 | 2246.04 img/s | 14.25 ms | 14.33 ms | 14.40 ms | 14.57 ms |
961+
| 64 | 2398.07 img/s | 26.69 ms | 26.87 ms | 26.91 ms | 27.06 ms |
962+
| 128 | 2489.96 img/s | 51.41 ms | 51.74 ms | 51.80 ms | 51.94 ms |
963+
| 256 | 2523.22 img/s | 101.46 ms | 102.13 ms | 102.35 ms | 102.77 ms |
945964

946965
**FP16 Inference Latency**
947966

948967
|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
949968
|--------------|------------------|---------------|---------------|---------------|---------------|
950-
| 1 | 365.38 img/s | 2.74 ms | 3.94 ms | 4.35 ms | 5.64 ms |
951-
| 2 | 612.52 img/s | 3.26 ms | 4.34 ms | 4.80 ms | 6.97 ms |
952-
| 4 | 1018.15 img/s | 3.93 ms | 4.95 ms | 5.55 ms | 9.16 ms |
953-
| 8 | 1924.26 img/s | 4.16 ms | 5.44 ms | 6.20 ms | 11.89 ms |
954-
| 16 | 2477.49 img/s | 6.46 ms | 8.07 ms | 9.21 ms | 15.05 ms |
955-
| 32 | 2896.01 img/s | 11.05 ms | 13.56 ms | 15.32 ms | 21.76 ms |
956-
| 64 | 3165.27 img/s | 20.22 ms | 24.20 ms | 25.94 ms | 33.18 ms |
957-
| 128 | 3176.46 img/s | 40.29 ms | 46.36 ms | 49.15 ms | 54.95 ms |
958-
| 256 | 3110.01 img/s | 82.31 ms | 93.21 ms | 96.06 ms | 99.97 ms |
969+
| 1 | 1296.81 img/s | 0.77 ms | 0.77 ms | 0.87 ms | 0.88 ms |
970+
| 2 | 2224.06 img/s | 0.90 ms | 0.90 ms | 1.00 ms | 1.01 ms |
971+
| 4 | 2845.61 img/s | 1.41 ms | 1.43 ms | 1.51 ms | 1.53 ms |
972+
| 8 | 3793.35 img/s | 2.11 ms | 2.19 ms | 2.22 ms | 2.30 ms |
973+
| 16 | 4315.53 img/s | 3.71 ms | 3.80 ms | 3.86 ms | 3.98 ms |
974+
| 32 | 4815.26 img/s | 6.64 ms | 6.74 ms | 6.79 ms | 7.15 ms |
975+
| 64 | 5103.27 img/s | 12.54 ms | 12.66 ms | 12.70 ms | 13.01 ms |
976+
| 128 | 5393.20 img/s | 23.73 ms | 23.98 ms | 24.05 ms | 24.20 ms |
977+
| 256 | 5505.24 img/s | 46.50 ms | 46.82 ms | 46.92 ms | 47.17 ms |
959978

960979
## Release notes
961980

PaddlePaddle/Classification/RN50v1.5/dali.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,15 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15+
import ctypes
1516
import os
1617
from dataclasses import dataclass
18+
from cuda import cudart
1719
import paddle
20+
import numpy as np
21+
from nvidia.dali.backend import TensorListCPU
1822
import nvidia.dali.ops as ops
23+
import nvidia.dali.fn as fn
1924
import nvidia.dali.types as types
2025
from nvidia.dali.pipeline import Pipeline
2126
from nvidia.dali.plugin.paddle import DALIGenericIterator
@@ -236,3 +241,54 @@ def build_dataloader(args, mode):
236241
"""
237242
assert mode in Mode, "Dataset mode should be in supported Modes (train or eval)"
238243
return dali_dataloader(args, mode, paddle.device.get_device())
244+
245+
246+
def dali_synthetic_dataloader(args, device):
247+
"""
248+
Define a dali dataloader with synthetic data.
249+
250+
Args:
251+
args(Namespace): Arguments obtained from ArgumentParser.
252+
device(int): Id of GPU to load data.
253+
Outputs:
254+
DALIGenericIterator(nvidia.dali.plugin.paddle.DALIGenericIterator)
255+
Iteratable outputs of DALI pipeline,
256+
including "data" in type of Paddle's Tensor.
257+
"""
258+
assert "gpu" in device, "gpu training is required for DALI"
259+
260+
device_id = int(device.split(':')[1])
261+
262+
batch_size = args.batch_size
263+
image_shape = args.image_shape
264+
output_dtype = types.FLOAT16 if args.dali_output_fp16 else types.FLOAT
265+
num_threads = args.dali_num_threads
266+
267+
class ExternalInputIterator(object):
268+
def __init__(self, batch_size, image_shape):
269+
n_bytes = int(batch_size * np.prod(image_shape) * 4)
270+
err, mem = cudart.cudaMallocHost(n_bytes)
271+
assert err == cudart.cudaError_t.cudaSuccess
272+
mem_ptr = ctypes.cast(mem, ctypes.POINTER(ctypes.c_float))
273+
self.synthetic_data = np.ctypeslib.as_array(mem_ptr, shape=(batch_size, *image_shape))
274+
self.n = args.benchmark_steps
275+
276+
def __iter__(self):
277+
self.i = 0
278+
return self
279+
280+
def __next__(self):
281+
if self.i >= self.n:
282+
self.__iter__()
283+
raise StopIteration()
284+
self.i += 1
285+
return TensorListCPU(self.synthetic_data, is_pinned=True)
286+
287+
eli = ExternalInputIterator(batch_size, image_shape)
288+
pipe = Pipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id)
289+
with pipe:
290+
images = fn.external_source(source=eli, no_copy=True, dtype=output_dtype)
291+
images = images.gpu()
292+
pipe.set_outputs(images)
293+
pipe.build()
294+
return DALIGenericIterator([pipe], ['data'])

0 commit comments

Comments
 (0)