91geek
diff --git a/‎benchmarks/README.md‎
Lines changed: 1 addition & 1 deletion b/‎benchmarks/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎benchmarks/communication/README.md‎
Lines changed: 89 additions & 0 deletions b/‎benchmarks/communication/README.md‎
Lines changed: 89 additions & 0 deletions
diff --git a/‎benchmarks/communication/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎benchmarks/communication/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎benchmarks/communication/all_gather.py‎
Lines changed: 136 additions & 0 deletions b/‎benchmarks/communication/all_gather.py‎
Lines changed: 136 additions & 0 deletions
diff --git a/‎benchmarks/communication/all_reduce.py‎
Lines changed: 114 additions & 0 deletions b/‎benchmarks/communication/all_reduce.py‎
Lines changed: 114 additions & 0 deletions
@@ -1 +1 @@
-The new home for DeepSpeed benchmarks. TODO: Move DS benchmarks to this repo. 
+All benchmarks that use the DeepSpeed library are maintained in this folder. We welcome contributions in this space! 
@@ -0,0 +1,89 @@
+# The DeepSpeed Communication Benchmarking Suite
+
+The intent of these benchmarks is to measure communication latency/bw of deepspeed and/or pytorch distributed communication operations at the Python layer. These benchmarks are complementary to C-level comms benchmarks like [OSU Micro-Benchmarks](https://mvapich.cse.ohio-state.edu/benchmarks/) and [NCCL Tests](https://github.com/NVIDIA/nccl-tests) in that users can:
+- Easily debug which layer of the communication software stack hangs or performance degradations originate from.
+- Measure the expected communication performance of either DeepSpeed comms or pure PyTorch distributed
+
+To run benchmarks, there are two options:
+
+1. Run a single communication operation:
+
+For example, run with a single large message size (calculated to barely fit within GPU mem):
+<pre>
+deepspeed all_reduce.py
+</pre>
+
+Scan across message sizes:
+<pre>
+deepspeed all_reduce.py --scan
+</pre>
+
+Benchmark pure PyTorch distributed comms (without importing or using DeepSpeed) with MPI
+<pre>
+mpirun -np 16 --hostfile ${HOSTFILE} -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD python all_reduce.py --scan --dist="torch"
+</pre>
+
+or Slurm
+<pre>
+srun -n 16 python all_reduce.py --scan --dist="torch"
+</pre>
+
+
+2. Run all available communication benchmarks:
+
+<pre>
+deepspeed run_all.py
+</pre>
+
+Like the individual benchmarks, `run_all.py` supports scanning arguments for the max message size, bw-unit, etc. Simply pass the desired arguments to `run_all.py` and they'll be propagated to each comm op.
+
+<pre>
+usage: ds_bench [-h] [--local_rank LOCAL_RANK] [--trials TRIALS] [--warmups WARMUPS] [--maxsize MAXSIZE] [--async-op] [--bw-unit {Gbps,GBps}] [--backend {nccl}] [--dist {deepspeed,torch}] [--scan] [--raw] [--all-reduce] [--all-gather] [--all-to-all]
+                [--pt2pt] [--broadcast] [--dtype DTYPE] [--mem-factor MEM_FACTOR] [--debug]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --local_rank LOCAL_RANK
+  --trials TRIALS       Number of timed iterations
+  --warmups WARMUPS     Number of warmup (non-timed) iterations
+  --maxsize MAXSIZE     Max message size as a power of 2
+  --async-op            Enables non-blocking communication
+  --bw-unit {Gbps,GBps}
+  --backend {nccl}      Communication library to use
+  --dist {deepspeed,torch}
+                        Distributed DL framework to use
+  --scan                Enables scanning all message sizes
+  --raw                 Print the message size and latency without units
+  --all-reduce          Run all_reduce
+  --all-gather          Run all_gather
+  --all-to-all          Run all_to_all
+  --pt2pt               Run pt2pt
+  --broadcast           Run broadcast
+  --dtype DTYPE         PyTorch tensor dtype
+  --mem-factor MEM_FACTOR
+                        Proportion of max available GPU memory to use for single-size evals
+  --debug               Enables all_to_all debug prints
+</pre>
+
+Note that `ds_bench` is a pre-packaged wrapper around `run_all.py`. Users can pass the same arguments as well:
+
+<pre>
+<path to deepspeed>/bin/ds_bench --scan --trials=10
+</pre>
+
+Finally, users can choose specific communication operations to run in `run_all.py` or `ds_bench` by passing them as arguments (all operations are run by default). For example:
+
+<pre>
+deepspeed run_all.py --scan --all-reduce --all-to-all --broadcast
+</pre>
+
+
+# Adding Communication Benchmarks
+
+To add new communication benchmarks, follow this general procedure:
+
+1. Copy a similar benchmark file (e.g. to add `reduce_scatter`, copy `all_reduce.py` as a template)
+2. Add a new bw formula in `utils.get_bw`, a new maximum tensor element formula in `utils.max_numel`, and a new arg in `utils.benchmark_parser`
+3. Replace comm op calls in new file with find-replace
+4. Find a good default `mem_factor` for use in `run_<collective>_single()` function
+5. Add new comm op to `run_all.py`
@@ -0,0 +1 @@
+'''Copyright The Microsoft DeepSpeed Team'''
@@ -0,0 +1,136 @@
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+
+import torch
+import sys, os, time
+
+COMMS_BENCH_DIR = os.path.join(os.path.dirname(__file__), "../")
+sys.path.append(COMMS_BENCH_DIR)
+
+from communication.utils import *
+from communication.constants import *
+from deepspeed.accelerator import get_accelerator
+from deepspeed.comm import TorchBackend
+
+
+# Run all_gather and print metrics
+def timed_all_gather(input, output, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+
+        all_gather_func = TorchBackend.get_all_gather_function()
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+        all_gather_func = dist.allgather_fn
+
+    sync_all()
+    # Warmups, establish connections, etc.
+    for i in range(args.warmups):
+        all_gather_func(output, input, group=None, async_op=args.async_op)
+    sync_all()
+
+    # time the actual comm op trials times and average it
+    pre = time.perf_counter()
+    for i in range(args.trials):
+        all_gather_func(output, input, group=None, async_op=args.async_op)
+    sync_all()
+    duration = time.perf_counter() - pre
+
+    # maintain and clean performance data
+    avg_duration = duration / args.trials
+    size = input.element_size() * input.nelement()
+    tput, busbw = get_bw('all_gather', size, avg_duration, args)
+    tput_str, busbw_str, duration_str = get_metric_strings(args, tput, busbw, avg_duration)
+    desc = f'{input.nelement()}x{input.element_size()}'
+
+    if not args.raw:
+        size = convert_size(size)
+
+    print_rank_0(f"{size:<20} {desc:25s} {duration_str:20s} {tput_str:20s} {busbw_str:20s}")
+
+
+def run_all_gather(local_rank, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    # Prepare benchmark header
+    print_header(args, 'all_gather')
+    global_rank = dist.get_rank()
+    world_size = dist.get_world_size()
+
+    if args.scan:
+        # Create list of message sizes
+        M_LIST = []
+        for x in (2**p for p in range(1, args.maxsize)):
+            M_LIST.append(x)
+
+        sync_all()
+        # loop over various tensor sizes
+        for M in M_LIST:
+            global_rank = dist.get_rank()
+            try:
+                mat = torch.ones(world_size, M,
+                                 dtype=getattr(torch, args.dtype)).to(get_accelerator().device_name(local_rank))
+                sync_all()
+                input = ((mat.mul_(float(global_rank))).view(-1))
+                # Delete original mat to avoid OOM
+                del mat
+                get_accelerator().empty_cache()
+                output = torch.zeros(input.nelement() * world_size,
+                                     dtype=getattr(torch, args.dtype)).to(get_accelerator().device_name(local_rank))
+            except RuntimeError as e:
+                if 'out of memory' in str(e):
+                    if dist.get_rank() == 0:
+                        print('WARNING: Ran out of GPU memory. Exiting comm op.')
+                    sync_all()
+                    break
+                else:
+                    raise e
+            sync_all()
+            timed_all_gather(input, output, args)
+    else:
+        # all_gather_into_tensor saves memory
+        if ((args.dist == 'torch' or args.dist == 'deepspeed') and dist.has_all_gather_into_tensor()):
+            mem_factor = args.mem_factor + 0.2
+        else:
+            mem_factor = args.mem_factor
+        # Send the biggest message size our GPUs can fit. If you're facing OOM errors, reduce the mem_factor
+        sync_all()
+        elements_per_gpu = max_numel(comm_op='all_gather',
+                                     dtype=getattr(torch, args.dtype),
+                                     mem_factor=mem_factor,
+                                     local_rank=local_rank,
+                                     args=args)
+        try:
+            mat = torch.ones(elements_per_gpu, dtype=getattr(torch,
+                                                             args.dtype)).to(get_accelerator().device_name(local_rank))
+            # multiply each GPU's tensor by the rank to ease debugging
+            input = ((mat.mul_(float(global_rank))).view(-1))
+            # Delete original mat to avoid OOM
+            del mat
+            get_accelerator().empty_cache()
+            output = torch.zeros(elements_per_gpu * world_size,
+                                 dtype=getattr(torch, args.dtype)).to(get_accelerator().device_name(local_rank))
+        except RuntimeError as e:
+            if 'out of memory' in str(e):
+                if dist.get_rank() == 0:
+                    print('WARNING: Ran out of GPU memory. Try to reduce the --mem-factor argument!')
+                sync_all()
+                return
+            else:
+                raise e
+
+        sync_all()
+        timed_all_gather(input, output, args)
+
+
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    init_processes(local_rank=rank, args=args)
+    run_all_gather(local_rank=rank, args=args)
@@ -0,0 +1,114 @@
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+
+# DeepSpeed Team
+
+import torch
+import sys, os, time
+
+COMMS_BENCH_DIR = os.path.join(os.path.dirname(__file__), "../")
+sys.path.append(COMMS_BENCH_DIR)
+
+from communication.utils import *
+from communication.constants import *
+from deepspeed.accelerator import get_accelerator
+
+
+def timed_all_reduce(input, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    sync_all()
+    # Warmups, establish connections, etc.
+    for i in range(args.warmups):
+        dist.all_reduce(input, async_op=args.async_op)
+    sync_all()
+
+    # time the actual comm op trials times and average it
+    pre = time.perf_counter()
+    for i in range(args.trials):
+        dist.all_reduce(input, async_op=args.async_op)
+    sync_all()
+    duration = time.perf_counter() - pre
+
+    # maintain and clean performance data
+    avg_duration = duration / args.trials
+    size = input.element_size() * input.nelement()
+    n = dist.get_world_size()
+    tput, busbw = get_bw('all_reduce', size, avg_duration, args)
+    tput_str, busbw_str, duration_str = get_metric_strings(args, tput, busbw, avg_duration)
+    desc = f'{input.nelement()}x{input.element_size()}'
+
+    if not args.raw:
+        size = convert_size(size)
+
+    print_rank_0(f"{size:<20} {desc:25s} {duration_str:20s} {tput_str:20s} {busbw_str:20s}")
+
+
+def run_all_reduce(local_rank, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    # Prepare benchmark header
+    print_header(args, 'all_reduce')
+
+    world_size = dist.get_world_size()
+    global_rank = dist.get_rank()
+
+    if args.scan:
+        M_LIST = []
+        for x in (2**p for p in range(1, args.maxsize)):
+            M_LIST.append(x)
+
+        sync_all()
+        # loop over various tensor sizes
+        for M in M_LIST:
+            global_rank = dist.get_rank()
+            try:
+                mat = torch.ones(world_size, M,
+                                 dtype=getattr(torch, args.dtype)).to(get_accelerator().device_name(local_rank))
+                sync_all()
+                input = ((mat.mul_(float(global_rank))).view(-1))
+            except RuntimeError as e:
+                if 'out of memory' in str(e):
+                    if dist.get_rank() == 0:
+                        print('WARNING: Ran out of GPU memory. Exiting comm op.')
+                    sync_all()
+                    break
+                else:
+                    raise e
+            sync_all()
+            timed_all_reduce(input, args)
+    else:
+        # Send the biggest message size our GPUs can fit. If you're facing OOM errors, reduce the mem_factor
+        # Don't need output tensor, so we double mem_factor
+        elements_per_gpu = max_numel(comm_op='all_reduce',
+                                     dtype=getattr(torch, args.dtype),
+                                     mem_factor=args.mem_factor * 2,
+                                     local_rank=local_rank,
+                                     args=args)
+        try:
+            mat = torch.ones(elements_per_gpu, dtype=getattr(torch,
+                                                             args.dtype)).to(get_accelerator().device_name(local_rank))
+            input = ((mat.mul_(float(global_rank))).view(-1))
+        except RuntimeError as e:
+            if 'out of memory' in str(e):
+                if dist.get_rank() == 0:
+                    print('WARNING: Ran out of GPU memory. Try to reduce the --mem-factor argument!')
+                sync_all()
+                return
+            else:
+                raise e
+        sync_all()
+        timed_all_reduce(input, args)
+
+
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    init_processes(local_rank=rank, args=args)
+    run_all_reduce(local_rank=rank, args=args)
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-The new home for DeepSpeed benchmarks. TODO: Move DS benchmarks to this repo.`
	`1`	`+All benchmarks that use the DeepSpeed library are maintained in this folder. We welcome contributions in this space!`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+'''Copyright The Microsoft DeepSpeed Team'''`