【CPT】可重复性

原创已于 2025-05-21 10:41:47 修改 · 811 阅读

25 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#预训练 #可重复性

于 2025-05-16 18:56:00 首次发布

CPT 专栏收录该内容

1 篇文章

订阅专栏

可重复性

原文链接：https://docs.pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
无法保证在 PyTorch 不同版本、单个提交或不同平台上都能完全重现结果。此外，即使使用相同的种子，CPU 和 GPU 执行之间也可能无法重现结果。

不过，您可以采取一些措施来限制特定平台、设备和 PyTorch 版本的不确定性行为源的数量。首先，您可以控制可能导致应用程序多次执行时行为不同的随机性源。其次，您可以配置 PyTorch，避免对某些操作使用不确定性算法，这样，在给定相同输入的情况下，多次调用这些操作将产生相同的结果。

警告

确定性操作通常比非确定性操作慢，因此模型的单次运行性能可能会降低。然而，确定性可以通过促进实验、调试和回归测试来节省开发时间。

控制随机性来源

PyTorch 随机数生成器

您可以使用它torch.manual_seed()来为所有设备（CPU 和 CUDA）播种 RNG：

import torch
torch.manual_seed(0)

某些 PyTorch 操作可能在内部使用随机数。 torch.svd_lowrank()例如，就是这么做的。因此，使用相同的输入参数连续多次调用它可能会得到不同的结果。但是，只要torch.manual_seed()在应用程序开始时将设置为常量，并且消除了所有其他不确定性因素，那么每次在同一环境中运行应用程序时，都会生成相同的一系列随机数。

torch.manual_seed()通过在后续调用之间设置相同的值，也可以从使用随机数的操作中获得相同的结果。

Python

对于自定义运算符，您可能还需要设置 python 种子：

import random
random.seed(0)

其他库中的随机数生成器

如果您或您正在使用的任何库依赖于 NumPy，则可以使用以下方式为全局 NumPy RNG 播种：

import numpy as np
np.random.seed(0)

但是，某些应用程序和库可能使用 NumPy 随机生成器对象，而不是全局 RNG（https://numpy.org/doc/stable/reference/random/generator.html），并且这些对象也需要一致地播种。

如果您正在使用任何其他使用随机数生成器的库，请参阅这些库的文档以了解如何为它们设置一致的种子。

CUDA 卷积基准测试

CUDA 卷积运算所使用的 cuDNN 库可能会在应用程序的多次执行中造成不确定性。当使用一组新的尺寸参数调用 cuDNN 卷积时，一项可选功能可以运行多个卷积算法，并对它们进行基准测试以找到最快的算法。然后，在后续过程中，针对相应的尺寸参数集，将始终使用最快的算法。由于基准测试噪声和硬件差异，即使在同一台机器上，基准测试也可能在后续运行中选择不同的算法。

禁用基准测试功能会导致 cuDNN 确定性地选择一种算法，但可能会以性能降低为代价。torch.backends.cudnn.benchmark = False

但是，如果您不需要在应用程序的多次执行中进行重现，则启用基准测试功能可能会提高性能。torch.backends.cudnn.benchmark = True

请注意，此设置与下面讨论的设置不同torch.backends.cudnn.deterministic 。

避免不确定性算法

torch.use_deterministic_algorithms()允许您配置 PyTorch 以在可用的情况下使用确定性算法而不是非确定性算法，并且如果已知操作是非确定性的（并且没有确定性的替代方法），则抛出错误。

请查看文档以torch.use_deterministic_algorithms() 获取受影响操作的完整列表。如果某个操作未按照文档正确执行，或者您需要某个操作的确定性实现，但该操作尚未提供确定性实现，请提交问题： https://github.com/pytorch/pytorch/issues ?q=label:%22module:%20determinism%22

例如，运行非确定性 CUDA 实现torch.Tensor.index_add_() 将引发错误：

>>> import torch
>>> torch.use_deterministic_algorithms(True)
>>> torch.randn(2, 2).cuda().index_add_(0, torch.tensor([0, 1]), torch.randn(2, 2))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: index_add_cuda_ does not have a deterministic implementation, but you set
'torch.use_deterministic_algorithms(True)'. ...

当torch.bmm()使用稀疏密集 CUDA 张量调用时，它通常使用非确定性算法，但是当确定性标志打开时，将使用其替代的确定性实现：

>>> import torch
>>> torch.use_deterministic_algorithms(True)
>>> torch.bmm(torch.randn(2, 2, 2).to_sparse().cuda(), torch.randn(2, 2, 2).cuda())
tensor([[[ 1.1900, -2.3409],
         [ 0.4796,  0.8003]],
        [[ 0.1509,  1.8027],
         [ 0.0333, -1.1444]]], device='cuda:0')

此外，如果您使用 CUDA 张量，并且您的 CUDA 版本为 10.2 或更高版本，则应根据 CUDA 文档设置环境变量CUBLAS_WORKSPACE_CONFIG ： https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility

CUDA 卷积确定性

虽然禁用 CUDA 卷积基准测试（如上所述）可确保 CUDA 在每次运行应用程序时选择相同的算法，但该算法本身可能是不确定的，除非设置了 torch.use_deterministic_algorithms(True)或。后者设置仅控制此行为，而后者会使其他 PyTorch 操作也具有确定性行为。torch.backends.cudnn.deterministic = Truetorch.use_deterministic_algorithms()

CUDA RNN 和 LSTM

在某些 CUDA 版本中，RNN 和 LSTM 网络可能存在非确定性行为。请参阅torch.nn.RNN()和torch.nn.LSTM()了解详细信息及解决方法。

填充未初始化的内存

torch.empty()像和这样的操作torch.Tensor.resize_()可能会返回包含未初始化内存的张量，这些内存包含未定义的值。如果需要确定性，则将这样的张量作为其他操作的输入是无效的，因为输出将是不确定的。但实际上没有任何方法可以阻止此类无效代码的运行。因此，为了安全起见，默认情况下torch.utils.deterministic.fill_uninitialized_memory设置为True ，如果设置为，它将用已知值填充未初始化的内存 torch.use_deterministic_algorithms(True)。这将避免出现这种不确定的行为。

然而，填充未初始化的内存会损害性能。因此，如果您的程序有效，并且不使用未初始化的内存作为操作的输入，则可以关闭此设置以获得更好的性能。

数据加载器
DataLoader 将根据多进程数据加载算法中的随机性重新播种工作进程。使用worker_init_fn()生成器来保持可重复性：

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(0)

DataLoader(
    train_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    worker_init_fn=seed_worker,
    generator=g,
)

Megatron可重复性

摘自：Github Megatron

Megatron training can be bitwise reproducible; to enable this mode use --deterministic-mode. This means that the same training config run twice in the same HW and SW environment should produce identical model checkpoints, losses and accuracy metric values (iteration time metrics may vary).

There are currently three known Megatron optimizations that break reproducibility whilst still producing almost identical training runs:(Megatron有三种已知的优化方法会破坏可重复性，但是对训练结果的复现影响不是很大，近乎相同)

The specific NCCL algorithm that is used during an all-reduce (as specified by the environment variable NCCL_ALGO) is important. We have tested the following: ^NVLS, Tree, Ring, CollnetDirect, CollnetChain. The code admits the use of ^NVLS, which allows NCCL the choice of non-NVLS algorithms; its choice seems to be stable.
Flash attention is non-deterministic; do not use --use-flash-attn.
If using Transformer Engine, you must also set the environment variable NVTE_ALLOW_NONDETERMINISTIC_ALGO=0.

In addition, determinisim has only been verified in NGC PyTorch containers up to and newer than 23.12. If you observe nondeterminism in Megatron training under other circumstances please open an issue.