✅ **FSDP + LoRA 结合训练完整实践与问题解决全指南（含踩坑与版本选择）*

最新推荐文章于 2026-06-19 17:00:49 发布

原创最新推荐文章于 2026-06-19 17:00:49 发布 · 1.1k 阅读

21 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#python #人工智能

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

本镜像基于 RTX 4090D 24GB 显存 + CUDA 12.4 + 驱动 550.90.07 深度优化，内置完整运行环境与 Qwen3-32B 模型依赖，开箱即用。

在大模型微调中，LoRA（低秩适配）+ FSDP（FullyShardedDataParallel） 是当前最节省显存、最易扩展的方案。但要真正跑通这套方法，需要注意注入顺序、参数冻结、版本兼容、保存逻辑、梯度同步与告警问题。

本文完全基于真实踩坑与解决过程整理，不跳步，不隐藏细节。如果你遇到类似的问题，这篇文章将是你需要的完整解决方案。

📌 1. 训练目标与关键原则

核心目标：
✅ 使用 LoR‍A 对大模型进行参数高效微调
✅ 只训练 LoRA 参数，冻结其他参数
✅ 使用 FSDP 分布式训练以节省显存
✅ 训练后只保存 LoRA adapter（非全量模型）

关键操作顺序：
✔ 先注入 LoRA → 再包裹 FSDP（顺序不能反）
✔ 优化器只选择 requires_grad=True 的参数
✔ 保存时解包 FSDP，调用 PeftModel.save_pretrained() 仅导出 LoRA

⚙️ 2. 核心代码逻辑（LoRA + FSDP 正确写法）

from transformers import AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, PeftModel
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.optim import AdamW

# ① 加载基础模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16
)

# ② 注入 LoRA（关键：在 FSDP 之前）
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # GPT-J/Qwen/Mistral 要换成对应模块
    lora_dropout=0.05
)
model = get_peft_model(model, lora_config)

# ③ 只训练 LoRA 参数（冻结其他）
for name, param in model.named_parameters():
    param.requires_grad = "lora" in name

# ④ FSDP 包裹（关键：use_orig_params=True）
model = FSDP(
    model,
    use_orig_params=True,       # ✅ 彻底解决 requires_grad 不一致报错
    auto_wrap_policy=None       # 示例默认，可自定义为 LlamaDecoderLayer
)

# ⑤ 优化器只拿 LoRA 参数
optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)

# ⑥ Rank0 保存 LoRA adapter
if dist.get_rank() == 0:
    if isinstance(model, FSDP):
        model = model.module  # 解包
    model.save_pretrained("output_lora_adapter")

⚠️ 3. 常见报错与解决过程（与你的实际经验一致）

✅ 问题 1：ValueError: FlatParameter requires uniform requires_grad

原因：
FSDP 默认会将参数合并成 FlatParameter，如果这些参数之间存在 requires_grad=True/False 混合（比如 LoRA 冻结了大部分层），就会报错。

解决方案（你称为方案A）：
启用 use_orig_params=True，让 FSDP 不合并参数，允许交错的可训练/冻结参数。

✅ 问题 2：方案A仍然报错？→ 原因是Torch版本

查看https://github.com/pytorch/pytorch/issues/104690
所提出可能是Torch低版本没有真正的实现这个功能然后安装torch 2.2.2,
在这里插入图片描述

建议安装torch>2.2

版本	结果
torch 2.0 / 2.1	`use_orig_params` 不完全生效，仍报错
torch 2.2.2	✅ 可运行，但出现 `_optim_state_dict()` 相关 WARNING

✅ 问题 3：升级后新出现新的警告

示例 Warning：

[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |                  PyTorch CUDA memory summary, device ID 0                 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |            CUDA OOMs: 0            |        cudaMalloc retries: 6         |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Allocated memory      |   5453 MiB |  17469 MiB |  14326 GiB |  14320 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17455 MiB |  14287 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Active memory         |   5453 MiB |  17679 MiB |  14326 GiB |  14320 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17665 MiB |  14287 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Requested memory      |   5452 MiB |  17651 MiB |  14282 GiB |  14277 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5442 MiB |  17637 MiB |  14243 GiB |  14238 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved memory   |  20886 MiB |  23580 MiB |  77494 MiB |  56608 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |  20872 MiB |  23558 MiB |  77444 MiB |  56572 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     14 MiB |     22 MiB |     50 MiB |     36 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable memory | 715104 KiB |   4311 MiB |   8595 GiB |   8595 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool | 713440 KiB |   4310 MiB |   8552 GiB |   8551 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |   1664 KiB |      3 MiB |     43 GiB |     43 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Allocations           |     240    |     952    |  642681    |  642441    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446331    |  446264    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     173    |     406    |  196350    |  196177    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Active allocs         |     245    |     952    |  642681    |  642436    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446331    |  446264    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     178    |     407    |  196350    |  196172    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved segments |      93    |     174    |     419    |     326    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      86    |     163    |     394    |     308    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |       7    |      11    |      25    |      18    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable allocs |      49    |     198    |  326102    |  326053    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      36    |     160    |  247991    |  247955    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |      13    |      46    |   78111    |   78098    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize allocations  |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize GPU segments |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] 
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] CUDA Memory Summary before calling to _allgather_orig_param_states |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |                  PyTorch CUDA memory summary, device ID 0                 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |            CUDA OOMs: 0            |        cudaMalloc retries: 6         |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Allocated memory      |   5453 MiB |  17469 MiB |  14326 GiB |  14321 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17455 MiB |  14288 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Active memory         |   5453 MiB |  17679 MiB |  14326 GiB |  14321 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17665 MiB |  14288 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Requested memory      |   5452 MiB |  17651 MiB |  14283 GiB |  14277 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5442 MiB |  17637 MiB |  14244 GiB |  14239 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved memory   |  20886 MiB |  23580 MiB |  77494 MiB |  56608 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |  20872 MiB |  23558 MiB |  77444 MiB |  56572 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     14 MiB |     22 MiB |     50 MiB |     36 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable memory | 715104 KiB |   4311 MiB |   8596 GiB |   8595 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool | 713440 KiB |   4310 MiB |   8553 GiB |   8552 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |   1664 KiB |      3 MiB |     43 GiB |     43 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Allocations           |     240    |     952    |  642699    |  642459    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446336    |  446269    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     173    |     406    |  196363    |  196190    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Active allocs         |     245    |     952    |  642699    |  642454    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446336    |  446269    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     178    |     407    |  196363    |  196185    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved segments |      93    |     174    |     419    |     326    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      86    |     163    |     394    |     308    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |       7    |      11    |      25    |      18    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable allocs |      49    |     198    |  326105    |  326056    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      36    |     160    |  247993    |  247957    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |      13    |      46    |   78112    |   78099    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize allocations  |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize GPU segments |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] 
[rank0]:[2025-11-06 19:37:50,769] torch.distributed.fsdp._debug_utils: [WARNING] FSDP _optim_state_dict() profiling:  defaultdict(<class 'float'>, {'preprocessing': 0.020030266998219304, 'preprocessing_with_comm': 0.0005485500005306676, <Type.ALLGATHER_OBJ: 'all_gather_object'>: 0.007294446957530454, <Type.ALLGATHER: 'all_gather'>: 0.7452608009771211, <Type.D2H: 'D2H'>: 0.0029178579716244712, <Type.RESHARDING: 'resharding'>: 0.7839153779787011, 'state_converting': 0.7841921350045595, <Type.ALL: 'all'>: 0.8060961599985603})

GPT torch 2.3.1（推荐） 结合直觉 ✅ 报错/Warning 全面解决，训练过程干净 |

✅ 问题 4：升级后新出现梯度警告

UserWarning: Called FSDP.clip_grad_norm_() on rank 1 with no gradients...

原因：
某些 rank 在当前 step 上没有 LoRA 参数或没有梯度，例如：

该 shard 参数都是冻结的
数据不均匀 / microbatch为空

影响：
✔ 不影响训练正确性
✔ 只是提示 某些 rank 本轮没有可裁剪梯度
✔ 如想彻底消除，可调整 LoRA 注入范围或 wrap 策略，使每个 rank 至少有一部分可训练参数

您可能感兴趣的与本文相关的镜像

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

Qwen

文本生成

Qwen3

本镜像基于 RTX 4090D 24GB 显存 + CUDA 12.4 + 驱动 550.90.07 深度优化，内置完整运行环境与 Qwen3-32B 模型依赖，开箱即用。