✅ **FSDP + LoRA 结合训练完整实践与问题解决全指南(含踩坑与版本选择)*

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

本镜像基于 RTX 4090D 24GB 显存 + CUDA 12.4 + 驱动 550.90.07 深度优化,内置完整运行环境与 Qwen3-32B 模型依赖,开箱即用。

在大模型微调中,LoRA(低秩适配)+ FSDP(FullyShardedDataParallel) 是当前最节省显存、最易扩展的方案。但要真正跑通这套方法,需要注意注入顺序、参数冻结、版本兼容、保存逻辑、梯度同步与告警问题

本文完全基于真实踩坑与解决过程整理,不跳步,不隐藏细节。如果你遇到类似的问题,这篇文章将是你需要的完整解决方案。


📌 1. 训练目标与关键原则

核心目标:
✅ 使用 LoR‍A 对大模型进行参数高效微调
✅ 只训练 LoRA 参数,冻结其他参数
✅ 使用 FSDP 分布式训练以节省显存
✅ 训练后只保存 LoRA adapter(非全量模型)

关键操作顺序:
先注入 LoRA再包裹 FSDP(顺序不能反)
优化器只选择 requires_grad=True 的参数
保存时解包 FSDP,调用 PeftModel.save_pretrained() 仅导出 LoRA


⚙️ 2. 核心代码逻辑(LoRA + FSDP 正确写法)

from transformers import AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, PeftModel
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.optim import AdamW

# ① 加载基础模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16
)

# ② 注入 LoRA(关键:在 FSDP 之前)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # GPT-J/Qwen/Mistral 要换成对应模块
    lora_dropout=0.05
)
model = get_peft_model(model, lora_config)

# ③ 只训练 LoRA 参数(冻结其他)
for name, param in model.named_parameters():
    param.requires_grad = "lora" in name

# ④ FSDP 包裹(关键:use_orig_params=True)
model = FSDP(
    model,
    use_orig_params=True,       # ✅ 彻底解决 requires_grad 不一致报错
    auto_wrap_policy=None       # 示例默认,可自定义为 LlamaDecoderLayer
)

# ⑤ 优化器只拿 LoRA 参数
optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)

# ⑥ Rank0 保存 LoRA adapter
if dist.get_rank() == 0:
    if isinstance(model, FSDP):
        model = model.module  # 解包
    model.save_pretrained("output_lora_adapter")

⚠️ 3. 常见报错与解决过程(与你的实际经验一致)

问题 1:ValueError: FlatParameter requires uniform requires_grad

原因:
FSDP 默认会将参数合并成 FlatParameter,如果这些参数之间存在 requires_grad=True/False 混合(比如 LoRA 冻结了大部分层),就会报错。

解决方案(你称为方案A):
启用 use_orig_params=True,让 FSDP 不合并参数,允许交错的可训练/冻结参数。


问题 2:方案A仍然报错?→ 原因是Torch版本

查看https://github.com/pytorch/pytorch/issues/104690
所提出可能是Torch低版本没有真正的实现这个功能然后安装torch 2.2.2,
在这里插入图片描述
在这里插入图片描述
建议安装torch>2.2

版本结果
torch 2.0 / 2.1use_orig_params 不完全生效,仍报错
torch 2.2.2✅ 可运行,但出现 _optim_state_dict() 相关 WARNING

问题 3:升级后新出现新的警告

示例 Warning:

[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |                  PyTorch CUDA memory summary, device ID 0                 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |            CUDA OOMs: 0            |        cudaMalloc retries: 6         |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Allocated memory      |   5453 MiB |  17469 MiB |  14326 GiB |  14320 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17455 MiB |  14287 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Active memory         |   5453 MiB |  17679 MiB |  14326 GiB |  14320 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17665 MiB |  14287 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Requested memory      |   5452 MiB |  17651 MiB |  14282 GiB |  14277 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5442 MiB |  17637 MiB |  14243 GiB |  14238 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved memory   |  20886 MiB |  23580 MiB |  77494 MiB |  56608 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |  20872 MiB |  23558 MiB |  77444 MiB |  56572 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     14 MiB |     22 MiB |     50 MiB |     36 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable memory | 715104 KiB |   4311 MiB |   8595 GiB |   8595 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool | 713440 KiB |   4310 MiB |   8552 GiB |   8551 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |   1664 KiB |      3 MiB |     43 GiB |     43 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Allocations           |     240    |     952    |  642681    |  642441    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446331    |  446264    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     173    |     406    |  196350    |  196177    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Active allocs         |     245    |     952    |  642681    |  642436    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446331    |  446264    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     178    |     407    |  196350    |  196172    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved segments |      93    |     174    |     419    |     326    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      86    |     163    |     394    |     308    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |       7    |      11    |      25    |      18    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable allocs |      49    |     198    |  326102    |  326053    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      36    |     160    |  247991    |  247955    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |      13    |      46    |   78111    |   78098    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize allocations  |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize GPU segments |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] 
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] CUDA Memory Summary before calling to _allgather_orig_param_states |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |                  PyTorch CUDA memory summary, device ID 0                 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |            CUDA OOMs: 0            |        cudaMalloc retries: 6         |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Allocated memory      |   5453 MiB |  17469 MiB |  14326 GiB |  14321 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17455 MiB |  14288 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Active memory         |   5453 MiB |  17679 MiB |  14326 GiB |  14321 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5443 MiB |  17665 MiB |  14288 GiB |  14282 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Requested memory      |   5452 MiB |  17651 MiB |  14283 GiB |  14277 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |   5442 MiB |  17637 MiB |  14244 GiB |  14239 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     10 MiB |     20 MiB |     38 GiB |     38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved memory   |  20886 MiB |  23580 MiB |  77494 MiB |  56608 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |  20872 MiB |  23558 MiB |  77444 MiB |  56572 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     14 MiB |     22 MiB |     50 MiB |     36 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable memory | 715104 KiB |   4311 MiB |   8596 GiB |   8595 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool | 713440 KiB |   4310 MiB |   8553 GiB |   8552 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |   1664 KiB |      3 MiB |     43 GiB |     43 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Allocations           |     240    |     952    |  642699    |  642459    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446336    |  446269    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     173    |     406    |  196363    |  196190    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Active allocs         |     245    |     952    |  642699    |  642454    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      67    |     553    |  446336    |  446269    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |     178    |     407    |  196363    |  196185    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved segments |      93    |     174    |     419    |     326    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      86    |     163    |     394    |     308    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |       7    |      11    |      25    |      18    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable allocs |      49    |     198    |  326105    |  326056    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from large pool |      36    |     160    |  247993    |  247957    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |       from small pool |      13    |      46    |   78112    |   78099    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize allocations  |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize GPU segments |       0    |       0    |       0    |       0    |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] 
[rank0]:[2025-11-06 19:37:50,769] torch.distributed.fsdp._debug_utils: [WARNING] FSDP _optim_state_dict() profiling:  defaultdict(<class 'float'>, {'preprocessing': 0.020030266998219304, 'preprocessing_with_comm': 0.0005485500005306676, <Type.ALLGATHER_OBJ: 'all_gather_object'>: 0.007294446957530454, <Type.ALLGATHER: 'all_gather'>: 0.7452608009771211, <Type.D2H: 'D2H'>: 0.0029178579716244712, <Type.RESHARDING: 'resharding'>: 0.7839153779787011, 'state_converting': 0.7841921350045595, <Type.ALL: 'all'>: 0.8060961599985603})

GPT torch 2.3.1(推荐) 结合直觉 ✅ 报错/Warning 全面解决,训练过程干净 |

问题 4:升级后新出现梯度警告

UserWarning: Called FSDP.clip_grad_norm_() on rank 1 with no gradients...

原因:
某些 rank 在当前 step 上没有 LoRA 参数或没有梯度,例如:

  • 该 shard 参数都是冻结的
  • 数据不均匀 / microbatch为空

影响:
✔ 不影响训练正确性
✔ 只是提示 某些 rank 本轮没有可裁剪梯度
✔ 如想彻底消除,可调整 LoRA 注入范围或 wrap 策略,使每个 rank 至少有一部分可训练参数

您可能感兴趣的与本文相关的镜像

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

Qwen3-32B-Chat 私有部署镜像 | RTX4090D 24G 显存 CUDA12.4 优化版

Qwen
文本生成
Qwen3

本镜像基于 RTX 4090D 24GB 显存 + CUDA 12.4 + 驱动 550.90.07 深度优化,内置完整运行环境与 Qwen3-32B 模型依赖,开箱即用。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值