在大模型微调中,LoRA(低秩适配)+ FSDP(FullyShardedDataParallel) 是当前最节省显存、最易扩展的方案。但要真正跑通这套方法,需要注意注入顺序、参数冻结、版本兼容、保存逻辑、梯度同步与告警问题。
本文完全基于真实踩坑与解决过程整理,不跳步,不隐藏细节。如果你遇到类似的问题,这篇文章将是你需要的完整解决方案。
📌 1. 训练目标与关键原则
核心目标:
✅ 使用 LoRA 对大模型进行参数高效微调
✅ 只训练 LoRA 参数,冻结其他参数
✅ 使用 FSDP 分布式训练以节省显存
✅ 训练后只保存 LoRA adapter(非全量模型)
关键操作顺序:
✔ 先注入 LoRA → 再包裹 FSDP(顺序不能反)
✔ 优化器只选择 requires_grad=True 的参数
✔ 保存时解包 FSDP,调用 PeftModel.save_pretrained() 仅导出 LoRA
⚙️ 2. 核心代码逻辑(LoRA + FSDP 正确写法)
from transformers import AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, PeftModel
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.optim import AdamW
# ① 加载基础模型
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16
)
# ② 注入 LoRA(关键:在 FSDP 之前)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"], # GPT-J/Qwen/Mistral 要换成对应模块
lora_dropout=0.05
)
model = get_peft_model(model, lora_config)
# ③ 只训练 LoRA 参数(冻结其他)
for name, param in model.named_parameters():
param.requires_grad = "lora" in name
# ④ FSDP 包裹(关键:use_orig_params=True)
model = FSDP(
model,
use_orig_params=True, # ✅ 彻底解决 requires_grad 不一致报错
auto_wrap_policy=None # 示例默认,可自定义为 LlamaDecoderLayer
)
# ⑤ 优化器只拿 LoRA 参数
optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)
# ⑥ Rank0 保存 LoRA adapter
if dist.get_rank() == 0:
if isinstance(model, FSDP):
model = model.module # 解包
model.save_pretrained("output_lora_adapter")
⚠️ 3. 常见报错与解决过程(与你的实际经验一致)
✅ 问题 1:ValueError: FlatParameter requires uniform requires_grad
原因:
FSDP 默认会将参数合并成 FlatParameter,如果这些参数之间存在 requires_grad=True/False 混合(比如 LoRA 冻结了大部分层),就会报错。
解决方案(你称为方案A):
启用 use_orig_params=True,让 FSDP 不合并参数,允许交错的可训练/冻结参数。
✅ 问题 2:方案A仍然报错?→ 原因是Torch版本
查看https://github.com/pytorch/pytorch/issues/104690
所提出可能是Torch低版本没有真正的实现这个功能然后安装torch 2.2.2,


建议安装torch>2.2
| 版本 | 结果 |
|---|---|
| torch 2.0 / 2.1 | use_orig_params 不完全生效,仍报错 |
| torch 2.2.2 | ✅ 可运行,但出现 _optim_state_dict() 相关 WARNING |
✅ 问题 3:升级后新出现新的警告
示例 Warning:
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | PyTorch CUDA memory summary, device ID 0 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | CUDA OOMs: 0 | cudaMalloc retries: 6 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Allocated memory | 5453 MiB | 17469 MiB | 14326 GiB | 14320 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 5443 MiB | 17455 MiB | 14287 GiB | 14282 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 10 MiB | 20 MiB | 38 GiB | 38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Active memory | 5453 MiB | 17679 MiB | 14326 GiB | 14320 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 5443 MiB | 17665 MiB | 14287 GiB | 14282 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 10 MiB | 20 MiB | 38 GiB | 38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Requested memory | 5452 MiB | 17651 MiB | 14282 GiB | 14277 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 5442 MiB | 17637 MiB | 14243 GiB | 14238 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 10 MiB | 20 MiB | 38 GiB | 38 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved memory | 20886 MiB | 23580 MiB | 77494 MiB | 56608 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 20872 MiB | 23558 MiB | 77444 MiB | 56572 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 14 MiB | 22 MiB | 50 MiB | 36 MiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable memory | 715104 KiB | 4311 MiB | 8595 GiB | 8595 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 713440 KiB | 4310 MiB | 8552 GiB | 8551 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 1664 KiB | 3 MiB | 43 GiB | 43 GiB |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Allocations | 240 | 952 | 642681 | 642441 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 67 | 553 | 446331 | 446264 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 173 | 406 | 196350 | 196177 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Active allocs | 245 | 952 | 642681 | 642436 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 67 | 553 | 446331 | 446264 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 178 | 407 | 196350 | 196172 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved segments | 93 | 174 | 419 | 326 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 86 | 163 | 394 | 308 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 7 | 11 | 25 | 18 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable allocs | 49 | 198 | 326102 | 326053 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 36 | 160 | 247991 | 247955 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 13 | 46 | 78111 | 78098 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize allocations | 0 | 0 | 0 | 0 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize GPU segments | 0 | 0 | 0 | 0 |
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,720] torch.distributed.fsdp._optim_utils: [WARNING]
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] CUDA Memory Summary before calling to _allgather_orig_param_states |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | PyTorch CUDA memory summary, device ID 0 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | CUDA OOMs: 0 | cudaMalloc retries: 6 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Allocated memory | 5453 MiB | 17469 MiB | 14326 GiB | 14321 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 5443 MiB | 17455 MiB | 14288 GiB | 14282 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 10 MiB | 20 MiB | 38 GiB | 38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Active memory | 5453 MiB | 17679 MiB | 14326 GiB | 14321 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 5443 MiB | 17665 MiB | 14288 GiB | 14282 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 10 MiB | 20 MiB | 38 GiB | 38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Requested memory | 5452 MiB | 17651 MiB | 14283 GiB | 14277 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 5442 MiB | 17637 MiB | 14244 GiB | 14239 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 10 MiB | 20 MiB | 38 GiB | 38 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved memory | 20886 MiB | 23580 MiB | 77494 MiB | 56608 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 20872 MiB | 23558 MiB | 77444 MiB | 56572 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 14 MiB | 22 MiB | 50 MiB | 36 MiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable memory | 715104 KiB | 4311 MiB | 8596 GiB | 8595 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 713440 KiB | 4310 MiB | 8553 GiB | 8552 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 1664 KiB | 3 MiB | 43 GiB | 43 GiB |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Allocations | 240 | 952 | 642699 | 642459 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 67 | 553 | 446336 | 446269 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 173 | 406 | 196363 | 196190 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Active allocs | 245 | 952 | 642699 | 642454 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 67 | 553 | 446336 | 446269 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 178 | 407 | 196363 | 196185 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | GPU reserved segments | 93 | 174 | 419 | 326 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 86 | 163 | 394 | 308 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 7 | 11 | 25 | 18 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Non-releasable allocs | 49 | 198 | 326105 | 326056 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from large pool | 36 | 160 | 247993 | 247957 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | from small pool | 13 | 46 | 78112 | 78099 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize allocations | 0 | 0 | 0 | 0 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |---------------------------------------------------------------------------|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] | Oversize GPU segments | 0 | 0 | 0 | 0 |
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING] |===========================================================================|
[rank0]:[2025-11-06 19:37:50,745] torch.distributed.fsdp._optim_utils: [WARNING]
[rank0]:[2025-11-06 19:37:50,769] torch.distributed.fsdp._debug_utils: [WARNING] FSDP _optim_state_dict() profiling: defaultdict(<class 'float'>, {'preprocessing': 0.020030266998219304, 'preprocessing_with_comm': 0.0005485500005306676, <Type.ALLGATHER_OBJ: 'all_gather_object'>: 0.007294446957530454, <Type.ALLGATHER: 'all_gather'>: 0.7452608009771211, <Type.D2H: 'D2H'>: 0.0029178579716244712, <Type.RESHARDING: 'resharding'>: 0.7839153779787011, 'state_converting': 0.7841921350045595, <Type.ALL: 'all'>: 0.8060961599985603})
GPT torch 2.3.1(推荐) 结合直觉 ✅ 报错/Warning 全面解决,训练过程干净 |
✅ 问题 4:升级后新出现梯度警告
UserWarning: Called FSDP.clip_grad_norm_() on rank 1 with no gradients...
原因:
某些 rank 在当前 step 上没有 LoRA 参数或没有梯度,例如:
- 该 shard 参数都是冻结的
- 数据不均匀 / microbatch为空
影响:
✔ 不影响训练正确性
✔ 只是提示 某些 rank 本轮没有可裁剪梯度
✔ 如想彻底消除,可调整 LoRA 注入范围或 wrap 策略,使每个 rank 至少有一部分可训练参数
741

被折叠的 条评论
为什么被折叠?



