From e91a6480959bea20b894d1c4c56a8ac41f4fcb7b Mon Sep 17 00:00:00 2001 From: Mattheliu Date: Fri, 12 Sep 2025 15:56:24 +0800 Subject: [PATCH] add v3.2 Release Note (#7430) --- docs/release_note_cn.md | 440 +++++++++++++++--------------------- docs/release_note_en.md | 483 +++++++++++++++++----------------------- 2 files changed, 377 insertions(+), 546 deletions(-) diff --git a/docs/release_note_cn.md b/docs/release_note_cn.md index b4b8d0e2f75..d0c739cc615 100644 --- a/docs/release_note_cn.md +++ b/docs/release_note_cn.md @@ -1,299 +1,211 @@ -# 3.1 Release Note +# 3.2 Release Note +飞桨框架3.2版本在大模型训练推理性能、硬件适配、主流大模型及高性能加速库的支持上进一步提升。 -飞桨框架 3.1 版本,针对核心功能自动并行进一步优化打磨,提升易用性和性能表现;同时提供 FP8 低精度训练支持,提升大模型训练速度提升 10-20%;完善硬件扩展机制,降低类 cuda 类硬件适配成本,用户仅需注册 kernel;同时对于框架基础能力进行增强,提升框架稳定性。重点更新功能如下: +- 大模型训练方面,飞桨框架在计算、并行策略、容错能力三方面进行了升级: + - 从基础计算性能层面,提出了存算重叠的稀疏掩码注意力计算FlashMask V3,极致优化Attention的计算效率,同时还实现了高效的FP8混合精度效果无损训练技术。 + - 在分布式并行策略层面,提出了动态自适应的显存卸载策略,实现存算最优均衡,再结合创新设计的显存友好的流水线并行调度,进一步降低显存开销。 + - 增强了框架原生的容错能力,实现了大规模集群训练容错系统,可在不影响训练效率的前提下在线监测静默数据损坏等难以察觉的故障,并实现了高可用的检查点容灾方法,降低中断恢复损失。 +- 在硬件适配方面,面向类CUDA芯片,全面升级插件式适配方案。 + - 在设备资源的管理调度和高性能集合通讯库方面,针对类CUDA芯片做了管理接口升级和通信能力的增强,特别增强了分布式通信能力,使XCCL对齐NCCL的各结构体和功能。 + - 新增了类CUDA算子注册机制。以沐曦适配为例,在复用GPU算子内核的基础上,仅需一行代码即可完成算子内核注册。经过统计计算,算子内核的复用率最高可以达到92%,可大幅降低硬件适配成本。 +- 使用体验方面,重点提升了兼容能力,包括开发接口兼容业界用法、safetensors模型格式兼容、和第三方高性能加速库的兼容。 + - 新增和修改开发接口兼容业界用法,新增系列API和别名,新增参数别名,新增专有和通用的参数。 + - 全面兼容 Safetensors 模型格式。新增 FlexCheckpoint 机制,支持跨分布式策略、跨模型结构间自动实现参数重切分,可显著降低权重转换成本,进而提升大模型端到端的训练与推理研发效率。 + - 系统性增强了接口兼容与算子注册能力,实现了高性能加速库一键导入,无需修改代码直接复用于飞桨的模型训练与推理加速过程中。 -- **自动并行架构:** 自动并行架构进一步打磨,以提高自动并行核心机制易用性和动态图性能。完善了自动并行核心机制,包括新增了多个算子的切分推导规则,支持分布式张量的同一维度被多个 mesh 维度切分,支持动态图并行策略(PP,CP,SEP,TP-CONV)等。同时,对动态图自动并行系统地做了性能优化,在 Llama2 Qwen Baichuan 等系列模型上性能基本持平手动并行的性能。 -- **低精度训练:** 基于 blockwise 的 fp8 gemm 算子,支持低精度训练,训练精度媲美 BF16,大模型训练速度提速 10-20%。 -- **异构多芯适配:** 提供类 cuda 算子复用机制,仅需注册即可使用对应 kernel。 -- **框架稳定性增强:** 系统修复算子在 0-Size 和大维度情况计算结果错误。 - -## 1. 用户体验升级 - -API 功能增强、Bug 修复与改进,旨在提升用户体验和 API 的易用性。新增了`paddle.randn_like` API,修复了多个 API 的功能缺陷,并增强了对复数类型和 0-Size Tensor 的支持。文档和代码也进行了相应的更新和优化,以提升整体的准确性和专业性。 +## 1. 用户体验 ### 新特性 - -- 新增`paddle.randn_like` API。[#72492](https://github.com/PaddlePaddle/Paddle/pull/72492) - -### Bug 修复 - -- 修复`tensordot` API 输入输出类型不一致问题。[#72139](https://github.com/PaddlePaddle/Paddle/pull/72139) -- 修复`atleast` API 输出是 Tensor 列表时的问题。[#73102](https://github.com/PaddlePaddle/Paddle/pull/73102) -- 修复`nonzer` API 问题。[#72003](https://github.com/PaddlePaddle/Paddle/pull/72003) -- 修复`dualpipev`中的内存泄漏问题。[#72070](https://github.com/PaddlePaddle/Paddle/pull/72070) -- 修复`softmax`计算溢出问题。[#71935](https://github.com/PaddlePaddle/Paddle/pull/71935) -- 修复`take_along_axis`中在`broadcast=False`时的形状检查问题。[#72436](https://github.com/PaddlePaddle/Paddle/pull/72436) -- 修复`maximum`、`minimum`对 Nan 输入的不正确问题。[#71933](https://github.com/PaddlePaddle/Paddle/pull/71933) -- 修复`visit_type` 问题。[#72782](https://github.com/PaddlePaddle/Paddle/pull/72782) -- 修复`gather_scatter_functor`中的 int32 越界问题。[#72905](https://github.com/PaddlePaddle/Paddle/pull/72905) -- 修复`Bernoulli`的 inplace 实现。[#73271](https://github.com/PaddlePaddle/Paddle/pull/73271) -- 修复`moe_permute`、`moe_unpermute`问题。[#73365](https://github.com/PaddlePaddle/Paddle/pull/73365) -- 修复`ast.parse`对 pyi 文件语法检查问题。[#71872](https://github.com/PaddlePaddle/Paddle/pull/71872) -- 修复复数除法问题。[#73331](https://github.com/PaddlePaddle/Paddle/pull/73331) -- 修复与 TensorRT 集成相关的问题。[#72302](https://github.com/PaddlePaddle/Paddle/pull/72302), [#72278](https://github.com/PaddlePaddle/Paddle/pull/72278) +- 新增API:`paddle.msort`、`paddle.ravel`、`paddle.nn.functional.dropout1d`、`paddle.Tensor.type_as`、`paddle.Tensor.requires_grad`、`paddle.view_as_complex`、`paddle.view_as_real`、`paddle.nn.Parameter`、`paddle.broadcast_shapes`、`paddle.range`、`paddle.as_tensor`、`paddle.scatter_reduce/scatter_reduce_`、`paddle.scatter_add`、`paddle.tensor`、`paddle.softmax`、`paddle.Tensor.softmax`、`paddle.rand_like`、`paddle.is_autocast_enabled`、`paddle.get_autocast_gpu_dtype`、`paddle.Tensor.repeat`、`paddle.permute`。[#74421](https://github.com/PaddlePaddle/Paddle/pull/74421),[#74439](https://github.com/PaddlePaddle/Paddle/pull/74439),[#74444](https://github.com/PaddlePaddle/Paddle/pull/74444),[#74454](https://github.com/PaddlePaddle/Paddle/pull/74454),[#74459](https://github.com/PaddlePaddle/Paddle/pull/74459),[#74491](https://github.com/PaddlePaddle/Paddle/pull/74491)、[#74466](https://github.com/PaddlePaddle/Paddle/pull/74466),[#74438](https://github.com/PaddlePaddle/Paddle/pull/74438),[#74594](https://github.com/PaddlePaddle/Paddle/pull/74594),[#74542](https://github.com/PaddlePaddle/Paddle/pull/74542),[#74694](https://github.com/PaddlePaddle/Paddle/pull/74694),[#74564](https://github.com/PaddlePaddle/Paddle/pull/74564),[#74540](https://github.com/PaddlePaddle/Paddle/pull/74540),[#74586](https://github.com/PaddlePaddle/Paddle/pull/74586),[#74651](https://github.com/PaddlePaddle/Paddle/pull/74651),[#74807](https://github.com/PaddlePaddle/Paddle/pull/74807),[#74632](https://github.com/PaddlePaddle/Paddle/pull/74632),[#74834](https://github.com/PaddlePaddle/Paddle/pull/74834),[#74952](https://github.com/PaddlePaddle/Paddle/pull/74952),[#74772](https://github.com/PaddlePaddle/Paddle/pull/74772),[#74441](https://github.com/PaddlePaddle/Paddle/pull/74441),[#74561](https://github.com/PaddlePaddle/Paddle/pull/74561),[#74525](https://github.com/PaddlePaddle/Paddle/pull/74525) +- 新增`paddle.compat.*`一系列API,支持业界的通用用法,便于迁移代码,包括 `paddle.compat.median`、`paddle.compat.nanmedian`、`paddle.compat.softmax`、`paddle.compat.sort`、`paddle.compat.split`、`paddle.compat.min/max`、`paddle.compat.Unfold`。[#74865](https://github.com/PaddlePaddle/Paddle/pull/74865),[#74874](https://github.com/PaddlePaddle/Paddle/pull/74874) +- 新增初始化一系列API,支持业界通用的参数初始化方式,包括`paddle.nn.init.kaiming_uniform_`、`paddle.nn.init.xavier_uniform_`、`paddle.nn.init.uniform_`、`paddle.nn.init.kaiming_normal_`、`paddle.nn.init.xavier_normal_`、`paddle.nn.init.normal_`、`paddle.nn.init.calculate_gain`、`paddle.nn.init.constant_`、`paddle.nn.init.dirac_`、`paddle.nn.init.eye_`、`paddle.nn.init.ones_`、`paddle.nn.init.orthogonal_`、`paddle.nn.init.trunc_normal_`、`paddle.nn.init.zeros_`。[#74478](https://github.com/PaddlePaddle/Paddle/pull/74478) +- API新增参数别名用法,例如既可以输入`x`,也可以输入`input`,用法更为灵活。包括 `paddle.maximum`、`paddle.minimum`、`paddle.sqrt`、`paddle.topk`、`paddle.polar`、`paddle.stack`、`paddle.cos`、`paddle.floor`、`paddle.log`、`paddle.pow`、`paddle.rsqrt`、`paddle.sign`、`paddle.sin`、`paddle.multiply`、`paddle.where`等。[#74683](https://github.com/PaddlePaddle/Paddle/pull/74683),[#74795](https://github.com/PaddlePaddle/Paddle/pull/74795),[#74887](https://github.com/PaddlePaddle/Paddle/pull/74887),[#74592](https://github.com/PaddlePaddle/Paddle/pull/74592) +- `paddle.Tensor`新增支持多种初始化方式,支持灵活的创建Tensor。[#74619](https://github.com/PaddlePaddle/Paddle/pull/74619),[#75022](https://github.com/PaddlePaddle/Paddle/pull/75022),[#75065](https://github.com/PaddlePaddle/Paddle/pull/75065) +- API新增一些专有参数,增强原有功能。包括 `paddle.nn.functional.gelu`、`paddle.divide/div/div_`、`paddle.add`、`paddle.Tensor.copy_`、`paddle.norm`、`paddle.linalg.norm`、`paddle.nn.functional.silu`、`paddle.repeat_interleave`。[#74485](https://github.com/PaddlePaddle/Paddle/pull/74485),[#74562](https://github.com/PaddlePaddle/Paddle/pull/74562),[#74420](https://github.com/PaddlePaddle/Paddle/pull/74420),[#74768](https://github.com/PaddlePaddle/Paddle/pull/74768),[#74855](https://github.com/PaddlePaddle/Paddle/pull/74855),[#74903](https://github.com/PaddlePaddle/Paddle/pull/74903),[#74788](https://github.com/PaddlePaddle/Paddle/pull/74788),[#74631](https://github.com/PaddlePaddle/Paddle/pull/74631),[#74947](https://github.com/PaddlePaddle/Paddle/pull/74947) +- API新增一些通用参数:`out`、`device`、`dtype`、`requires_grad`、`pin_memory`、`bias`,增强原有功能。包括 `paddle.zeros`、`paddle.zeros_like`、`paddle.ones`、`paddle.ones_like`、`paddle.arange`、`paddle.eye`、`paddle.empty`、`paddle.empty_like`、`paddle.full`、`paddle.full_like`、`paddle.randn`、`paddle.Tensor.new_full`、`paddle.Tensor.new_empty`、`paddle.Tensor.new_ones`、`paddle.Tensor.new_zeros`、`paddle.tril/triu`、`paddle.bmm`、`paddle.nn.Conv1D/Conv2D/Conv3D/Embedding`、`paddle.diff`、`paddle.cumsum`、`paddle.var`、`paddle.multinomial`、`paddle.mean`等。[#74477](https://github.com/PaddlePaddle/Paddle/pull/74477),[#74526](https://github.com/PaddlePaddle/Paddle/pull/74526),[#74711](https://github.com/PaddlePaddle/Paddle/pull/74711),[#74582](https://github.com/PaddlePaddle/Paddle/pull/74582),[#74624](https://github.com/PaddlePaddle/Paddle/pull/74624),[#74849](https://github.com/PaddlePaddle/Paddle/pull/74849),[#74612](https://github.com/PaddlePaddle/Paddle/pull/74612),[#74875](https://github.com/PaddlePaddle/Paddle/pull/74875),[#74641](https://github.com/PaddlePaddle/Paddle/pull/74641),[#74949](https://github.com/PaddlePaddle/Paddle/pull/74949),[#74918](https://github.com/PaddlePaddle/Paddle/pull/74918),[#74914](https://github.com/PaddlePaddle/Paddle/pull/74914),[#74934](https://github.com/PaddlePaddle/Paddle/pull/74934),[#74920](https://github.com/PaddlePaddle/Paddle/pull/74920),[#74955](https://github.com/PaddlePaddle/Paddle/pull/74955),[#74226](https://github.com/PaddlePaddle/Paddle/pull/74226),[#74946](https://github.com/PaddlePaddle/Paddle/pull/74946) +- API新增别名,支持更多调用方式。包括 `paddle.Tensor.mul_/mul`、`paddle.autograd.Function`、`paddle.argwhere`、`paddle.cat`、`paddle.clamp`、`paddle.ger`、`paddle.take_along_dim`、`paddle.linalg.matmul`、`paddle.special.logsumexp`、`paddle.concatenate`、`paddle.eq/gt、`paddle.Tensor.take_along_dim`、`paddle.nn.Conv1d/Conv2d/Conv3d`等。[#74493](https://github.com/PaddlePaddle/Paddle/pull/74493),[#74569](https://github.com/PaddlePaddle/Paddle/pull/74569),[#74870](https://github.com/PaddlePaddle/Paddle/pull/74870) + +### Bug修复 +- 修复 `paddle.nanmedian` 精度问题。[#74263](https://github.com/PaddlePaddle/Paddle/pull/74263) +- 修复 `paddle.distributed.fleet.utils.hybrid_parallel_util.fused_allreduce_gradients` 在0-D下的问题。[#74957](https://github.com/PaddlePaddle/Paddle/pull/74957) +- 修复 `paddle.matmul` 在分布式下的问题。[#74989](https://github.com/PaddlePaddle/Paddle/pull/74989) ### 功能增强 - -- 增强 API 的功能,提升 API 易用性,改善用户体验。包括但不限于扩展 API 支持的数据类型,API 参数检查,纠正 API 参数默认值,完善 API 返回值等。[#71997](https://github.com/PaddlePaddle/Paddle/pull/71997), [#72911](https://github.com/PaddlePaddle/Paddle/pull/72911), [#72985](https://github.com/PaddlePaddle/Paddle/pull/72985), [#73240](https://github.com/PaddlePaddle/Paddle/pull/73240), [#72927](https://github.com/PaddlePaddle/Paddle/pull/72927), [#73451](https://github.com/PaddlePaddle/Paddle/pull/73451), [#73416](https://github.com/PaddlePaddle/Paddle/pull/73416), [#73420](https://github.com/PaddlePaddle/Paddle/pull/73420), [#73347](https://github.com/PaddlePaddle/Paddle/pull/73347), [#73050](https://github.com/PaddlePaddle/Paddle/pull/73050), [#73246](https://github.com/PaddlePaddle/Paddle/pull/73246), [#73123](https://github.com/PaddlePaddle/Paddle/pull/73123), [#73336](https://github.com/PaddlePaddle/Paddle/pull/73336), [#73062](https://github.com/PaddlePaddle/Paddle/pull/73062), [#72201](https://github.com/PaddlePaddle/Paddle/pull/72201), [#72190](https://github.com/PaddlePaddle/Paddle/pull/72190) -- 增强 API 对复数类型的支持。[#72279](https://github.com/PaddlePaddle/Paddle/pull/72279), [#72308](https://github.com/PaddlePaddle/Paddle/pull/72308), [#72518](https://github.com/PaddlePaddle/Paddle/pull/72518), [#72391](https://github.com/PaddlePaddle/Paddle/pull/72391), [#72239](https://github.com/PaddlePaddle/Paddle/pull/72239), [#72286](https://github.com/PaddlePaddle/Paddle/pull/72286), [#72169](https://github.com/PaddlePaddle/Paddle/pull/72169), [#72577](https://github.com/PaddlePaddle/Paddle/pull/72577), [#72619](https://github.com/PaddlePaddle/Paddle/pull/72619) -- 增强 API 对 0-Size Tensor 的支持。[#72570](https://github.com/PaddlePaddle/Paddle/pull/72570), [#72692](https://github.com/PaddlePaddle/Paddle/pull/72692), [#72138](https://github.com/PaddlePaddle/Paddle/pull/72138), [#72410](https://github.com/PaddlePaddle/Paddle/pull/72410), [#72565](https://github.com/PaddlePaddle/Paddle/pull/72565), [#72262](https://github.com/PaddlePaddle/Paddle/pull/72262) -- 修改对 API 代码中的拼写错误,以提高整体的准确性和专业性。[#71780](https://github.com/PaddlePaddle/Paddle/pull/71780), [#71786](https://github.com/PaddlePaddle/Paddle/pull/71786), [#72093](https://github.com/PaddlePaddle/Paddle/pull/72093), [#72113](https://github.com/PaddlePaddle/Paddle/pull/72113), [#72241](https://github.com/PaddlePaddle/Paddle/pull/72241), [#72237](https://github.com/PaddlePaddle/Paddle/pull/72237), [#72590](https://github.com/PaddlePaddle/Paddle/pull/72590), [#72591](https://github.com/PaddlePaddle/Paddle/pull/72591), [#72769](https://github.com/PaddlePaddle/Paddle/pull/72769), [#72858](https://github.com/PaddlePaddle/Paddle/pull/72858), [#73045](https://github.com/PaddlePaddle/Paddle/pull/73045), [#72195](https://github.com/PaddlePaddle/Paddle/pull/72195), [#72627](https://github.com/PaddlePaddle/Paddle/pull/72627), [#72657](https://github.com/PaddlePaddle/Paddle/pull/72657), [#73162](https://github.com/PaddlePaddle/Paddle/pull/73162), [#73402](https://github.com/PaddlePaddle/Paddle/pull/73402), [#72208](https://github.com/PaddlePaddle/Paddle/pull/72208), [#72659](https://github.com/PaddlePaddle/Paddle/pull/72659), [#72658](https://github.com/PaddlePaddle/Paddle/pull/72658), [#72660](https://github.com/PaddlePaddle/Paddle/pull/72660), [#72661](https://github.com/PaddlePaddle/Paddle/pull/72661), [#72656](https://github.com/PaddlePaddle/Paddle/pull/72656) -- 通信优化减少显存峰值。[#72035](https://github.com/PaddlePaddle/Paddle/pull/72035) +- 针对返回多个Tensor的情况,通过paddle数据结构来封装,优化体验。包括 `paddle.topk`。[#74931](https://github.com/PaddlePaddle/Paddle/pull/74931) +- 创建类API支持size为可变参数的用法。[#74494](https://github.com/PaddlePaddle/Paddle/pull/74494) ### 文档 +- 新增或修复文档。[#74453](https://github.com/PaddlePaddle/Paddle/pull/74453),[#74846](https://github.com/PaddlePaddle/Paddle/pull/74846),[#74982](https://github.com/PaddlePaddle/Paddle/pull/74982) -- 修正了文档中的错误,提高了文档的可用性和用户体验。[#72549](https://github.com/PaddlePaddle/Paddle/pull/72549), [#73036](https://github.com/PaddlePaddle/Paddle/pull/73036) - -### 开发者相关 - -- 代码风格检查规则更新。[#72896](https://github.com/PaddlePaddle/Paddle/pull/72896), [#73179](https://github.com/PaddlePaddle/Paddle/pull/73179), [#73060](https://github.com/PaddlePaddle/Paddle/pull/73060), [#72553](https://github.com/PaddlePaddle/Paddle/pull/72553), [#72915](https://github.com/PaddlePaddle/Paddle/pull/72915), [#72916](https://github.com/PaddlePaddle/Paddle/pull/72916), [#73338](https://github.com/PaddlePaddle/Paddle/pull/73338), [#72935](https://github.com/PaddlePaddle/Paddle/pull/72935), [#72325](https://github.com/PaddlePaddle/Paddle/pull/72325), [#72935](https://github.com/PaddlePaddle/Paddle/pull/72935) -- 代码变量命名更新与代码迁移。[#73048](https://github.com/PaddlePaddle/Paddle/pull/73048), [#73148](https://github.com/PaddlePaddle/Paddle/pull/73148), [#73149](https://github.com/PaddlePaddle/Paddle/pull/73149), [#73264](https://github.com/PaddlePaddle/Paddle/pull/73264), [#73159](https://github.com/PaddlePaddle/Paddle/pull/73159), [#73124](https://github.com/PaddlePaddle/Paddle/pull/73124), [#73160](https://github.com/PaddlePaddle/Paddle/pull/73160), [#73161](https://github.com/PaddlePaddle/Paddle/pull/73161), [#73374](https://github.com/PaddlePaddle/Paddle/pull/73374), [#73395](https://github.com/PaddlePaddle/Paddle/pull/73395), [#73076](https://github.com/PaddlePaddle/Paddle/pull/73076), [#73163](https://github.com/PaddlePaddle/Paddle/pull/73163), [#73255](https://github.com/PaddlePaddle/Paddle/pull/73255) -- LodTensor 退场。[#71968](https://github.com/PaddlePaddle/Paddle/pull/71968), [#72152](https://github.com/PaddlePaddle/Paddle/pull/72152), [#72145](https://github.com/PaddlePaddle/Paddle/pull/72145) - -### 废弃代码清理 - -- 无用代码清理。[#71795](https://github.com/PaddlePaddle/Paddle/pull/71795), [#71792](https://github.com/PaddlePaddle/Paddle/pull/71792), [#71794](https://github.com/PaddlePaddle/Paddle/pull/71794), [#71793](https://github.com/PaddlePaddle/Paddle/pull/71793), [#72265](https://github.com/PaddlePaddle/Paddle/pull/72265), [#73167](https://github.com/PaddlePaddle/Paddle/pull/73167), [#73115](https://github.com/PaddlePaddle/Paddle/pull/73115), [#73049](https://github.com/PaddlePaddle/Paddle/pull/73049), [#72162](https://github.com/PaddlePaddle/Paddle/pull/72162), [#72321](https://github.com/PaddlePaddle/Paddle/pull/72321), [#72336](https://github.com/PaddlePaddle/Paddle/pull/72336), [#72952](https://github.com/PaddlePaddle/Paddle/pull/72952), [#72828](https://github.com/PaddlePaddle/Paddle/pull/72828) +### 其他 +- 代码风格相关的优化。[#74654](https://github.com/PaddlePaddle/Paddle/pull/74654),[#74655](https://github.com/PaddlePaddle/Paddle/pull/74655),[#74665](https://github.com/PaddlePaddle/Paddle/pull/74665),[#74660](https://github.com/PaddlePaddle/Paddle/pull/74660),[#74667](https://github.com/PaddlePaddle/Paddle/pull/74667),[#74664](https://github.com/PaddlePaddle/Paddle/pull/74664),[#74662](https://github.com/PaddlePaddle/Paddle/pull/74662),[#74661](https://github.com/PaddlePaddle/Paddle/pull/74661),[#74658](https://github.com/PaddlePaddle/Paddle/pull/74658),[#74657](https://github.com/PaddlePaddle/Paddle/pull/74657),[#74666](https://github.com/PaddlePaddle/Paddle/pull/74666),[#74659](https://github.com/PaddlePaddle/Paddle/pull/74659),[#74663](https://github.com/PaddlePaddle/Paddle/pull/74663),[#74656](https://github.com/PaddlePaddle/Paddle/pull/74656),[#74673](https://github.com/PaddlePaddle/Paddle/pull/74673),[#74672](https://github.com/PaddlePaddle/Paddle/pull/74672),[#74671](https://github.com/PaddlePaddle/Paddle/pull/74671),[#74674](https://github.com/PaddlePaddle/Paddle/pull/74674),[#74675](https://github.com/PaddlePaddle/Paddle/pull/74675),[#74670](https://github.com/PaddlePaddle/Paddle/pull/74670),[#74669](https://github.com/PaddlePaddle/Paddle/pull/74669),[#74677](https://github.com/PaddlePaddle/Paddle/pull/74677),[#74709](https://github.com/PaddlePaddle/Paddle/pull/74709),[#74714](https://github.com/PaddlePaddle/Paddle/pull/74714),[#74712](https://github.com/PaddlePaddle/Paddle/pull/74712),[#74713](https://github.com/PaddlePaddle/Paddle/pull/74713),[#74704](https://github.com/PaddlePaddle/Paddle/pull/74704),[#74746](https://github.com/PaddlePaddle/Paddle/pull/74746),[#74748](https://github.com/PaddlePaddle/Paddle/pull/74748),[#74743](https://github.com/PaddlePaddle/Paddle/pull/74743),[#74742](https://github.com/PaddlePaddle/Paddle/pull/74742),[#74744](https://github.com/PaddlePaddle/Paddle/pull/74744),[#74745](https://github.com/PaddlePaddle/Paddle/pull/74745),[#74747](https://github.com/PaddlePaddle/Paddle/pull/74747),[#74794](https://github.com/PaddlePaddle/Paddle/pull/74794),[#74789](https://github.com/PaddlePaddle/Paddle/pull/74789),[#74793](https://github.com/PaddlePaddle/Paddle/pull/74793),[#74786](https://github.com/PaddlePaddle/Paddle/pull/74786),[#74791](https://github.com/PaddlePaddle/Paddle/pull/74791),[#74787](https://github.com/PaddlePaddle/Paddle/pull/74787),[#74827](https://github.com/PaddlePaddle/Paddle/pull/74827),[#74608](https://github.com/PaddlePaddle/Paddle/pull/74608),[#74288](https://github.com/PaddlePaddle/Paddle/pull/74288),[#74287](https://github.com/PaddlePaddle/Paddle/pull/74287),[#74385](https://github.com/PaddlePaddle/Paddle/pull/74385),[#74395](https://github.com/PaddlePaddle/Paddle/pull/74395),[#74475](https://github.com/PaddlePaddle/Paddle/pull/74475),[#74647](https://github.com/PaddlePaddle/Paddle/pull/74647) +- MKLDNN/ONEDNN相关的优化。[#74299](https://github.com/PaddlePaddle/Paddle/pull/74299),[#74244](https://github.com/PaddlePaddle/Paddle/pull/74244),[#74230](https://github.com/PaddlePaddle/Paddle/pull/74230),[#74314](https://github.com/PaddlePaddle/Paddle/pull/74314),[#74327](https://github.com/PaddlePaddle/Paddle/pull/74327),[#74325](https://github.com/PaddlePaddle/Paddle/pull/74325),[#74326](https://github.com/PaddlePaddle/Paddle/pull/74326),[#74315](https://github.com/PaddlePaddle/Paddle/pull/74315),[#74399](https://github.com/PaddlePaddle/Paddle/pull/74399),[#74398](https://github.com/PaddlePaddle/Paddle/pull/74398),[#74393](https://github.com/PaddlePaddle/Paddle/pull/74393),[#74392](https://github.com/PaddlePaddle/Paddle/pull/74392),[#74367](https://github.com/PaddlePaddle/Paddle/pull/74367),[#74391](https://github.com/PaddlePaddle/Paddle/pull/74391),[#74423](https://github.com/PaddlePaddle/Paddle/pull/74423),[#74424](https://github.com/PaddlePaddle/Paddle/pull/74424),[#74436](https://github.com/PaddlePaddle/Paddle/pull/74436),[#74417](https://github.com/PaddlePaddle/Paddle/pull/74417),[#74410](https://github.com/PaddlePaddle/Paddle/pull/74410),[#74473](https://github.com/PaddlePaddle/Paddle/pull/74473),[#74458](https://github.com/PaddlePaddle/Paddle/pull/74458),[#74501](https://github.com/PaddlePaddle/Paddle/pull/74501),[#74487](https://github.com/PaddlePaddle/Paddle/pull/74487),[#74502](https://github.com/PaddlePaddle/Paddle/pull/74502),[#74513](https://github.com/PaddlePaddle/Paddle/pull/74513),[#74518](https://github.com/PaddlePaddle/Paddle/pull/74518),[#74516](https://github.com/PaddlePaddle/Paddle/pull/74516),[#74507](https://github.com/PaddlePaddle/Paddle/pull/74507),[#74504](https://github.com/PaddlePaddle/Paddle/pull/74504),[#74505](https://github.com/PaddlePaddle/Paddle/pull/74505),[#74509](https://github.com/PaddlePaddle/Paddle/pull/74509),[#74535](https://github.com/PaddlePaddle/Paddle/pull/74535),[#74536](https://github.com/PaddlePaddle/Paddle/pull/74536),[#74517](https://github.com/PaddlePaddle/Paddle/pull/74517),[#74503](https://github.com/PaddlePaddle/Paddle/pull/74503),[#74557](https://github.com/PaddlePaddle/Paddle/pull/74557),[#74550](https://github.com/PaddlePaddle/Paddle/pull/74550),[#74575](https://github.com/PaddlePaddle/Paddle/pull/74575),[#74587](https://github.com/PaddlePaddle/Paddle/pull/74587),[#74576](https://github.com/PaddlePaddle/Paddle/pull/74576),[#74588](https://github.com/PaddlePaddle/Paddle/pull/74588),[#74549](https://github.com/PaddlePaddle/Paddle/pull/74549),[#74581](https://github.com/PaddlePaddle/Paddle/pull/74581),[#74583](https://github.com/PaddlePaddle/Paddle/pull/74583),[#74628](https://github.com/PaddlePaddle/Paddle/pull/74628),[#74630](https://github.com/PaddlePaddle/Paddle/pull/74630),[#74635](https://github.com/PaddlePaddle/Paddle/pull/74635),[#74679](https://github.com/PaddlePaddle/Paddle/pull/74679),[#74648](https://github.com/PaddlePaddle/Paddle/pull/74648),[#74127](https://github.com/PaddlePaddle/Paddle/pull/74127),[#74636](https://github.com/PaddlePaddle/Paddle/pull/74636),[#74552](https://github.com/PaddlePaddle/Paddle/pull/74552),[#74551](https://github.com/PaddlePaddle/Paddle/pull/74551),[#74678](https://github.com/PaddlePaddle/Paddle/pull/74678),[#74680](https://github.com/PaddlePaddle/Paddle/pull/74680),[#74730](https://github.com/PaddlePaddle/Paddle/pull/74730),[#74751](https://github.com/PaddlePaddle/Paddle/pull/74751),[#74895](https://github.com/PaddlePaddle/Paddle/pull/74895),[#74821](https://github.com/PaddlePaddle/Paddle/pull/74821),[#74897](https://github.com/PaddlePaddle/Paddle/pull/74897),[#74734](https://github.com/PaddlePaddle/Paddle/pull/74734) +- 代码实现相关的优化,变量与文件重命名。[#74309](https://github.com/PaddlePaddle/Paddle/pull/74309),[#74597](https://github.com/PaddlePaddle/Paddle/pull/74597),[#74613](https://github.com/PaddlePaddle/Paddle/pull/74613),[#74376](https://github.com/PaddlePaddle/Paddle/pull/74376),[#74479](https://github.com/PaddlePaddle/Paddle/pull/74479),[#74960](https://github.com/PaddlePaddle/Paddle/pull/74960),[#74968](https://github.com/PaddlePaddle/Paddle/pull/74968),[#74977](https://github.com/PaddlePaddle/Paddle/pull/74977) +- 单测相关的优化,单测问题修复。[#74595](https://github.com/PaddlePaddle/Paddle/pull/74595) +- 编译相关的优化,CI问题修复。[#74356](https://github.com/PaddlePaddle/Paddle/pull/74356),[#74936](https://github.com/PaddlePaddle/Paddle/pull/74936) +- 优化调试与打印信息,优化报错信息。[#74765](https://github.com/PaddlePaddle/Paddle/pull/74765),[#74381](https://github.com/PaddlePaddle/Paddle/pull/74381),[#74384](https://github.com/PaddlePaddle/Paddle/pull/74384),[#74386](https://github.com/PaddlePaddle/Paddle/pull/74386),[#74387](https://github.com/PaddlePaddle/Paddle/pull/74387),[#74383](https://github.com/PaddlePaddle/Paddle/pull/74383),[#74519](https://github.com/PaddlePaddle/Paddle/pull/74519),[#74520](https://github.com/PaddlePaddle/Paddle/pull/74520),[#74468](https://github.com/PaddlePaddle/Paddle/pull/74468) +- 自定义算子相关优化。[#74402](https://github.com/PaddlePaddle/Paddle/pull/74402) +- 分布式FlexCheckpoint支持。[#74966](https://github.com/PaddlePaddle/Paddle/pull/74966),[#74593](https://github.com/PaddlePaddle/Paddle/pull/74593),[#74785](https://github.com/PaddlePaddle/Paddle/pull/74785),[#74814](https://github.com/PaddlePaddle/Paddle/pull/74814) ## 2. 基础执行架构 -支持 FP8 矩阵运算,提升模型训练效率,同时对多个模型进行增强,提升稳定性; 提供是 C_ops 的方式调用反向接口,方便显存优化和功能实验。 - -### 新特性 - -- 支持 FP8 矩阵乘法加速,提升计算性能与精度适配能力。 [#73092](https://github.com/PaddlePaddle/Paddle/pull/73092) -- 0-size Tensor 执行支持。 [#71829](https://github.com/PaddlePaddle/Paddle/pull/71829), [#72263](https://github.com/PaddlePaddle/Paddle/pull/72263), [#72244](https://github.com/PaddlePaddle/Paddle/pull/72244), [#72814](https://github.com/PaddlePaddle/Paddle/pull/72814) -- DeepEP 支持。 [#73495](https://github.com/PaddlePaddle/Paddle/pull/73495) -- 默认开启 CINN 后端。 [#71838](https://github.com/PaddlePaddle/Paddle/pull/71838) -- 支持 SOT 相关执行。 [#72472](https://github.com/PaddlePaddle/Paddle/pull/72472), [#72559](https://github.com/PaddlePaddle/Paddle/pull/72559), [#72466](https://github.com/PaddlePaddle/Paddle/pull/72466), [#73269](https://github.com/PaddlePaddle/Paddle/pull/73269), [#73329](https://github.com/PaddlePaddle/Paddle/pull/73329), [#73405](https://github.com/PaddlePaddle/Paddle/pull/73405), [#73399](https://github.com/PaddlePaddle/Paddle/pull/73399), [#73424](https://github.com/PaddlePaddle/Paddle/pull/73424), [#73509](https://github.com/PaddlePaddle/Paddle/pull/73509) -- 支持动转静。 [#73417](https://github.com/PaddlePaddle/Paddle/pull/73417), [#73081](https://github.com/PaddlePaddle/Paddle/pull/73081) -- 新增支持 stride 机制的 kernel。 [#73053](https://github.com/PaddlePaddle/Paddle/pull/73053) - -### Bug 修复 - -- 性能优化与稳定性:优化训练稳定性,增强 Python 3.11+支持,提升 CINN 编译器在动态图模式下的自动启用逻辑,修复动态 shape 推断与梯度回传问题,优化 GPU 内核执行效率(如 for_range、常量折叠),改进 NPU 内存拷贝与上下文管理,提升大规模模型训练性能与硬件利用率。 [#71777](https://github.com/PaddlePaddle/Paddle/pull/71777), [#71837](https://github.com/PaddlePaddle/Paddle/pull/71837), [#71834](https://github.com/PaddlePaddle/Paddle/pull/71834), [#71950](https://github.com/PaddlePaddle/Paddle/pull/71950), [#71960](https://github.com/PaddlePaddle/Paddle/pull/71960), [#72103](https://github.com/PaddlePaddle/Paddle/pull/72103), [#70652](https://github.com/PaddlePaddle/Paddle/pull/70652), [#72313](https://github.com/PaddlePaddle/Paddle/pull/72313), [#72405](https://github.com/PaddlePaddle/Paddle/pull/72405), [#72581](https://github.com/PaddlePaddle/Paddle/pull/72581), [#73418](https://github.com/PaddlePaddle/Paddle/pull/73418) -- 大 Tensor 支持扩展:扩展算子对超大尺寸 Tensor 的支持,包括数学运算(lerp/mean/bmm/trapezoid)、张量操作(arg_min_max/diag/prelu)、填充(pad)、比较(allclose/isclose)及融合算子(softmax_mask_fuse)等,解决混合精度训练中的兼容性问题。 [#71916](https://github.com/PaddlePaddle/Paddle/pull/71916), [#71970](https://github.com/PaddlePaddle/Paddle/pull/71970), [#72516](https://github.com/PaddlePaddle/Paddle/pull/72516), [#72517](https://github.com/PaddlePaddle/Paddle/pull/72517), [#72638](https://github.com/PaddlePaddle/Paddle/pull/72638), [#72652](https://github.com/PaddlePaddle/Paddle/pull/72652), [#73046](https://github.com/PaddlePaddle/Paddle/pull/73046), [#73093](https://github.com/PaddlePaddle/Paddle/pull/73093), [#73136](https://github.com/PaddlePaddle/Paddle/pull/73136), [#72679](https://github.com/PaddlePaddle/Paddle/pull/72679), [#73174](https://github.com/PaddlePaddle/Paddle/pull/73174), [#73198](https://github.com/PaddlePaddle/Paddle/pull/73198), [#73121](https://github.com/PaddlePaddle/Paddle/pull/73121), [#73096](https://github.com/PaddlePaddle/Paddle/pull/73096), [#73261](https://github.com/PaddlePaddle/Paddle/pull/73261), [#73201](https://github.com/PaddlePaddle/Paddle/pull/73201), [#73291](https://github.com/PaddlePaddle/Paddle/pull/73291), [#73373](https://github.com/PaddlePaddle/Paddle/pull/73373), [#73318](https://github.com/PaddlePaddle/Paddle/pull/73318), [#73436](https://github.com/PaddlePaddle/Paddle/pull/73436), [#72705](https://github.com/PaddlePaddle/Paddle/pull/72705), [#72276](https://github.com/PaddlePaddle/Paddle/pull/72276), [#73135](https://github.com/PaddlePaddle/Paddle/pull/73135), [#73304](https://github.com/PaddlePaddle/Paddle/pull/73304), [#73381](https://github.com/PaddlePaddle/Paddle/pull/73381), [#72712](https://github.com/PaddlePaddle/Paddle/pull/72712), [#72717](https://github.com/PaddlePaddle/Paddle/pull/72717), [#72634](https://github.com/PaddlePaddle/Paddle/pull/72634), [#72562](https://github.com/PaddlePaddle/Paddle/pull/72562), [#72628](https://github.com/PaddlePaddle/Paddle/pull/72628), [#72706](https://github.com/PaddlePaddle/Paddle/pull/72706), [#72831](https://github.com/PaddlePaddle/Paddle/pull/72831), [#72888](https://github.com/PaddlePaddle/Paddle/pull/72888), [#72753](https://github.com/PaddlePaddle/Paddle/pull/72753), [#72931](https://github.com/PaddlePaddle/Paddle/pull/72931), [#73021](https://github.com/PaddlePaddle/Paddle/pull/73021), [#73064](https://github.com/PaddlePaddle/Paddle/pull/73064), [#73069](https://github.com/PaddlePaddle/Paddle/pull/73069), [#73153](https://github.com/PaddlePaddle/Paddle/pull/73153), [#73118](https://github.com/PaddlePaddle/Paddle/pull/73118), [#73252](https://github.com/PaddlePaddle/Paddle/pull/73252), [#73253](https://github.com/PaddlePaddle/Paddle/pull/73253), [#73262](https://github.com/PaddlePaddle/Paddle/pull/73262), [#73259](https://github.com/PaddlePaddle/Paddle/pull/73259), [#73288](https://github.com/PaddlePaddle/Paddle/pull/73288), [#73105](https://github.com/PaddlePaddle/Paddle/pull/73105), [#73275](https://github.com/PaddlePaddle/Paddle/pull/73275), [#73284](https://github.com/PaddlePaddle/Paddle/pull/73284), [#73110](https://github.com/PaddlePaddle/Paddle/pull/73110), [#73335](https://github.com/PaddlePaddle/Paddle/pull/73335), [#73342](https://github.com/PaddlePaddle/Paddle/pull/73342), [#73447](https://github.com/PaddlePaddle/Paddle/pull/73447), [#73460](https://github.com/PaddlePaddle/Paddle/pull/73460), [#73194](https://github.com/PaddlePaddle/Paddle/pull/73194) -- 0-Size Tensor 问题修复:修复 0-Size Tensor 导致的计算异常,覆盖池化(max_pool1d/lp_pool1d)、排序(matrix_rank)、统计(std/nanmedian)及元素级操作(elementwise compare)等,确保极端输入场景下的数值稳定性与 API 一致性。 [#71961](https://github.com/PaddlePaddle/Paddle/pull/71961), [#72017](https://github.com/PaddlePaddle/Paddle/pull/72017), [#72785](https://github.com/PaddlePaddle/Paddle/pull/72785), [#73214](https://github.com/PaddlePaddle/Paddle/pull/73214), [#73263](https://github.com/PaddlePaddle/Paddle/pull/73263), [#73267](https://github.com/PaddlePaddle/Paddle/pull/73267), [#73280](https://github.com/PaddlePaddle/Paddle/pull/73280), [#72444](https://github.com/PaddlePaddle/Paddle/pull/72444), [#72437](https://github.com/PaddlePaddle/Paddle/pull/72437), [#72460](https://github.com/PaddlePaddle/Paddle/pull/72460), [#73090](https://github.com/PaddlePaddle/Paddle/pull/73090), [#73516](https://github.com/PaddlePaddle/Paddle/pull/73516), [#72807](https://github.com/PaddlePaddle/Paddle/pull/72807), [#72799](https://github.com/PaddlePaddle/Paddle/pull/72799), [#72800](https://github.com/PaddlePaddle/Paddle/pull/72800), [#72809](https://github.com/PaddlePaddle/Paddle/pull/72809), [#73497](https://github.com/PaddlePaddle/Paddle/pull/73497) -- API 功能增强与兼容性:新增对 Python 标准库类型(dataclasses)的支持,扩展 API 数据类型兼容性(bfloat16 参数创建、-1 维自动推断),修复 NumPy API 交互错误,优化 BatchNorm 内存布局。 [#72059](https://github.com/PaddlePaddle/Paddle/pull/72059), [#72283](https://github.com/PaddlePaddle/Paddle/pull/72283), [#72451](https://github.com/PaddlePaddle/Paddle/pull/72451), [#72512](https://github.com/PaddlePaddle/Paddle/pull/72512), [#72618](https://github.com/PaddlePaddle/Paddle/pull/72618), [#72976](https://github.com/PaddlePaddle/Paddle/pull/72976), [#73084](https://github.com/PaddlePaddle/Paddle/pull/73084), [#73205](https://github.com/PaddlePaddle/Paddle/pull/73205), [#73250](https://github.com/PaddlePaddle/Paddle/pull/73250), [#73111](https://github.com/PaddlePaddle/Paddle/pull/73111), [#73260](https://github.com/PaddlePaddle/Paddle/pull/73260), [#72094](https://github.com/PaddlePaddle/Paddle/pull/72094), [#71844](https://github.com/PaddlePaddle/Paddle/pull/71844), [#71357](https://github.com/PaddlePaddle/Paddle/pull/71357) -- 内存管理与错误修复:解决内存越界(set_value/nonzero)、空指针(data nullptr)、CUDA graph 分配失败等高危问题,修复梯度裁剪(clip_grad)、张量赋值(assign)、广播(broadcast)等核心操作的内存泄漏与计算错误,优化 NPU 异步执行与预测器 GIL 释放逻辑,提升系统健壮性。 [#71895](https://github.com/PaddlePaddle/Paddle/pull/71895), [#72101](https://github.com/PaddlePaddle/Paddle/pull/72101), [#72133](https://github.com/PaddlePaddle/Paddle/pull/72133), [#72149](https://github.com/PaddlePaddle/Paddle/pull/72149), [#72176](https://github.com/PaddlePaddle/Paddle/pull/72176), [#72314](https://github.com/PaddlePaddle/Paddle/pull/72314), [#72256](https://github.com/PaddlePaddle/Paddle/pull/72256), [#72757](https://github.com/PaddlePaddle/Paddle/pull/72757), [#72749](https://github.com/PaddlePaddle/Paddle/pull/72749), [#72792](https://github.com/PaddlePaddle/Paddle/pull/72792), [#72815](https://github.com/PaddlePaddle/Paddle/pull/72815), [#72819](https://github.com/PaddlePaddle/Paddle/pull/72819), [#72958](https://github.com/PaddlePaddle/Paddle/pull/72958), [#73023](https://github.com/PaddlePaddle/Paddle/pull/73023), [#73103](https://github.com/PaddlePaddle/Paddle/pull/73103), [#73014](https://github.com/PaddlePaddle/Paddle/pull/73014), [#73137](https://github.com/PaddlePaddle/Paddle/pull/73137), [#73256](https://github.com/PaddlePaddle/Paddle/pull/73256), [#73211](https://github.com/PaddlePaddle/Paddle/pull/73211), [#73251](https://github.com/PaddlePaddle/Paddle/pull/73251), [#73210](https://github.com/PaddlePaddle/Paddle/pull/73210), [#73415](https://github.com/PaddlePaddle/Paddle/pull/73415), [#73206](https://github.com/PaddlePaddle/Paddle/pull/73206), [#71983](https://github.com/PaddlePaddle/Paddle/pull/71983), [#72485](https://github.com/PaddlePaddle/Paddle/pull/72485), [#72561](https://github.com/PaddlePaddle/Paddle/pull/72561) -- 其他重要修复:修复科学计算、save/load 等模块缺陷,改进 Slice 算子内核配置,优化动态 shaoe 推断的回退策略,完善异常抛出与类型检查逻辑等。 [#71810](https://github.com/PaddlePaddle/Paddle/pull/71810), [#72246](https://github.com/PaddlePaddle/Paddle/pull/72246), [#72378](https://github.com/PaddlePaddle/Paddle/pull/72378), [#72467](https://github.com/PaddlePaddle/Paddle/pull/72467), [#72635](https://github.com/PaddlePaddle/Paddle/pull/72635), [#72751](https://github.com/PaddlePaddle/Paddle/pull/72751), [#72044](https://github.com/PaddlePaddle/Paddle/pull/72044), [#72051](https://github.com/PaddlePaddle/Paddle/pull/72051), [#73231](https://github.com/PaddlePaddle/Paddle/pull/73231), [#73109](https://github.com/PaddlePaddle/Paddle/pull/73109) -- SOT 相关问题修复, [#71932](https://github.com/PaddlePaddle/Paddle/pull/71932), [#71971](https://github.com/PaddlePaddle/Paddle/pull/71971), [#72194](https://github.com/PaddlePaddle/Paddle/pull/72194), [#72288](https://github.com/PaddlePaddle/Paddle/pull/72288), [#72306](https://github.com/PaddlePaddle/Paddle/pull/72306), [#72367](https://github.com/PaddlePaddle/Paddle/pull/72367), [#72495](https://github.com/PaddlePaddle/Paddle/pull/72495), [#72522](https://github.com/PaddlePaddle/Paddle/pull/72522), [#72704](https://github.com/PaddlePaddle/Paddle/pull/72704), [#72631](https://github.com/PaddlePaddle/Paddle/pull/72631), [#72737](https://github.com/PaddlePaddle/Paddle/pull/72737), [#73067](https://github.com/PaddlePaddle/Paddle/pull/73067), [#73030](https://github.com/PaddlePaddle/Paddle/pull/73030), [#73059](https://github.com/PaddlePaddle/Paddle/pull/73059), [#73282](https://github.com/PaddlePaddle/Paddle/pull/73282), [#73511](https://github.com/PaddlePaddle/Paddle/pull/73511), [#73526](https://github.com/PaddlePaddle/Paddle/pull/73526), [#73549](https://github.com/PaddlePaddle/Paddle/pull/73549), [#73515](https://github.com/PaddlePaddle/Paddle/pull/73515) +### 新功能 +- 动态图支持。[#74484](https://github.com/PaddlePaddle/Paddle/pull/74484) +- 支持 safetensors。[#74642](https://github.com/PaddlePaddle/Paddle/pull/74642), [#74609](https://github.com/PaddlePaddle/Paddle/pull/74609), [#75049](https://github.com/PaddlePaddle/Paddle/pull/75049) +- 添加offloader优化计算效率。 [#74837](https://github.com/PaddlePaddle/Paddle/pull/74837) +- 为 conv_transpose 前向计算添加 API 支持。 [#74431](https://github.com/PaddlePaddle/Paddle/pull/74431) +- 添加offloader优化计算效率。 [#74837](https://github.com/PaddlePaddle/Paddle/pull/74837) +- 推理部署增加了w4afp8量化推理,支持w4afp8量化权重纯排及all2all通信[#74270](https://github.com/PaddlePaddle/Paddle/pull/74270) + +### Bug修复 +- 核心框架与基础设施优化。[#74336](https://github.com/PaddlePaddle/Paddle/pull/74336), [#74554](https://github.com/PaddlePaddle/Paddle/pull/74554), [#74634](https://github.com/PaddlePaddle/Paddle/pull/74634) +- 计算精度与类型处理。 [#74278](https://github.com/PaddlePaddle/Paddle/pull/74278), [#74222](https://github.com/PaddlePaddle/Paddle/pull/74222), [#74830](https://github.com/PaddlePaddle/Paddle/pull/74830) +- 动态维度检查逻辑优化。 [#74633](https://github.com/PaddlePaddle/Paddle/pull/74633), [#74650](https://github.com/PaddlePaddle/Paddle/pull/74650) +- 内存与非法访问修复。 [#74347](https://github.com/PaddlePaddle/Paddle/pull/74347), [#73443](https://github.com/PaddlePaddle/Paddle/pull/73443), [#74953](https://github.com/PaddlePaddle/Paddle/pull/74953) +- 修复报错/告警信息打印。 [#74474](https://github.com/PaddlePaddle/Paddle/pull/74474), [#74533](https://github.com/PaddlePaddle/Paddle/pull/74533), [#74685](https://github.com/PaddlePaddle/Paddle/pull/74685), [#74721](https://github.com/PaddlePaddle/Paddle/pull/74721), [#74754](https://github.com/PaddlePaddle/Paddle/pull/74754) +- 代码质量与文档修正。 [#74378](https://github.com/PaddlePaddle/Paddle/pull/74378), [#74828](https://github.com/PaddlePaddle/Paddle/pull/74828) +- 修复 flashmask API 处理逻辑。 [#74928](https://github.com/PaddlePaddle/Paddle/pull/74928) +- 修复动转静模式下切分CudaGraph子图未生效的问题。 ([#74749](https://github.com/PaddlePaddle/Paddle/pull/74749)) ### 功能增强 - -- Paddle API 0-size 机制建设。 [#72721](https://github.com/PaddlePaddle/Paddle/pull/72721), [#72756](https://github.com/PaddlePaddle/Paddle/pull/72756), [#72790](https://github.com/PaddlePaddle/Paddle/pull/72790), [#72806](https://github.com/PaddlePaddle/Paddle/pull/72806), [#72764](https://github.com/PaddlePaddle/Paddle/pull/72764), [#72786](https://github.com/PaddlePaddle/Paddle/pull/72786), [#72853](https://github.com/PaddlePaddle/Paddle/pull/72853), [#72826](https://github.com/PaddlePaddle/Paddle/pull/72826), [#72851](https://github.com/PaddlePaddle/Paddle/pull/72851), [#72928](https://github.com/PaddlePaddle/Paddle/pull/72928), [#72912](https://github.com/PaddlePaddle/Paddle/pull/72912), [#72922](https://github.com/PaddlePaddle/Paddle/pull/72922), [#72924](https://github.com/PaddlePaddle/Paddle/pull/72924), [#72887](https://github.com/PaddlePaddle/Paddle/pull/72887), [#72921](https://github.com/PaddlePaddle/Paddle/pull/72921), [#72906](https://github.com/PaddlePaddle/Paddle/pull/72906), [#72895](https://github.com/PaddlePaddle/Paddle/pull/72895), [#72821](https://github.com/PaddlePaddle/Paddle/pull/72821), [#72914](https://github.com/PaddlePaddle/Paddle/pull/72914), [#72936](https://github.com/PaddlePaddle/Paddle/pull/72936), [#72943](https://github.com/PaddlePaddle/Paddle/pull/72943), [#72694](https://github.com/PaddlePaddle/Paddle/pull/72694), [#72919](https://github.com/PaddlePaddle/Paddle/pull/72919), [#72940](https://github.com/PaddlePaddle/Paddle/pull/72940), [#72820](https://github.com/PaddlePaddle/Paddle/pull/72820), [#72934](https://github.com/PaddlePaddle/Paddle/pull/72934), [#72975](https://github.com/PaddlePaddle/Paddle/pull/72975), [#72872](https://github.com/PaddlePaddle/Paddle/pull/72872), [#72984](https://github.com/PaddlePaddle/Paddle/pull/72984), [#72988](https://github.com/PaddlePaddle/Paddle/pull/72988), [#72972](https://github.com/PaddlePaddle/Paddle/pull/72972), [#72977](https://github.com/PaddlePaddle/Paddle/pull/72977), [#72937](https://github.com/PaddlePaddle/Paddle/pull/72937), [#73086](https://github.com/PaddlePaddle/Paddle/pull/73086), [#73042](https://github.com/PaddlePaddle/Paddle/pull/73042), [#73017](https://github.com/PaddlePaddle/Paddle/pull/73017), [#73044](https://github.com/PaddlePaddle/Paddle/pull/73044), [#73077](https://github.com/PaddlePaddle/Paddle/pull/73077), [#73108](https://github.com/PaddlePaddle/Paddle/pull/73108), [#73027](https://github.com/PaddlePaddle/Paddle/pull/73027), [#72970](https://github.com/PaddlePaddle/Paddle/pull/72970), [#73008](https://github.com/PaddlePaddle/Paddle/pull/73008), [#72996](https://github.com/PaddlePaddle/Paddle/pull/72996), [#73165](https://github.com/PaddlePaddle/Paddle/pull/73165), [#73166](https://github.com/PaddlePaddle/Paddle/pull/73166), [#73170](https://github.com/PaddlePaddle/Paddle/pull/73170), [#73122](https://github.com/PaddlePaddle/Paddle/pull/73122), [#73204](https://github.com/PaddlePaddle/Paddle/pull/73204), [#73207](https://github.com/PaddlePaddle/Paddle/pull/73207), [#73186](https://github.com/PaddlePaddle/Paddle/pull/73186), [#73197](https://github.com/PaddlePaddle/Paddle/pull/73197), [#73168](https://github.com/PaddlePaddle/Paddle/pull/73168), [#73172](https://github.com/PaddlePaddle/Paddle/pull/73172), [#73125](https://github.com/PaddlePaddle/Paddle/pull/73125), [#73181](https://github.com/PaddlePaddle/Paddle/pull/73181), [#73270](https://github.com/PaddlePaddle/Paddle/pull/73270), [#73028](https://github.com/PaddlePaddle/Paddle/pull/73028), [#73094](https://github.com/PaddlePaddle/Paddle/pull/73094), [#73180](https://github.com/PaddlePaddle/Paddle/pull/73180), [#73276](https://github.com/PaddlePaddle/Paddle/pull/73276), [#73333](https://github.com/PaddlePaddle/Paddle/pull/73333), [#73341](https://github.com/PaddlePaddle/Paddle/pull/73341), [#73299](https://github.com/PaddlePaddle/Paddle/pull/73299), [#73346](https://github.com/PaddlePaddle/Paddle/pull/73346), [#73361](https://github.com/PaddlePaddle/Paddle/pull/73361), [#73375](https://github.com/PaddlePaddle/Paddle/pull/73375), [#73152](https://github.com/PaddlePaddle/Paddle/pull/73152), [#73377](https://github.com/PaddlePaddle/Paddle/pull/73377), [#73355](https://github.com/PaddlePaddle/Paddle/pull/73355), [#73382](https://github.com/PaddlePaddle/Paddle/pull/73382), [#73385](https://github.com/PaddlePaddle/Paddle/pull/73385), [#73386](https://github.com/PaddlePaddle/Paddle/pull/73386), [#73352](https://github.com/PaddlePaddle/Paddle/pull/73352), [#73387](https://github.com/PaddlePaddle/Paddle/pull/73387), [#73401](https://github.com/PaddlePaddle/Paddle/pull/73401), [#73384](https://github.com/PaddlePaddle/Paddle/pull/73384), [#73450](https://github.com/PaddlePaddle/Paddle/pull/73450), [#73437](https://github.com/PaddlePaddle/Paddle/pull/73437), [#73503](https://github.com/PaddlePaddle/Paddle/pull/73503), [#73507](https://github.com/PaddlePaddle/Paddle/pull/73507), [#73477](https://github.com/PaddlePaddle/Paddle/pull/73477), [#73513](https://github.com/PaddlePaddle/Paddle/pull/73513), [#73525](https://github.com/PaddlePaddle/Paddle/pull/73525), [#73528](https://github.com/PaddlePaddle/Paddle/pull/73528), [#73517](https://github.com/PaddlePaddle/Paddle/pull/73517), [#72898](https://github.com/PaddlePaddle/Paddle/pull/72898), [#72880](https://github.com/PaddlePaddle/Paddle/pull/72880), [#72864](https://github.com/PaddlePaddle/Paddle/pull/72864), [#72993](https://github.com/PaddlePaddle/Paddle/pull/72993), [#72954](https://github.com/PaddlePaddle/Paddle/pull/72954), [#72866](https://github.com/PaddlePaddle/Paddle/pull/72866), [#72878](https://github.com/PaddlePaddle/Paddle/pull/72878), [#72889](https://github.com/PaddlePaddle/Paddle/pull/72889), [#72861](https://github.com/PaddlePaddle/Paddle/pull/72861), [#72837](https://github.com/PaddlePaddle/Paddle/pull/72837) -- SOT 相关提升:增强了功能(如 NumPy 互操作性和 super 支持)、改进训练稳定性,修复多个问题以提升代码健壮性, [#71763](https://github.com/PaddlePaddle/Paddle/pull/71763), [#71666](https://github.com/PaddlePaddle/Paddle/pull/71666), [#71858](https://github.com/PaddlePaddle/Paddle/pull/71858), [#71865](https://github.com/PaddlePaddle/Paddle/pull/71865), [#72474](https://github.com/PaddlePaddle/Paddle/pull/72474), [#72154](https://github.com/PaddlePaddle/Paddle/pull/72154), [#72784](https://github.com/PaddlePaddle/Paddle/pull/72784), [#72956](https://github.com/PaddlePaddle/Paddle/pull/72956), [#73038](https://github.com/PaddlePaddle/Paddle/pull/73038), [#73066](https://github.com/PaddlePaddle/Paddle/pull/73066), [#73287](https://github.com/PaddlePaddle/Paddle/pull/73287), [#73278](https://github.com/PaddlePaddle/Paddle/pull/73278), [#73332](https://github.com/PaddlePaddle/Paddle/pull/73332), [#73372](https://github.com/PaddlePaddle/Paddle/pull/73372), [#73412](https://github.com/PaddlePaddle/Paddle/pull/73412), [#73407](https://github.com/PaddlePaddle/Paddle/pull/73407), [#73506](https://github.com/PaddlePaddle/Paddle/pull/73506) -- 代码风格重构:通过代码重构及跨平台内核行为统一,提升代码质量与可维护性,并新增了 YAML 格式预提交检查工具, [#72216](https://github.com/PaddlePaddle/Paddle/pull/72216), [#72360](https://github.com/PaddlePaddle/Paddle/pull/72360), [#72816](https://github.com/PaddlePaddle/Paddle/pull/72816), [#72969](https://github.com/PaddlePaddle/Paddle/pull/72969), [#73106](https://github.com/PaddlePaddle/Paddle/pull/73106), [#72825](https://github.com/PaddlePaddle/Paddle/pull/72825), [#73150](https://github.com/PaddlePaddle/Paddle/pull/73150), [#73151](https://github.com/PaddlePaddle/Paddle/pull/73151), [#73158](https://github.com/PaddlePaddle/Paddle/pull/73158), [#73101](https://github.com/PaddlePaddle/Paddle/pull/73101), [#73326](https://github.com/PaddlePaddle/Paddle/pull/73326), [#72580](https://github.com/PaddlePaddle/Paddle/pull/72580), [#72424](https://github.com/PaddlePaddle/Paddle/pull/72424) -- Paddle CPU/GPU Kernel 精度问题推全。 [#72879](https://github.com/PaddlePaddle/Paddle/pull/72879), [#72894](https://github.com/PaddlePaddle/Paddle/pull/72894), [#73012](https://github.com/PaddlePaddle/Paddle/pull/73012), [#72973](https://github.com/PaddlePaddle/Paddle/pull/72973), [#73018](https://github.com/PaddlePaddle/Paddle/pull/73018), [#72965](https://github.com/PaddlePaddle/Paddle/pull/72965), [#73128](https://github.com/PaddlePaddle/Paddle/pull/73128), [#73229](https://github.com/PaddlePaddle/Paddle/pull/73229), [#72992](https://github.com/PaddlePaddle/Paddle/pull/72992), [#73344](https://github.com/PaddlePaddle/Paddle/pull/73344), [#73274](https://github.com/PaddlePaddle/Paddle/pull/73274), [#73295](https://github.com/PaddlePaddle/Paddle/pull/73295), [#73293](https://github.com/PaddlePaddle/Paddle/pull/73293), [#73317](https://github.com/PaddlePaddle/Paddle/pull/73317), [#73320](https://github.com/PaddlePaddle/Paddle/pull/73320), [#73454](https://github.com/PaddlePaddle/Paddle/pull/73454), [#73492](https://github.com/PaddlePaddle/Paddle/pull/73492), [#73535](https://github.com/PaddlePaddle/Paddle/pull/73535) - -- slice 问题修复:修复了 slice 相关问题,包括索引逻辑、性能优化等, [#72644](https://github.com/PaddlePaddle/Paddle/pull/72644), [#72676](https://github.com/PaddlePaddle/Paddle/pull/72676), [#72838](https://github.com/PaddlePaddle/Paddle/pull/72838), [#72966](https://github.com/PaddlePaddle/Paddle/pull/72966), [#73095](https://github.com/PaddlePaddle/Paddle/pull/73095), [#72840](https://github.com/PaddlePaddle/Paddle/pull/72840), [#73112](https://github.com/PaddlePaddle/Paddle/pull/73112), [#73367](https://github.com/PaddlePaddle/Paddle/pull/73367), [#73390](https://github.com/PaddlePaddle/Paddle/pull/73390), [#73307](https://github.com/PaddlePaddle/Paddle/pull/73307), [#73465](https://github.com/PaddlePaddle/Paddle/pull/73465), [#73362](https://github.com/PaddlePaddle/Paddle/pull/73362), [#72733](https://github.com/PaddlePaddle/Paddle/pull/72733), [#72886](https://github.com/PaddlePaddle/Paddle/pull/72886) -- 性能优化:通过优化索引逻辑、性能提升等手段,提升整体性能表现, [#72707](https://github.com/PaddlePaddle/Paddle/pull/72707), [#73485](https://github.com/PaddlePaddle/Paddle/pull/73485) -- 其他重要提升:包括动态 shape 支持、修复 meshgrid 并增加单元测试、升级 CUB 至 2.1.0 版本、改进 FP8 数值处理、优化 CUDA 图共享池机制、移除 ShadowFeedOp 以简化数据流、增强 PIR 模型保存/加载的版本兼容性、修复 flip 和 reverse 内核问题、改进 paddle.angle 的 NaN 传播逻辑、引入异步 GC 检查机制、优化 Dy2St 的 Scope 无锁接口、清理未使用的第三方依赖(absl),并进一步推进 PHI 与 Fluid 的解耦,提升框架的稳定性、性能和扩展性。 [#72356](https://github.com/PaddlePaddle/Paddle/pull/72356), [#72380](https://github.com/PaddlePaddle/Paddle/pull/72380), [#72633](https://github.com/PaddlePaddle/Paddle/pull/72633), [#72794](https://github.com/PaddlePaddle/Paddle/pull/72794), [#72917](https://github.com/PaddlePaddle/Paddle/pull/72917), [#72920](https://github.com/PaddlePaddle/Paddle/pull/72920), [#72945](https://github.com/PaddlePaddle/Paddle/pull/72945), [#72620](https://github.com/PaddlePaddle/Paddle/pull/72620), [#73011](https://github.com/PaddlePaddle/Paddle/pull/73011), [#73051](https://github.com/PaddlePaddle/Paddle/pull/73051), [#73052](https://github.com/PaddlePaddle/Paddle/pull/73052), [#73075](https://github.com/PaddlePaddle/Paddle/pull/73075), [#73176](https://github.com/PaddlePaddle/Paddle/pull/73176), [#73191](https://github.com/PaddlePaddle/Paddle/pull/73191), [#73337](https://github.com/PaddlePaddle/Paddle/pull/73337), [#73311](https://github.com/PaddlePaddle/Paddle/pull/73311), [#73173](https://github.com/PaddlePaddle/Paddle/pull/73173), [#73239](https://github.com/PaddlePaddle/Paddle/pull/73239), [#73448](https://github.com/PaddlePaddle/Paddle/pull/73448), [#73478](https://github.com/PaddlePaddle/Paddle/pull/73478), [#73522](https://github.com/PaddlePaddle/Paddle/pull/73522), [#73369](https://github.com/PaddlePaddle/Paddle/pull/73369) - -### 性能提升 - -- SOT 相关:通过优化 Guard 条件机制、增强动态 shape 处理能力及新增 no_grad 支持等改进,提升了执行效率并扩展了功能特性,同时优化了代码结构与性能表现。 [#70362](https://github.com/PaddlePaddle/Paddle/pull/70362), [#70154](https://github.com/PaddlePaddle/Paddle/pull/70154), [#71748](https://github.com/PaddlePaddle/Paddle/pull/71748), [#72004](https://github.com/PaddlePaddle/Paddle/pull/72004), [#72159](https://github.com/PaddlePaddle/Paddle/pull/72159), [#72174](https://github.com/PaddlePaddle/Paddle/pull/72174), [#71994](https://github.com/PaddlePaddle/Paddle/pull/71994), [#72250](https://github.com/PaddlePaddle/Paddle/pull/72250), [#72285](https://github.com/PaddlePaddle/Paddle/pull/72285), [#72322](https://github.com/PaddlePaddle/Paddle/pull/72322), [#72272](https://github.com/PaddlePaddle/Paddle/pull/72272), [#72417](https://github.com/PaddlePaddle/Paddle/pull/72417), [#72438](https://github.com/PaddlePaddle/Paddle/pull/72438), [#72462](https://github.com/PaddlePaddle/Paddle/pull/72462), [#72463](https://github.com/PaddlePaddle/Paddle/pull/72463), [#72503](https://github.com/PaddlePaddle/Paddle/pull/72503), [#72501](https://github.com/PaddlePaddle/Paddle/pull/72501), [#72521](https://github.com/PaddlePaddle/Paddle/pull/72521), [#72509](https://github.com/PaddlePaddle/Paddle/pull/72509), [#72544](https://github.com/PaddlePaddle/Paddle/pull/72544), [#73469](https://github.com/PaddlePaddle/Paddle/pull/73469), [#73471](https://github.com/PaddlePaddle/Paddle/pull/73471), [#73555](https://github.com/PaddlePaddle/Paddle/pull/73555) +- C++ 扩展开发。 [#74338](https://github.com/PaddlePaddle/Paddle/pull/74338) +- FlexCP 功能优化。 [#74752](https://github.com/PaddlePaddle/Paddle/pull/74752), [#74981](https://github.com/PaddlePaddle/Paddle/pull/74981) +- 优化内存分配。[#74463](https://github.com/PaddlePaddle/Paddle/pull/74463) ### 废弃 - -- 代码清理:清理 Python 3.8 支持声明,并完成了相关代码清理、依赖精简及语法现代化更新,以优化代码维护性与兼容性。 [#71815](https://github.com/PaddlePaddle/Paddle/pull/71815), [#72802](https://github.com/PaddlePaddle/Paddle/pull/72802), [#72856](https://github.com/PaddlePaddle/Paddle/pull/72856), [#72854](https://github.com/PaddlePaddle/Paddle/pull/72854), [#72855](https://github.com/PaddlePaddle/Paddle/pull/72855), [#72873](https://github.com/PaddlePaddle/Paddle/pull/72873), [#72870](https://github.com/PaddlePaddle/Paddle/pull/72870), [#72868](https://github.com/PaddlePaddle/Paddle/pull/72868), [#72891](https://github.com/PaddlePaddle/Paddle/pull/72891) - -### 开发者相关 - -- 优化了 CINN 后端集成与动态 shape 处理逻辑,通过代码结构重构与测试强化提升了框架稳定性,并新增调试日志功能以增强可维护性。 [#71817](https://github.com/PaddlePaddle/Paddle/pull/71817), [#71896](https://github.com/PaddlePaddle/Paddle/pull/71896), [#71984](https://github.com/PaddlePaddle/Paddle/pull/71984), [#72067](https://github.com/PaddlePaddle/Paddle/pull/72067), [#72165](https://github.com/PaddlePaddle/Paddle/pull/72165), [#72207](https://github.com/PaddlePaddle/Paddle/pull/72207), [#72235](https://github.com/PaddlePaddle/Paddle/pull/72235), [#72273](https://github.com/PaddlePaddle/Paddle/pull/72273), [#72326](https://github.com/PaddlePaddle/Paddle/pull/72326), [#72400](https://github.com/PaddlePaddle/Paddle/pull/72400), [#72381](https://github.com/PaddlePaddle/Paddle/pull/72381), [#72560](https://github.com/PaddlePaddle/Paddle/pull/72560), [#72783](https://github.com/PaddlePaddle/Paddle/pull/72783), [#73530](https://github.com/PaddlePaddle/Paddle/pull/73530) +- 清理动转静旧 IR 相关单测。 [#74698](https://github.com/PaddlePaddle/Paddle/pull/74698), [#74715](https://github.com/PaddlePaddle/Paddle/pull/74715), [#74718](https://github.com/PaddlePaddle/Paddle/pull/74718), [#74782](https://github.com/PaddlePaddle/Paddle/pull/74782), [#74962](https://github.com/PaddlePaddle/Paddle/pull/74962) ### 其他 - -- 其他:新增 CPU 部分 kernel 对 FP16/BF16 数据类型的内核支持,优化测试模块错误处理与容差配置等。 [#71764](https://github.com/PaddlePaddle/Paddle/pull/71764), [#71951](https://github.com/PaddlePaddle/Paddle/pull/71951), [#72944](https://github.com/PaddlePaddle/Paddle/pull/72944) - -## 3. 编译器架构 - -优化编译器性能和增加稳定性 - -### 性能优化 - -- 支持训练场景的 Layout 自动转换优化。([#71891](https://github.com/PaddlePaddle/Paddle/pull/71891)) -- 后端新增了 argmin、argmax、arange 等算子的 Kernel 编译优化。([#71956](https://github.com/PaddlePaddle/Paddle/pull/71956), [#72598](https://github.com/PaddlePaddle/Paddle/pull/72598))) -- 支持矩阵乘的融合优化。([#72846](https://github.com/PaddlePaddle/Paddle/pull/72846)) -- 优化部分算子 Kernel 计算性能。([#72871](https://github.com/PaddlePaddle/Paddle/pull/72871)) - -### Bug 修复 - -修复各类场景下的一些处理逻辑 Bug。([#71813](https://github.com/PaddlePaddle/Paddle/pull/71813), [#71886](https://github.com/PaddlePaddle/Paddle/pull/71886), [#71927](https://github.com/PaddlePaddle/Paddle/pull/71927), [#71915](https://github.com/PaddlePaddle/Paddle/pull/71915), [#71946](https://github.com/PaddlePaddle/Paddle/pull/71946), [#71949](https://github.com/PaddlePaddle/Paddle/pull/71949), [#71955](https://github.com/PaddlePaddle/Paddle/pull/71955), [#71942](https://github.com/PaddlePaddle/Paddle/pull/71942), [#71939](https://github.com/PaddlePaddle/Paddle/pull/71939), [#71973](https://github.com/PaddlePaddle/Paddle/pull/71973), [#72001](https://github.com/PaddlePaddle/Paddle/pull/72001), [#72020](https://github.com/PaddlePaddle/Paddle/pull/72020), [#72014](https://github.com/PaddlePaddle/Paddle/pull/72014), [#72021](https://github.com/PaddlePaddle/Paddle/pull/72021), [#72027](https://github.com/PaddlePaddle/Paddle/pull/72027), [#72061](https://github.com/PaddlePaddle/Paddle/pull/72061), [#72025](https://github.com/PaddlePaddle/Paddle/pull/72025), [#72095](https://github.com/PaddlePaddle/Paddle/pull/72095), [#72108](https://github.com/PaddlePaddle/Paddle/pull/72108), [#72132](https://github.com/PaddlePaddle/Paddle/pull/72132), [#71985](https://github.com/PaddlePaddle/Paddle/pull/71985), [#72106](https://github.com/PaddlePaddle/Paddle/pull/72106), [#72140](https://github.com/PaddlePaddle/Paddle/pull/72140), [#72167](https://github.com/PaddlePaddle/Paddle/pull/72167), [#72037](https://github.com/PaddlePaddle/Paddle/pull/72037), [#72178](https://github.com/PaddlePaddle/Paddle/pull/72178), [#72143](https://github.com/PaddlePaddle/Paddle/pull/72143), [#72175](https://github.com/PaddlePaddle/Paddle/pull/72175), [#72191](https://github.com/PaddlePaddle/Paddle/pull/72191), [#72213](https://github.com/PaddlePaddle/Paddle/pull/72213), [#72189](https://github.com/PaddlePaddle/Paddle/pull/72189), [#72214](https://github.com/PaddlePaddle/Paddle/pull/72214), [#72166](https://github.com/PaddlePaddle/Paddle/pull/72166), [#72180](https://github.com/PaddlePaddle/Paddle/pull/72180), [#72284](https://github.com/PaddlePaddle/Paddle/pull/72284), [#72267](https://github.com/PaddlePaddle/Paddle/pull/72267), [#72348](https://github.com/PaddlePaddle/Paddle/pull/72348), [#72332](https://github.com/PaddlePaddle/Paddle/pull/72332), [#72307](https://github.com/PaddlePaddle/Paddle/pull/72307), [#72353](https://github.com/PaddlePaddle/Paddle/pull/72353), [#72204](https://github.com/PaddlePaddle/Paddle/pull/72204), [#72457](https://github.com/PaddlePaddle/Paddle/pull/72457), [#72426](https://github.com/PaddlePaddle/Paddle/pull/72426), [#72536](https://github.com/PaddlePaddle/Paddle/pull/72536), [#72541](https://github.com/PaddlePaddle/Paddle/pull/72541), [#72365](https://github.com/PaddlePaddle/Paddle/pull/72365), [#72621](https://github.com/PaddlePaddle/Paddle/pull/72621), [#72630](https://github.com/PaddlePaddle/Paddle/pull/72630), [#72669](https://github.com/PaddlePaddle/Paddle/pull/72669), [#72682](https://github.com/PaddlePaddle/Paddle/pull/72682), [#72732](https://github.com/PaddlePaddle/Paddle/pull/72732), [#72811](https://github.com/PaddlePaddle/Paddle/pull/72811), [#72941](https://github.com/PaddlePaddle/Paddle/pull/72941), [#72795](https://github.com/PaddlePaddle/Paddle/pull/72795), [#73536](https://github.com/PaddlePaddle/Paddle/pull/73536)) - -## 4. 自动并行架构 - -在 3.1 版本中,我们对自动并行架构进一步打磨,以提高自动并行易用性和动态图性能。具体地,我们完善了自动并行核心机制,包括新增了多个算子的切分推导规则,支持分布式张量的同一维度被多个 mesh 维度切分,支持动态图并行策略(PP,CP,SEP,TP-CONV)等。同时,对动态图自动并行系统地做了性能优化,在 Llama 等系列模型上性能基本持平手动并行的性能。 - -### 功能改进 - -- 支持分布式张量的同一维度被多个 mesh 维度切分。 [#73233](https://github.com/PaddlePaddle/Paddle/pull/73233) -- 支持自动并行通信拓扑描述 ProcessMesh 转换为手动并行通信组。 [#72052](https://github.com/PaddlePaddle/Paddle/pull/72052) -- 支持任意可序列化 python object 的 send/recv。 [#72098](https://github.com/PaddlePaddle/Paddle/pull/72098) -- 动态图并行策略补齐 - - - 支持流水线并行策略 1F1B 和 VPP 调度。 [#72155](https://github.com/PaddlePaddle/Paddle/pull/72155),[#72480](https://github.com/PaddlePaddle/Paddle/pull/72480),[#72179](https://github.com/PaddlePaddle/Paddle/pull/72179) - - 支持长文并行策略。[#73195](https://github.com/PaddlePaddle/Paddle/pull/73195) - - 支持视觉并行策略。[#73063](https://github.com/PaddlePaddle/Paddle/pull/73063),[#73039](https://github.com/PaddlePaddle/Paddle/pull/73039) - - 支持自动并行在数据并行维度的通信。[#72540](https://github.com/PaddlePaddle/Paddle/pull/72540) -- 新增以下算子的切分推导规则 - - - `min`, `min_grad` [#72269](https://github.com/PaddlePaddle/Paddle/pull/72269) - - `bitwise_or`,`atan2`,`fmax`,`fmin`,`reciprocal` [#72310](https://github.com/PaddlePaddle/Paddle/pull/72310) - - `argmin`, `abs`, `cosh` [#72264](https://github.com/PaddlePaddle/Paddle/pull/72264) - - `mean_all`, `mean_all_grad` [#72479](https://github.com/PaddlePaddle/Paddle/pull/72479) - - `topk`, `topk_grad` [#72499](https://github.com/PaddlePaddle/Paddle/pull/72499) - - `argsort` [#72388](https://github.com/PaddlePaddle/Paddle/pull/72388) - - `round`, `mish`, `elu`, `selu`, `celu`, `stanh`, `softplus`, `softshrink`, `thresholded_relu`, `logit`, `nonzero` [#72312](https://github.com/PaddlePaddle/Paddle/pull/72312) - - `unique ops` [#72824](https://github.com/PaddlePaddle/Paddle/pull/72824) - - `put_along_axis` [#72766](https://github.com/PaddlePaddle/Paddle/pull/72766) - - `round_grad`, `trunc_grad`, `ceil_grad`, `floor_grad`, `poisson_grad` [#72677](https://github.com/PaddlePaddle/Paddle/pull/72677) - - `log_softmax`, `cummax`, `cummin` [#72720](https://github.com/PaddlePaddle/Paddle/pull/72720) - - `unary` [#72177](https://github.com/PaddlePaddle/Paddle/pull/72177) - - `unary_grad` [#72260](https://github.com/PaddlePaddle/Paddle/pull/72260) - - `index_select`, `index_select_grad` [#72727](https://github.com/PaddlePaddle/Paddle/pull/72727) - - `roll`, `roll_grad` [#72740](https://github.com/PaddlePaddle/Paddle/pull/72740) - - `empty_like` [#73169](https://github.com/PaddlePaddle/Paddle/pull/73169) - - `roi_align`, `roi_align_grad` [#72925](https://github.com/PaddlePaddle/Paddle/pull/72925) - - `expand_as`, `expand_as_grad` [#73107](https://github.com/PaddlePaddle/Paddle/pull/73107) - - `fused_gemm_epilogur` [#73126](https://github.com/PaddlePaddle/Paddle/pull/73126) - - `label_smooth`, `label_smooth` [#72845](https://github.com/PaddlePaddle/Paddle/pull/72845) - - `group_norm`, `group_norm_grad` [#72946](https://github.com/PaddlePaddle/Paddle/pull/72946) - - `instance_norm`, `instance_norm_grad` [#72938](https://github.com/PaddlePaddle/Paddle/pull/72938) - - `batch_norm`, `sync_batch_norm` [#72918](https://github.com/PaddlePaddle/Paddle/pull/72918) - - `reduce_any` [#73175](https://github.com/PaddlePaddle/Paddle/pull/73175) - - `fused_gemm_epilogue_rule` [#73494](https://github.com/PaddlePaddle/Paddle/pull/73494) - -### 性能优化 - -* 支持分组切分并行的 tensor_fusion 优化策略和 overlap 优化策略。 [#72551](https://github.com/PaddlePaddle/Paddle/pull/72551), [#72902](https://github.com/PaddlePaddle/Paddle/pull/72902), [#73142](https://github.com/PaddlePaddle/Paddle/pull/73142),[#71785](https://github.com/PaddlePaddle/Paddle/pull/71785) -* 优化 reshard 模块,以降低通信开销。[#71969](https://github.com/PaddlePaddle/Paddle/pull/71969), [#73024](https://github.com/PaddlePaddle/Paddle/pull/73024),[#71868](https://github.com/PaddlePaddle/Paddle/pull/71868) -* 优化 multiply 的切分推导规则,以降低通信开销。[#73408](https://github.com/PaddlePaddle/Paddle/pull/73408) -* 优化分布式切分状态为 Partial 时反向通信,以降低通信开销。 [#73236](https://github.com/PaddlePaddle/Paddle/pull/73236) -* 梯度更新时通信融合优化。 [#72120](https://github.com/PaddlePaddle/Paddle/pull/72120 )、[#72745](https://github.com/PaddlePaddle/Paddle/pull/72745) -* 优化 gelu 切分推导,以降低通信开销。 [#73279](https://github.com/PaddlePaddle/Paddle/pull/73279) -* 优化 fused_rms_norm 在输入有 Partial 状态时的切分推导规则,以减少通信和计算开销。 [#73054](https://github.com/PaddlePaddle/Paddle/pull/73054) - -### Bug 修复 - -- 修复虚拟流水线并行策略在 H 卡上通信 hang 的 bug。[#71104](https://github.com/PaddlePaddle/Paddle/pull/71104), [#73470](https://github.com/PaddlePaddle/Paddle/pull/73470) -- 修复 save/load 的 bug。 [#72023](https://github.com/PaddlePaddle/Paddle/pull/72023) -- 修复 linear_fused_grad_add 策略在动态图模式下跑不通的 bug。 [#72708](https://github.com/PaddlePaddle/Paddle/pull/72708)) -- 修复 fused_rms_norm 算子跑不通和精度 bug。 [#72663](https://github.com/PaddlePaddle/Paddle/pull/72663 ) -- 修复 expand 算子切分推导规则的 bug。[#73154](https://github.com/PaddlePaddle/Paddle/pull/73154) - -### 其他 - -- 清理废弃代码,以便于维护代码。 [#71814](https://github.com/PaddlePaddle/Paddle/pull/71814),[#72538](https://github.com/PaddlePaddle/Paddle/pull/72538) -- 新增 API local_map,将分布式张量传递给为普通张量编写的函数。 ([#71804](https://github.com/PaddlePaddle/Paddle/pull/71804)) -- 为算子 fused_linear_param_grad_add 增加检查。([#72483](https://github.com/PaddlePaddle/Paddle/pull/72483)) - -## 5. 算子机制 - +- 更改补丁版本。 [#74940](https://github.com/PaddlePaddle/Paddle/pull/74940) + +## 3. 分布式&自动并行 + +### 并行策略 +在3.2版本中,我们对流水线并行功能进行了多项增强,包括实现了字典参数传递的支持,并扩展了Pipeline Layer和SharedLayerDesc对非流水线并行的兼容性;同时修复了多个关键问题,包括大尺寸张量的IPC API异常、流水线并行中的评估批次和非计算损失问题、MoE模型的梯度释放错误、PP场景下NCCL通信重建导致的hang问题,以及双流水线并行的event管理错误;此外还进行了多项性能优化,改进了双流水线并行的计算重叠效率以提升训练性能,并升级了clear_param_storage方法使其支持sharding模式下多color集合的清除和重置操作。 + +#### 功能新增 +- 实现流水线并行(Pipeline Parallel)中字典参数传递的支持。[#74574](https://github.com/PaddlePaddle/Paddle/pull/74574),[#74867](https://github.com/PaddlePaddle/Paddle/pull/74867) +- Pipeline Layer 和 SharedLayerDesc 支持非流水线并行(nonpp parallel)。[#74573](https://github.com/PaddlePaddle/Paddle/pull/74573) + +#### Bug 修复 +- 修复大尺寸张量的 IPC API 问题。[#74472](https://github.com/PaddlePaddle/Paddle/pull/74472) +- 修复流水线并行中的评估批次(eval batch)及非计算损失(non-compute_loss)问题。[#74170](https://github.com/PaddlePaddle/Paddle/pull/74170) +- 修复 MoE 模型上的梯度释放问题。[#74972](https://github.com/PaddlePaddle/Paddle/pull/74972) +- 修复在pp的场景下重建NCCL comm存在hang的问题。[#73625](https://github.com/PaddlePaddle/Paddle/pull/73625) +- 修复双流水线并行(dual pp)的event管理错误。[#74158](https://github.com/PaddlePaddle/Paddle/pull/74158) + +#### 优化改进 +- 优化双流水线并行的计算重叠(overlap)效率,提升训练性能。[#74527](https://github.com/PaddlePaddle/Paddle/pull/74527) +- 升级clear_param_storage方法,支持sharding下多个color集合清除和重置。[#74741](https://github.com/PaddlePaddle/Paddle/pull/74741) + +### 自动并行 +#### 功能改进 +- 支持分布式张量的同一维度被多个mesh维度切分时的默认切分推导规则。[#74396](https://github.com/PaddlePaddle/Paddle/pull/74396) +- 改进 `reshape` 算子的切分推导规则,以支持分布式张量的同一维度被多个mesh维度切分的场景。[#74352](https://github.com/PaddlePaddle/Paddle/pull/74352),[#74579](https://github.com/PaddlePaddle/Paddle/pull/74579), [#74565](https://github.com/PaddlePaddle/Paddle/pull/74565) +- 支持在不改变分布式张量数据的情况下改变张量的mesh。[#74248](https://github.com/PaddlePaddle/Paddle/pull/74248) + +#### Bug 修复 +- 修复调用 `ProcessMesh` 的 `get_group` 方法时重复创建通信组的bug。[#73099](https://github.com/PaddlePaddle/Paddle/pull/73099) +- 修复MoE场景下`get_local_slices` 方法的bug。[#74705](https://github.com/PaddlePaddle/Paddle/pull/74705) +- 修复MoE场景下梯度裁剪的bug。[#74916](https://github.com/PaddlePaddle/Paddle/pull/74916) +- 修复流水线并行场景下不同stage间无法传递`stop_gradient`参数的bug。[#73459](https://github.com/PaddlePaddle/Paddle/pull/73459) +- 修复流水线并行场景下梯度裁剪的精度bug。[#74409](https://github.com/PaddlePaddle/Paddle/pull/74409) +- 修复动态图流水线并行场景下产生冗余输出的bug。[#74913](https://github.com/PaddlePaddle/Paddle/pull/74913) +- 修复算子`moe_combine`和`moe_gate_dispatch`在MoE场景下跑不通的bug。[#74645](https://github.com/PaddlePaddle/Paddle/pull/74645) + +#### 其他 +- 支持dataloader手动并行和自动并行的精度对齐。[#73941](https://github.com/PaddlePaddle/Paddle/pull/73941) +- 优化动态图流水并行调度逻辑。[#74720](https://github.com/PaddlePaddle/Paddle/pull/74720) + +### 通信库 +在3.2版本中,我们修复了DeepEP支持sm90编译的一个报错,同时对DeepEP申请的显存分配添加了预分配功能,并升级了其intranode和internode计算kernel,进一步优化了性能和稳定性。 + +#### Bug修复 +- 修复DeepEP支持sm90 编译的一个报错。[#74762](https://github.com/PaddlePaddle/Paddle/pull/74762) + +#### 功能改进 +- 对DeepEP申请的显存分配添加预分配功能。[#74465](https://github.com/PaddlePaddle/Paddle/pull/74465) +- 升级DeepEP的intranode和internode计算kernel。[#74284](https://github.com/PaddlePaddle/Paddle/pull/74284) + +## 4. 算子机制 ### 新特性 +- API 兼容性支持。 [#74506](https://github.com/PaddlePaddle/Paddle/pull/74506), [#74676](https://github.com/PaddlePaddle/Paddle/pull/74676), [#74558](https://github.com/PaddlePaddle/Paddle/pull/74558), [#74572](https://github.com/PaddlePaddle/Paddle/pull/74572), [#74691](https://github.com/PaddlePaddle/Paddle/pull/74691), [#74703](https://github.com/PaddlePaddle/Paddle/pull/74703), [#74750](https://github.com/PaddlePaddle/Paddle/pull/74750), [#74757](https://github.com/PaddlePaddle/Paddle/pull/74757), [#74802](https://github.com/PaddlePaddle/Paddle/pull/74802), [#74546](https://github.com/PaddlePaddle/Paddle/pull/74546), [#74547](https://github.com/PaddlePaddle/Paddle/pull/74547), [#74802](https://github.com/PaddlePaddle/Paddle/pull/74802), [#74859](https://github.com/PaddlePaddle/Paddle/pull/74859), [#74910](https://github.com/PaddlePaddle/Paddle/pull/74910), [#74873](https://github.com/PaddlePaddle/Paddle/pull/74873), [#74882](https://github.com/PaddlePaddle/Paddle/pull/74882), [#74901](https://github.com/PaddlePaddle/Paddle/pull/74901), [#74899](https://github.com/PaddlePaddle/Paddle/pull/74899), [#74449](https://github.com/PaddlePaddle/Paddle/pull/74449) +- 新增 fused_partial_rope 算子。 [#74577](https://github.com/PaddlePaddle/Paddle/pull/74577) -- 梯度与自动微分优化:初步支持 put_along_axis 及 repeat_interleave 操作的双重梯度计算,提升复杂算子在自动微分场景下的数值稳定性,实现 masked_fill 操作的算子分解。 [#72789](https://github.com/PaddlePaddle/Paddle/pull/72789), [#73056](https://github.com/PaddlePaddle/Paddle/pull/73056), [#73225](https://github.com/PaddlePaddle/Paddle/pull/73225) -- 运算符机制扩展:新增对__radd__和__rmul__的自定义支持,增强框架对非对称运算符的重载能力。 [#73119](https://github.com/PaddlePaddle/Paddle/pull/73119) -- FP8 模块支持及算子开发:新增 FP8 块量化 GEMM 支持,引入多个融合算子,为混合专家(MoE)模型提供高效算子级实现,提升训推性能。 [#73228](https://github.com/PaddlePaddle/Paddle/pull/73228), [#73285](https://github.com/PaddlePaddle/Paddle/pull/73285), [#73133](https://github.com/PaddlePaddle/Paddle/pull/73133), [#73364](https://github.com/PaddlePaddle/Paddle/pull/73364), [#73520](https://github.com/PaddlePaddle/Paddle/pull/73520), [#73531](https://github.com/PaddlePaddle/Paddle/pull/73531) - -### Bug 修复 - -- 梯度与自动微分稳定性提升:修复部分反向算子梯度计算错误,增强自动微分场景下的数值稳定性与功能正确性。 [#71716](https://github.com/PaddlePaddle/Paddle/pull/71716), [#72299](https://github.com/PaddlePaddle/Paddle/pull/72299), [#72358](https://github.com/PaddlePaddle/Paddle/pull/72358), [#73037](https://github.com/PaddlePaddle/Paddle/pull/73037), [#73140](https://github.com/PaddlePaddle/Paddle/pull/73140), [#73185](https://github.com/PaddlePaddle/Paddle/pull/73185) -- 数值精度与溢出防护:解决数值溢出、精度损失及大 tensor 溢出问题,保障低精度计算与大张量操作的可靠性。 [#72584](https://github.com/PaddlePaddle/Paddle/pull/72584), [#72608](https://github.com/PaddlePaddle/Paddle/pull/72608), [#72681](https://github.com/PaddlePaddle/Paddle/pull/72681), [#72639](https://github.com/PaddlePaddle/Paddle/pull/72639), [#73245](https://github.com/PaddlePaddle/Paddle/pull/73245), [#73359](https://github.com/PaddlePaddle/Paddle/pull/73359), [#72456](https://github.com/PaddlePaddle/Paddle/pull/72456) -- 算子逻辑与框架对齐:对齐算子运算逻辑,修复部分算子输入异常等问题,其他重要修复:添加检查,保障框架功能正确性。 [#72282](https://github.com/PaddlePaddle/Paddle/pull/72282), [#71863](https://github.com/PaddlePaddle/Paddle/pull/71863), [#72650](https://github.com/PaddlePaddle/Paddle/pull/72650), [#72843](https://github.com/PaddlePaddle/Paddle/pull/72843), [#73070](https://github.com/PaddlePaddle/Paddle/pull/73070), [#73141](https://github.com/PaddlePaddle/Paddle/pull/73141), [#73203](https://github.com/PaddlePaddle/Paddle/pull/73203), [#73350](https://github.com/PaddlePaddle/Paddle/pull/73350), [#73440](https://github.com/PaddlePaddle/Paddle/pull/73440), [#73539](https://github.com/PaddlePaddle/Paddle/pull/73539), [#73339](https://github.com/PaddlePaddle/Paddle/pull/73339) -- CUDA 内核与硬件适配优化:支持 NVIDIA SM90 架构,修复溢出等问题,移除冗余 CUDA 错误检查,提升 GPU 计算效率与新硬件适配性。 [#72507](https://github.com/PaddlePaddle/Paddle/pull/72507), [#72849](https://github.com/PaddlePaddle/Paddle/pull/72849), [#72959](https://github.com/PaddlePaddle/Paddle/pull/72959), [#73130](https://github.com/PaddlePaddle/Paddle/pull/73130), [#73489](https://github.com/PaddlePaddle/Paddle/pull/73489) +### Bug修复 +- 0-size Tensor 相关修复。 [#74295](https://github.com/PaddlePaddle/Paddle/pull/74295), [#74305](https://github.com/PaddlePaddle/Paddle/pull/74305), [#74323](https://github.com/PaddlePaddle/Paddle/pull/74323), [#74354](https://github.com/PaddlePaddle/Paddle/pull/74354) +- 大 Tensor 相关修复。 [#74242](https://github.com/PaddlePaddle/Paddle/pull/74242), [#74293](https://github.com/PaddlePaddle/Paddle/pull/74293), [#74289](https://github.com/PaddlePaddle/Paddle/pull/74289), [#74279](https://github.com/PaddlePaddle/Paddle/pull/74279), [#74330](https://github.com/PaddlePaddle/Paddle/pull/74330), [#74329](https://github.com/PaddlePaddle/Paddle/pull/74329), [#74342](https://github.com/PaddlePaddle/Paddle/pull/74342), [#74369](https://github.com/PaddlePaddle/Paddle/pull/74369), [#74370](https://github.com/PaddlePaddle/Paddle/pull/74370), [#74404](https://github.com/PaddlePaddle/Paddle/pull/74404), [#74537](https://github.com/PaddlePaddle/Paddle/pull/74537), [#74451](https://github.com/PaddlePaddle/Paddle/pull/74451), [#74172](https://github.com/PaddlePaddle/Paddle/pull/74172), [#74324](https://github.com/PaddlePaddle/Paddle/pull/74324), [#74964](https://github.com/PaddlePaddle/Paddle/pull/74964), [#74360](https://github.com/PaddlePaddle/Paddle/pull/74360), [#74379](https://github.com/PaddlePaddle/Paddle/pull/74379), [#74377](https://github.com/PaddlePaddle/Paddle/pull/74377), [#74380](https://github.com/PaddlePaddle/Paddle/pull/74380), [#74362](https://github.com/PaddlePaddle/Paddle/pull/74362), [#74197](https://github.com/PaddlePaddle/Paddle/pull/74197) +- API 兼容性相关修复。 [#74764](https://github.com/PaddlePaddle/Paddle/pull/74764), [#74869](https://github.com/PaddlePaddle/Paddle/pull/74869), [#74935](https://github.com/PaddlePaddle/Paddle/pull/74935) +- 【开源任务】Paddle CPU/GPU Kernel 精度问题推全。 [#74149](https://github.com/PaddlePaddle/Paddle/pull/74149), [#74598](https://github.com/PaddlePaddle/Paddle/pull/74598), [#74719](https://github.com/PaddlePaddle/Paddle/pull/74719), [#74625](https://github.com/PaddlePaddle/Paddle/pull/74625), [#74555](https://github.com/PaddlePaddle/Paddle/pull/74555) +- 其他重要修复。 [#74282](https://github.com/PaddlePaddle/Paddle/pull/74282), [#74313](https://github.com/PaddlePaddle/Paddle/pull/74313), [#74303](https://github.com/PaddlePaddle/Paddle/pull/74303), [#74306](https://github.com/PaddlePaddle/Paddle/pull/74306), [#74298](https://github.com/PaddlePaddle/Paddle/pull/74298), [#74044](https://github.com/PaddlePaddle/Paddle/pull/74044), [#74290](https://github.com/PaddlePaddle/Paddle/pull/74290), [#74348](https://github.com/PaddlePaddle/Paddle/pull/74348), [#74364](https://github.com/PaddlePaddle/Paddle/pull/74364), [#74332](https://github.com/PaddlePaddle/Paddle/pull/74332), [#74224](https://github.com/PaddlePaddle/Paddle/pull/74224), [#74382](https://github.com/PaddlePaddle/Paddle/pull/74382), [#74406](https://github.com/PaddlePaddle/Paddle/pull/74406), [#74434](https://github.com/PaddlePaddle/Paddle/pull/74434), [#74448](https://github.com/PaddlePaddle/Paddle/pull/74448), [#74457](https://github.com/PaddlePaddle/Paddle/pull/74457), [#74322](https://github.com/PaddlePaddle/Paddle/pull/74322), [#74530](https://github.com/PaddlePaddle/Paddle/pull/74530), [#74716](https://github.com/PaddlePaddle/Paddle/pull/74716), [#74839](https://github.com/PaddlePaddle/Paddle/pull/74839), [#74842](https://github.com/PaddlePaddle/Paddle/pull/74842), [#74854](https://github.com/PaddlePaddle/Paddle/pull/74854), [#74919](https://github.com/PaddlePaddle/Paddle/pull/74919), [#74767](https://github.com/PaddlePaddle/Paddle/pull/74767), [#75003](https://github.com/PaddlePaddle/Paddle/pull/75003) ### 功能增强 +- API 兼容能力提升。 [#74456](https://github.com/PaddlePaddle/Paddle/pull/74456), [#74480](https://github.com/PaddlePaddle/Paddle/pull/74480), [#74523](https://github.com/PaddlePaddle/Paddle/pull/74523), [#74490](https://github.com/PaddlePaddle/Paddle/pull/74490), [#74548](https://github.com/PaddlePaddle/Paddle/pull/74548), [#74596](https://github.com/PaddlePaddle/Paddle/pull/74596), [#74568](https://github.com/PaddlePaddle/Paddle/pull/74568), [#74559](https://github.com/PaddlePaddle/Paddle/pull/74559), [#74629](https://github.com/PaddlePaddle/Paddle/pull/74629), [#74623](https://github.com/PaddlePaddle/Paddle/pull/74623), [#74700](https://github.com/PaddlePaddle/Paddle/pull/74700), [#74643](https://github.com/PaddlePaddle/Paddle/pull/74643), [#74602](https://github.com/PaddlePaddle/Paddle/pull/74602), [#74783](https://github.com/PaddlePaddle/Paddle/pull/74783), [#74781](https://github.com/PaddlePaddle/Paddle/pull/74781), [#74735](https://github.com/PaddlePaddle/Paddle/pull/74735), [#74725](https://github.com/PaddlePaddle/Paddle/pull/74725), [#74815](https://github.com/PaddlePaddle/Paddle/pull/74815), [#74856](https://github.com/PaddlePaddle/Paddle/pull/74856), [#74925](https://github.com/PaddlePaddle/Paddle/pull/74925), [#74545](https://github.com/PaddlePaddle/Paddle/pull/74545), [#74932](https://github.com/PaddlePaddle/Paddle/pull/74932), [#74784](https://github.com/PaddlePaddle/Paddle/pull/74784) +- slice/stride 相关优化。 [#74731](https://github.com/PaddlePaddle/Paddle/pull/74731), [#74740](https://github.com/PaddlePaddle/Paddle/pull/74740), [#74769](https://github.com/PaddlePaddle/Paddle/pull/74769), [#74810](https://github.com/PaddlePaddle/Paddle/pull/74810), [#74841](https://github.com/PaddlePaddle/Paddle/pull/74841), [#74954](https://github.com/PaddlePaddle/Paddle/pull/74954), [#74888](https://github.com/PaddlePaddle/Paddle/pull/74888), [#74944](https://github.com/PaddlePaddle/Paddle/pull/74944), [#74312](https://github.com/PaddlePaddle/Paddle/pull/74312), [#74291](https://github.com/PaddlePaddle/Paddle/pull/74291), [#74271](https://github.com/PaddlePaddle/Paddle/pull/74271), [#74320](https://github.com/PaddlePaddle/Paddle/pull/74320), [#74344](https://github.com/PaddlePaddle/Paddle/pull/74344), [#74727](https://github.com/PaddlePaddle/Paddle/pull/74727), [#74637](https://github.com/PaddlePaddle/Paddle/pull/74637) +- 算子优化与 CUDA 支持。 [#74693](https://github.com/PaddlePaddle/Paddle/pull/74693), [#74922](https://github.com/PaddlePaddle/Paddle/pull/74922), [#74967](https://github.com/PaddlePaddle/Paddle/pull/74967) +- 改进调试信息、兼容性增强。 [#74372](https://github.com/PaddlePaddle/Paddle/pull/74372), [#74622](https://github.com/PaddlePaddle/Paddle/pull/74622) +- 算子功能扩展与优化。 [#74790](https://github.com/PaddlePaddle/Paddle/pull/74790), [#74979](https://github.com/PaddlePaddle/Paddle/pull/74979) -- 新增 int64_t 版本的快速除法取模实现,提升大整数场景下的计算性能与数值稳定性, [#72530](https://github.com/PaddlePaddle/Paddle/pull/72530) -- 优化带步长张量拷贝 kernel,改进非连续内存布局下的数据拷贝效率。 [#72662](https://github.com/PaddlePaddle/Paddle/pull/72662) - --统一动态图与静态图模式下量化 API 的使用方式,简化量化模型开发流程, [#73100](https://github.com/PaddlePaddle/Paddle/pull/73100) - -### 性能提升 +### 性能优化 +- FP8 计算优化。 [#74471](https://github.com/PaddlePaddle/Paddle/pull/74471), [#74684](https://github.com/PaddlePaddle/Paddle/pull/74684), [#74911](https://github.com/PaddlePaddle/Paddle/pull/74911) +- 基础算子性能优化。 [#74442](https://github.com/PaddlePaddle/Paddle/pull/74442), [#74638](https://github.com/PaddlePaddle/Paddle/pull/74638) +- 支持 fa3 变长序列反向计算并优化前向 API。 [#73831](https://github.com/PaddlePaddle/Paddle/pull/73831) +- 新增 FlashMask V2 功能。 [#74729](https://github.com/PaddlePaddle/Paddle/pull/74729) -- 优化 gelu 算子分解性能,提升计算效率。 [#72812](https://github.com/PaddlePaddle/Paddle/pull/72812) +### 文档 +- 修复英文文档问题以及版权年份问题。 [#74737](https://github.com/PaddlePaddle/Paddle/pull/74737) ### 其他 +- 在XPU硬件上默认开启 WITH_XPU_FFT 选项。 [#74699](https://github.com/PaddlePaddle/Paddle/pull/74699) -- fluid 算子规范化与退场, [#71789](https://github.com/PaddlePaddle/Paddle/pull/71789), [#71818](https://github.com/PaddlePaddle/Paddle/pull/71818), [#71808](https://github.com/PaddlePaddle/Paddle/pull/71808), [#71860](https://github.com/PaddlePaddle/Paddle/pull/71860), [#71806](https://github.com/PaddlePaddle/Paddle/pull/71806), [#72011](https://github.com/PaddlePaddle/Paddle/pull/72011), [#72043](https://github.com/PaddlePaddle/Paddle/pull/72043), [#72034](https://github.com/PaddlePaddle/Paddle/pull/72034), [#72047](https://github.com/PaddlePaddle/Paddle/pull/72047), [#72056](https://github.com/PaddlePaddle/Paddle/pull/72056), [#72087](https://github.com/PaddlePaddle/Paddle/pull/72087), [#72086](https://github.com/PaddlePaddle/Paddle/pull/72086), [#72083](https://github.com/PaddlePaddle/Paddle/pull/72083), [#72079](https://github.com/PaddlePaddle/Paddle/pull/72079), [#72078](https://github.com/PaddlePaddle/Paddle/pull/72078), [#72076](https://github.com/PaddlePaddle/Paddle/pull/72076), [#72057](https://github.com/PaddlePaddle/Paddle/pull/72057), [#72077](https://github.com/PaddlePaddle/Paddle/pull/72077), [#72096](https://github.com/PaddlePaddle/Paddle/pull/72096), [#72085](https://github.com/PaddlePaddle/Paddle/pull/72085), [#72092](https://github.com/PaddlePaddle/Paddle/pull/72092), [#72110](https://github.com/PaddlePaddle/Paddle/pull/72110), [#72127](https://github.com/PaddlePaddle/Paddle/pull/72127), [#72111](https://github.com/PaddlePaddle/Paddle/pull/72111), [#72126](https://github.com/PaddlePaddle/Paddle/pull/72126), [#72135](https://github.com/PaddlePaddle/Paddle/pull/72135), [#72112](https://github.com/PaddlePaddle/Paddle/pull/72112), [#72131](https://github.com/PaddlePaddle/Paddle/pull/72131), [#70358](https://github.com/PaddlePaddle/Paddle/pull/70358), [#72125](https://github.com/PaddlePaddle/Paddle/pull/72125), [#72171](https://github.com/PaddlePaddle/Paddle/pull/72171), [#72160](https://github.com/PaddlePaddle/Paddle/pull/72160), [#72188](https://github.com/PaddlePaddle/Paddle/pull/72188), [#72197](https://github.com/PaddlePaddle/Paddle/pull/72197), [#72212](https://github.com/PaddlePaddle/Paddle/pull/72212), [#72211](https://github.com/PaddlePaddle/Paddle/pull/72211), [#72184](https://github.com/PaddlePaddle/Paddle/pull/72184), [#71897](https://github.com/PaddlePaddle/Paddle/pull/71897), [#72219](https://github.com/PaddlePaddle/Paddle/pull/72219), [#72218](https://github.com/PaddlePaddle/Paddle/pull/72218), [#72074](https://github.com/PaddlePaddle/Paddle/pull/72074), [#70330](https://github.com/PaddlePaddle/Paddle/pull/70330), [#70274](https://github.com/PaddlePaddle/Paddle/pull/70274), [#72295](https://github.com/PaddlePaddle/Paddle/pull/72295), [#72220](https://github.com/PaddlePaddle/Paddle/pull/72220), [#72343](https://github.com/PaddlePaddle/Paddle/pull/72343), [#72303](https://github.com/PaddlePaddle/Paddle/pull/72303), [#72296](https://github.com/PaddlePaddle/Paddle/pull/72296), [#72338](https://github.com/PaddlePaddle/Paddle/pull/72338), [#70001](https://github.com/PaddlePaddle/Paddle/pull/70001), [#70348](https://github.com/PaddlePaddle/Paddle/pull/70348), [#70329](https://github.com/PaddlePaddle/Paddle/pull/70329) - -## 6. 框架性能优化 - -### 新特性 - -支持`sharding_overlap`的`acc_steps`可配置。 [#72395](https://github.com/PaddlePaddle/Paddle/pull/72395) - -### Bug 修复 - -- 修复算子`c_softmax_with_cross_entropy_grad`的`inplace`问题。 [#72366](https://github.com/PaddlePaddle/Paddle/pull/72366) - -### 功能增强 - -- 性能优化与加速:启用深度卷积的 cuDNN 支持,提升卷积运算效率。更新池化操作策略并优化 permute 内存操作,减少 CUDA 内存占用。优化打印速度,加速调试与日志输出流程。 [#71796](https://github.com/PaddlePaddle/Paddle/pull/71796), [#73442](https://github.com/PaddlePaddle/Paddle/pull/73442), [#73563](https://github.com/PaddlePaddle/Paddle/pull/73563) -- 功能增强与操作支持:新增 masked_fill 操作及布尔索引优化,增强张量掩码处理能力。实现 index_elementwise 操作,支持基于索引的元素级运算。添加池化与 reshape 执行策略,提升模型操作的灵活性。 [#72788](https://github.com/PaddlePaddle/Paddle/pull/72788), [#72942](https://github.com/PaddlePaddle/Paddle/pull/72942) -- 错误修复与稳定性提升:修复 fused_rms_norm 在 SPMD 并行模式下的部分状态支持问题。修正 slice 操作中输出维度计算及 IndexGetStride 的索引错误,确保计算正确性。 [#72118](https://github.com/PaddlePaddle/Paddle/pull/72118), [#72223](https://github.com/PaddlePaddle/Paddle/pull/72223), [#73184](https://github.com/PaddlePaddle/Paddle/pull/73184), [#73237](https://github.com/PaddlePaddle/Paddle/pull/73237), [#73054](https://github.com/PaddlePaddle/Paddle/pull/73054) - -### 性能提升 - -- Faster Guard 适配:减少 SOT 端到端链路开销。 [#71900](https://github.com/PaddlePaddle/Paddle/pull/71900), [#71979](https://github.com/PaddlePaddle/Paddle/pull/71979), [#72081](https://github.com/PaddlePaddle/Paddle/pull/72081), [#72327](https://github.com/PaddlePaddle/Paddle/pull/72327), [#72564](https://github.com/PaddlePaddle/Paddle/pull/72564), [#72823](https://github.com/PaddlePaddle/Paddle/pull/72823) -- 性能优化与加速:优化算子调度策略。升级 Flash Attention 至 v3 版本,减少计算开销。修复模型性能瓶颈,提升推理与训练速度。 [#71937](https://github.com/PaddlePaddle/Paddle/pull/71937), [#71828](https://github.com/PaddlePaddle/Paddle/pull/71828), [#71461](https://github.com/PaddlePaddle/Paddle/pull/71461), [#72039](https://github.com/PaddlePaddle/Paddle/pull/72039), [#72228](https://github.com/PaddlePaddle/Paddle/pull/72228), [#72225](https://github.com/PaddlePaddle/Paddle/pull/72225), [#72623](https://github.com/PaddlePaddle/Paddle/pull/72623), [#72666](https://github.com/PaddlePaddle/Paddle/pull/72666), [#73147](https://github.com/PaddlePaddle/Paddle/pull/73147), [#73393](https://github.com/PaddlePaddle/Paddle/pull/73393) -- 并行计算:优化自动并行中的网格重分片策略,实现 Sharding Stage 的通信融合并优化逻辑,提升分布式训练稳定性,降低分布式训练通信开销。 [#71969](https://github.com/PaddlePaddle/Paddle/pull/71969), [#72120](https://github.com/PaddlePaddle/Paddle/pull/72120), [#73279](https://github.com/PaddlePaddle/Paddle/pull/73279), [#73406](https://github.com/PaddlePaddle/Paddle/pull/73406) +## 5. 硬件适配 +### 类CUDA硬件接入方案完善 +- 类CUDA硬件接入方案支持cuBlas kernel的复用 [#74591](https://github.com/PaddlePaddle/Paddle/pull/74591), +- 类CUDA硬件接入方案已知问题修复 + [#74397](https://github.com/PaddlePaddle/Paddle/pull/74397), [#74411](https://github.com/PaddlePaddle/Paddle/pull/74411), [#74428](https://github.com/PaddlePaddle/Paddle/pull/74428), [#74877](https://github.com/PaddlePaddle/Paddle/pull/74877), [#74939](https://github.com/PaddlePaddle/Paddle/pull/74939) -功能增强与修复:- 优化算子索引和内核调度逻辑。 [#72625](https://github.com/PaddlePaddle/Paddle/pull/72625), [#72741](https://github.com/PaddlePaddle/Paddle/pull/72741), [#73082](https://github.com/PaddlePaddle/Paddle/pull/73082), [#73501](https://github.com/PaddlePaddle/Paddle/pull/73501) +### 主仓单测支持多硬件 +- 单测支持多硬件 [#74349](https://github.com/PaddlePaddle/Paddle/pull/74349), [#74363](https://github.com/PaddlePaddle/Paddle/pull/74363),[#74806](https://github.com/PaddlePaddle/Paddle/pull/74806), [#74868](https://github.com/PaddlePaddle/Paddle/pull/74868), [#74820](https://github.com/PaddlePaddle/Paddle/pull/74820), [#74927](https://github.com/PaddlePaddle/Paddle/pull/74927) -- 模型与操作支持:支持 NHWC 格式的深度卷积,适配更多硬件内存布局。 [#72121](https://github.com/PaddlePaddle/Paddle/pull/72121) - -## 7. 硬件适配 - -优化硬件机制,提供类 cuda 硬件 kernel 复用方案。 - -### 新特性 - -以 customdevice 接入方案为基础,增加低成本支持类 cuda 后端硬件的支持方案。类 cuda 后端可以以插件式方式接入 paddle,低成本复用 paddle 中多数 nv 生态中的 cuda kernel,且可以与 paddle 框架中的特性 feature 升级解耦,大大降低硬件后端接入与迭代成本,提升用户接入意愿,形成 paddle 与硬件厂商共建生态的良好合作关系。 -[#72604](https://github.com/PaddlePaddle/Paddle/pull/72604)[#72668](https://github.com/PaddlePaddle/Paddle/pull/72668))[#72758](https://github.com/PaddlePaddle/Paddle/pull/72758)[#72865](https://github.com/PaddlePaddle/Paddle/pull/72865)[#72910](https://github.com/PaddlePaddle/Paddle/pull/72910)[#73033](https://github.com/PaddlePaddle/Paddle/pull/73033))[#73145](https://github.com/PaddlePaddle/Paddle/pull/73145)[#73281](https://github.com/PaddlePaddle/Paddle/pull/73281)[#73079](https://github.com/PaddlePaddle/Paddle/pull/73079) - -补充 XPU 基础能力:XPU 环境下增加 kernel ,扩展数据类型,补充分支 -[#71424](https://github.com/PaddlePaddle/Paddle/pull/71424)[#71809](https://github.com/PaddlePaddle/Paddle/pull/71809)[#71594](https://github.com/PaddlePaddle/Paddle/pull/71594)[#71779](https://github.com/PaddlePaddle/Paddle/pull/71779)[#71756](https://github.com/PaddlePaddle/Paddle/pull/71756)[#71573](https://github.com/PaddlePaddle/Paddle/pull/71573)[#71883](https://github.com/PaddlePaddle/Paddle/pull/71883)[#71954](https://github.com/PaddlePaddle/Paddle/pull/71954)[#71931](https://github.com/PaddlePaddle/Paddle/pull/71931)[#72280](https://github.com/PaddlePaddle/Paddle/pull/72280)[#72361](https://github.com/PaddlePaddle/Paddle/pull/72361)[#72406](https://github.com/PaddlePaddle/Paddle/pull/72406)[#72528](https://github.com/PaddlePaddle/Paddle/pull/72528)[#72752](https://github.com/PaddlePaddle/Paddle/pull/72752)[#72852](https://github.com/PaddlePaddle/Paddle/pull/72852)[#72982](https://github.com/PaddlePaddle/Paddle/pull/72982)[#73357](https://github.com/PaddlePaddle/Paddle/pull/73357)[#73414](https://github.com/PaddlePaddle/Paddle/pull/73414)[#73464](https://github.com/PaddlePaddle/Paddle/pull/73464)[#73234](https://github.com/PaddlePaddle/Paddle/pull/73234)[#71776](https://github.com/PaddlePaddle/Paddle/pull/71776) - -DCU kernel 扩展数据类型 -[#73129](https://github.com/PaddlePaddle/Paddle/pull/73129) +### 新增Custom Device API支持 +- 新增Custom Device API支持 [#74308](https://github.com/PaddlePaddle/Paddle/pull/74308), [#74371](https://github.com/PaddlePaddle/Paddle/pull/74371), [#74539](https://github.com/PaddlePaddle/Paddle/pull/74539) +## 6. 安装环境 ### Bug 修复 -修复 xpu 执行问题 -[#71852](https://github.com/PaddlePaddle/Paddle/pull/71852)[#71966](https://github.com/PaddlePaddle/Paddle/pull/71966)[#72005](https://github.com/PaddlePaddle/Paddle/pull/72005)[#71908](https://github.com/PaddlePaddle/Paddle/pull/71908)[#72431](https://github.com/PaddlePaddle/Paddle/pull/72431)[#72519](https://github.com/PaddlePaddle/Paddle/pull/72519)[#72734](https://github.com/PaddlePaddle/Paddle/pull/72734)[#72763](https://github.com/PaddlePaddle/Paddle/pull/72763)[#72762](https://github.com/PaddlePaddle/Paddle/pull/72762)[#72890](https://github.com/PaddlePaddle/Paddle/pull/72890)[#72867](https://github.com/PaddlePaddle/Paddle/pull/72867)[#73071](https://github.com/PaddlePaddle/Paddle/pull/73071)[#73004](https://github.com/PaddlePaddle/Paddle/pull/73004)[#72726](https://github.com/PaddlePaddle/Paddle/pull/72726)[#73113](https://github.com/PaddlePaddle/Paddle/pull/73113)[#73127](https://github.com/PaddlePaddle/Paddle/pull/73127)[#73025](https://github.com/PaddlePaddle/Paddle/pull/73025)[#73301](https://github.com/PaddlePaddle/Paddle/pull/73301)[#73292](https://github.com/PaddlePaddle/Paddle/pull/73292)[#73272](https://github.com/PaddlePaddle/Paddle/pull/73272)[#73305](https://github.com/PaddlePaddle/Paddle/pull/73305)[#73356](https://github.com/PaddlePaddle/Paddle/pull/73356)[#73438](https://github.com/PaddlePaddle/Paddle/pull/73438)[#72041](https://github.com/PaddlePaddle/Paddle/pull/72041)[#72275](https://github.com/PaddlePaddle/Paddle/pull/72275)[#72787](https://github.com/PaddlePaddle/Paddle/pull/72787)[#73504](https://github.com/PaddlePaddle/Paddle/pull/73504)[#73290](https://github.com/PaddlePaddle/Paddle/pull/73290) - -## 8. 安装环境适配 - -优化了框架的稳定性和跨平台兼容性,修复了不同平台上的编译安装失败问题;升级 CUDA 等关键依赖,进一步优化 CI/CD 流程,提升构建速度并增强系统整体稳定性;停止对 Python3.8 环境下的编译安装维护。 - -### Bug 修复 +- 修复flashattent编译缓存的bug。[#74388](https://github.com/PaddlePaddle/Paddle/pull/74388) +- 修复site.USER_SITE为None的bug。 [#74373](https://github.com/PaddlePaddle/Paddle/pull/74373) +- 修复多架构 Linux 系统下gtest的编译bug。 [#74723](https://github.com/PaddlePaddle/Paddle/pull/74723) +- 修复在 WITH_GPU=ON 情况下 DEBUG 模式编译多个报错。 [#74401](https://github.com/PaddlePaddle/Paddle/pull/74401) +- 修复Windows下CUDA12.6编译bug。 [#74990](https://github.com/PaddlePaddle/Paddle/pull/74990) +- 修复api-benchmark基线流水线bug。 [#74770](https://github.com/PaddlePaddle/Paddle/pull/74770) +- 修复api-benchmark基线流水线bug。 [#74778](https://github.com/PaddlePaddle/Paddle/pull/74778) +- 修复api-benchmark基线流水线bug。 [#74779](https://github.com/PaddlePaddle/Paddle/pull/74779) +- 修复api-benchmark基线流水线bug。 [#74780](https://github.com/PaddlePaddle/Paddle/pull/74780) +- 修复api-benchmark基线流水线bug。 [#74800](https://github.com/PaddlePaddle/Paddle/pull/74800) +- 修复api-benchmark基线流水线bug。 [#74803](https://github.com/PaddlePaddle/Paddle/pull/74803) -- 修复使用 clang17 编译第三方库时的编译错误。[#72524](https://github.com/PaddlePaddle/Paddle/pull/72524) -- 修复使用 CUDA12.9 时的编译问题。 [#72808](https://github.com/PaddlePaddle/Paddle/pull/72808), [#72841](https://github.com/PaddlePaddle/Paddle/pull/72841), [#72978](https://github.com/PaddlePaddle/Paddle/pull/72978), [#73360](https://github.com/PaddlePaddle/Paddle/pull/73360) -- 修复使用 GCC13.3 时的编译问题。[#73144](https://github.com/PaddlePaddle/Paddle/pull/73144) -- 修复 WITH_PIP_CUDA_LIBRARIES=ON 时的编译问题。[#72907](https://github.com/PaddlePaddle/Paddle/pull/72907) -- 修复 WITH_NVSHMEM=ON 时的编译问题。[#73368](https://github.com/PaddlePaddle/Paddle/pull/73368) - -### 功能增强 - -- 避免自定义算子编译产生的临时文件的拷贝。[#73196](https://github.com/PaddlePaddle/Paddle/pull/73196) -- Warning 信息优化。[#72877](https://github.com/PaddlePaddle/Paddle/pull/72877) - -### 开发者相关 - -- 编译安装维护与升级。[#71911](https://github.com/PaddlePaddle/Paddle/pull/71911), [#73005](https://github.com/PaddlePaddle/Paddle/pull/73005) -- 镜像维护与更新。[#71065](https://github.com/PaddlePaddle/Paddle/pull/71065), [#71821](https://github.com/PaddlePaddle/Paddle/pull/71821) -- Windows 平台符号的导入导出更新。[#72497](https://github.com/PaddlePaddle/Paddle/pull/72497), [#72498](https://github.com/PaddlePaddle/Paddle/pull/72498), [#72500](https://github.com/PaddlePaddle/Paddle/pull/72500) -- Windows 平台支持 CUDA12.8。[#72433](https://github.com/PaddlePaddle/Paddle/pull/72433) -- CI 维护与升级。[#72443](https://github.com/PaddlePaddle/Paddle/pull/72443), [#72836](https://github.com/PaddlePaddle/Paddle/pull/72836), [#72563](https://github.com/PaddlePaddle/Paddle/pull/72563), [#72653](https://github.com/PaddlePaddle/Paddle/pull/72653), [#72477](https://github.com/PaddlePaddle/Paddle/pull/72477), [#72778](https://github.com/PaddlePaddle/Paddle/pull/72778), [#72960](https://github.com/PaddlePaddle/Paddle/pull/72960), [#73289](https://github.com/PaddlePaddle/Paddle/pull/73289), [#73422](https://github.com/PaddlePaddle/Paddle/pull/73422), [#73514](https://github.com/PaddlePaddle/Paddle/pull/73514), [#72748](https://github.com/PaddlePaddle/Paddle/pull/72748), -- Github Action CI 建设。[#71738](https://github.com/PaddlePaddle/Paddle/pull/71738), [#70602](https://github.com/PaddlePaddle/Paddle/pull/70602), [#71958](https://github.com/PaddlePaddle/Paddle/pull/71958), [#71959](https://github.com/PaddlePaddle/Paddle/pull/71959), [#71992](https://github.com/PaddlePaddle/Paddle/pull/71992), [#72013](https://github.com/PaddlePaddle/Paddle/pull/72013), [#72153](https://github.com/PaddlePaddle/Paddle/pull/72153), [#72031](https://github.com/PaddlePaddle/Paddle/pull/72031), [#72141](https://github.com/PaddlePaddle/Paddle/pull/72141), [#72104](https://github.com/PaddlePaddle/Paddle/pull/72104), [#72182](https://github.com/PaddlePaddle/Paddle/pull/72182), [#72342](https://github.com/PaddlePaddle/Paddle/pull/72342), [#72352](https://github.com/PaddlePaddle/Paddle/pull/72352), [#72249](https://github.com/PaddlePaddle/Paddle/pull/72249), [#72068](https://github.com/PaddlePaddle/Paddle/pull/72068), [#72441](https://github.com/PaddlePaddle/Paddle/pull/72441), [#72392](https://github.com/PaddlePaddle/Paddle/pull/72392), [#72446](https://github.com/PaddlePaddle/Paddle/pull/72446), [#72435](https://github.com/PaddlePaddle/Paddle/pull/72435), [#72515](https://github.com/PaddlePaddle/Paddle/pull/72515), [#72514](https://github.com/PaddlePaddle/Paddle/pull/72514), [#72396](https://github.com/PaddlePaddle/Paddle/pull/72396), [#72547](https://github.com/PaddlePaddle/Paddle/pull/72547), [#72345](https://github.com/PaddlePaddle/Paddle/pull/72345), [#72236](https://github.com/PaddlePaddle/Paddle/pull/72236), [#72586](https://github.com/PaddlePaddle/Paddle/pull/72586), [#72537](https://github.com/PaddlePaddle/Paddle/pull/72537), [#72609](https://github.com/PaddlePaddle/Paddle/pull/72609), [#72632](https://github.com/PaddlePaddle/Paddle/pull/72632), [#72642](https://github.com/PaddlePaddle/Paddle/pull/72642), [#72673](https://github.com/PaddlePaddle/Paddle/pull/72673), [#72647](https://github.com/PaddlePaddle/Paddle/pull/72647), [#72696](https://github.com/PaddlePaddle/Paddle/pull/72696), [#72771](https://github.com/PaddlePaddle/Paddle/pull/72771), [#72711](https://github.com/PaddlePaddle/Paddle/pull/72711), [#72680](https://github.com/PaddlePaddle/Paddle/pull/72680), [#72774](https://github.com/PaddlePaddle/Paddle/pull/72774), [#72813](https://github.com/PaddlePaddle/Paddle/pull/72813), [#72804](https://github.com/PaddlePaddle/Paddle/pull/72804), [#72903](https://github.com/PaddlePaddle/Paddle/pull/72903), [#72900](https://github.com/PaddlePaddle/Paddle/pull/72900), [#72932](https://github.com/PaddlePaddle/Paddle/pull/72932), [#72967](https://github.com/PaddlePaddle/Paddle/pull/72967), [#72991](https://github.com/PaddlePaddle/Paddle/pull/72991), [#72115](https://github.com/PaddlePaddle/Paddle/pull/72115), [#73242](https://github.com/PaddlePaddle/Paddle/pull/73242), [#72801](https://github.com/PaddlePaddle/Paddle/pull/72801), [#73433](https://github.com/PaddlePaddle/Paddle/pull/73433), [#73391](https://github.com/PaddlePaddle/Paddle/pull/73391), [#73456](https://github.com/PaddlePaddle/Paddle/pull/73456), [#73376](https://github.com/PaddlePaddle/Paddle/pull/73376), [#73453](https://github.com/PaddlePaddle/Paddle/pull/73453), [#73481](https://github.com/PaddlePaddle/Paddle/pull/73481), [#73546](https://github.com/PaddlePaddle/Paddle/pull/73546), [#73446](https://github.com/PaddlePaddle/Paddle/pull/73446), [#72744](https://github.com/PaddlePaddle/Paddle/pull/72744) - -### 废弃 - -- 停止支持 Python3.8 环境下的编译。[#72827](https://github.com/PaddlePaddle/Paddle/pull/72827) - -## 9. 贡献者名单 +### 其他 -0x3878f, A-nnonymous, AndSonder, ApricityXX, aquagull, author, baoqiwen, BeingGod, blacksheep-Aristotle, BoShen5, bukejiyu, cangtianhuang, carryyu, chang-wenbin, changeyoung98, chen2016013, ckl117, co63oc, cqulilujia, crashbussy, cszdrg, Cutelemon6, cyy536, DanielSun11, danleifeng, datutu-L, deepllz, Dmovic, DrRyanHuang, dynamicheart, Eddie-Wang1120, eggman-1024, emmanuel-ferdman, Enigmatisms, enkilee, fangfangssj, feixi21, FeixLiu, ForFishes, Function-Samuel, ggggxm, GITD245, Glencsa, GoldenStain, gongshaotian, gouzil, gzy19990617, hanlintang, Hongqing-work, houj04, huangjiyi, hxzd5568, HydrogenSulfate, jzhang533, LCStayingdullCircuit, leon062112, lifulll, linkk08, LittleHeroZZZX, liufengwei0103, Liujie0926, liuruyan, lixinqi, LiYuRio, lizexu123, lizhenyun01, lj970926, lshpku, megemini, mikethegoblin, ming1753, mzj104, NKNaN, ooooo-create, pesionzhao, phlrain, pkuzyc, PolaKuma, Qin-sx, RichardWooSJTU, risemeup1, runzhech, RuohengMa, sasaya123, shanjiang7, SigureMo, sneaxiy, swgu98, SylarTiaNII, tianhaodongbd, tianshuo78520a, timminator, tizhou86, umiswing, waliwali777, wanghuancoder, Waynezee, Wennie396, xiaoguoguo626807, XieYunshen, Xing-lil, xkkkkkk23, Xreki, xuxinyi389, Yeenyeong, yongqiangma, YqGe585, yuanlehome, YuanRisheng, yulangz, yuwu46, zeroRains, zhangbo9674, zhanghonggeng, zhangting2020, ZhangX-21, zhangyk0314, zhangyuqin1998, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhupengyang, zrr1999, zty-king, zyfncg +- 禁用test_custom_contiguous单测。 [#74337](https://github.com/PaddlePaddle/Paddle/pull/74337) +- 支持录取slice 流水线基线任务定时触发。 [#74419](https://github.com/PaddlePaddle/Paddle/pull/74419) +- 支持slice录基线添加手动指定pr。 [#74445](https://github.com/PaddlePaddle/Paddle/pull/74445) +- 检查代码中是否带有中问题。 [#74460](https://github.com/PaddlePaddle/Paddle/pull/74460) +- 支持CI PaddleX在XPU上的任务。 [#74426](https://github.com/PaddlePaddle/Paddle/pull/74426) +- 支持slice流水线豁免机制。 [#74482](https://github.com/PaddlePaddle/Paddle/pull/74482) +- 更新paddle基础镜像。 [#73423](https://github.com/PaddlePaddle/Paddle/pull/73423) +- windows 固定ninja版本1.11。 [#74590](https://github.com/PaddlePaddle/Paddle/pull/74590) +- 支持添加关闭pr取消CI。 [#74604](https://github.com/PaddlePaddle/Paddle/pull/74604) +- 支持快速跳过所有CI。 [#74696](https://github.com/PaddlePaddle/Paddle/pull/74696) +- 增加api-benchmark基线流水线。 [#74690](https://github.com/PaddlePaddle/Paddle/pull/74690) +- 更新nccl版本。 [#74809](https://github.com/PaddlePaddle/Paddle/pull/74809) +- 更新approve流水线RD名单。 [#74838](https://github.com/PaddlePaddle/Paddle/pull/74838) +- 更新approve流水线RD名单。 [#74902](https://github.com/PaddlePaddle/Paddle/pull/74902) +- 更新safetensor到镜像中。 [#74904](https://github.com/PaddlePaddle/Paddle/pull/74904) +- 添加flashatten的编译flag。 [#74959](https://github.com/PaddlePaddle/Paddle/pull/74959) +- 临时禁用win-inference流水线。 [#74980](https://github.com/PaddlePaddle/Paddle/pull/74980) +- 支持windows编译phi动态库。 [#74950](https://github.com/PaddlePaddle/Paddle/pull/74950) + +## 7. 贡献者名单 +AIbin, Ayakouji, baiyue, baoqiwen, Chang Lu, Chen Zhiyang, co63oc, cyberslack_lee, cyy536, datutu-L, Deng Haodong, Difer, Eddie-Wang, enzodechine, fangfangssj, feri, fxyfxy777, ggggxm, GoldPancake, gouzil, Gu Shiwei, Haze188 灏喆, hohdiy, hong, HU Shenwei, huangjiyi, HydrogenSulfate, kjagsdq, LCStayingdullCircuit, Leo Guo, lightbrother, liufengwei0103, liuruyan, LiYuRio, LLSGYN, Lucas, Luckycheng222, lzy, Nana, Nyakku Shigure, ooo oo, Qianyue He, risemeup1, Ruibiao Chen, Ryan, Shuhao Liang, sneaxiy, Starrysea996, SUN Dong, Tao Luo, Tian, tianhaodongbd, tianshuo78520a, umiswing, waliwali777, wanghuancoder, Wenhao.Dai, wyw, XiaoguangHu, xiaoguoguo626807, xingmingyyj, Yichen Zhang, Yohanna, yongqiangma, Yuan Xiaolan, YUNSHEN XIE, Yuntao Nie, Yuqiang Ge, Yutian Rao, Zero Rains, Zhan Rongrui, Zhang Ting, zhanghonggeng, Zhaowu Pan, zhengshengning, ZhenxingLi, Zhou Xin, zhupengyang, zhwesky2010, Zichao, zty-king, Zx, zyfncg, zzm, 周周周, 正在学习, 苍天荒 \ No newline at end of file diff --git a/docs/release_note_en.md b/docs/release_note_en.md index e5b53007587..31149c8acb0 100644 --- a/docs/release_note_en.md +++ b/docs/release_note_en.md @@ -1,292 +1,211 @@ # 3.1 Release Note -PaddlePaddle framework version 3.1 has undergone further optimization and polishing for its core function of automatic parallelism, enhancing usability and performance. It also provides support for FP8 low-precision training, improving the training speed of large language models by 10-20%. The hardware expansion mechanism has been improved, reducing the cost of adapting to hardware similar to CUDA. Users only need to register the kernel. At the same time, the basic capabilities of the framework have been enhanced to improve its stability. The key updated features are as follows: - -- **Automatic Parallel Architecture:** The automatic parallel architecture has undergone further refinement to enhance the usability of the automatic parallel core mechanism and improve dynamic graph performance. The automatic parallel core mechanism has been improved, including the addition of multiple operator slicing derivation rules, support for distributed tensors being sliced along the same dimension by multiple mesh dimensions, and support for dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), among others. Meanwhile, performance optimizations have been systematically implemented for the automatic parallel system of dynamic graphs, achieving performance that is essentially on par with manual parallelism on models such as Llama2, Qwen Baichuan, and others. -- **Low-precision training:** Based on the blockwise fp8 gemm operator, it supports low-precision training, achieving training accuracy comparable to BF16, and speeding up large model training by 10-20%. -- **Heterogeneous multi-core adaptation:** Provides a CUDA-like operator reuse mechanism, where users can simply register to use the corresponding kernel. -- **Framework stability enhancement:** The system has fixed the calculation errors of operators in the cases of 0-Size and large dimensions. - -## 1. User experience - -API enhancements, bug fixes, and improvements are aimed at enhancing user experience and improving the usability of the API. The `paddle.randn_like` API has been added, multiple functional defects in APIs have been fixed, and support for complex types and 0-Size Tensor has been enhanced. Documentation and code have also been updated and optimized accordingly to improve overall accuracy and professionalism. - -### New Features - -- Added `paddle.randn_like` API. [#72492](https://github.com/PaddlePaddle/Paddle/pull/72492) - -### Bug Fixes - -- Fixed the issue of inconsistent input and output types in the `tensordot` API. [#72139](https://github.com/PaddlePaddle/Paddle/pull/72139) -- Fixed the issue where the output of the `atleast` API was a Tensor list. [#73102](https://github.com/PaddlePaddle/Paddle/pull/73102) -- Fixed the issue with the `nonzer` API. [#72003](https://github.com/PaddlePaddle/Paddle/pull/72003) -- Fixed the memory leak issue in `dualpipev`. [#72070](https://github.com/PaddlePaddle/Paddle/pull/72070) -- Fixed the overflow issue in `softmax` calculation. [#71935](https://github.com/PaddlePaddle/Paddle/pull/71935) -- Fixed the shape checking issue in `take_along_axis` when `broadcast=False`. [#72436](https://github.com/PaddlePaddle/Paddle/pull/72436) -- Fixed the incorrect handling of Nan input in `maximum` and `minimum` functions. [#71933](https://github.com/PaddlePaddle/Paddle/pull/71933) -- Fixed the issue with `visit_type`. [#72782](https://github.com/PaddlePaddle/Paddle/pull/72782) -- Fixed the int32 out-of-bounds issue in `gather_scatter_functor`. [#72905](https://github.com/PaddlePaddle/Paddle/pull/72905) -- Fixed the inplace implementation of `Bernoulli` in PaddlePaddle. [#73271](https://github.com/PaddlePaddle/Paddle/pull/73271) -- Fixed issues with `moe_permute` and `moe_unpermute`. [#73365](https://github.com/PaddlePaddle/Paddle/pull/73365) -- Fixed the syntax checking issue of `ast.parse` for pyi files. [#71872](https://github.com/PaddlePaddle/Paddle/pull/71872) -- Fixed the issue of complex division. [#73331](https://github.com/PaddlePaddle/Paddle/pull/73331) -- Fixed issues related to TensorRT integration. [#72302](https://github.com/PaddlePaddle/Paddle/pull/72302), [#72278](https://github.com/PaddlePaddle/Paddle/pull/72278) - -### Improvements - -- Enhance the functionality of the API, improve its usability, and enhance the user experience. This includes but is not limited to expanding the data types supported by the API, checking API parameters, correcting default values of API parameters, and refining API return values. [#71997](https://github.com/PaddlePaddle/Paddle/pull/71997), [#72911](https://github.com/PaddlePaddle/Paddle/pull/72911), [#72985](https://github.com/PaddlePaddle/Paddle/pull/72985), [#73240](https://github.com/PaddlePaddle/Paddle/pull/73240), [#72927](https://github.com/PaddlePaddle/Paddle/pull/72927), [#73451](https://github.com/PaddlePaddle/Paddle/pull/73451), [#73416](https://github.com/PaddlePaddle/Paddle/pull/73416), [#73420](https://github.com/PaddlePaddle/Paddle/pull/73420), [#73347](https://github.com/PaddlePaddle/Paddle/pull/73347), [#73050](https://github.com/PaddlePaddle/Paddle/pull/73050), [#73246](https://github.com/PaddlePaddle/Paddle/pull/73246), [#73123](https://github.com/PaddlePaddle/Paddle/pull/73123), [#73336](https://github.com/PaddlePaddle/Paddle/pull/73336), [#73062](https://github.com/PaddlePaddle/Paddle/pull/73062), [#72201](https://github.com/PaddlePaddle/Paddle/pull/72201), [#72190](https://github.com/PaddlePaddle/Paddle/pull/72190) -- Enhanced API support for complex types. [#72279](https://github.com/PaddlePaddle/Paddle/pull/72279), [#72308](https://github.com/PaddlePaddle/Paddle/pull/72308), [#72518](https://github.com/PaddlePaddle/Paddle/pull/72518), [#72391](https://github.com/PaddlePaddle/Paddle/pull/72391), [#72239](https://github.com/PaddlePaddle/Paddle/pull/72239), [#72286](https://github.com/PaddlePaddle/Paddle/pull/72286), [#72169](https://github.com/PaddlePaddle/Paddle/pull/72169), [#72577](https://github.com/PaddlePaddle/Paddle/pull/72577), [#72619](https://github.com/PaddlePaddle/Paddle/pull/72619) -- Enhanced API support for 0-Size Tensor. [#72570](https://github.com/PaddlePaddle/Paddle/pull/72570), [#72692](https://github.com/PaddlePaddle/Paddle/pull/72692), [#72138](https://github.com/PaddlePaddle/Paddle/pull/72138), [#72410](https://github.com/PaddlePaddle/Paddle/pull/72410), [#72565](https://github.com/PaddlePaddle/Paddle/pull/72565), [#72262](https://github.com/PaddlePaddle/Paddle/pull/72262) -- Correct spelling errors in the API code to enhance overall accuracy and professionalism. [#71780](https://github.com/PaddlePaddle/Paddle/pull/71780), [#71786](https://github.com/PaddlePaddle/Paddle/pull/71786), [#72093](https://github.com/PaddlePaddle/Paddle/pull/72093), [#72113](https://github.com/PaddlePaddle/Paddle/pull/72113), [#72241](https://github.com/PaddlePaddle/Paddle/pull/72241), [#72237](https://github.com/PaddlePaddle/Paddle/pull/72237), [#72590](https://github.com/PaddlePaddle/Paddle/pull/72590), [#72591](https://github.com/PaddlePaddle/Paddle/pull/72591), [#72769](https://github.com/PaddlePaddle/Paddle/pull/72769), [#72858](https://github.com/PaddlePaddle/Paddle/pull/72858), [#73045](https://github.com/PaddlePaddle/Paddle/pull/73045), [#72195](https://github.com/PaddlePaddle/Paddle/pull/72195), [#72627](https://github.com/PaddlePaddle/Paddle/pull/72627), [#72657](https://github.com/PaddlePaddle/Paddle/pull/72657), [#73162](https://github.com/PaddlePaddle/Paddle/pull/73162), [#73402](https://github.com/PaddlePaddle/Paddle/pull/73402), [#72208](https://github.com/PaddlePaddle/Paddle/pull/72208), [#72659](https://github.com/PaddlePaddle/Paddle/pull/72659), [#72658](https://github.com/PaddlePaddle/Paddle/pull/72658), [#72660](https://github.com/PaddlePaddle/Paddle/pull/72660), [#72661](https://github.com/PaddlePaddle/Paddle/pull/72661), [#72656](https://github.com/PaddlePaddle/Paddle/pull/72656) -- Communication optimization reduces peak memory usage. [#72035](https://github.com/PaddlePaddle/Paddle/pull/72035) - -### Docs - -- Fixed errors in the documentation, improving its usability and user experience. [#72549](https://github.com/PaddlePaddle/Paddle/pull/72549), [#73036](https://github.com/PaddlePaddle/Paddle/pull/73036) - -### Devs - -- Updates to code style check rules. [#72896](https://github.com/PaddlePaddle/Paddle/pull/72896), [#73179](https://github.com/PaddlePaddle/Paddle/pull/73179), [#73060](https://github.com/PaddlePaddle/Paddle/pull/73060), [#72553](https://github.com/PaddlePaddle/Paddle/pull/72553), [#72915](https://github.com/PaddlePaddle/Paddle/pull/72915), [#72916](https://github.com/PaddlePaddle/Paddle/pull/72916), [#73338](https://github.com/PaddlePaddle/Paddle/pull/73338), [#72935](https://github.com/PaddlePaddle/Paddle/pull/72935), [#72325](https://github.com/PaddlePaddle/Paddle/pull/72325), [#72935](https://github.com/PaddlePaddle/Paddle/pull/72935) -- Code variable naming updates and code migration. [#73048](https://github.com/PaddlePaddle/Paddle/pull/73048), [#73148](https://github.com/PaddlePaddle/Paddle/pull/73148), [#73149](https://github.com/PaddlePaddle/Paddle/pull/73149), [#73264](https://github.com/PaddlePaddle/Paddle/pull/73264), [#73159](https://github.com/PaddlePaddle/Paddle/pull/73159), [#73124](https://github.com/PaddlePaddle/Paddle/pull/73124), [#73160](https://github.com/PaddlePaddle/Paddle/pull/73160), [#73161](https://github.com/PaddlePaddle/Paddle/pull/73161), [#73374](https://github.com/PaddlePaddle/Paddle/pull/73374), [#73395](https://github.com/PaddlePaddle/Paddle/pull/73395), [#73076](https://github.com/PaddlePaddle/Paddle/pull/73076), [#73163](https://github.com/PaddlePaddle/Paddle/pull/73163), [#73255](https://github.com/PaddlePaddle/Paddle/pull/73255) -- LodTensor is being phased out. [#71968](https://github.com/PaddlePaddle/Paddle/pull/71968), [#72152](https://github.com/PaddlePaddle/Paddle/pull/72152), [#72145](https://github.com/PaddlePaddle/Paddle/pull/72145) - -### Deprecations - -- Cleaned up useless code. [#71795](https://github.com/PaddlePaddle/Paddle/pull/71795), [#71792](https://github.com/PaddlePaddle/Paddle/pull/71792), [#71794](https://github.com/PaddlePaddle/Paddle/pull/71794), [#71793](https://github.com/PaddlePaddle/Paddle/pull/71793), [#72265](https://github.com/PaddlePaddle/Paddle/pull/72265), [#73167](https://github.com/PaddlePaddle/Paddle/pull/73167), [#73115](https://github.com/PaddlePaddle/Paddle/pull/73115), [#73049](https://github.com/PaddlePaddle/Paddle/pull/73049), [#72162](https://github.com/PaddlePaddle/Paddle/pull/72162), [#72321](https://github.com/PaddlePaddle/Paddle/pull/72321), [#72336](https://github.com/PaddlePaddle/Paddle/pull/72336), [#72952](https://github.com/PaddlePaddle/Paddle/pull/72952), [#72828](https://github.com/PaddlePaddle/Paddle/pull/72828) - -## 2. Execution architecture - -Supports FP8 matrix operations, enhances model training efficiency, and simultaneously enhances multiple models to improve stability; provides a C_ops-style interface for calling the inverse, facilitating memory optimization and functional experimentation. - -### New Features - -- Support FP8 matrix multiplication acceleration to enhance computational performance and precision adaptability. [#73092](https://github.com/PaddlePaddle/Paddle/pull/73092) -- Support for 0-size Tensor execution. [#71829](https://github.com/PaddlePaddle/Paddle/pull/71829), [#72263](https://github.com/PaddlePaddle/Paddle/pull/72263), [#72244](https://github.com/PaddlePaddle/Paddle/pull/72244), [#72814](https://github.com/PaddlePaddle/Paddle/pull/72814) -- Support for DeepEP. [#73495](https://github.com/PaddlePaddle/Paddle/pull/73495) -- The CINN backend is enabled by default. [#71838](https://github.com/PaddlePaddle/Paddle/pull/71838) -- Support for SOT-related execution. [#72472](https://github.com/PaddlePaddle/Paddle/pull/72472), [#72559](https://github.com/PaddlePaddle/Paddle/pull/72559), [#72466](https://github.com/PaddlePaddle/Paddle/pull/72466), [#73269](https://github.com/PaddlePaddle/Paddle/pull/73269), [#73329](https://github.com/PaddlePaddle/Paddle/pull/73329), [#73405](https://github.com/PaddlePaddle/Paddle/pull/73405), [#73399](https://github.com/PaddlePaddle/Paddle/pull/73399), [#73424](https://github.com/PaddlePaddle/Paddle/pull/73424), [#73509](https://github.com/PaddlePaddle/Paddle/pull/73509) -- Support for converting dynamic to static. [#73417](https://github.com/PaddlePaddle/Paddle/pull/73417), [#73081](https://github.com/PaddlePaddle/Paddle/pull/73081) -- Added support for kernels with the stride mechanism. [#73053](https://github.com/PaddlePaddle/Paddle/pull/73053) +The PaddlePaddle framework version 3.2 has further enhanced its performance in large model training and inference, hardware adaptation, and support for mainstream large models and high-performance acceleration libraries. + +- In terms of large model training, the PaddlePaddle framework has undergone upgrades in three aspects: computation, parallel strategy, and fault tolerance: +- From the perspective of basic computational performance, FlashMask V3, a sparse mask attention computation with overlapping storage and computation, is proposed to maximize the computational efficiency of Attention. Additionally, it also implements an efficient lossless training technique with FP8 mixed precision effect. +- At the level of distributed parallel strategy, a dynamically adaptive VRAM offloading strategy is proposed to achieve optimal balance between memory and computation. Combined with an innovatively designed VRAM-friendly pipeline parallel scheduling, it further reduces VRAM overhead. +- Enhanced the native fault tolerance capability of the framework, implemented a large-scale cluster training fault tolerance system, which can monitor online silent data corruption and other difficult-to-detect faults without affecting training efficiency, and implemented a highly available checkpoint disaster recovery method to reduce the loss of interruption recovery. +- In terms of hardware adaptation, we have comprehensively upgraded the plug-in adaptation solution for CUDA-like chips. +In terms of device resource management and scheduling, as well as high-performance collective communication libraries, management interface upgrades and communication capability enhancements have been made for CUDA-like chips, with a particular emphasis on enhancing distributed communication capabilities, aligning XCCL with the various structures and functions of NCCL. +- Added a registration mechanism for CUDA-like operators. Taking Muxi adaptation as an example, operator kernel registration can be completed with just one line of code based on the reuse of GPU operator kernels. According to statistical calculations, the reuse rate of operator kernels can reach up to 92%, significantly reducing hardware adaptation costs. +In terms of user experience, the focus has been placed on enhancing compatibility, encompassing the development of interfaces compatible with industry practices, compatibility with the SafeTensors model format, and compatibility with third-party high-performance acceleration libraries. +- The newly added and modified development interfaces are compatible with industry practices, with a series of new APIs and aliases introduced, along with new parameter aliases and both proprietary and generic parameters. +- Fully compatible with the Safetensors model format. The newly added FlexCheckpoint mechanism supports automatic parameter re-sharding across distributed strategies and model structures, significantly reducing the cost of weight conversion and thereby enhancing the end-to-end training and inference development efficiency of large models. +- The system has systematically enhanced its interface compatibility and operator registration capabilities, enabling one-click import of high-performance acceleration libraries. These libraries can be directly reused in PaddlePaddle's model training and inference acceleration processes without requiring code modifications. + +## 1. user experience + +### New features +- New APIs: `paddle.msort`, `paddle.ravel`, `paddle.nn.functional.dropout1d`, `paddle.Tensor.type_as`, `paddle.Tensor.requires_grad`, `paddle.view_as_complex`, `paddle.view_as_real`, `paddle.nn.Parameter`, `paddle.broadcast_shapes`, `paddle.range`, `paddle.as_tensor`, `paddle.scatter_reduce/scatter_reduce_`, `paddle.scatter_add`, `paddle.tensor`, `paddle.softmax`, `paddle.Tensor.softmax`, `paddle.rand_like`, `paddle.is_autocast_enabled`, `paddle.get_autocast_gpu_dtype`, `paddle.Tensor.repeat`, `paddle.permute`. [#74421]( https://github.com/PaddlePaddle/Paddle/pull/74421 ),[#74439]( https://github.com/PaddlePaddle/Paddle/pull/74439 ),[#74444]( https://github.com/PaddlePaddle/Paddle/pull/74444 ),[#74454]( https://github.com/PaddlePaddle/Paddle/pull/74454 ),[#74459]( https://github.com/PaddlePaddle/Paddle/pull/74459 ),[#74491]( https://github.com/PaddlePaddle/Paddle/pull/74491 )[# 74466]( https://github.com/PaddlePaddle/Paddle/pull/74466 ),[#74438]( https://github.com/PaddlePaddle/Paddle/pull/74438 ),[#74594]( https://github.com/PaddlePaddle/Paddle/pull/74594 ),[#74542]( https://github.com/PaddlePaddle/Paddle/pull/74542 ),[#74694]( https://github.com/PaddlePaddle/Paddle/pull/74694 ),[#74564]( https://github.com/PaddlePaddle/Paddle/pull/74564 ),[#74540]( https://github.com/PaddlePaddle/Paddle/pull/74540 ),[#74586]( https://github.com/PaddlePaddle/Paddle/pull/74586 ),[#74651]( https://github.com/PaddlePaddle/Paddle/pull/74651 ),[#74807]( https://github.com/PaddlePaddle/Paddle/pull/74807 ),[#74632]( https://github.com/PaddlePaddle/Paddle/pull/74632 ),[#74834]( https://github.com/PaddlePaddle/Paddle/pull/74834 ),[#74952]( https://github.com/PaddlePaddle/Paddle/pull/74952 ),[#74772]( https://github.com/PaddlePaddle/Paddle/pull/74772 ),[#74441]( https://github.com/PaddlePaddle/Paddle/pull/74441 ),[#74561]( https://github.com/PaddlePaddle/Paddle/pull/74561 ),[#74525]( https://github.com/PaddlePaddle/Paddle/pull/74525 ) +- Added a series of APIs under `paddle.compat.*` to support common usage in the industry and facilitate code migration, including `paddle.compat.median`, `paddle.compat.nanmedian`, `paddle.compat.softmax`, `paddle.compat.sort`, `paddle.compat.split`, `paddle.compat.min/max`, and `paddle.compat.Unfold`. [#74865](https://github.com/PaddlePaddle/Paddle/pull/74865), [#74874](https://github.com/PaddlePaddle/Paddle/pull/74874) +- Added a series of initialization APIs to support commonly used parameter initialization methods in the industry, including `paddle.nn.init.kaiming_uniform_`, `paddle.nn.init.xavier_uniform_`, `paddle.nn.init.uniform_`, `paddle.nn.init.kaiming_normal_`, `paddle.nn.init.xavier_normal_`, `paddle.nn.init.normal_`, `paddle.nn.init.calculate_gain`, `paddle.nn.init.constant_`, `paddle.nn.init.dirac_`, `paddle.nn.init.eye_`, `paddle.nn.init.ones_`, `paddle.nn.init.orthogonal_`, `paddle.nn.init.trunc_normal_`, and `paddle.nn.init.zeros_`. [#74478](https://github.com/PaddlePaddle/Paddle/pull/74478) +- Added usage of parameter aliases in API, allowing for more flexible input options such as `x` or `input`. This includes functions like `paddle.maximum`, `paddle.minimum`, `paddle.sqrt`, `paddle.topk`, `paddle.polar`, `paddle.stack`, `paddle.cos`, `paddle.floor`, `paddle.log`, `paddle.pow`, `paddle.rsqrt`, `paddle.sign`, `paddle.sin`, `paddle.multiply`, and `paddle.where`. [#74683](https://github.com/PaddlePaddle/Paddle/pull/74683), [#74795](https://github.com/PaddlePaddle/Paddle/pull/74795), [#74887](https://github.com/PaddlePaddle/Paddle/pull/74887), [#74592](https://github.com/PaddlePaddle/Paddle/pull/74592) +- `paddle.Tensor` now supports multiple initialization methods, enabling flexible Tensor creation. [#74619](https://github.com/PaddlePaddle/Paddle/pull/74619), [#75022](https://github.com/PaddlePaddle/Paddle/pull/75022), [#75065](https://github.com/PaddlePaddle/Paddle/pull/75065) +- The API has added some proprietary parameters to enhance existing functions, including `paddle.nn.functional.gelu`, `paddle.divide/div/div_`, `paddle.add`, `paddle.Tensor.copy_`, `paddle.norm`, `paddle.linalg.norm`, `paddle.nn.functional.silu`, and `paddle.repeat_interleave`. [#74485](https://github.com/PaddlePaddle/Paddle/pull/74485), [#74562](https://github.com/PaddlePaddle/Paddle/pull/74562), [#74420](https://github.com/PaddlePaddle/Paddle/pull/74420), [#74768](https://github.com/PaddlePaddle/Paddle/pull/74768), [#74855](https://github.com/PaddlePaddle/Paddle/pull/74855), [#74903](https://github.com/PaddlePaddle/Paddle/pull/74903), [#74788](https://github.com/PaddlePaddle/Paddle/pull/74788), [#74631](https://github.com/PaddlePaddle/Paddle/pull/74631), [#74947](https://github.com/PaddlePaddle/Paddle/pull/74947) +- The API has added some common parameters: `out`, `device`, `dtype`, `requires_grad`, `pin_memory`, and `bias`, enhancing the existing functionality. These include `paddle.zeros`, `paddle.zeros_like`, `paddle.ones`, `paddle.ones_like`, `paddle.arange`, `paddle.eye`, `paddle.empty`, `paddle.empty_like`, `paddle.full`, `paddle.full_like`, `paddle.randn`, `paddle.Tensor.new_full`, `paddle.Tensor.new_empty`, `paddle.Tensor.new_ones`, `paddle.Tensor.new_zeros`, `paddle.tril/triu`, `paddle.bmm`, `paddle.nn.Conv1D/Conv2D/Conv3D/Embedding`, `paddle.diff`, `paddle.cumsum`, `paddle.var`, `paddle.multinomial`, and `paddle.mean`. [#74477](https://github.com/PaddlePaddle/Paddle/pull/74477),[#74526](https://github.com/PaddlePaddle/Paddle/pull/74526),[#74711](https://github.com/PaddlePaddle/Paddle/pull/74711),[#74582](https://github.com/PaddlePaddle/Paddle/pull/74582),[#74624](https://github.com/PaddlePaddle/Paddle/pull/74624),[#74849](https://github.com/PaddlePaddle/Paddle/pull/74849),[#74612](https://github.com/PaddlePaddle/Paddle/pull/74612),[#74875](https://github.com/PaddlePaddle/Paddle/pull/74875),[#74641](https://github.com/PaddlePaddle/Paddle/pull/74641),[#74949](https://github.com/PaddlePaddle/Paddle/pull/74949),[#74918](https://github.com/PaddlePaddle/Paddle/pull/74918),[#74914](https://github.com/PaddlePaddle/Paddle/pull/74914),[#74934](https://github.com/PaddlePaddle/Paddle/pull/74934),[#74920](https://github.com/PaddlePaddle/Paddle/pull/74920),[#74955](https://github.com/PaddlePaddle/Paddle/pull/74955),[#74226](https://github.com/PaddlePaddle/Paddle/pull/74226),[#74946](https://github.com/PaddlePaddle/Paddle/pull/74946) +- Added aliases to APIs to support more calling methods. These include `paddle.Tensor.mul_/mul`, `paddle.autograd.Function`, `paddle.argwhere`, `paddle.cat`, `paddle.clamp`, `paddle.ger`, `paddle.take_along_dim`, `paddle.linalg.matmul`, `paddle.special.logsumexp`, `paddle.concatenate`, `paddle.eq/gt`, `paddle.Tensor.take_along_dim`, and `paddle.nn.Conv1d/Conv2d/Conv3d`, etc. [#74493](https://github.com/PaddlePaddle/Paddle/pull/74493), [#74569](https://github.com/PaddlePaddle/Paddle/pull/74569), [#74870](https://github.com/PaddlePaddle/Paddle/pull/74870) ### Bug fixes - -- Performance optimization and stability: Optimize training stability, enhance support for Python 3.11+ versions, improve the automatic activation logic of the CINN compiler in dynamic graph mode, fix issues related to dynamic shape inference and gradient backpropagation, optimize GPU kernel execution efficiency (such as for_range and constant folding), improve NPU memory copy and context management, and enhance large-scale model training performance and hardware utilization. [#71777](https://github.com/PaddlePaddle/Paddle/pull/71777), [#71837](https://github.com/PaddlePaddle/Paddle/pull/71837), [#71834](https://github.com/PaddlePaddle/Paddle/pull/71834), [#71950](https://github.com/PaddlePaddle/Paddle/pull/71950), [#71960](https://github.com/PaddlePaddle/Paddle/pull/71960), [#72103](https://github.com/PaddlePaddle/Paddle/pull/72103), [#70652](https://github.com/PaddlePaddle/Paddle/pull/70652), [#72313](https://github.com/PaddlePaddle/Paddle/pull/72313), [#72405](https://github.com/PaddlePaddle/Paddle/pull/72405), [#72581](https://github.com/PaddlePaddle/Paddle/pull/72581), [#73418](https://github.com/PaddlePaddle/Paddle/pull/73418) -- Large Tensor Support Extension: The extension operator supports very large-sized tensors, including mathematical operations (lerp/mean/bmm/trapezoid), tensor operations (arg_min_max/diag/prelu), padding, comparisons (allclose/isclose), and fusion operators (softmax_mask_fuse), addressing compatibility issues in mixed-precision training. [#71916](https://github.com/PaddlePaddle/Paddle/pull/71916), [#71970](https://github.com/PaddlePaddle/Paddle/pull/71970), [#72516](https://github.com/PaddlePaddle/Paddle/pull/72516), [#72517](https://github.com/PaddlePaddle/Paddle/pull/72517), [#72638](https://github.com/PaddlePaddle/Paddle/pull/72638), [#72652](https://github.com/PaddlePaddle/Paddle/pull/72652), [#73046](https://github.com/PaddlePaddle/Paddle/pull/73046), [#73093](https://github.com/PaddlePaddle/Paddle/pull/73093), [#73136](https://github.com/PaddlePaddle/Paddle/pull/73136), [#72679](https://github.com/PaddlePaddle/Paddle/pull/72679), [#73174](https://github.com/PaddlePaddle/Paddle/pull/73174), [#73198](https://github.com/PaddlePaddle/Paddle/pull/73198), [#73121](https://github.com/PaddlePaddle/Paddle/pull/73121), [#73096](https://github.com/PaddlePaddle/Paddle/pull/73096), [#73261](https://github.com/PaddlePaddle/Paddle/pull/73261), [#73201](https://github.com/PaddlePaddle/Paddle/pull/73201), [#73291](https://github.com/PaddlePaddle/Paddle/pull/73291), [#73373](https://github.com/PaddlePaddle/Paddle/pull/73373), [#73318](https://github.com/PaddlePaddle/Paddle/pull/73318), [#73436](https://github.com/PaddlePaddle/Paddle/pull/73436), [#72705](https://github.com/PaddlePaddle/Paddle/pull/72705), [#72276](https://github.com/PaddlePaddle/Paddle/pull/72276), [#73135](https://github.com/PaddlePaddle/Paddle/pull/73135), [#73304](https://github.com/PaddlePaddle/Paddle/pull/73304), [#73381](https://github.com/PaddlePaddle/Paddle/pull/73381), [#72712](https://github.com/PaddlePaddle/Paddle/pull/72712), [#72717](https://github.com/PaddlePaddle/Paddle/pull/72717), [#72634](https://github.com/PaddlePaddle/Paddle/pull/72634), [#72562](https://github.com/PaddlePaddle/Paddle/pull/72562), [#72628](https://github.com/PaddlePaddle/Paddle/pull/72628), [#72706](https://github.com/PaddlePaddle/Paddle/pull/72706), [#72831](https://github.com/PaddlePaddle/Paddle/pull/72831), [#72888](https://github.com/PaddlePaddle/Paddle/pull/72888), [#72753](https://github.com/PaddlePaddle/Paddle/pull/72753), [#72931](https://github.com/PaddlePaddle/Paddle/pull/72931), [#73021](https://github.com/PaddlePaddle/Paddle/pull/73021), [#73064](https://github.com/PaddlePaddle/Paddle/pull/73064), [#73069](https://github.com/PaddlePaddle/Paddle/pull/73069), [#73153](https://github.com/PaddlePaddle/Paddle/pull/73153), [#73118](https://github.com/PaddlePaddle/Paddle/pull/73118), [#73252](https://github.com/PaddlePaddle/Paddle/pull/73252), [#73253](https://github.com/PaddlePaddle/Paddle/pull/73253), [#73262](https://github.com/PaddlePaddle/Paddle/pull/73262), [#73259](https://github.com/PaddlePaddle/Paddle/pull/73259), [#73288](https://github.com/PaddlePaddle/Paddle/pull/73288), [#73105](https://github.com/PaddlePaddle/Paddle/pull/73105), [#73275](https://github.com/PaddlePaddle/Paddle/pull/73275), [#73284](https://github.com/PaddlePaddle/Paddle/pull/73284), [#73110](https://github.com/PaddlePaddle/Paddle/pull/73110), [#73335](https://github.com/PaddlePaddle/Paddle/pull/73335), [#73342](https://github.com/PaddlePaddle/Paddle/pull/73342), [#73447](https://github.com/PaddlePaddle/Paddle/pull/73447), [#73460](https://github.com/PaddlePaddle/Paddle/pull/73460), [#73194](https://github.com/PaddlePaddle/Paddle/pull/73194) -- Fix for 0-Size Tensor issue: Fixed computation anomalies caused by 0-Size Tensor, covering pooling (max_pool1d/lp_pool1d), sorting (matrix_rank), statistics (std/nanmedian), and element-level operations (elementwise compare), ensuring numerical stability and API consistency under extreme input scenarios. [#71961](https://github.com/PaddlePaddle/Paddle/pull/71961), [#72017](https://github.com/PaddlePaddle/Paddle/pull/72017), [#72785](https://github.com/PaddlePaddle/Paddle/pull/72785), [#73214](https://github.com/PaddlePaddle/Paddle/pull/73214), [#73263](https://github.com/PaddlePaddle/Paddle/pull/73263), [#73267](https://github.com/PaddlePaddle/Paddle/pull/73267), [#73280](https://github.com/PaddlePaddle/Paddle/pull/73280), [#72444](https://github.com/PaddlePaddle/Paddle/pull/72444), [#72437](https://github.com/PaddlePaddle/Paddle/pull/72437), [#72460](https://github.com/PaddlePaddle/Paddle/pull/72460), [#73090](https://github.com/PaddlePaddle/Paddle/pull/73090), [#73516](https://github.com/PaddlePaddle/Paddle/pull/73516), [#72807](https://github.com/PaddlePaddle/Paddle/pull/72807), [#72799](https://github.com/PaddlePaddle/Paddle/pull/72799), [#72800](https://github.com/PaddlePaddle/Paddle/pull/72800), [#72809](https://github.com/PaddlePaddle/Paddle/pull/72809), [#73497](https://github.com/PaddlePaddle/Paddle/pull/73497) -- API Enhancements and Compatibility: Added support for Python standard library types (dataclasses), expanded API data type compatibility (creation of bfloat16 parameters, automatic inference of -1 dimension), fixed NumPy API interaction errors, and optimized BatchNorm memory layout. [#72059](https://github.com/PaddlePaddle/Paddle/pull/72059), [#72283](https://github.com/PaddlePaddle/Paddle/pull/72283), [#72451](https://github.com/PaddlePaddle/Paddle/pull/72451), [#72512](https://github.com/PaddlePaddle/Paddle/pull/72512), [#72618](https://github.com/PaddlePaddle/Paddle/pull/72618), [#72976](https://github.com/PaddlePaddle/Paddle/pull/72976), [#73084](https://github.com/PaddlePaddle/Paddle/pull/73084), [#73205](https://github.com/PaddlePaddle/Paddle/pull/73205), [#73250](https://github.com/PaddlePaddle/Paddle/pull/73250), [#73111](https://github.com/PaddlePaddle/Paddle/pull/73111), [#73260](https://github.com/PaddlePaddle/Paddle/pull/73260), [#72094](https://github.com/PaddlePaddle/Paddle/pull/72094), [#71844](https://github.com/PaddlePaddle/Paddle/pull/71844), [#71357](https://github.com/PaddlePaddle/Paddle/pull/71357) -- Memory management and bug fixes: Address high-risk issues such as memory overflow (set_value/nonzero), null pointer (data nullptr), and CUDA graph allocation failure. Fix memory leaks and computational errors in core operations such as gradient clipping (clip_grad), tensor assignment (assign), and broadcasting (broadcast). Optimize NPU asynchronous execution and predictor GIL release logic to enhance system robustness. [#71895](https://github.com/PaddlePaddle/Paddle/pull/71895), [#72101](https://github.com/PaddlePaddle/Paddle/pull/72101), [#72133](https://github.com/PaddlePaddle/Paddle/pull/72133), [#72149](https://github.com/PaddlePaddle/Paddle/pull/72149), [#72176](https://github.com/PaddlePaddle/Paddle/pull/72176), [#72314](https://github.com/PaddlePaddle/Paddle/pull/72314), [#72256](https://github.com/PaddlePaddle/Paddle/pull/72256), [#72757](https://github.com/PaddlePaddle/Paddle/pull/72757), [#72749](https://github.com/PaddlePaddle/Paddle/pull/72749), [#72792](https://github.com/PaddlePaddle/Paddle/pull/72792), [#72815](https://github.com/PaddlePaddle/Paddle/pull/72815), [#72819](https://github.com/PaddlePaddle/Paddle/pull/72819), [#72958](https://github.com/PaddlePaddle/Paddle/pull/72958), [#73023](https://github.com/PaddlePaddle/Paddle/pull/73023), [#73103](https://github.com/PaddlePaddle/Paddle/pull/73103), [#73014](https://github.com/PaddlePaddle/Paddle/pull/73014), [#73137](https://github.com/PaddlePaddle/Paddle/pull/73137), [#73256](https://github.com/PaddlePaddle/Paddle/pull/73256), [#73211](https://github.com/PaddlePaddle/Paddle/pull/73211), [#73251](https://github.com/PaddlePaddle/Paddle/pull/73251), [#73210](https://github.com/PaddlePaddle/Paddle/pull/73210), [#73415](https://github.com/PaddlePaddle/Paddle/pull/73415), [#73206](https://github.com/PaddlePaddle/Paddle/pull/73206), [#71983](https://github.com/PaddlePaddle/Paddle/pull/71983), [#72485](https://github.com/PaddlePaddle/Paddle/pull/72485), [#72561](https://github.com/PaddlePaddle/Paddle/pull/72561) -- Other important fixes: Fixed defects in scientific computation, save/load modules, improved Slice operator kernel configuration, optimized fallback strategy for dynamic shape inference, and refined exception throwing and type checking logic. [#71810](https://github.com/PaddlePaddle/Paddle/pull/71810), [#72246](https://github.com/PaddlePaddle/Paddle/pull/72246), [#72378](https://github.com/PaddlePaddle/Paddle/pull/72378), [#72467](https://github.com/PaddlePaddle/Paddle/pull/72467), [#72635](https://github.com/PaddlePaddle/Paddle/pull/72635), [#72751](https://github.com/PaddlePaddle/Paddle/pull/72751), [#72044](https://github.com/PaddlePaddle/Paddle/pull/72044), [#72051](https://github.com/PaddlePaddle/Paddle/pull/72051), [#73231](https://github.com/PaddlePaddle/Paddle/pull/73231), [#73109](https://github.com/PaddlePaddle/Paddle/pull/73109) -- Fixed issues related to SOT, [#71932](https://github.com/PaddlePaddle/Paddle/pull/71932), [#71971](https://github.com/PaddlePaddle/Paddle/pull/71971), [#72194](https://github.com/PaddlePaddle/Paddle/pull/72194), [#72288](https://github.com/PaddlePaddle/Paddle/pull/72288), [#72306](https://github.com/PaddlePaddle/Paddle/pull/72306), [#72367](https://github.com/PaddlePaddle/Paddle/pull/72367), [#72495](https://github.com/PaddlePaddle/Paddle/pull/72495), [#72522](https://github.com/PaddlePaddle/Paddle/pull/72522), [#72704](https://github.com/PaddlePaddle/Paddle/pull/72704), [#72631](https://github.com/PaddlePaddle/Paddle/pull/72631), [#72737](https://github.com/PaddlePaddle/Paddle/pull/72737), [#73067](https://github.com/PaddlePaddle/Paddle/pull/73067), [#73030](https://github.com/PaddlePaddle/Paddle/pull/73030), [#73059](https://github.com/PaddlePaddle/Paddle/pull/73059), [#73282](https://github.com/PaddlePaddle/Paddle/pull/73282), [#73511](https://github.com/PaddlePaddle/Paddle/pull/73511), [#73526](https://github.com/PaddlePaddle/Paddle/pull/73526), [#73549](https://github.com/PaddlePaddle/Paddle/pull/73549), [#73515](https://github.com/PaddlePaddle/Paddle/pull/73515) - -### Improvements - -- Development of the 0-size mechanism for Paddle API. [#72721](https://github.com/PaddlePaddle/Paddle/pull/72721), [#72756](https://github.com/PaddlePaddle/Paddle/pull/72756), [#72790](https://github.com/PaddlePaddle/Paddle/pull/72790), [#72806](https://github.com/PaddlePaddle/Paddle/pull/72806), [#72764](https://github.com/PaddlePaddle/Paddle/pull/72764), [#72786](https://github.com/PaddlePaddle/Paddle/pull/72786), [#72853](https://github.com/PaddlePaddle/Paddle/pull/72853), [#72826](https://github.com/PaddlePaddle/Paddle/pull/72826), [#72851](https://github.com/PaddlePaddle/Paddle/pull/72851), [#72928](https://github.com/PaddlePaddle/Paddle/pull/72928), [#72912](https://github.com/PaddlePaddle/Paddle/pull/72912), [#72922](https://github.com/PaddlePaddle/Paddle/pull/72922), [#72924](https://github.com/PaddlePaddle/Paddle/pull/72924), [#72887](https://github.com/PaddlePaddle/Paddle/pull/72887), [#72921](https://github.com/PaddlePaddle/Paddle/pull/72921), [#72906](https://github.com/PaddlePaddle/Paddle/pull/72906), [#72895](https://github.com/PaddlePaddle/Paddle/pull/72895), [#72821](https://github.com/PaddlePaddle/Paddle/pull/72821), [#72914](https://github.com/PaddlePaddle/Paddle/pull/72914), [#72936](https://github.com/PaddlePaddle/Paddle/pull/72936), [#72943](https://github.com/PaddlePaddle/Paddle/pull/72943), [#72694](https://github.com/PaddlePaddle/Paddle/pull/72694), [#72919](https://github.com/PaddlePaddle/Paddle/pull/72919), [#72940](https://github.com/PaddlePaddle/Paddle/pull/72940), [#72820](https://github.com/PaddlePaddle/Paddle/pull/72820), [#72934](https://github.com/PaddlePaddle/Paddle/pull/72934), [#72975](https://github.com/PaddlePaddle/Paddle/pull/72975), [#72872](https://github.com/PaddlePaddle/Paddle/pull/72872), [#72984](https://github.com/PaddlePaddle/Paddle/pull/72984), [#72988](https://github.com/PaddlePaddle/Paddle/pull/72988), [#72972](https://github.com/PaddlePaddle/Paddle/pull/72972), [#72977](https://github.com/PaddlePaddle/Paddle/pull/72977), [#72937](https://github.com/PaddlePaddle/Paddle/pull/72937), [#73086](https://github.com/PaddlePaddle/Paddle/pull/73086), [#73042](https://github.com/PaddlePaddle/Paddle/pull/73042), [#73017](https://github.com/PaddlePaddle/Paddle/pull/73017), [#73044](https://github.com/PaddlePaddle/Paddle/pull/73044), [#73077](https://github.com/PaddlePaddle/Paddle/pull/73077), [#73108](https://github.com/PaddlePaddle/Paddle/pull/73108), [#73027](https://github.com/PaddlePaddle/Paddle/pull/73027), [#72970](https://github.com/PaddlePaddle/Paddle/pull/72970), [#73008](https://github.com/PaddlePaddle/Paddle/pull/73008), [#72996](https://github.com/PaddlePaddle/Paddle/pull/72996), [#73165](https://github.com/PaddlePaddle/Paddle/pull/73165), [#73166](https://github.com/PaddlePaddle/Paddle/pull/73166), [#73170](https://github.com/PaddlePaddle/Paddle/pull/73170), [#73122](https://github.com/PaddlePaddle/Paddle/pull/73122), [#73204](https://github.com/PaddlePaddle/Paddle/pull/73204), [#73207](https://github.com/PaddlePaddle/Paddle/pull/73207), [#73186](https://github.com/PaddlePaddle/Paddle/pull/73186), [#73197](https://github.com/PaddlePaddle/Paddle/pull/73197), [#73168](https://github.com/PaddlePaddle/Paddle/pull/73168), [#73172](https://github.com/PaddlePaddle/Paddle/pull/73172), [#73125](https://github.com/PaddlePaddle/Paddle/pull/73125), [#73181](https://github.com/PaddlePaddle/Paddle/pull/73181), [#73270](https://github.com/PaddlePaddle/Paddle/pull/73270), [#73028](https://github.com/PaddlePaddle/Paddle/pull/73028), [#73094](https://github.com/PaddlePaddle/Paddle/pull/73094), [#73180](https://github.com/PaddlePaddle/Paddle/pull/73180), [#73276](https://github.com/PaddlePaddle/Paddle/pull/73276), [#73333](https://github.com/PaddlePaddle/Paddle/pull/73333), [#73341](https://github.com/PaddlePaddle/Paddle/pull/73341), [#73299](https://github.com/PaddlePaddle/Paddle/pull/73299), [#73346](https://github.com/PaddlePaddle/Paddle/pull/73346), [#73361](https://github.com/PaddlePaddle/Paddle/pull/73361), [#73375](https://github.com/PaddlePaddle/Paddle/pull/73375), [#73152](https://github.com/PaddlePaddle/Paddle/pull/73152), [#73377](https://github.com/PaddlePaddle/Paddle/pull/73377), [#73355](https://github.com/PaddlePaddle/Paddle/pull/73355), [#73382](https://github.com/PaddlePaddle/Paddle/pull/73382), [#73385](https://github.com/PaddlePaddle/Paddle/pull/73385), [#73386](https://github.com/PaddlePaddle/Paddle/pull/73386), [#73352](https://github.com/PaddlePaddle/Paddle/pull/73352), [#73387](https://github.com/PaddlePaddle/Paddle/pull/73387), [#73401](https://github.com/PaddlePaddle/Paddle/pull/73401), [#73384](https://github.com/PaddlePaddle/Paddle/pull/73384), [#73450](https://github.com/PaddlePaddle/Paddle/pull/73450), [#73437](https://github.com/PaddlePaddle/Paddle/pull/73437), [#73503](https://github.com/PaddlePaddle/Paddle/pull/73503), [#73507](https://github.com/PaddlePaddle/Paddle/pull/73507), [#73477](https://github.com/PaddlePaddle/Paddle/pull/73477), [#73513](https://github.com/PaddlePaddle/Paddle/pull/73513), [#73525](https://github.com/PaddlePaddle/Paddle/pull/73525), [#73528](https://github.com/PaddlePaddle/Paddle/pull/73528), [#73517](https://github.com/PaddlePaddle/Paddle/pull/73517), [#72898](https://github.com/PaddlePaddle/Paddle/pull/72898), [#72880](https://github.com/PaddlePaddle/Paddle/pull/72880), [#72864](https://github.com/PaddlePaddle/Paddle/pull/72864), [#72993](https://github.com/PaddlePaddle/Paddle/pull/72993), [#72954](https://github.com/PaddlePaddle/Paddle/pull/72954), [#72866](https://github.com/PaddlePaddle/Paddle/pull/72866), [#72878](https://github.com/PaddlePaddle/Paddle/pull/72878), [#72889](https://github.com/PaddlePaddle/Paddle/pull/72889), [#72861](https://github.com/PaddlePaddle/Paddle/pull/72861), [#72837](https://github.com/PaddlePaddle/Paddle/pull/72837) -- SOT-related enhancements: Enhanced functionalities (such as NumPy interoperability and super support), improved training stability, and fixed multiple issues to enhance code robustness, [#71763](https://github.com/PaddlePaddle/Paddle/pull/71763), [#71666](https://github.com/PaddlePaddle/Paddle/pull/71666), [#71858](https://github.com/PaddlePaddle/Paddle/pull/71858), [#71865](https://github.com/PaddlePaddle/Paddle/pull/71865), [#72474](https://github.com/PaddlePaddle/Paddle/pull/72474), [#72154](https://github.com/PaddlePaddle/Paddle/pull/72154), [#72784](https://github.com/PaddlePaddle/Paddle/pull/72784), [#72956](https://github.com/PaddlePaddle/Paddle/pull/72956), [#73038](https://github.com/PaddlePaddle/Paddle/pull/73038), [#73066](https://github.com/PaddlePaddle/Paddle/pull/73066), [#73287](https://github.com/PaddlePaddle/Paddle/pull/73287), [#73278](https://github.com/PaddlePaddle/Paddle/pull/73278), [#73332](https://github.com/PaddlePaddle/Paddle/pull/73332), [#73372](https://github.com/PaddlePaddle/Paddle/pull/73372), [#73412](https://github.com/PaddlePaddle/Paddle/pull/73412), [#73407](https://github.com/PaddlePaddle/Paddle/pull/73407), [#73506](https://github.com/PaddlePaddle/Paddle/pull/73506) -- Code style refactoring: Through code refactoring and the unification of cross-platform kernel behaviors, we have improved code quality and maintainability. Additionally, we have added a YAML format pre-commit check tool, as documented in [#72216](https://github.com/PaddlePaddle/Paddle/pull/72216), [#72360](https://github.com/PaddlePaddle/Paddle/pull/72360), [#72816](https://github.com/PaddlePaddle/Paddle/pull/72816), [#72969](https://github.com/PaddlePaddle/Paddle/pull/72969), [#73106](https://github.com/PaddlePaddle/Paddle/pull/73106), [#72825](https://github.com/PaddlePaddle/Paddle/pull/72825), [#73150](https://github.com/PaddlePaddle/Paddle/pull/73150), [#73151](https://github.com/PaddlePaddle/Paddle/pull/73151), [#73158](https://github.com/PaddlePaddle/Paddle/pull/73158), [#73101](https://github.com/PaddlePaddle/Paddle/pull/73101), [#73326](https://github.com/PaddlePaddle/Paddle/pull/73326), [#72580](https://github.com/PaddlePaddle/Paddle/pull/72580), and [#72424](https://github.com/PaddlePaddle/Paddle/pull/72424) -- Paddle CPU/GPU Kernel accuracy issue is pushed to the whole team. [#72879](https://github.com/PaddlePaddle/Paddle/pull/72879), [#72894](https://github.com/PaddlePaddle/Paddle/pull/72894), [#73012](https://github.com/PaddlePaddle/Paddle/pull/73012), [#72973](https://github.com/PaddlePaddle/Paddle/pull/72973), [#73018](https://github.com/PaddlePaddle/Paddle/pull/73018), [#72965](https://github.com/PaddlePaddle/Paddle/pull/72965), [#73128](https://github.com/PaddlePaddle/Paddle/pull/73128), [#73229](https://github.com/PaddlePaddle/Paddle/pull/73229), [#72992](https://github.com/PaddlePaddle/Paddle/pull/72992), [#73344](https://github.com/PaddlePaddle/Paddle/pull/73344), [#73274](https://github.com/PaddlePaddle/Paddle/pull/73274), [#73295](https://github.com/PaddlePaddle/Paddle/pull/73295), [#73293](https://github.com/PaddlePaddle/Paddle/pull/73293), [#73317](https://github.com/PaddlePaddle/Paddle/pull/73317), [#73320](https://github.com/PaddlePaddle/Paddle/pull/73320), [#73454](https://github.com/PaddlePaddle/Paddle/pull/73454), [#73492](https://github.com/PaddlePaddle/Paddle/pull/73492), [#73535](https://github.com/PaddlePaddle/Paddle/pull/73535) -- Slice issue fixes: Fixed issues related to slices, including indexing logic, performance optimization, etc., [#72644](https://github.com/PaddlePaddle/Paddle/pull/72644), [#72676](https://github.com/PaddlePaddle/Paddle/pull/72676), [#72838](https://github.com/PaddlePaddle/Paddle/pull/72838), [#72966](https://github.com/PaddlePaddle/Paddle/pull/72966), [#73095](https://github.com/PaddlePaddle/Paddle/pull/73095), [#72840](https://github.com/PaddlePaddle/Paddle/pull/72840), [#73112](https://github.com/PaddlePaddle/Paddle/pull/73112), [#73367](https://github.com/PaddlePaddle/Paddle/pull/73367), [#73390](https://github.com/PaddlePaddle/Paddle/pull/73390), [#73307](https://github.com/PaddlePaddle/Paddle/pull/73307), [#73465](https://github.com/PaddlePaddle/Paddle/pull/73465), [#73362](https://github.com/PaddlePaddle/Paddle/pull/73362), [#72733](https://github.com/PaddlePaddle/Paddle/pull/72733), [#72886](https://github.com/PaddlePaddle/Paddle/pull/72886) -- Performance optimization: By optimizing index logic and enhancing performance, we aim to improve overall performance, [#72707](https://github.com/PaddlePaddle/Paddle/pull/72707), [#73485](https://github.com/PaddlePaddle/Paddle/pull/73485) -- Other significant improvements: including dynamic shape support, fixing meshgrid and adding unit tests, upgrading CUB to version 2.1.0, improving FP8 numerical processing, optimizing the CUDA graph shared pool mechanism, removing ShadowFeedOp to simplify data flow, enhancing version compatibility for PIR model saving/loading, fixing flip and reverse kernel issues, improving the NaN propagation logic of paddle.angle, introducing an asynchronous GC check mechanism, optimizing the Scope lock-free interface of Dy2St, cleaning up unused third-party dependencies (absl), and further promoting the decoupling of PHI and Fluid to enhance the framework's stability, performance, and scalability. [#72356](https://github.com/PaddlePaddle/Paddle/pull/72356), [#72380](https://github.com/PaddlePaddle/Paddle/pull/72380), [#72633](https://github.com/PaddlePaddle/Paddle/pull/72633), [#72794](https://github.com/PaddlePaddle/Paddle/pull/72794), [#72917](https://github.com/PaddlePaddle/Paddle/pull/72917), [#72920](https://github.com/PaddlePaddle/Paddle/pull/72920), [#72945](https://github.com/PaddlePaddle/Paddle/pull/72945), [#72620](https://github.com/PaddlePaddle/Paddle/pull/72620), [#73011](https://github.com/PaddlePaddle/Paddle/pull/73011), [#73051](https://github.com/PaddlePaddle/Paddle/pull/73051), [#73052](https://github.com/PaddlePaddle/Paddle/pull/73052), [#73075](https://github.com/PaddlePaddle/Paddle/pull/73075), [#73176](https://github.com/PaddlePaddle/Paddle/pull/73176), [#73191](https://github.com/PaddlePaddle/Paddle/pull/73191), [#73337](https://github.com/PaddlePaddle/Paddle/pull/73337), [#73311](https://github.com/PaddlePaddle/Paddle/pull/73311), [#73173](https://github.com/PaddlePaddle/Paddle/pull/73173), [#73239](https://github.com/PaddlePaddle/Paddle/pull/73239), [#73448](https://github.com/PaddlePaddle/Paddle/pull/73448), [#73478](https://github.com/PaddlePaddle/Paddle/pull/73478), [#73522](https://github.com/PaddlePaddle/Paddle/pull/73522), [#73369](https://github.com/PaddlePaddle/Paddle/pull/73369) - -### Performance - -- SOT-related: Through improvements such as optimizing the Guard condition mechanism, enhancing dynamic shape processing capabilities, and adding support for no_grad, execution efficiency has been enhanced, functional features have been expanded, and the code structure and performance have been optimized. [#70362](https://github.com/PaddlePaddle/Paddle/pull/70362), [#70154](https://github.com/PaddlePaddle/Paddle/pull/70154), [#71748](https://github.com/PaddlePaddle/Paddle/pull/71748), [#72004](https://github.com/PaddlePaddle/Paddle/pull/72004), [#72159](https://github.com/PaddlePaddle/Paddle/pull/72159), [#72174](https://github.com/PaddlePaddle/Paddle/pull/72174), [#71994](https://github.com/PaddlePaddle/Paddle/pull/71994), [#72250](https://github.com/PaddlePaddle/Paddle/pull/72250), [#72285](https://github.com/PaddlePaddle/Paddle/pull/72285), [#72322](https://github.com/PaddlePaddle/Paddle/pull/72322), [#72272](https://github.com/PaddlePaddle/Paddle/pull/72272), [#72417](https://github.com/PaddlePaddle/Paddle/pull/72417), [#72438](https://github.com/PaddlePaddle/Paddle/pull/72438), [#72462](https://github.com/PaddlePaddle/Paddle/pull/72462), [#72463](https://github.com/PaddlePaddle/Paddle/pull/72463), [#72503](https://github.com/PaddlePaddle/Paddle/pull/72503), [#72501](https://github.com/PaddlePaddle/Paddle/pull/72501), [#72521](https://github.com/PaddlePaddle/Paddle/pull/72521), [#72509](https://github.com/PaddlePaddle/Paddle/pull/72509), [#72544](https://github.com/PaddlePaddle/Paddle/pull/72544), [#73469](https://github.com/PaddlePaddle/Paddle/pull/73469), [#73471](https://github.com/PaddlePaddle/Paddle/pull/73471), [#73555](https://github.com/PaddlePaddle/Paddle/pull/73555) - -### Deprecations - -- Code cleanup: Cleaned up Python 3.8 support declarations, and completed related code cleanup, dependency streamlining, and syntax modernization updates to optimize code maintainability and compatibility. [#71815](https://github.com/PaddlePaddle/Paddle/pull/71815), [#72802](https://github.com/PaddlePaddle/Paddle/pull/72802), [#72856](https://github.com/PaddlePaddle/Paddle/pull/72856), [#72854](https://github.com/PaddlePaddle/Paddle/pull/72854), [#72855](https://github.com/PaddlePaddle/Paddle/pull/72855), [#72873](https://github.com/PaddlePaddle/Paddle/pull/72873), [#72870](https://github.com/PaddlePaddle/Paddle/pull/72870), [#72868](https://github.com/PaddlePaddle/Paddle/pull/72868), [#72891](https://github.com/PaddlePaddle/Paddle/pull/72891) - -### Devs - -- Optimized CINN backend integration and dynamic shape processing logic, improved framework stability through code structure refactoring and test reinforcement, and added debug log functionality to enhance maintainability. [#71817](https://github.com/PaddlePaddle/Paddle/pull/71817), [#71896](https://github.com/PaddlePaddle/Paddle/pull/71896), [#71984](https://github.com/PaddlePaddle/Paddle/pull/71984), [#72067](https://github.com/PaddlePaddle/Paddle/pull/72067), [#72165](https://github.com/PaddlePaddle/Paddle/pull/72165), [#72207](https://github.com/PaddlePaddle/Paddle/pull/72207), [#72235](https://github.com/PaddlePaddle/Paddle/pull/72235), [#72273](https://github.com/PaddlePaddle/Paddle/pull/72273), [#72326](https://github.com/PaddlePaddle/Paddle/pull/72326), [#72400](https://github.com/PaddlePaddle/Paddle/pull/72400), [#72381](https://github.com/PaddlePaddle/Paddle/pull/72381), [#72560](https://github.com/PaddlePaddle/Paddle/pull/72560), [#72783](https://github.com/PaddlePaddle/Paddle/pull/72783), [#73530](https://github.com/PaddlePaddle/Paddle/pull/73530) - -### Others - -- Others: Added kernel support for FP16/BF16 data types in CPU sections, optimized error handling and tolerance configuration in the testing module, etc. [#71764](https://github.com/PaddlePaddle/Paddle/pull/71764), [#71951](https://github.com/PaddlePaddle/Paddle/pull/71951), [#72944](https://github.com/PaddlePaddle/Paddle/pull/72944) - -## 3. CINN - -Optimize compiler performance and enhance stability - -### Performance - -- Support automatic conversion and optimization of Layout in training scenarios. ([#71891](https://github.com/PaddlePaddle/Paddle/pull/71891)) -- Kernel compilation optimizations for operators such as argmin, argmax, and arange have been added to the backend. ([#71956](https://github.com/PaddlePaddle/Paddle/pull/71956), [#72598](https://github.com/PaddlePaddle/Paddle/pull/72598))) -- Support for fused optimization of matrix multiplication. ([#72846](https://github.com/PaddlePaddle/Paddle/pull/72846)) -- Optimize the computation performance of some operators, specifically the Kernel. ([#72871](https://github.com/PaddlePaddle/Paddle/pull/72871)) +- Fixed the precision issue of `paddle.nanmedian` in PaddlePaddle. [#74263](https://github.com/PaddlePaddle/Paddle/pull/74263) +- Fixed the issue of `paddle.distributed.fleet.utils.hybrid_parallel_util.fused_allreduce_gradients` in 0-D scenarios. [#74957](https://github.com/PaddlePaddle/Paddle/pull/74957) +- Fixed the issue of `paddle.matmul` in distributed mode. [#74989](https://github.com/PaddlePaddle/Paddle/pull/74989) + +### Enhanced functionality +- For scenarios involving the return of multiple Tensor objects, the experience has been optimized through encapsulation using the Paddle data structure, including `paddle.topk`.[#74931](https://github.com/PaddlePaddle/Paddle/pull/74931) +- Create a class API to support the usage of variable-sized parameters. [#74494](https://github.com/PaddlePaddle/Paddle/pull/74494) + +### Documents +- Added or fixed documentation. [#74453](https://github.com/PaddlePaddle/Paddle/pull/74453), [#74846](https://github.com/PaddlePaddle/Paddle/pull/74846), [#74982](https://github.com/PaddlePaddle/Paddle/pull/74982) + +### Other +- Optimization related to code style. [#74654](https://github.com/PaddlePaddle/Paddle/pull/74654),[#74655](https://github.com/PaddlePaddle/Paddle/pull/74655),[#74665](https://github.com/PaddlePaddle/Paddle/pull/74665),[#74660](https://github.com/PaddlePaddle/Paddle/pull/74660),[#74667](https://github.com/PaddlePaddle/Paddle/pull/74667),[#74664](https://github.com/PaddlePaddle/Paddle/pull/74664),[#74662](https://github.com/PaddlePaddle/Paddle/pull/74662),[#74661](https://github.com/PaddlePaddle/Paddle/pull/74661),[#74658](https://github.com/PaddlePaddle/Paddle/pull/74658),[#74657](https://github.com/PaddlePaddle/Paddle/pull/74657),[#74666](https://github.com/PaddlePaddle/Paddle/pull/74666),[#74659](https://github.com/PaddlePaddle/Paddle/pull/74659),[#74663](https://github.com/PaddlePaddle/Paddle/pull/74663),[#74656](https://github.com/PaddlePaddle/Paddle/pull/74656),[#74673](https://github.com/PaddlePaddle/Paddle/pull/74673),[#74672](https://github.com/PaddlePaddle/Paddle/pull/74672),[#74671](https://github.com/PaddlePaddle/Paddle/pull/74671),[#74674](https://github.com/PaddlePaddle/Paddle/pull/74674),[#74675](https://github.com/PaddlePaddle/Paddle/pull/74675),[#74670](https://github.com/PaddlePaddle/Paddle/pull/74670),[#74669](https://github.com/PaddlePaddle/Paddle/pull/74669),[#74677](https://github.com/PaddlePaddle/Paddle/pull/74677),[#74709](https://github.com/PaddlePaddle/Paddle/pull/74709),[#74714](https://github.com/PaddlePaddle/Paddle/pull/74714),[#74712](https://github.com/PaddlePaddle/Paddle/pull/74712),[#74713](https://github.com/PaddlePaddle/Paddle/pull/74713),[#74704](https://github.com/PaddlePaddle/Paddle/pull/74704),[#74746](https://github.com/PaddlePaddle/Paddle/pull/74746),[#74748](https://github.com/PaddlePaddle/Paddle/pull/74748),[#74743](https://github.com/PaddlePaddle/Paddle/pull/74743),[#74742](https://github.com/PaddlePaddle/Paddle/pull/74742),[#74744](https://github.com/PaddlePaddle/Paddle/pull/74744),[#74745](https://github.com/PaddlePaddle/Paddle/pull/74745),[#74747](https://github.com/PaddlePaddle/Paddle/pull/74747),[#74794](https://github.com/PaddlePaddle/Paddle/pull/74794),[#74789](https://github.com/PaddlePaddle/Paddle/pull/74789),[#74793](https://github.com/PaddlePaddle/Paddle/pull/74793),[#74786](https://github.com/PaddlePaddle/Paddle/pull/74786),[#74791](https://github.com/PaddlePaddle/Paddle/pull/74791),[#74787](https://github.com/PaddlePaddle/Paddle/pull/74787),[#74827](https://github.com/PaddlePaddle/Paddle/pull/74827),[#74608](https://github.com/PaddlePaddle/Paddle/pull/74608),[#74288](https://github.com/PaddlePaddle/Paddle/pull/74288),[#74287](https://github.com/PaddlePaddle/Paddle/pull/74287),[#74385](https://github.com/PaddlePaddle/Paddle/pull/74385),[#74395](https://github.com/PaddlePaddle/Paddle/pull/74395),[#74475](https://github.com/PaddlePaddle/Paddle/pull/74475),[#74647](https://github.com/PaddlePaddle/Paddle/pull/74647) +- Optimization related to MKLDNN/ONEDNN. [#74299](https://github.com/PaddlePaddle/Paddle/pull/74299),[#74244](https://github.com/PaddlePaddle/Paddle/pull/74244),[#74230](https://github.com/PaddlePaddle/Paddle/pull/74230),[#74314](https://github.com/PaddlePaddle/Paddle/pull/74314),[#74327](https://github.com/PaddlePaddle/Paddle/pull/74327),[#74325](https://github.com/PaddlePaddle/Paddle/pull/74325),[#74326](https://github.com/PaddlePaddle/Paddle/pull/74326),[#74315](https://github.com/PaddlePaddle/Paddle/pull/74315),[#74399](https://github.com/PaddlePaddle/Paddle/pull/74399),[#74398](https://github.com/PaddlePaddle/Paddle/pull/74398),[#74393](https://github.com/PaddlePaddle/Paddle/pull/74393),[#74392](https://github.com/PaddlePaddle/Paddle/pull/74392),[#74367](https://github.com/PaddlePaddle/Paddle/pull/74367),[#74391](https://github.com/PaddlePaddle/Paddle/pull/74391),[#74423](https://github.com/PaddlePaddle/Paddle/pull/74423),[#74424](https://github.com/PaddlePaddle/Paddle/pull/74424),[#74436](https://github.com/PaddlePaddle/Paddle/pull/74436),[#74417](https://github.com/PaddlePaddle/Paddle/pull/74417),[#74410](https://github.com/PaddlePaddle/Paddle/pull/74410),[#74473](https://github.com/PaddlePaddle/Paddle/pull/74473),[#74458](https://github.com/PaddlePaddle/Paddle/pull/74458),[#74501](https://github.com/PaddlePaddle/Paddle/pull/74501),[#74487](https://github.com/PaddlePaddle/Paddle/pull/74487),[#74502](https://github.com/PaddlePaddle/Paddle/pull/74502),[#74513](https://github.com/PaddlePaddle/Paddle/pull/74513),[#74518](https://github.com/PaddlePaddle/Paddle/pull/74518),[#74516](https://github.com/PaddlePaddle/Paddle/pull/74516),[#74507](https://github.com/PaddlePaddle/Paddle/pull/74507),[#74504](https://github.com/PaddlePaddle/Paddle/pull/74504),[#74505](https://github.com/PaddlePaddle/Paddle/pull/74505),[#74509](https://github.com/PaddlePaddle/Paddle/pull/74509),[#74535](https://github.com/PaddlePaddle/Paddle/pull/74535),[#74536](https://github.com/PaddlePaddle/Paddle/pull/74536),[#74517](https://github.com/PaddlePaddle/Paddle/pull/74517),[#74503](https://github.com/PaddlePaddle/Paddle/pull/74503),[#74557](https://github.com/PaddlePaddle/Paddle/pull/74557),[#74550](https://github.com/PaddlePaddle/Paddle/pull/74550),[#74575](https://github.com/PaddlePaddle/Paddle/pull/74575),[#74587](https://github.com/PaddlePaddle/Paddle/pull/74587),[#74576](https://github.com/PaddlePaddle/Paddle/pull/74576),[#74588](https://github.com/PaddlePaddle/Paddle/pull/74588),[#74549](https://github.com/PaddlePaddle/Paddle/pull/74549),[#74581](https://github.com/PaddlePaddle/Paddle/pull/74581),[#74583](https://github.com/PaddlePaddle/Paddle/pull/74583),[#74628](https://github.com/PaddlePaddle/Paddle/pull/74628),[#74630](https://github.com/PaddlePaddle/Paddle/pull/74630),[#74635](https://github.com/PaddlePaddle/Paddle/pull/74635),[#74679](https://github.com/PaddlePaddle/Paddle/pull/74679),[#74648](https://github.com/PaddlePaddle/Paddle/pull/74648),[#74127](https://github.com/PaddlePaddle/Paddle/pull/74127),[#74636](https://github.com/PaddlePaddle/Paddle/pull/74636),[#74552](https://github.com/PaddlePaddle/Paddle/pull/74552),[#74551](https://github.com/PaddlePaddle/Paddle/pull/74551),[#74678](https://github.com/PaddlePaddle/Paddle/pull/74678),[#74680](https://github.com/PaddlePaddle/Paddle/pull/74680),[#74730](https://github.com/PaddlePaddle/Paddle/pull/74730),[#74751](https://github.com/PaddlePaddle/Paddle/pull/74751),[#74895](https://github.com/PaddlePaddle/Paddle/pull/74895),[#74821](https://github.com/PaddlePaddle/Paddle/pull/74821),[#74897](https://github.com/PaddlePaddle/Paddle/pull/74897),[#74734](https://github.com/PaddlePaddle/Paddle/pull/74734) +- Optimizations related to code implementation, variable and file renaming. [#74309](https://github.com/PaddlePaddle/Paddle/pull/74309), [#74597](https://github.com/PaddlePaddle/Paddle/pull/74597), [#74613](https://github.com/PaddlePaddle/Paddle/pull/74613), [#74376](https://github.com/PaddlePaddle/Paddle/pull/74376), [#74479](https://github.com/PaddlePaddle/Paddle/pull/74479), [#74960](https://github.com/PaddlePaddle/Paddle/pull/74960), [#74968](https://github.com/PaddlePaddle/Paddle/pull/74968), [#74977](https://github.com/PaddlePaddle/Paddle/pull/74977) +- Optimizations related to unit tests, and bug fixes for unit test issues. [#74595](https://github.com/PaddlePaddle/Paddle/pull/74595) +- Compilation-related optimizations and CI issue fixes. [#74356](https://github.com/PaddlePaddle/Paddle/pull/74356), [#74936](https://github.com/PaddlePaddle/Paddle/pull/74936) +- Optimize debugging and printing information, and optimize error reporting information. [#74765](https://github.com/PaddlePaddle/Paddle/pull/74765), [#74381](https://github.com/PaddlePaddle/Paddle/pull/74381), [#74384](https://github.com/PaddlePaddle/Paddle/pull/74384), [#74386](https://github.com/PaddlePaddle/Paddle/pull/74386), [#74387](https://github.com/PaddlePaddle/Paddle/pull/74387), [#74383](https://github.com/PaddlePaddle/Paddle/pull/74383), [#74519](https://github.com/PaddlePaddle/Paddle/pull/74519), [#74520](https://github.com/PaddlePaddle/Paddle/pull/74520), [#74468](https://github.com/PaddlePaddle/Paddle/pull/74468) +- Optimizations related to custom operators. [#74402](https://github.com/PaddlePaddle/Paddle/pull/74402) +- Distributed FlexCheckpoint support. [#74966](https://github.com/PaddlePaddle/Paddle/pull/74966), [#74593](https://github.com/PaddlePaddle/Paddle/pull/74593), [#74785](https://github.com/PaddlePaddle/Paddle/pull/74785), [#74814](https://github.com/PaddlePaddle/Paddle/pull/74814) + +## 2. Basic execution architecture + +### New features +- Support for dynamic graphs. [#74484](https://github.com/PaddlePaddle/Paddle/pull/74484) +- Support for safetensors. [#74642](https://github.com/PaddlePaddle/Paddle/pull/74642), [#74609](https://github.com/PaddlePaddle/Paddle/pull/74609), [#75049](https://github.com/PaddlePaddle/Paddle/pull/75049) +- Added offloader to optimize computation efficiency. [#74837](https://github.com/PaddlePaddle/Paddle/pull/74837) +- Added API support for forward computation of conv_transpose. [#74431](https://github.com/PaddlePaddle/Paddle/pull/74431) +- Added offloader to optimize computation efficiency. [#74837](https://github.com/PaddlePaddle/Paddle/pull/74837) +- The inference deployment has added w4afp8 quantization inference, supporting w4afp8 quantization weight pure permutation and all2all communication [#74270](https://github.com/PaddlePaddle/Paddle/pull/74270) ### Bug fixes - -Fix some processing logic bugs in various scenarios. ([#71813](https://github.com/PaddlePaddle/Paddle/pull/71813), [#71886](https://github.com/PaddlePaddle/Paddle/pull/71886), [#71927](https://github.com/PaddlePaddle/Paddle/pull/71927), [#71915](https://github.com/PaddlePaddle/Paddle/pull/71915), [#71946](https://github.com/PaddlePaddle/Paddle/pull/71946), [#71949](https://github.com/PaddlePaddle/Paddle/pull/71949), [#71955](https://github.com/PaddlePaddle/Paddle/pull/71955), [#71942](https://github.com/PaddlePaddle/Paddle/pull/71942), [#71939](https://github.com/PaddlePaddle/Paddle/pull/71939), [#71973](https://github.com/PaddlePaddle/Paddle/pull/71973), [#72001](https://github.com/PaddlePaddle/Paddle/pull/72001), [#72020](https://github.com/PaddlePaddle/Paddle/pull/72020), [#72014](https://github.com/PaddlePaddle/Paddle/pull/72014), [#72021](https://github.com/PaddlePaddle/Paddle/pull/72021), [#72027](https://github.com/PaddlePaddle/Paddle/pull/72027), [#72061](https://github.com/PaddlePaddle/Paddle/pull/72061), [#72025](https://github.com/PaddlePaddle/Paddle/pull/72025), [#72095](https://github.com/PaddlePaddle/Paddle/pull/72095), [#72108](https://github.com/PaddlePaddle/Paddle/pull/72108), [#72132](https://github.com/PaddlePaddle/Paddle/pull/72132), [#71985](https://github.com/PaddlePaddle/Paddle/pull/71985), [#72106](https://github.com/PaddlePaddle/Paddle/pull/72106), [#72140](https://github.com/PaddlePaddle/Paddle/pull/72140), [#72167](https://github.com/PaddlePaddle/Paddle/pull/72167), [#72037](https://github.com/PaddlePaddle/Paddle/pull/72037), [#72178](https://github.com/PaddlePaddle/Paddle/pull/72178), [#72143](https://github.com/PaddlePaddle/Paddle/pull/72143), [#72175](https://github.com/PaddlePaddle/Paddle/pull/72175), [#72191](https://github.com/PaddlePaddle/Paddle/pull/72191), [#72213](https://github.com/PaddlePaddle/Paddle/pull/72213), [#72189](https://github.com/PaddlePaddle/Paddle/pull/72189), [#72214](https://github.com/PaddlePaddle/Paddle/pull/72214), [#72166](https://github.com/PaddlePaddle/Paddle/pull/72166), [#72180](https://github.com/PaddlePaddle/Paddle/pull/72180), [#72284](https://github.com/PaddlePaddle/Paddle/pull/72284), [#72267](https://github.com/PaddlePaddle/Paddle/pull/72267), [#72348](https://github.com/PaddlePaddle/Paddle/pull/72348), [#72332](https://github.com/PaddlePaddle/Paddle/pull/72332), [#72307](https://github.com/PaddlePaddle/Paddle/pull/72307), [#72353](https://github.com/PaddlePaddle/Paddle/pull/72353), [#72204](https://github.com/PaddlePaddle/Paddle/pull/72204), [#72457](https://github.com/PaddlePaddle/Paddle/pull/72457), [#72426](https://github.com/PaddlePaddle/Paddle/pull/72426), [#72536](https://github.com/PaddlePaddle/Paddle/pull/72536), [#72541](https://github.com/PaddlePaddle/Paddle/pull/72541), [#72365](https://github.com/PaddlePaddle/Paddle/pull/72365), [#72621](https://github.com/PaddlePaddle/Paddle/pull/72621), [#72630](https://github.com/PaddlePaddle/Paddle/pull/72630), [#72669](https://github.com/PaddlePaddle/Paddle/pull/72669), [#72682](https://github.com/PaddlePaddle/Paddle/pull/72682), [#72732](https://github.com/PaddlePaddle/Paddle/pull/72732), [#72811](https://github.com/PaddlePaddle/Paddle/pull/72811), [#72941](https://github.com/PaddlePaddle/Paddle/pull/72941), [#72795](https://github.com/PaddlePaddle/Paddle/pull/72795), [#73536](https://github.com/PaddlePaddle/Paddle/pull/73536)) - -## 4. Auto Parallel architecture - -In version 3.1, we further refined the automatic parallel architecture to enhance the usability of automatic parallelism and the performance of dynamic graphs. Specifically, we improved the core mechanism of automatic parallelism, including adding multiple operator splitting derivation rules, supporting the splitting of the same dimension of distributed tensors by multiple mesh dimensions, and supporting dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), etc. At the same time, we systematically optimized the performance of the automatic parallel system for dynamic graphs, achieving performance that is basically on par with manual parallelism on models such as Llama. - -### Improvements - -- Support for distributed tensors where the same dimension is partitioned by multiple mesh dimensions. [#73233](https://github.com/PaddlePaddle/Paddle/pull/73233) -- Support for converting automatic parallel communication topology descriptions (ProcessMesh) into manual parallel communication groups. [#72052](https://github.com/PaddlePaddle/Paddle/pull/72052) -- Support send/recv of any serializable Python object. [#72098](https://github.com/PaddlePaddle/Paddle/pull/72098) -- Complete dynamic graph parallel strategy -- Support for pipeline parallelism strategies 1F1B and VPP scheduling. [#72155](https://github.com/PaddlePaddle/Paddle/pull/72155), [#72480](https://github.com/PaddlePaddle/Paddle/pull/72480), [#72179](https://github.com/PaddlePaddle/Paddle/pull/72179) -- Support for parallel processing of long texts. [#73195](https://github.com/PaddlePaddle/Paddle/pull/73195) -- Support for visual parallelism strategy. [#73063](https://github.com/PaddlePaddle/Paddle/pull/73063), [#73039](https://github.com/PaddlePaddle/Paddle/pull/73039) -- Support automatic parallel communication in the data parallel dimension. [#72540](https://github.com/PaddlePaddle/Paddle/pull/72540) -- Add the following operator segmentation derivation rules -- `min`, `min_grad` [#72269](https://github.com/PaddlePaddle/Paddle/pull/72269) -- `bitwise_or`,`atan2`,`fmax`,`fmin`,`reciprocal` [#72310](https://github.com/PaddlePaddle/Paddle/pull/72310) -- `argmin`, `abs`, `cosh` [#72264](https://github.com/PaddlePaddle/Paddle/pull/72264) -- `mean_all`, `mean_all_grad` [#72479](https://github.com/PaddlePaddle/Paddle/pull/72479) -- `topk`, `topk_grad` [#72499](https://github.com/PaddlePaddle/Paddle/pull/72499) -- `argsort` [#72388](https://github.com/PaddlePaddle/Paddle/pull/72388) -- `round`, `mish`, `elu`, `selu`, `celu`, `stanh`, `softplus`, `softshrink`, `thresholded_relu`, `logit`, `nonzero` [#72312](https://github.com/PaddlePaddle/Paddle/pull/72312) -- `unique ops` [#72824](https://github.com/PaddlePaddle/Paddle/pull/72824) -- `put_along_axis` [#72766](https://github.com/PaddlePaddle/Paddle/pull/72766) -- `round_grad`, `trunc_grad`, `ceil_grad`, `floor_grad`, `poisson_grad` [#72677](https://github.com/PaddlePaddle/Paddle/pull/72677) -- `log_softmax`, `cummax`, `cummin` [#72720](https://github.com/PaddlePaddle/Paddle/pull/72720) -- `unary` [#72177](https://github.com/PaddlePaddle/Paddle/pull/72177) -- `unary_grad` [#72260](https://github.com/PaddlePaddle/Paddle/pull/72260) -- `index_select`, `index_select_grad` [#72727](https://github.com/PaddlePaddle/Paddle/pull/72727) -- `roll`, `roll_grad` [#72740](https://github.com/PaddlePaddle/Paddle/pull/72740) -- `empty_like` [#73169](https://github.com/PaddlePaddle/Paddle/pull/73169) -- `roi_align`, `roi_align_grad` [#72925](https://github.com/PaddlePaddle/Paddle/pull/72925) -- `expand_as`, `expand_as_grad` [#73107](https://github.com/PaddlePaddle/Paddle/pull/73107) -- `fused_gemm_epilogur` [#73126](https://github.com/PaddlePaddle/Paddle/pull/73126) -- `label_smooth`, `label_smooth` [#72845](https://github.com/PaddlePaddle/Paddle/pull/72845) -- `group_norm`, `group_norm_grad` [#72946](https://github.com/PaddlePaddle/Paddle/pull/72946) -- `instance_norm`, `instance_norm_grad` [#72938](https://github.com/PaddlePaddle/Paddle/pull/72938) -- `batch_norm`, `sync_batch_norm` [#72918](https://github.com/PaddlePaddle/Paddle/pull/72918) -- `reduce_any` [#73175](https://github.com/PaddlePaddle/Paddle/pull/73175) -- `fused_gemm_epilogue_rule` [#73494](https://github.com/PaddlePaddle/Paddle/pull/73494) - -### Performance - -* Support for the tensor_fusion optimization strategy and overlap optimization strategy with grouped parallel slicing. [#72551](https://github.com/PaddlePaddle/Paddle/pull/72551), [#72902](https://github.com/PaddlePaddle/Paddle/pull/72902), [#73142](https://github.com/PaddlePaddle/Paddle/pull/73142), [#71785](https://github.com/PaddlePaddle/Paddle/pull/71785) -* Optimize the reshard module to reduce communication overhead. [#71969](https://github.com/PaddlePaddle/Paddle/pull/71969), [#73024](https://github.com/PaddlePaddle/Paddle/pull/73024), [#71868](https://github.com/PaddlePaddle/Paddle/pull/71868) -* Optimize the slicing derivation rule for multiply to reduce communication overhead. [#73408](https://github.com/PaddlePaddle/Paddle/pull/73408) -* Optimize the reverse communication when the distributed partition status is set to Partial, in order to reduce communication overhead. [#73236](https://github.com/PaddlePaddle/Paddle/pull/73236) -* Communication fusion optimization during gradient update. [#72120](https://github.com/PaddlePaddle/Paddle/pull/72120) and [#72745](https://github.com/PaddlePaddle/Paddle/pull/72745) -* Optimize the derivation of gelu slicing to reduce communication overhead. [#73279](https://github.com/PaddlePaddle/Paddle/pull/73279) -* Optimize the slicing derivation rule of fused_rms_norm when there is a Partial state in the input, to reduce communication and computation overhead. [#73054](https://github.com/PaddlePaddle/Paddle/pull/73054) +- Core framework and infrastructure optimization. [#74336](https://github.com/PaddlePaddle/Paddle/pull/74336), [#74554](https://github.com/PaddlePaddle/Paddle/pull/74554), [#74634](https://github.com/PaddlePaddle/Paddle/pull/74634) +- Calculation accuracy and type handling. [#74278](https://github.com/PaddlePaddle/Paddle/pull/74278), [#74222](https://github.com/PaddlePaddle/Paddle/pull/74222), [#74830](https://github.com/PaddlePaddle/Paddle/pull/74830) +- Optimization of dynamic dimension check logic. [#74633](https://github.com/PaddlePaddle/Paddle/pull/74633), [#74650](https://github.com/PaddlePaddle/Paddle/pull/74650) +- Memory and illegal access fixes. [#74347](https://github.com/PaddlePaddle/Paddle/pull/74347), [#73443](https://github.com/PaddlePaddle/Paddle/pull/73443), [#74953](https://github.com/PaddlePaddle/Paddle/pull/74953) +- Fixed printing of error/warning messages. [#74474](https://github.com/PaddlePaddle/Paddle/pull/74474), [#74533](https://github.com/PaddlePaddle/Paddle/pull/74533), [#74685](https://github.com/PaddlePaddle/Paddle/pull/74685), [#74721](https://github.com/PaddlePaddle/Paddle/pull/74721), [#74754](https://github.com/PaddlePaddle/Paddle/pull/74754) +- Code quality and documentation correction. [#74378](https://github.com/PaddlePaddle/Paddle/pull/74378), [#74828](https://github.com/PaddlePaddle/Paddle/pull/74828) +- Fixed the processing logic of the flashmask API. [#74928](https://github.com/PaddlePaddle/Paddle/pull/74928) +- Fixed the issue where splitting CudaGraph subgraphs did not take effect in dynamic-to-static mode. ([#74749](https://github.com/PaddlePaddle/Paddle/pull/74749)) + +### Enhanced functionality +- C++ extension development. [#74338](https://github.com/PaddlePaddle/Paddle/pull/74338) +- Optimization of FlexCP function. [#74752](https://github.com/PaddlePaddle/Paddle/pull/74752), [#74981](https://github.com/PaddlePaddle/Paddle/pull/74981) +- Optimize memory allocation. [#74463](https://github.com/PaddlePaddle/Paddle/pull/74463) + +### Deprecated +- Clean up old IR-related unit tests for dynamic, static, and transition scenarios. [#74698](https://github.com/PaddlePaddle/Paddle/pull/74698), [#74715](https://github.com/PaddlePaddle/Paddle/pull/74715), [#74718](https://github.com/PaddlePaddle/Paddle/pull/74718), [#74782](https://github.com/PaddlePaddle/Paddle/pull/74782), [#74962](https://github.com/PaddlePaddle/Paddle/pull/74962) + +### Other +- Update patch version. [#74940](https://github.com/PaddlePaddle/Paddle/pull/74940) + +## 3. Distributed & automatic parallelism + +### Parallel strategy +In version 3.2, we have made multiple enhancements to the pipeline parallelism feature, including implementing support for dictionary parameter passing and extending the compatibility of Pipeline Layer and SharedLayerDesc with non-pipeline parallelism. Additionally, we have fixed several critical issues, such as IPC API exceptions for large-sized tensors, evaluation batches and non-computational losses in pipeline parallelism, gradient release errors in MoE models, hang issues caused by NCCL communication reconstruction in PP scenarios, and event management errors in dual-pipeline parallelism. Furthermore, we have conducted various performance optimizations, improved the computation overlap efficiency of dual-pipeline parallelism to enhance training performance, and upgraded the clear_param_storage method to support the clearing and resetting operations of multiple color collections in sharding mode. + +#### New Features +- Implement support for dictionary parameter passing in Pipeline Parallel. [#74574](https://github.com/PaddlePaddle/Paddle/pull/74574), [#74867](https://github.com/PaddlePaddle/Paddle/pull/74867) +- Pipeline Layer and SharedLayerDesc support non-pipeline parallelism (nonpp parallel). [#74573](https://github.com/PaddlePaddle/Paddle/pull/74573) + +#### Bug fixes +- Fixed the IPC API issue with large-sized tensors. [#74472](https://github.com/PaddlePaddle/Paddle/pull/74472) +- Fixed issues related to evaluation batch and non-compute_loss in pipeline parallelism. [#74170](https://github.com/PaddlePaddle/Paddle/pull/74170) +- Fixed the gradient release issue on MoE model. [#74972](https://github.com/PaddlePaddle/Paddle/pull/74972) +- Fixed the hang issue when rebuilding NCCL comm in the pp scenario. [#73625](https://github.com/PaddlePaddle/Paddle/pull/73625) +- Fixed the event management error in dual pipeline parallelism (dual pp). [#74158](https://github.com/PaddlePaddle/Paddle/pull/74158) + +#### Optimization and improvement +- Optimize the efficiency of computation overlap in parallel dual pipelines to enhance training performance. [#74527](https://github.com/PaddlePaddle/Paddle/pull/74527) +- Upgrade the clear_param_storage method to support the clearing and resetting of multiple color collections under sharding. [#74741](https://github.com/PaddlePaddle/Paddle/pull/74741) + +### Automatic parallelism +#### Functional improvements +- Support the default splitting derivation rule for the same dimension of distributed tensors when it is split by multiple mesh dimensions. [#74396](https://github.com/PaddlePaddle/Paddle/pull/74396) +- Improved the slicing derivation rule of the `reshape` operator to support scenarios where the same dimension of a distributed tensor is sliced by multiple mesh dimensions. [#74352](https://github.com/PaddlePaddle/Paddle/pull/74352), [#74579](https://github.com/PaddlePaddle/Paddle/pull/74579), [#74565](https://github.com/PaddlePaddle/Paddle/pull/74565) +- Support changing the mesh of a tensor without altering the distributed tensor data. [#74248](https://github.com/PaddlePaddle/Paddle/pull/74248) + +#### Bug fixes +- Fixed the bug of repeatedly creating communication groups when calling the `get_group` method of `ProcessMesh`. [#73099](https://github.com/PaddlePaddle/Paddle/pull/73099) +- Fixed the bug in the `get_local_slices` method in the MoE scenario. [#74705](https://github.com/PaddlePaddle/Paddle/pull/74705) +- Fixed the bug of gradient clipping in the MoE scenario. [#74916](https://github.com/PaddlePaddle/Paddle/pull/74916) +- Fixed the bug where the `stop_gradient` parameter could not be passed between different stages in the pipeline parallel scenario. [#73459](https://github.com/PaddlePaddle/Paddle/pull/73459) +- Fixed the accuracy bug of gradient clipping in parallel pipeline scenarios. [#74409](https://github.com/PaddlePaddle/Paddle/pull/74409) +- Fixed the bug of generating redundant outputs in the dynamic graph pipeline parallel scenario. [#74913](https://github.com/PaddlePaddle/Paddle/pull/74913) +- Fixed the bug that the operators `moe_combine` and `moe_gate_dispatch` did not work in the MoE scenario. [#74645](https://github.com/PaddlePaddle/Paddle/pull/74645) + +#### Other +- Support accuracy alignment for manual and automatic parallelism of data loaders. [#73941](https://github.com/PaddlePaddle/Paddle/pull/73941) +- Optimize the dynamic graph pipeline parallel scheduling logic. [#74720](https://github.com/PaddlePaddle/Paddle/pull/74720) + +### Communication Library +In version 3.2, we fixed an error in DeepEP's support for sm90 compilation, added a pre-allocation function to the video memory allocation requested by DeepEP, and upgraded its intranode and internode computation kernels, further optimizing performance and stability. + +#### Bug fixes +- Fixed a bug in DeepEP support for sm90 compilation. [#74762](https://github.com/PaddlePaddle/Paddle/pull/74762) + +#### Functional improvements +- Added pre-allocation function for the GPU memory allocation requested by DeepEP. [#74465](https://github.com/PaddlePaddle/Paddle/pull/74465) +- Upgraded the intranode and internode computation kernels of DeepEP. [#74284](https://github.com/PaddlePaddle/Paddle/pull/74284) + +## 4. Operator mechanism +### New features +- API compatibility support. [#74506](https://github.com/PaddlePaddle/Paddle/pull/74506), [#74676](https://github.com/PaddlePaddle/Paddle/pull/74676), [#74558](https://github.com/PaddlePaddle/Paddle/pull/74558), [#74572](https://github.com/PaddlePaddle/Paddle/pull/74572), [#74691](https://github.com/PaddlePaddle/Paddle/pull/74691), [#74703](https://github.com/PaddlePaddle/Paddle/pull/74703), [#74750](https://github.com/PaddlePaddle/Paddle/pull/74750), [#74757](https://github.com/PaddlePaddle/Paddle/pull/74757), [#74802](https://github.com/PaddlePaddle/Paddle/pull/74802), [#74546](https://github.com/PaddlePaddle/Paddle/pull/74546), [#74547](https://github.com/PaddlePaddle/Paddle/pull/74547), [#74802](https://github.com/PaddlePaddle/Paddle/pull/74802), [#74859](https://github.com/PaddlePaddle/Paddle/pull/74859), [#74910](https://github.com/PaddlePaddle/Paddle/pull/74910), [#74873](https://github.com/PaddlePaddle/Paddle/pull/74873), [#74882](https://github.com/PaddlePaddle/Paddle/pull/74882), [#74901](https://github.com/PaddlePaddle/Paddle/pull/74901), [#74899](https://github.com/PaddlePaddle/Paddle/pull/74899), [#74449](https://github.com/PaddlePaddle/Paddle/pull/74449) +- Added fused_partial_rope operator. [#74577](https://github.com/PaddlePaddle/Paddle/pull/74577) ### Bug fixes - -- Fixed the bug of communication hang in the virtual pipeline parallel strategy on H-card. [#71104](https://github.com/PaddlePaddle/Paddle/pull/71104), [#73470](https://github.com/PaddlePaddle/Paddle/pull/73470) -- Fixed the bug in save/load. [#72023](https://github.com/PaddlePaddle/Paddle/pull/72023) -- Fixed the bug that the linear_fused_grad_add strategy did not work in dynamic graph mode. [#72708](https://github.com/PaddlePaddle/Paddle/pull/72708) -- Fixed the issues of the fused_rms_norm operator not running and accuracy bugs. [#72663](https://github.com/PaddlePaddle/Paddle/pull/72663) -- Fixed the bug in the derivation rule for the expand operator segmentation. [#73154](https://github.com/PaddlePaddle/Paddle/pull/73154) - -### Others - -- Clean up dead code to facilitate code maintenance. [#71814](https://github.com/PaddlePaddle/Paddle/pull/71814), [#72538](https://github.com/PaddlePaddle/Paddle/pull/72538) -- Added a new API, `local_map`, to pass distributed tensors to functions written for ordinary tensors. ([#71804](https://github.com/PaddlePaddle/Paddle/pull/71804)) -- Add checks for operator fused_linear_param_grad_add. ([#72483](https://github.com/PaddlePaddle/Paddle/pull/72483)) - -## 5. Operator Mechanism - -### New Features - -- Gradient and automatic differentiation optimization: Initially supports dual gradient computation for put_along_axis and repeat_interleave operations, improves the numerical stability of complex operators in automatic differentiation scenarios, and implements operator decomposition for masked_fill operations. [#72789](https://github.com/PaddlePaddle/Paddle/pull/72789), [#73056](https://github.com/PaddlePaddle/Paddle/pull/73056), [#73225](https://github.com/PaddlePaddle/Paddle/pull/73225) -- Operator mechanism extension: Added custom support for __radd__ and __rmul__, enhancing the framework's ability to overload asymmetric operators. [#73119](https://github.com/PaddlePaddle/Paddle/pull/73119) -- FP8 Module Support and Operator Development: Added support for FP8 block quantization GEMM, introduced multiple fused operators, and provided efficient operator-level implementations for Mixture of Experts (MoE) models, enhancing training and inference performance. [#73228](https://github.com/PaddlePaddle/Paddle/pull/73228), [#73285](https://github.com/PaddlePaddle/Paddle/pull/73285), [#73133](https://github.com/PaddlePaddle/Paddle/pull/73133), [#73364](https://github.com/PaddlePaddle/Paddle/pull/73364), [#73520](https://github.com/PaddlePaddle/Paddle/pull/73520), [#73531](https://github.com/PaddlePaddle/Paddle/pull/73531) - -### Bug Fixes - -- Gradient and automatic differentiation stability improvement: Fixed some errors in the calculation of inverse operator gradients, enhancing numerical stability and functional correctness in automatic differentiation scenarios. [#71716](https://github.com/PaddlePaddle/Paddle/pull/71716), [#72299](https://github.com/PaddlePaddle/Paddle/pull/72299), [#72358](https://github.com/PaddlePaddle/Paddle/pull/72358), [#73037](https://github.com/PaddlePaddle/Paddle/pull/73037), [#73140](https://github.com/PaddlePaddle/Paddle/pull/73140), [#73185](https://github.com/PaddlePaddle/Paddle/pull/73185) -- Numerical accuracy and overflow protection: Addresses issues such as numerical overflow, loss of precision, and large tensor overflow, ensuring the reliability of low-precision computations and large tensor operations. [#72584](https://github.com/PaddlePaddle/Paddle/pull/72584), [#72608](https://github.com/PaddlePaddle/Paddle/pull/72608), [#72681](https://github.com/PaddlePaddle/Paddle/pull/72681), [#72639](https://github.com/PaddlePaddle/Paddle/pull/72639), [#73245](https://github.com/PaddlePaddle/Paddle/pull/73245), [#73359](https://github.com/PaddlePaddle/Paddle/pull/73359), [#72456](https://github.com/PaddlePaddle/Paddle/pull/72456) -- Operator logic and framework alignment: Align operator operation logic, fix issues such as abnormal operator inputs, and other important fixes: add checks to ensure the correctness of framework functionality. [#72282](https://github.com/PaddlePaddle/Paddle/pull/72282), [#71863](https://github.com/PaddlePaddle/Paddle/pull/71863), [#72650](https://github.com/PaddlePaddle/Paddle/pull/72650), [#72843](https://github.com/PaddlePaddle/Paddle/pull/72843), [#73070](https://github.com/PaddlePaddle/Paddle/pull/73070), [#73141](https://github.com/PaddlePaddle/Paddle/pull/73141), [#73203](https://github.com/PaddlePaddle/Paddle/pull/73203), [#73350](https://github.com/PaddlePaddle/Paddle/pull/73350), [#73440](https://github.com/PaddlePaddle/Paddle/pull/73440), [#73539](https://github.com/PaddlePaddle/Paddle/pull/73539), [#73339](https://github.com/PaddlePaddle/Paddle/pull/73339) -- CUDA kernel and hardware adaptation optimization: Supports NVIDIA SM90 architecture, fixes issues such as overflow, removes redundant CUDA error checks, and enhances GPU computing efficiency and adaptability to new hardware. [#72507](https://github.com/PaddlePaddle/Paddle/pull/72507), [#72849](https://github.com/PaddlePaddle/Paddle/pull/72849), [#72959](https://github.com/PaddlePaddle/Paddle/pull/72959), [#73130](https://github.com/PaddlePaddle/Paddle/pull/73130), [#73489](https://github.com/PaddlePaddle/Paddle/pull/73489) - -### Improvements - -- Added an implementation of fast division and modulo operation for the int64_t version, improving computational performance and numerical stability in large integer scenarios, [#72530](https://github.com/PaddlePaddle/Paddle/pull/72530) -- Optimize the kernel with stride tensor copy to improve the efficiency of data copy under non-continuous memory layout. [#72662](https://github.com/PaddlePaddle/Paddle/pull/72662) - --Unify the usage of quantization API in dynamic and static graph modes, simplifying the development process of quantization models, [#73100](https://github.com/PaddlePaddle/Paddle/pull/73100) - -### Performance - -- Optimize the decomposition performance of the gelu operator to enhance computational efficiency. [#72812](https://github.com/PaddlePaddle/Paddle/pull/72812) - -### Others - -- Fluid operator normalization and exit, [#71789](https://github.com/PaddlePaddle/Paddle/pull/71789), [#71818](https://github.com/PaddlePaddle/Paddle/pull/71818), [#71808](https://github.com/PaddlePaddle/Paddle/pull/71808), [#71860](https://github.com/PaddlePaddle/Paddle/pull/71860), [#71806](https://github.com/PaddlePaddle/Paddle/pull/71806), [#72011](https://github.com/PaddlePaddle/Paddle/pull/72011), [#72043](https://github.com/PaddlePaddle/Paddle/pull/72043), [#72034](https://github.com/PaddlePaddle/Paddle/pull/72034), [#72047](https://github.com/PaddlePaddle/Paddle/pull/72047), [#72056](https://github.com/PaddlePaddle/Paddle/pull/72056), [#72087](https://github.com/PaddlePaddle/Paddle/pull/72087), [#72086](https://github.com/PaddlePaddle/Paddle/pull/72086), [#72083](https://github.com/PaddlePaddle/Paddle/pull/72083), [#72079](https://github.com/PaddlePaddle/Paddle/pull/72079), [#72078](https://github.com/PaddlePaddle/Paddle/pull/72078), [#72076](https://github.com/PaddlePaddle/Paddle/pull/72076), [#72057](https://github.com/PaddlePaddle/Paddle/pull/72057), [#72077](https://github.com/PaddlePaddle/Paddle/pull/72077), [#72096](https://github.com/PaddlePaddle/Paddle/pull/72096), [#72085](https://github.com/PaddlePaddle/Paddle/pull/72085), [#72092](https://github.com/PaddlePaddle/Paddle/pull/72092), [#72110](https://github.com/PaddlePaddle/Paddle/pull/72110), [#72127](https://github.com/PaddlePaddle/Paddle/pull/72127), [#72111](https://github.com/PaddlePaddle/Paddle/pull/72111), [#72126](https://github.com/PaddlePaddle/Paddle/pull/72126), [#72135](https://github.com/PaddlePaddle/Paddle/pull/72135), [#72112](https://github.com/PaddlePaddle/Paddle/pull/72112), [#72131](https://github.com/PaddlePaddle/Paddle/pull/72131), [#70358](https://github.com/PaddlePaddle/Paddle/pull/70358), [#72125](https://github.com/PaddlePaddle/Paddle/pull/72125), [#72171](https://github.com/PaddlePaddle/Paddle/pull/72171), [#72160](https://github.com/PaddlePaddle/Paddle/pull/72160), [#72188](https://github.com/PaddlePaddle/Paddle/pull/72188), [#72197](https://github.com/PaddlePaddle/Paddle/pull/72197) - -## 6. Performance - -### New Features - -The `acc_steps` of `sharding_overlap` is configurable. [#72395](https://github.com/PaddlePaddle/Paddle/pull/72395) - +- 0-size Tensor related fixes. [#74295](https://github.com/PaddlePaddle/Paddle/pull/74295), [#74305](https://github.com/PaddlePaddle/Paddle/pull/74305), [#74323](https://github.com/PaddlePaddle/Paddle/pull/74323), [#74354](https://github.com/PaddlePaddle/Paddle/pull/74354) +- Major Tensor-related fixes. [#74242](https://github.com/PaddlePaddle/Paddle/pull/74242), [#74293](https://github.com/PaddlePaddle/Paddle/pull/74293), [#74289](https://github.com/PaddlePaddle/Paddle/pull/74289), [#74279](https://github.com/PaddlePaddle/Paddle/pull/74279), [#74330](https://github.com/PaddlePaddle/Paddle/pull/74330), [#74329](https://github.com/PaddlePaddle/Paddle/pull/74329), [#74342](https://github.com/PaddlePaddle/Paddle/pull/74342), [#74369](https://github.com/PaddlePaddle/Paddle/pull/74369), [#74370](https://github.com/PaddlePaddle/Paddle/pull/74370), [#74404](https://github.com/PaddlePaddle/Paddle/pull/74404), [#74537](https://github.com/PaddlePaddle/Paddle/pull/74537), [#74451](https://github.com/PaddlePaddle/Paddle/pull/74451), [#74172](https://github.com/PaddlePaddle/Paddle/pull/74172), [#74324](https://github.com/PaddlePaddle/Paddle/pull/74324), [#74964](https://github.com/PaddlePaddle/Paddle/pull/74964), [#74360](https://github.com/PaddlePaddle/Paddle/pull/74360), [#74379](https://github.com/PaddlePaddle/Paddle/pull/74379), [#74377](https://github.com/PaddlePaddle/Paddle/pull/74377), [#74380](https://github.com/PaddlePaddle/Paddle/pull/74380), [#74362](https://github.com/PaddlePaddle/Paddle/pull/74362), [#74197](https://github.com/PaddlePaddle/Paddle/pull/74197) +- API compatibility-related fixes. [#74764](https://github.com/PaddlePaddle/Paddle/pull/74764), [#74869](https://github.com/PaddlePaddle/Paddle/pull/74869), [#74935](https://github.com/PaddlePaddle/Paddle/pull/74935) +- [Open Source Task] Investigate and resolve precision issues in Paddle CPU/GPU Kernels. [#74149](https://github.com/PaddlePaddle/Paddle/pull/74149), [#74598](https://github.com/PaddlePaddle/Paddle/pull/74598), [#74719](https://github.com/PaddlePaddle/Paddle/pull/74719), [#74625](https://github.com/PaddlePaddle/Paddle/pull/74625), [#74555](https://github.com/PaddlePaddle/Paddle/pull/74555) +- Other important fixes. [#74282](https://github.com/PaddlePaddle/Paddle/pull/74282), [#74313](https://github.com/PaddlePaddle/Paddle/pull/74313), [#74303](https://github.com/PaddlePaddle/Paddle/pull/74303), [#74306](https://github.com/PaddlePaddle/Paddle/pull/74306), [#74298](https://github.com/PaddlePaddle/Paddle/pull/74298), [#74044](https://github.com/PaddlePaddle/Paddle/pull/74044), [#74290](https://github.com/PaddlePaddle/Paddle/pull/74290), [#74348](https://github.com/PaddlePaddle/Paddle/pull/74348), [#74364](https://github.com/PaddlePaddle/Paddle/pull/74364), [#74332](https://github.com/PaddlePaddle/Paddle/pull/74332), [#74224](https://github.com/PaddlePaddle/Paddle/pull/74224), [#74382](https://github.com/PaddlePaddle/Paddle/pull/74382), [#74406](https://github.com/PaddlePaddle/Paddle/pull/74406), [#74434](https://github.com/PaddlePaddle/Paddle/pull/74434), [#74448](https://github.com/PaddlePaddle/Paddle/pull/74448), [#74457](https://github.com/PaddlePaddle/Paddle/pull/74457), [#74322](https://github.com/PaddlePaddle/Paddle/pull/74322), [#74530](https://github.com/PaddlePaddle/Paddle/pull/74530), [#74716](https://github.com/PaddlePaddle/Paddle/pull/74716), [#74839](https://github.com/PaddlePaddle/Paddle/pull/74839), [#74842](https://github.com/PaddlePaddle/Paddle/pull/74842), [#74854](https://github.com/PaddlePaddle/Paddle/pull/74854), [#74919](https://github.com/PaddlePaddle/Paddle/pull/74919), [#74767](https://github.com/PaddlePaddle/Paddle/pull/74767), [#75003](https://github.com/PaddlePaddle/Paddle/pull/75003) + +### Enhanced functionality +- Improved API compatibility. [#74456](https://github.com/PaddlePaddle/Paddle/pull/74456), [#74480](https://github.com/PaddlePaddle/Paddle/pull/74480), [#74523](https://github.com/PaddlePaddle/Paddle/pull/74523), [#74490](https://github.com/PaddlePaddle/Paddle/pull/74490), [#74548](https://github.com/PaddlePaddle/Paddle/pull/74548), [#74596](https://github.com/PaddlePaddle/Paddle/pull/74596), [#74568](https://github.com/PaddlePaddle/Paddle/pull/74568), [#74559](https://github.com/PaddlePaddle/Paddle/pull/74559), [#74629](https://github.com/PaddlePaddle/Paddle/pull/74629), [#74623](https://github.com/PaddlePaddle/Paddle/pull/74623), [#74700](https://github.com/PaddlePaddle/Paddle/pull/74700), [#74643](https://github.com/PaddlePaddle/Paddle/pull/74643), [#74602](https://github.com/PaddlePaddle/Paddle/pull/74602), [#74783](https://github.com/PaddlePaddle/Paddle/pull/74783), [#74781](https://github.com/PaddlePaddle/Paddle/pull/74781), [#74735](https://github.com/PaddlePaddle/Paddle/pull/74735), [#74725](https://github.com/PaddlePaddle/Paddle/pull/74725), [#74815](https://github.com/PaddlePaddle/Paddle/pull/74815), [#74856](https://github.com/PaddlePaddle/Paddle/pull/74856), [#74925](https://github.com/PaddlePaddle/Paddle/pull/74925), [#74545](https://github.com/PaddlePaddle/Paddle/pull/74545), [#74932](https://github.com/PaddlePaddle/Paddle/pull/74932), [#74784](https://github.com/PaddlePaddle/Paddle/pull/74784) +- Slice/stride related optimizations. [#74731](https://github.com/PaddlePaddle/Paddle/pull/74731), [#74740](https://github.com/PaddlePaddle/Paddle/pull/74740), [#74769](https://github.com/PaddlePaddle/Paddle/pull/74769), [#74810](https://github.com/PaddlePaddle/Paddle/pull/74810), [#74841](https://github.com/PaddlePaddle/Paddle/pull/74841), [#74954](https://github.com/PaddlePaddle/Paddle/pull/74954), [#74888](https://github.com/PaddlePaddle/Paddle/pull/74888), [#74944](https://github.com/PaddlePaddle/Paddle/pull/74944), [#74312](https://github.com/PaddlePaddle/Paddle/pull/74312), [#74291](https://github.com/PaddlePaddle/Paddle/pull/74291), [#74271](https://github.com/PaddlePaddle/Paddle/pull/74271), [#74320](https://github.com/PaddlePaddle/Paddle/pull/74320), [#74344](https://github.com/PaddlePaddle/Paddle/pull/74344), [#74727](https://github.com/PaddlePaddle/Paddle/pull/74727), [#74637](https://github.com/PaddlePaddle/Paddle/pull/74637) +- Operator optimization and CUDA support. [#74693](https://github.com/PaddlePaddle/Paddle/pull/74693), [#74922](https://github.com/PaddlePaddle/Paddle/pull/74922), [#74967](https://github.com/PaddlePaddle/Paddle/pull/74967) +- Improved debugging information and compatibility enhancements. [#74372](https://github.com/PaddlePaddle/Paddle/pull/74372), [#74622](https://github.com/PaddlePaddle/Paddle/pull/74622) +- Operator function expansion and optimization. [#74790](https://github.com/PaddlePaddle/Paddle/pull/74790), [#74979](https://github.com/PaddlePaddle/Paddle/pull/74979) + +### Performance optimization +- FP8 computation optimization. [#74471](https://github.com/PaddlePaddle/Paddle/pull/74471), [#74684](https://github.com/PaddlePaddle/Paddle/pull/74684), [#74911](https://github.com/PaddlePaddle/Paddle/pull/74911) +- Basic operator performance optimization. [#74442](https://github.com/PaddlePaddle/Paddle/pull/74442), [#74638](https://github.com/PaddlePaddle/Paddle/pull/74638) +- Support fa3 variable-length sequence reverse computation and optimize forward API. [#73831](https://github.com/PaddlePaddle/Paddle/pull/73831) +- Added FlashMask V2 function. [#74729](https://github.com/PaddlePaddle/Paddle/pull/74729) + +### Documents +- Fixed issues with English documentation and copyright year. [#74737](https://github.com/PaddlePaddle/Paddle/pull/74737) + +### Other +- The WITH_XPU_FFT option is enabled by default on XPU hardware. [#74699](https://github.com/PaddlePaddle/Paddle/pull/74699) + +## 5. Hardware adaptation +### Improved CUDA-like hardware integration solution +- The CUDA-like hardware access solution supports the reuse of cuBlas kernels [#74591](https://github.com/PaddlePaddle/Paddle/pull/74591), +- Fix known issues in the CUDA-like hardware access solution +[#74397](https://github.com/PaddlePaddle/Paddle/pull/74397), [#74411](https://github.com/PaddlePaddle/Paddle/pull/74411), [#74428](https://github.com/PaddlePaddle/Paddle/pull/74428), [#74877](https://github.com/PaddlePaddle/Paddle/pull/74877), [#74939](https://github.com/PaddlePaddle/Paddle/pull/74939) + +### Main warehouse supports multiple hardware for single testing +- Single test supports multiple hardware [#74349](https://github.com/PaddlePaddle/Paddle/pull/74349), [#74363](https://github.com/PaddlePaddle/Paddle/pull/74363), [#74806](https://github.com/PaddlePaddle/Paddle/pull/74806), [#74868](https://github.com/PaddlePaddle/Paddle/pull/74868), [#74820](https://github.com/PaddlePaddle/Paddle/pull/74820), [#74927](https://github.com/PaddlePaddle/Paddle/pull/74927) + +### New Custom Device API Support +- Added support for Custom Device API [#74308](https://github.com/PaddlePaddle/Paddle/pull/74308), [#74371](https://github.com/PaddlePaddle/Paddle/pull/74371), [#74539](https://github.com/PaddlePaddle/Paddle/pull/74539) + +## 6. Installation environment ### Bug fixes -- Fixed the `inplace` issue of operator `c_softmax_with_cross_entropy_grad`. [#72366](https://github.com/PaddlePaddle/Paddle/pull/72366) - -### Improvements - -- Performance optimization and acceleration: Enabled cuDNN support for deep convolution, enhancing convolution operation efficiency. Updated pooling operation strategy and optimized permute memory operations to reduce CUDA memory usage. Optimized printing speed, accelerating debugging and log output processes. [#71796](https://github.com/PaddlePaddle/Paddle/pull/71796), [#73442](https://github.com/PaddlePaddle/Paddle/pull/73442), [#73563](https://github.com/PaddlePaddle/Paddle/pull/73563) -- Feature Enhancements and Operational Support: Added the masked_fill operation and Boolean index optimization to enhance tensor masking processing capabilities. Implemented the index_elementwise operation to support index-based element-level operations. Added pooling and reshape execution strategies to enhance the flexibility of model operations. [#72788](https://github.com/PaddlePaddle/Paddle/pull/72788), [#72942](https://github.com/PaddlePaddle/Paddle/pull/72942) -- Bug fixes and stability improvements: Fixed a partial state support issue with fused_rms_norm in SPMD parallel mode. Corrected index errors in output dimension calculation and IndexGetStride during the slice operation to ensure computational correctness. [#72118](https://github.com/PaddlePaddle/Paddle/pull/72118), [#72223](https://github.com/PaddlePaddle/Paddle/pull/72223), [#73184](https://github.com/PaddlePaddle/Paddle/pull/73184), [#73237](https://github.com/PaddlePaddle/Paddle/pull/73237), [#73054](https://github.com/PaddlePaddle/Paddle/pull/73054) -- Faster Guard adaptation: Reduce SOT end-to-end link overhead. [#71900](https://github.com/PaddlePaddle/Paddle/pull/71900), [#71979](https://github.com/PaddlePaddle/Paddle/pull/71979), [#72081](https://github.com/PaddlePaddle/Paddle/pull/72081), [#72327](https://github.com/PaddlePaddle/Paddle/pull/72327), [#72564](https://github.com/PaddlePaddle/Paddle/pull/72564), [#72823](https://github.com/PaddlePaddle/Paddle/pull/72823) -- Performance optimization and acceleration: Optimize operator scheduling strategy. Upgrade Flash Attention to version v3 to reduce computational overhead. Fix model performance bottlenecks and improve inference and training speed. [#71937](https://github.com/PaddlePaddle/Paddle/pull/71937), [#71828](https://github.com/PaddlePaddle/Paddle/pull/71828), [#71461](https://github.com/PaddlePaddle/Paddle/pull/71461), [#72039](https://github.com/PaddlePaddle/Paddle/pull/72039), [#72228](https://github.com/PaddlePaddle/Paddle/pull/72228), [#72225](https://github.com/PaddlePaddle/Paddle/pull/72225), [#72623](https://github.com/PaddlePaddle/Paddle/pull/72623), [#72666](https://github.com/PaddlePaddle/Paddle/pull/72666), [#73147](https://github.com/PaddlePaddle/Paddle/pull/73147), [#73393](https://github.com/PaddlePaddle/Paddle/pull/73393) -- Parallel computing: Optimize the grid re-sharding strategy in automatic parallelism, integrate communication and optimize logic in the Sharding Stage, enhance the stability of distributed training, and reduce the communication overhead of distributed training. [#71969](https://github.com/PaddlePaddle/Paddle/pull/71969), [#72120](https://github.com/PaddlePaddle/Paddle/pull/72120), [#73279](https://github.com/PaddlePaddle/Paddle/pull/73279), [#73406](https://github.com/PaddlePaddle/Paddle/pull/73406) - -Feature enhancements and fixes: - Optimized operator indexing and kernel scheduling logic. [#72625](https://github.com/PaddlePaddle/Paddle/pull/72625), [#72741](https://github.com/PaddlePaddle/Paddle/pull/72741), [#73082](https://github.com/PaddlePaddle/Paddle/pull/73082), [#73501](https://github.com/PaddlePaddle/Paddle/pull/73501) - -- Model and operation support: Supports deep convolution in NHWC format, adapting to more hardware memory layouts. [#72121](https://github.com/PaddlePaddle/Paddle/pull/72121) - -## 7. Custom Device - -Optimize hardware mechanisms and provide a solution for reusing CUDA-like hardware kernels. - -### New Features - -Based on the customdevice integration solution, we introduce a low-cost support solution for hardware backends similar to CUDA. These CUDA-like backends can be plugged into Paddle in a modular manner, allowing for cost-effective reuse of the majority of CUDA kernels from the NVIDIA ecosystem within Paddle. Furthermore, they can be decoupled from feature upgrades within the Paddle framework, significantly reducing the cost of hardware backend integration and iteration, enhancing user willingness to adopt, and fostering a positive collaborative ecosystem between Paddle and hardware manufacturers. -[#72604](https://github.com/PaddlePaddle/Paddle/pull/72604), [#72668](https://github.com/PaddlePaddle/Paddle/pull/72668), [#72758](https://github.com/PaddlePaddle/Paddle/pull/72758), [#72865](https://github.com/PaddlePaddle/Paddle/pull/72865), [#72910](https://github.com/PaddlePaddle/Paddle/pull/72910), [#73033](https://github.com/PaddlePaddle/Paddle/pull/73033), [#73145](https://github.com/PaddlePaddle/Paddle/pull/73145), [#73281](https://github.com/PaddlePaddle/Paddle/pull/73281), [#73079](https://github.com/PaddlePaddle/Paddle/pull/73079) - -Enhance XPU fundamental capabilities: add kernels, expand data types, and supplement branches in the XPU environment -[#71424](https://github.com/PaddlePaddle/Paddle/pull/71424), [#71809](https://github.com/PaddlePaddle/Paddle/pull/71809), [#71594](https://github.com/PaddlePaddle/Paddle/pull/71594), [#71779](https://github.com/PaddlePaddle/Paddle/pull/71779), [#71756](https://github.com/PaddlePaddle/Paddle/pull/71756), [#71573](https://github.com/PaddlePaddle/Paddle/pull/71573), [#71883](https://github.com/PaddlePaddle/Paddle/pull/71883), [#71954](https://github.com/PaddlePaddle/Paddle/pull/71954), [#71931](https://github.com/PaddlePaddle/Paddle/pull/71931), [#72280](https://github.com/PaddlePaddle/Paddle/pull/72280), [#72361](https://github.com/PaddlePaddle/Paddle/pull/72361), [#72406](https://github.com/PaddlePaddle/Paddle/pull/72406), [#72528](https://github.com/PaddlePaddle/Paddle/pull/72528), [#72752](https://github.com/PaddlePaddle/Paddle/pull/72752), [#72852](https://github.com/PaddlePaddle/Paddle/pull/72852), [#72982](https://github.com/PaddlePaddle/Paddle/pull/72982), [#73357](https://github.com/PaddlePaddle/Paddle/pull/73357), [#73414](https://github.com/PaddlePaddle/Paddle/pull/73414), [#73464](https://github.com/PaddlePaddle/Paddle/pull/73464), [#73234](https://github.com/PaddlePaddle/Paddle/pull/73234), [#71776](https://github.com/PaddlePaddle/Paddle/pull/71776) - -DCU kernel extended data type -[#73129](https://github.com/PaddlePaddle/Paddle/pull/73129) - -### Bug Fixes - -Fix xpu execution issues -[#71852](https://github.com/PaddlePaddle/Paddle/pull/71852), [#71966](https://github.com/PaddlePaddle/Paddle/pull/71966), [#72005](https://github.com/PaddlePaddle/Paddle/pull/72005), [#71908](https://github.com/PaddlePaddle/Paddle/pull/71908), [#72431](https://github.com/PaddlePaddle/Paddle/pull/72431), [#72519](https://github.com/PaddlePaddle/Paddle/pull/72519), [#72734](https://github.com/PaddlePaddle/Paddle/pull/72734), [#72763](https://github.com/PaddlePaddle/Paddle/pull/72763), [#72762](https://github.com/PaddlePaddle/Paddle/pull/72762), [#72890](https://github.com/PaddlePaddle/Paddle/pull/72890), [#72867](https://github.com/PaddlePaddle/Paddle/pull/72867), [#73071](https://github.com/PaddlePaddle/Paddle/pull/73071), [#73004](https://github.com/PaddlePaddle/Paddle/pull/73004), [#72726](https://github.com/PaddlePaddle/Paddle/pull/72726), [#73113](https://github.com/PaddlePaddle/Paddle/pull/73113), [#73127](https://github.com/PaddlePaddle/Paddle/pull/73127), [#73025](https://github.com/PaddlePaddle/Paddle/pull/73025), [#73301](https://github.com/PaddlePaddle/Paddle/pull/73301), [#73292](https://github.com/PaddlePaddle/Paddle/pull/73292), [#73272](https://github.com/PaddlePaddle/Paddle/pull/73272), [#73305](https://github.com/PaddlePaddle/Paddle/pull/73305), [#73356](https://github.com/PaddlePaddle/Paddle/pull/73356), [#73438](https://github.com/PaddlePaddle/Paddle/pull/73438), [#72041](https://github.com/PaddlePaddle/Paddle/pull/72041), [#72275](https://github.com/PaddlePaddle/Paddle/pull/72275), [#72787](https://github.com/PaddlePaddle/Paddle/pull/72787), [#73504](https://github.com/PaddlePaddle/Paddle/pull/73504), [#73290](https://github.com/PaddlePaddle/Paddle/pull/73290) - -## 8. Environment Adaptation - -We have optimized the stability and cross-platform compatibility of the framework, and resolved issues related to compilation and installation failures on various platforms. We have upgraded key dependencies such as CUDA, further optimized the CI/CD process, improved the build speed, and enhanced the overall stability of the system. Additionally, we have ceased maintenance of compilation and installation in the Python 3.8 environment. - -### Bug fixes - -- Fixed compilation errors when using clang17 to compile third-party libraries. [#72524](https://github.com/PaddlePaddle/Paddle/pull/72524) -- Fixed compilation issues when using CUDA 12.9. [#72808](https://github.com/PaddlePaddle/Paddle/pull/72808), [#72841](https://github.com/PaddlePaddle/Paddle/pull/72841), [#72978](https://github.com/PaddlePaddle/Paddle/pull/72978), [#73360](https://github.com/PaddlePaddle/Paddle/pull/73360) -- Fixed compilation issues when using GCC 13.3. [#73144](https://github.com/PaddlePaddle/Paddle/pull/73144) -- Fixed compilation issues when WITH_PIP_CUDA_LIBRARIES=ON. [#72907](https://github.com/PaddlePaddle/Paddle/pull/72907) -- Fixed compilation issues when WITH_NVSHMEM=ON. [#73368](https://github.com/PaddlePaddle/Paddle/pull/73368) - -### Improvements - -- Avoid copying temporary files generated during the compilation of custom operators. [#73196](https://github.com/PaddlePaddle/Paddle/pull/73196) -- Warning message optimization. [#72877](https://github.com/PaddlePaddle/Paddle/pull/72877) - -### Devs - -- Compilation, installation, maintenance, and upgrade. [#71911](https://github.com/PaddlePaddle/Paddle/pull/71911), [#73005](https://github.com/PaddlePaddle/Paddle/pull/73005) -- Image maintenance and updates. [#71065](https://github.com/PaddlePaddle/Paddle/pull/71065), [#71821](https://github.com/PaddlePaddle/Paddle/pull/71821) -- Import, export, and update of symbols for the Windows platform. [#72497](https://github.com/PaddlePaddle/Paddle/pull/72497), [#72498](https://github.com/PaddlePaddle/Paddle/pull/72498), [#72500](https://github.com/PaddlePaddle/Paddle/pull/72500) -- Windows platform supports CUDA 12.8. [#72433](https://github.com/PaddlePaddle/Paddle/pull/72433) -- CI maintenance and upgrade. [#72443](https://github.com/PaddlePaddle/Paddle/pull/72443), [#72836](https://github.com/PaddlePaddle/Paddle/pull/72836), [#72563](https://github.com/PaddlePaddle/Paddle/pull/72563), [#72653](https://github.com/PaddlePaddle/Paddle/pull/72653), [#72477](https://github.com/PaddlePaddle/Paddle/pull/72477), [#72778](https://github.com/PaddlePaddle/Paddle/pull/72778), [#72960](https://github.com/PaddlePaddle/Paddle/pull/72960), [#73289](https://github.com/PaddlePaddle/Paddle/pull/73289), [#73422](https://github.com/PaddlePaddle/Paddle/pull/73422), [#73514](https://github.com/PaddlePaddle/Paddle/pull/73514), [#72748](https://github.com/PaddlePaddle/Paddle/pull/72748), -- Github Action CI construction. [#71738](https://github.com/PaddlePaddle/Paddle/pull/71738), [#70602](https://github.com/PaddlePaddle/Paddle/pull/70602), [#71958](https://github.com/PaddlePaddle/Paddle/pull/71958), [#71959](https://github.com/PaddlePaddle/Paddle/pull/71959), [#71992](https://github.com/PaddlePaddle/Paddle/pull/71992), [#72013](https://github.com/PaddlePaddle/Paddle/pull/72013), [#72153](https://github.com/PaddlePaddle/Paddle/pull/72153), [#72031](https://github.com/PaddlePaddle/Paddle/pull/72031), [#72141](https://github.com/PaddlePaddle/Paddle/pull/72141), [#72104](https://github.com/PaddlePaddle/Paddle/pull/72104), [#72182](https://github.com/PaddlePaddle/Paddle/pull/72182), [#72342](https://github.com/PaddlePaddle/Paddle/pull/72342), [#72352](https://github.com/PaddlePaddle/Paddle/pull/72352), [#72249](https://github.com/PaddlePaddle/Paddle/pull/72249), [#72068](https://github.com/PaddlePaddle/Paddle/pull/72068), [#72441](https://github.com/PaddlePaddle/Paddle/pull/72441), [#72392](https://github.com/PaddlePaddle/Paddle/pull/72392), [#72446](https://github.com/PaddlePaddle/Paddle/pull/72446), [#72435](https://github.com/PaddlePaddle/Paddle/pull/72435), [#72515](https://github.com/PaddlePaddle/Paddle/pull/72515), [#72514](https://github.com/PaddlePaddle/Paddle/pull/72514), [#72396](https://github.com/PaddlePaddle/Paddle/pull/72396), [#72547](https://github.com/PaddlePaddle/Paddle/pull/72547), [#72345](https://github.com/PaddlePaddle/Paddle/pull/72345), [#72236](https://github.com/PaddlePaddle/Paddle/pull/72236), [#72586](https://github.com/PaddlePaddle/Paddle/pull/72586), [#72537](https://github.com/PaddlePaddle/Paddle/pull/72537), [#72609](https://github.com/PaddlePaddle/Paddle/pull/72609), [#72632](https://github.com/PaddlePaddle/Paddle/pull/72632), [#72642](https://github.com/PaddlePaddle/Paddle/pull/72642), [#72673](https://github.com/PaddlePaddle/Paddle/pull/72673), [#72647](https://github.com/PaddlePaddle/Paddle/pull/72647), [#72696](https://github.com/PaddlePaddle/Paddle/pull/72696), [#72771](https://github.com/PaddlePaddle/Paddle/pull/72771), [#72711](https://github.com/PaddlePaddle/Paddle/pull/72711), [#72680](https://github.com/PaddlePaddle/Paddle/pull/72680), [#72774](https://github.com/PaddlePaddle/Paddle/pull/72774), [#72813](https://github.com/PaddlePaddle/Paddle/pull/72813), [#72804](https://github.com/PaddlePaddle/Paddle/pull/72804), [#72903](https://github.com/PaddlePaddle/Paddle/pull/72903), [#72900](https://github.com/PaddlePaddle/Paddle/pull/72900), [#72932](https://github.com/PaddlePaddle/Paddle/pull/72932), [#72967](https://github.com/PaddlePaddle/Paddle/pull/72967), [#72991](https://github.com/PaddlePaddle/Paddle/pull/72991), [#72115](https://github.com/PaddlePaddle/Paddle/pull/72115), [#73242](https://github.com/PaddlePaddle/Paddle/pull/73242), [#72801](https://github.com/PaddlePaddle/Paddle/pull/72801), [#73433](https://github.com/PaddlePaddle/Paddle/pull/73433), [#73391](https://github.com/PaddlePaddle/Paddle/pull/73391), [#73456](https://github.com/PaddlePaddle/Paddle/pull/73456), [#73376](https://github.com/PaddlePaddle/Paddle/pull/73376), [#73453](https://github.com/PaddlePaddle/Paddle/pull/73453), [#73481](https://github.com/PaddlePaddle/Paddle/pull/73481), [#73546](https://github.com/PaddlePaddle/Paddle/pull/73546), [#73446](https://github.com/PaddlePaddle/Paddle/pull/73446), [#72744](https://github.com/PaddlePaddle/Paddle/pull/72744) - -### Deprecations - -- Discontinue support for compilation in Python 3.8 environment. [#72827](https://github.com/PaddlePaddle/Paddle/pull/72827) - -## 9. List of contributors - -0x3878f, A-nnonymous, AndSonder, ApricityXX, aquagull, author, baoqiwen, BeingGod, blacksheep-Aristotle, BoShen5, bukejiyu, cangtianhuang, carryyu, chang-wenbin, changeyoung98, chen2016013, ckl117, co63oc, cqulilujia, crashbussy, cszdrg, Cutelemon6, cyy536, DanielSun11, danleifeng, datutu-L, deepllz, Dmovic, DrRyanHuang, dynamicheart, Eddie-Wang1120, eggman-1024, emmanuel-ferdman, Enigmatisms, enkilee, fangfangssj, feixi21, FeixLiu, ForFishes, Function-Samuel, ggggxm, GITD245, Glencsa, GoldenStain, gongshaotian, gouzil, gzy19990617, hanlintang, Hongqing-work, houj04, huangjiyi, hxzd5568, HydrogenSulfate, jzhang533, LCStayingdullCircuit, leon062112, lifulll, linkk08, LittleHeroZZZX, liufengwei0103, Liujie0926, liuruyan, lixinqi, LiYuRio, lizexu123, lizhenyun01, lj970926, lshpku, megemini, mikethegoblin, ming1753, mzj104, NKNaN, ooooo-create, pesionzhao, phlrain, pkuzyc, PolaKuma, Qin-sx, RichardWooSJTU, risemeup1, runzhech, RuohengMa, sasaya123, shanjiang7, SigureMo, sneaxiy, swgu98, SylarTiaNII, tianhaodongbd, tianshuo78520a, timminator, tizhou86, umiswing, waliwali777, wanghuancoder, Waynezee, Wennie396, xiaoguoguo626807, XieYunshen, Xing-lil, xkkkkkk23, Xreki, xuxinyi389, Yeenyeong, yongqiangma, YqGe585, yuanlehome, YuanRisheng, yulangz, yuwu46, zeroRains, zhangbo9674, zhanghonggeng, zhangting2020, ZhangX-21, zhangyk0314, zhangyuqin1998, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhupengyang, zrr1999, zty-king, zyfncg +- Fixed the bug in flashattent compilation cache. [#74388](https://github.com/PaddlePaddle/Paddle/pull/74388) +- Fixed the bug where site.USER_SITE was None. [#74373](https://github.com/PaddlePaddle/Paddle/pull/74373) +- Fixed the compilation bug of gtest in multi-architecture Linux systems. [#74723](https://github.com/PaddlePaddle/Paddle/pull/74723) +- Fixed multiple compilation errors in DEBUG mode when WITH_GPU=ON. [#74401](https://github.com/PaddlePaddle/Paddle/pull/74401) +- Fixed the compilation bug of CUDA12.6 under Windows. [#74990](https://github.com/PaddlePaddle/Paddle/pull/74990) +- Fixed the bug in the api-benchmark baseline pipeline. [#74770](https://github.com/PaddlePaddle/Paddle/pull/74770) +- Fixed the bug in the api-benchmark baseline pipeline. [#74778](https://github.com/PaddlePaddle/Paddle/pull/74778) +- Fixed the bug in the api-benchmark baseline pipeline. [#74779](https://github.com/PaddlePaddle/Paddle/pull/74779) +- Fixed the bug in the api-benchmark baseline pipeline. [#74780](https://github.com/PaddlePaddle/Paddle/pull/74780) +- Fixed the bug in the api-benchmark baseline pipeline. [#74800](https://github.com/PaddlePaddle/Paddle/pull/74800) +- Fixed the bug in the api-benchmark baseline pipeline. [#74803](https://github.com/PaddlePaddle/Paddle/pull/74803) + +### Other + +- Disable the test_custom_contiguous unit test. [#74337](https://github.com/PaddlePaddle/Paddle/pull/74337) +- Support for timed triggering of baseline tasks in the slice pipeline. [#74419](https://github.com/PaddlePaddle/Paddle/pull/74419) +- Support manually specifying the pr for adding slice recording baselines. [#74445](https://github.com/PaddlePaddle/Paddle/pull/74445) +- Check if there are any issues in the code. [#74460](https://github.com/PaddlePaddle/Paddle/pull/74460) +- Support CI PaddleX tasks on XPU. [#74426](https://github.com/PaddlePaddle/Paddle/pull/74426) +- Support slice pipeline exemption mechanism. [#74482](https://github.com/PaddlePaddle/Paddle/pull/74482) +- Updated the Paddle base image. [#73423](https://github.com/PaddlePaddle/Paddle/pull/73423) +- Fixed Ninja version 1.11 for Windows. [#74590](https://github.com/PaddlePaddle/Paddle/pull/74590) +- Support adding the ability to close PRs and cancel CIs. [#74604](https://github.com/PaddlePaddle/Paddle/pull/74604) +- Support for quickly skipping all CI. [#74696](https://github.com/PaddlePaddle/Paddle/pull/74696) +- Add an api-benchmark baseline pipeline. [#74690](https://github.com/PaddlePaddle/Paddle/pull/74690) +- Update the nccl version. [#74809](https://github.com/PaddlePaddle/Paddle/pull/74809) +- Update the RD list for the approve pipeline. [#74838](https://github.com/PaddlePaddle/Paddle/pull/74838) +- Update the RD list for the approve pipeline. [#74902](https://github.com/PaddlePaddle/Paddle/pull/74902) +- Update safetensor to the mirror. [#74904](https://github.com/PaddlePaddle/Paddle/pull/74904) +- Added the compilation flag for flashatten. [#74959](https://github.com/PaddlePaddle/Paddle/pull/74959) +- Temporarily disable the win-inference pipeline. [#74980](https://github.com/PaddlePaddle/Paddle/pull/74980) +- Support for compiling phi dynamic libraries on Windows. [#74950](https://github.com/PaddlePaddle/Paddle/pull/74950) + +## 7. List of contributors +AIbin, Ayakouji, baiyue, baoqiwen, Chang Lu, Chen Zhiyang, co63oc, cyberslack_lee, cyy536, datutu-L, Deng Haodong, Difer, Eddie-Wang, enzodechine, fangfangssj, feri, fxyfxy777, ggggxm, GoldPancake, gouzil, Gu Shiwei, Haze188 灏喆, hohdiy, hong, HU Shenwei, huangjiyi, HydrogenSulfate, kjagsdq, LCStayingdullCircuit, Leo Guo, lightbrother, liufengwei0103, liuruyan, LiYuRio, LLSGYN, Lucas, Luckycheng222, lzy, Nana, Nyakku Shigure, ooo oo, Qianyue He, risemeup1, Ruibiao Chen, Ryan, Shuhao Liang, sneaxiy, Starrysea996, SUN Dong, Tao Luo, Tian, tianhaodongbd, tianshuo78520a, umiswing, waliwali777, wanghuancoder, Wenhao.Dai, wyw, XiaoguangHu, xiaoguoguo626807, xingmingyyj, Yichen Zhang, Yohanna, yongqiangma, Yuan Xiaolan, YUNSHEN XIE, Yuntao Nie, Yuqiang Ge, Yutian Rao, Zero Rains, Zhan Rongrui, Zhang Ting, zhanghonggeng, Zhaowu Pan, zhengshengning, ZhenxingLi, Zhou Xin, zhupengyang, zhwesky2010, Zichao, zty-king, Zx, zyfncg, zzm, 周周周, 正在学习, 苍天荒 \ No newline at end of file