diff --git a/ci_scripts/Dockerfile b/ci_scripts/Dockerfile index 890978dd273..d20e278f66b 100644 --- a/ci_scripts/Dockerfile +++ b/ci_scripts/Dockerfile @@ -1,4 +1,4 @@ -ARG BASEIMAGE=registry.baidubce.com/paddleopen/paddle:build-gpu-cuda10.2-cudnn8-devel-ubuntu18.04-gcc8-py38 +ARG BASEIMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddleopen/paddle:build-gpu-cuda10.2-cudnn8-devel-ubuntu18.04-gcc8-py38 FROM ${BASEIMAGE} RUN git config --global user.name "PaddleCI" \ diff --git a/docs-build.sh b/docs-build.sh index 47b16dbdc5e..a710c48d53f 100644 --- a/docs-build.sh +++ b/docs-build.sh @@ -50,8 +50,8 @@ fi VERSIONSTR=${VERSIONSTR:=develop} -SPHINX_DOCKERIMAGE=registry.baidubce.com/paddleopen/fluiddoc-sphinx:20210610-py38 -PADDLEDEV_DOCKERIMAGE=registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82 +SPHINX_DOCKERIMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddleopen/fluiddoc-sphinx:20210610-py38 +PADDLEDEV_DOCKERIMAGE=ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82 if [ "$PADDLE_WHL" = '' ] && [ "$PADDLE_DIR" = '' ] ; then diff --git a/docs/api/paddle/fluid/Overview_cn.rst b/docs/api/paddle/fluid/Overview_cn.rst index 44fe15b4332..618a71b9566 100644 --- a/docs/api/paddle/fluid/Overview_cn.rst +++ b/docs/api/paddle/fluid/Overview_cn.rst @@ -3,4 +3,4 @@ paddle.fluid --------------------- .. warning:: - 从飞桨框架 2.5 版本开始,我们已经废弃了 `paddle.fluid` namespace 下的 API,请使用其他的替代 API。 + 从飞桨框架 2.5 版本开始,我们已经废弃了 `paddle.fluid` namespace 下的 API,请使用其他的替代 API,请参考 :ref:`cn_guides_api_mapping` 获取更多信息。如有疑问,请于 `Issue 区 `_ 按规范提交 Issue。 diff --git a/docs/faq/install_cn.md b/docs/faq/install_cn.md index eabd50dae60..efea7de0b70 100644 --- a/docs/faq/install_cn.md +++ b/docs/faq/install_cn.md @@ -1,5 +1,19 @@ # 安装常见问题 +#### 问题:conda 环境下安装 paddlepaddle 3.0.0b0 版本,运行`import paddle`时报错,报错信息为找不到 libpython.so 文件 + ++ 问题描述: +> ImportError: libpython3.12.so.1.0: cannot open shared object file: No such file or directory + ++ 问题分析: +遇到该问题是因为 3.0.0b0 版本中增加了对于 libpython 的依赖,但是使用 conda 安装的 python 环境时,未把 libpython.so 文件所在路径加入到环境变量中,导致找不到该文件。 + ++ 解决办法: +例如:执行`find / -name libpython3.12.so.1.0`, 发现 libpython 的路径如`/opt/conda/lib/`,使用如下命令安装即可; + +```bash +export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/conda/lib/ +``` ##### 问题:使用过程中报找不到 tensorrt 库的日志 diff --git a/docs/guides/06_distributed_training/deployment_cn.rst b/docs/guides/06_distributed_training/deployment_cn.rst index fc948ba2f85..e076eaeeeda 100644 --- a/docs/guides/06_distributed_training/deployment_cn.rst +++ b/docs/guides/06_distributed_training/deployment_cn.rst @@ -52,7 +52,7 @@ paddle 环境安装 .. code-block:: - $ docker run --name paddle -it --net=host -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda11.2-cudnn8 /bin/bash + $ docker run --name paddle -it --net=host -v $PWD:/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda11.2-cudnn8 /bin/bash * 当使用 gpu 时请配置 nvidia docker runtime 或使用 nvidia-docker 启动容器,进入容器后使用 nvidia-smi 命令确认环境正确 * 使用分布式时需要添加 --net=host 参数让容器使用主机网络以实现跨机建立连接 @@ -230,14 +230,14 @@ paddlejob 任务提交 spec: containers: - name: paddle - image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1 + image: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddle-operator/demo-wide-and-deep:v1 ps: replicas: 2 template: spec: containers: - name: paddle - image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1 + image: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddle-operator/demo-wide-and-deep:v1 说明: @@ -277,7 +277,7 @@ paddlejob 任务提交 spec: containers: - name: paddle - image: registry.baidubce.com/paddle-operator/demo-resnet:v1 + image: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddle-operator/demo-resnet:v1 command: - python args: diff --git a/docs/guides/06_distributed_training/images/reshard.png b/docs/guides/06_distributed_training/images/reshard.png deleted file mode 100644 index f942be94a4f..00000000000 Binary files a/docs/guides/06_distributed_training/images/reshard.png and /dev/null differ diff --git a/docs/guides/06_distributed_training/images/shard.png b/docs/guides/06_distributed_training/images/shard.png deleted file mode 100644 index bab204fd632..00000000000 Binary files a/docs/guides/06_distributed_training/images/shard.png and /dev/null differ diff --git a/docs/guides/hardware_support/dcu/install_cn.md b/docs/guides/hardware_support/dcu/install_cn.md index e577865a57f..2066399711c 100644 --- a/docs/guides/hardware_support/dcu/install_cn.md +++ b/docs/guides/hardware_support/dcu/install_cn.md @@ -18,13 +18,13 @@ ```bash # 拉取镜像 -docker pull registry.baidubce.com/device/paddle-dcu:dtk23.10.1-kylinv10-gcc73-py310 +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-dcu:dtk23.10.1-kylinv10-gcc73-py310 # 参考如下命令,启动容器 docker run -it --name paddle-dcu-dev -v `pwd`:/work \ -w=/work --shm-size=128G --network=host --privileged \ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ - registry.baidubce.com/device/paddle-dcu:dtk23.10.1-kylinv10-gcc73-py310 /bin/bash + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-dcu:dtk23.10.1-kylinv10-gcc73-py310 /bin/bash # 检查容器内是否可以正常识别海光 DCU 设备 rocm-smi @@ -51,7 +51,7 @@ DCU Temp AvgPwr Fan Perf PwrCap VRAM% DCU% ```bash # 下载并安装 wheel 包 -pip install paddlepaddle-rocm -i https://www.paddlepaddle.org.cn/packages/nightly/dcu +python -m pip install --pre paddlepaddle-dcu -i https://www.paddlepaddle.org.cn/packages/nightly/dcu/ ``` ### 安装方式二:源代码编译安装 @@ -76,7 +76,7 @@ cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_CXX_FLAGS="-Wno-error -w" \ make -j16 # 编译产出在 build/python/dist/ 路径下,使用 pip 安装即可 -pip install -U paddlepaddle_rocm-0.0.0-cp310-cp310-linux_x86_64.whl +pip install -U paddlepaddle_dcu-0.0.0-cp310-cp310-linux_x86_64.whl ``` ## 基础功能检查 @@ -106,5 +106,5 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad 请使用以下命令卸载: ```bash -pip uninstall paddlepaddle-rocm +pip uninstall paddlepaddle-dcu ``` diff --git a/docs/guides/hardware_support/mlu/install_cn.md b/docs/guides/hardware_support/mlu/install_cn.md index 0d186715db2..42fd7f78dd7 100644 --- a/docs/guides/hardware_support/mlu/install_cn.md +++ b/docs/guides/hardware_support/mlu/install_cn.md @@ -18,14 +18,14 @@ ```bash # 拉取镜像 -docker pull registry.baidubce.com/device/paddle-mlu:ubuntu20-x86_64-gcc84-py310 +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-mlu:ubuntu20-x86_64-gcc84-py310 # 参考如下命令,启动容器 docker run -it --name paddle-mlu-dev -v $(pwd):/work \ -w=/work --shm-size=128G --network=host --privileged \ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ -v /usr/bin/cnmon:/usr/bin/cnmon \ - registry.baidubce.com/device/paddle-mlu:ubuntu20-x86_64-gcc84-py310 /bin/bash + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-mlu:ubuntu20-x86_64-gcc84-py310 /bin/bash # 检查容器内是否可以正常识别寒武纪 MLU 设备 cnmon diff --git a/docs/guides/hardware_support/npu/install_cn.md b/docs/guides/hardware_support/npu/install_cn.md index 9e6516cfbd6..4a90b977af1 100644 --- a/docs/guides/hardware_support/npu/install_cn.md +++ b/docs/guides/hardware_support/npu/install_cn.md @@ -28,8 +28,8 @@ lspci | grep d802 ```bash # 拉取镜像 -docker pull registry.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-x86_64-gcc84-py39 # X86 架构 -docker pull registry.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-aarch64-gcc84-py39 # ARM 架构 +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-x86_64-gcc84-py39 # X86 架构 +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-aarch64-gcc84-py39 # ARM 架构 # 考如下命令启动容器,ASCEND_RT_VISIBLE_DEVICES 可指定可见的 NPU 卡号 docker run -it --name paddle-npu-dev -v $(pwd):/work \ @@ -38,7 +38,7 @@ docker run -it --name paddle-npu-dev -v $(pwd):/work \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/dcmi:/usr/local/dcmi \ -e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \ - registry.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-$(uname -m)-gcc84-py39 /bin/bash + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-$(uname -m)-gcc84-py39 /bin/bash # 检查容器内是否可以正常识别昇腾 NPU 设备 npu-smi info diff --git a/docs/guides/hardware_support/xpu/install_cn.md b/docs/guides/hardware_support/xpu/install_cn.md index e5377983103..b9dbe3c40c5 100644 --- a/docs/guides/hardware_support/xpu/install_cn.md +++ b/docs/guides/hardware_support/xpu/install_cn.md @@ -18,13 +18,13 @@ ```bash # 拉取镜像 -docker pull registry.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 # 参考如下命令,启动容器 docker run -it --name paddle-xpu-dev -v $(pwd):/work \ -w=/work --shm-size=128G --network=host --privileged \ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ - registry.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 /bin/bash + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 /bin/bash # 检查容器内是否可以正常识别昆仑芯 XPU 设备 xpu_smi diff --git a/docs/guides/index_cn.rst b/docs/guides/index_cn.rst index 1f5cba841c5..3a033a98f5c 100644 --- a/docs/guides/index_cn.rst +++ b/docs/guides/index_cn.rst @@ -8,6 +8,7 @@ 使用教程分为如下的模块: +- `飞桨 3.0 全新特性 <./paddle_v3_features/index_cn.html>`_ - `模型开发入门 <./beginner/index_cn.html>`_ - `模型开发更多用法 <./advanced/index_cn.html>`_ - `动态图转静态图 <./jit/index_cn.html>`_ @@ -22,6 +23,7 @@ .. toctree:: :hidden: + paddle_v3_features/index_cn.rst beginner/index_cn.rst advanced/index_cn.rst jit/index_cn.rst diff --git a/docs/guides/model_convert/convert_from_pytorch/index_cn.rst b/docs/guides/model_convert/convert_from_pytorch/index_cn.rst index c9d82028683..cc44f001d6f 100644 --- a/docs/guides/model_convert/convert_from_pytorch/index_cn.rst +++ b/docs/guides/model_convert/convert_from_pytorch/index_cn.rst @@ -4,24 +4,25 @@ 您可以通过下面的内容,如何将 PyTorch 训练代码迁移到飞桨: - +- `代码自动转换工具 <./paconvert_introduction_cn.html>`_ : 介绍 Pytorch 代码自动转 Paddle 工具使用方法。 +- `PyTorch API 映射表 <./pytorch_api_mapping_cn.html>`_ : 说明 PyTorch 最新 release 版本 与 Paddle develop 版本 API 对应关系。 - `CV - 快速上手 <./cv_quick_start_cn.html>`_ : 以 MobileNetV3 为例,介绍如何从 PyTorch 迁移到飞桨。 - `CV - 迁移经验总结 <./cv_experience_cn.html>`_ : 介绍 CV 各个方向从 PyTorch 迁移到飞桨的基本流程、常用工具、定位问题的思路及解决方法。 - `NLP - 快速上手 <./nlp_fast_explore_cn.html>`_ : 以 Bert 为例,介绍如何从 PyTorch 迁移到飞桨。 - `NLP - 迁移经验总结 <./nlp_migration_experiences_cn.html>`_ : 介绍 NLP 各个方向从 PyTorch 迁移到飞桨的基本流程、常用工具、定位问题的思路及解决方法。 - `解读网络结构转换 <./convert_net_structure_cn.html>`_ : 介绍网络结构转换的思路和方法。 - `解读 Bert 模型权重转换 <./convert_bert_weights_cn.html>`_ : 介绍如何进行不同框架下的模型权重转换。 -- `PyTorch API 映射表 <./pytorch_api_mapping_cn.html>`_ : 说明 PyTorch 2.1.0 版本与 Paddle develop 版本 API 对应关系。 - `PyTorch 自定义算子转写教程 <./pytorch_custom_op_convert_cn.html>`_ : 介绍 PyTorch 中自定义算子转写成 Paddle 自定义算子的思路和方法。 .. toctree:: :hidden: + paconvert_introduction_cn.md + pytorch_api_mapping_cn.md cv_quick_start_cn.md cv_experience_cn.md nlp_fast_explore_cn.md nlp_migration_experiences_cn.md convert_net_structure_cn.md convert_bert_weights_cn.md - pytorch_api_mapping_cn.md pytorch_custom_op_convert_cn.md diff --git a/docs/guides/model_convert/convert_from_pytorch/paconvert_introduction_cn.md b/docs/guides/model_convert/convert_from_pytorch/paconvert_introduction_cn.md new file mode 100644 index 00000000000..74777dca163 --- /dev/null +++ b/docs/guides/model_convert/convert_from_pytorch/paconvert_introduction_cn.md @@ -0,0 +1,184 @@ +# 代码自动转换工具 + +![](https://img.shields.io/badge/version-v2.0-brightgreen) ![](https://img.shields.io/badge/docs-latest-brightgreen) ![](https://img.shields.io/badge/PRs-welcome-brightgreen) ![](https://img.shields.io/badge/pre--commit-Yes-brightgreen) + +**[PaConvert Github](https://github.com/PaddlePaddle/PaConvert)** + +**Pa**ddlePaddle Code **Convert** Toolkits + +## 🤗 公告 🤗 +- 本工具由 Paddle 官方团队维护与建设,所有转换代码均已经过测试,欢迎使用,高效迁移 Pytorch 代码到 PaddlePaddle + +- 当前共支持约 1300+个 Pytorch API 的一键转换,我们通过 300+个 Pytorch 模型测试,代码行数平均转换率约为 **90+%** + +- 本工具基于 [PyTorch 最新 release 与 Paddle develop API 映射表](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/model_convert/convert_from_pytorch/pytorch_api_mapping_cn.html) 实现,表中 API 均经过详细对比分析,欢迎查阅 + +- 有使用问题和建议欢迎在 [PaConvert GitHub Issues](https://github.com/PaddlePaddle/PaConvert/issues) 中提出 + +## 概述 + +`代码自动转换工具` 能自动将其它深度学习框架训练或推理的**代码**,转换为 PaddlePaddle 的**代码**,方便快速自动地 **模型代码迁移**。 + +目前仅支持自动转换 Pytorch 代码,其它深度学习框架的支持后续新增中,转换时会尽量保持原代码的风格与结构,将其它深度学习框架的 API 接口 转换为 PaddlePaddle 的 API 接口。 + +转换过程中不会改动原文件,会将原项目中的文件一一转换到 `out_dir` 文件夹中(如不指定`out_dir`,则默认在当前目录下新建`paddle_project/`)。对不同类型的文件的处理逻辑分别为: + +- Python 代码文件:识别代码中调用其它深度学习框架的接口并转换为 PaddlePaddle 的接口 +- requirements.txt: 替换其中的安装依赖为 `paddlepaddle-gpu` +- 其他文件:原样拷贝 + +## 安装与使用 + +由于使用了一些较新的 Python 功能特性,你需要使用 `>=python3.8` 的解释器。 + +1. 使用 pip 安装 + +```bash +pip install -U paconvert +paconvert --in_dir torch_project [--out_dir paddle_project] [--exclude_dirs exclude_dirs] [--log_dir log_dir] [--log_level "DEBUG"] [--run_check 1] +``` + +2. 使用源码安装 + +```bash +git clone https://github.com/PaddlePaddle/PaConvert.git +python paconvert/main.py --in_dir torch_project [--out_dir paddle_project] [--exclude_dirs exclude_dirs] [--log_dir log_dir] [--log_level "DEBUG"] [--run_check 1] +``` + +**参数介绍** + +``` +--in_dir 输入 torch 项目文件,可以为单个文件或文件夹 +--out_dir 可选,输出 paddle 项目文件,可以为单个文件或文件夹,默认在当前目录下创建./paddle_project/ +--exclude_dirs 可选,排除转换的文件或文件夹,排除多项时请用逗号分隔,默认不排除 +--log_dir 可选,输出日志的路径,默认会在终端上打印日志 +--log_level 可选,打印 log 等级,仅支持"INFO" "DEBUG",默认"INFO" +--run_check 可选,工具自检 +``` + + +## 转换示例 + +以下面 Pytorch 代码为例,转换前: +``` +import torch +import torch.nn as nn +import torch.optim as optim +import torch.nn.Linear as Linear +import torch.nn.functional as F + +class MyNet(nn.Module): + test = "str" + + def __init__(self): + self._fc1 = torch.nn.Linear(10, 10) + self._fc2 = nn.Linear(10, 10) + self._fc3 = Linear(10, 10) + + @torch.no_grad() + def forward(self, x): + x = self._fc1(x) + x = self._fc2(x) + x = self._fc3(x) + y = torch.add(x, x) + return F.relu(y) + +net = MyNet() + +sgd = optim.SGD(net.parameters(), lr=0.1, momentum=0.9) +lr = optim.lr_scheduler.MultiStepLR(sgd, milestones=[2, 4, 6], gamma=0.8) +``` + +转换后: +``` +import paddle + + +class MyNet(paddle.nn.Layer): + test = 'str' + + def __init__(self): + self._fc1 = paddle.nn.Linear(in_features=10, out_features=10) + self._fc2 = paddle.nn.Linear(in_features=10, out_features=10) + self._fc3 = paddle.nn.Linear(in_features=10, out_features=10) + + @paddle.no_grad() + def forward(self, x): + x = self._fc1(x) + x = self._fc2(x) + x = self._fc3(x) + y = paddle.add(x=x, y=paddle.to_tensor(x)) + return paddle.nn.functional.relu(x=y) + + +net = MyNet() +>>>>>>sgd = torch.optim.SGD(net.parameters(), lr=0.1, momentum=0.9) +tmp_lr = paddle.optimizer.lr.MultiStepDecay(milestones=[2, 4, 6], gamma=0.8, + learning_rate=sgd.get_lr()) +sgd.set_lr_scheduler(tmp_lr) +lr = tmp_lr +``` + +打印信息如下: + +```text +=========================================== +PyTorch to Paddle Convert Start ------>: +=========================================== +Start convert file: /workspace/PaConvert/test_code.py --> /workspace/PaConvert/paddle_project/test_code.py +[test_code.py:1] remove 'import torch' +[test_code.py:2] remove 'import torch.nn as nn' +[test_code.py:3] remove 'import torch.optim as optim' +[test_code.py:4] remove 'import torch.nn.Linear as Linear' +[test_code.py:5] remove 'import torch.nn.functional as F' +[test_code.py] add 'import paddle' in line 1 +[test_code.py:1] [Success] Convert torch.nn.Module to Paddle +[test_code.py:11] [Success] Convert torch.nn.Linear to Paddle +[test_code.py:12] [Success] Convert torch.nn.Linear to Paddle +[test_code.py:13] [Success] Convert torch.nn.Linear to Paddle +[test_code.py:20] [Success] Convert torch.add to Paddle +[test_code.py:21] [Success] Convert torch.nn.functional.relu to Paddle +[test_code.py:15] [Success] Convert torch.no_grad to Paddle +[test_code.py:25] [Success] Convert Class Method: torch.nn.Module.parameters to Paddle +[test_code.py:25] [Not Support] convert torch.optim.SGD to Paddle is not supported currently +[test_code.py:26] [Success] Convert torch.optim.lr_scheduler.MultiStepLR to Paddle +Finish convert /workspace/PaConvert/test_code.py --> /workspace/PaConvert/paddle_project/test_code.py + + +=========================================== +Convert Summary: +=========================================== +There are 10 Pytorch APIs in this Project: + 9 Pytorch APIs have been converted to Paddle successfully! + 1 Pytorch APIs are not supported to convert to Paddle currently! + Convert Rate is: 90.000% + +For these 1 Pytorch APIs that currently do not support to convert, which have been marked by >>> before the line, +please refer to [https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/model_convert/convert_from_pytorch/pytorch_api_mapping_cn.html] +and convert it by yourself manually. In addition, these APIs will be supported in future. + +Thank you to use Paddle Code Convert Tool. You can make any suggestions +to us by submitting issues to [https://github.com/PaddlePaddle/PaConvert]. + +**************************************************************** +______ _____ _ +| ___ \ / ____| | | +| |_/ /_ _| | ___ _ ____ _____ _ __| |_ +| __/ _ | | / _ \| \_ \ \ / / _ \ \__| __| +| | | (_| | |___| (_) | | | \ V / __/ | | |_ +\_| \__,_|\_____\___/|_| |_|\_/ \___|_| \__| + +*************************************************************** + +``` + +转换完成后,会打印 **转换总结** ,包含 **总 API 数、成功转换 API 数、不支持转换 API 数、转化率** 。例如,上述代码里一共有 10 个 Pytorch API,其中 9 个被成功转换,1 个不支持转换,因此转换率为 `90.00%` 。 + +**对于成功转换的 API**:代码风格会略有变化,会 **补全 API 全名、补全参数关键字、移除注释、移除多余空行** 。因为在代码识别的过程中,**注释、空行** 等无法识别。 + +**对于不支持转换的 API**:将 **补全为 Pytorch API 全名**,同时在行前通过 `>>>>>>` 的形式加以标记,用户需要对该 API 进行人工手动转换,然后删除 `>>>>>>` 标记,否则代码无法运行。 + + +## 贡献代码 + +代码自动转换工具([PaConvert](https://github.com/PaddlePaddle/PaConvert))为开源贡献形式,欢迎你向我们贡献代码,详细开发步骤请参考 [贡献代码教程](https://github.com/PaddlePaddle/PaConvert/blob/master/docs/CONTRIBUTING.md) diff --git a/docs/guides/model_convert/index_cn.rst b/docs/guides/model_convert/index_cn.rst index b440f95ea35..fdb6e037212 100644 --- a/docs/guides/model_convert/index_cn.rst +++ b/docs/guides/model_convert/index_cn.rst @@ -7,13 +7,14 @@ - `迁移指南 <./convert_guide_cn.html>`_ : 介绍模型迁移场景及概览。 - `从 PyTorch 迁移到飞桨 <./convert_from_pytorch/index_cn.html>`_ : 介绍如何将 PyTorch 训练代码迁移到飞桨。 + - `代码自动转换工具 <./paconvert_introduction_cn.html>`_ : 介绍 Pytorch 代码自动转 Paddle 工具使用方法。 + - `PyTorch API 映射表 <./pytorch_api_mapping_cn.html>`_ : 说明 PyTorch 最新 release 版本 与 Paddle develop 版本 API 对应关系。 - `CV - 快速上手 <./convert_from_pytorch/cv_quick_start_cn.html>`_ : 以 MobileNetV3 为例,介绍如何从 PyTorch 迁移到飞桨。 - `CV - 迁移经验总结 <./convert_from_pytorch/cv_experience_cn.html>`_ : 介绍 CV 各个方向从 PyTorch 迁移到飞桨的基本流程、常用工具、定位问题的思路及解决方法。 - `NLP - 快速上手 <./convert_from_pytorch/nlp_fast_explore_cn.html>`_ : 以 Bert 为例,介绍如何从 PyTorch 迁移到飞桨。 - `NLP - 迁移经验总结 <./convert_from_pytorch/nlp_migration_experiences_cn.html>`_ : 介绍 NLP 各个方向从 PyTorch 迁移到飞桨的基本流程、常用工具、定位问题的思路及解决方法。 - `解读网络结构转换 <./convert_from_pytorch/convert_net_structure_cn.html>`_ : 介绍网络结构转换的思路和方法。 - `解读 Bert 模型权重转换 <./convert_from_pytorch/convert_bert_weights_cn.html>`_ : 介绍如何进行不同框架下的模型权重转换。 - - `PyTorch API 映射表 <./convert_from_pytorch/pytorch_api_mapping_cn.html>`_ : 说明 PyTorch 1.8 版本与 Paddle 2.0 API 对应关系。 - `PyTorch 自定义算子转写教程 <./pytorch_custom_op_convert_cn.html>`_ : 介绍 PyTorch 中自定义算子转写成 Paddle 自定义算子的思路和方法。 - `使用 X2Paddle 迁移推理模型 <./convert_with_x2paddle_cn.html>`_ : 介绍如何使用 X2Paddle 工具将 PyTorch、ONNX、TensorFlow、Caffe 推理模型迁移到飞桨。 - `迁移飞桨旧版本 <./convert_from_older_versions/index_cn.html>`_ : 介绍如何将飞桨 1.X 版本的训练代码与模型迁移到飞桨最新版。 diff --git a/docs/guides/06_distributed_training/auto_parallel_cn.md b/docs/guides/paddle_v3_features/auto_parallel_cn.md similarity index 78% rename from docs/guides/06_distributed_training/auto_parallel_cn.md rename to docs/guides/paddle_v3_features/auto_parallel_cn.md index 4ac741cd772..e164595750d 100644 --- a/docs/guides/06_distributed_training/auto_parallel_cn.md +++ b/docs/guides/paddle_v3_features/auto_parallel_cn.md @@ -40,7 +40,12 @@ * Shard(axis),指将张量沿 axis 维度做切分后,放到不同的计算设备上。 * Partial,指每个计算设备只拥有部分值,需要通过指定的规约操作才能恢复成全量数据。 -![三种分布式状态](images/placements.png) + +

+ +

+ + 在如下的示例中,我们希望在 6 个计算设备上,创建一个形状为(4, 3)的分布式张量,其中沿着计算设备的 x 维,切分张量的 0 维;沿着计算设备的 y 维上,切分张量的 1 维。最终,每个计算设备实际拥有大小为(2, 1)的实际张量,如图所示。 @@ -58,7 +63,9 @@ dense_tensor = paddle.to_tensor([[1,2,3], placements = [dist.Shard(0), dist.Shard(1)] dist_tensor = dist.shard_tensor(dense_tensor, mesh, placements) ``` -![切分状态](images/shard.png) +

+ +

同时,为了提供 ``重切分`` 的能力,我们提供 ``paddle.distributed.reshard`` 接口,支持跨 ``ProcessMesh`` 的分布式张量转换,比如,我们可以把在[0, 1] 两个设备上状态为 ``Replicate`` 的分布式张量,转换到 [2, 3] 这两个设备上,并变成状态为 ``Shard`` 的分布式张量。 @@ -78,8 +85,54 @@ placements1 = [dist.Shard(0)] dist_tensor = dist.shard_tensor(dense_tensor, mesh0, placements0) dist_tensor_after_reshard = dist.reshard(dist_tensor, mesh1, placements1) ``` -![切分状态](images/reshard.png) +

+ +

+ +# 三、原理简介 + +下面我们用一个简单的列子介绍自动并行框架底层的执行流程和原理。 + +在单卡逻辑视角下我们希望完成计算 C = Matmul(A, B),D = Relu(C)。 +假设用户将 TensorB 标记成按列切分,表示在实际分布式集群中 TensorB 被按行切分到不同的 Devices 上。将 TensorA 标记成复制,表示所有 Devices 上都有完整 TensorA 副本。 + +```python +import paddle +import paddle.distributed as dist + +mesh = dist.ProcessMesh([0, 1], dim_names=['x']) +dense_tensorA = paddle.to_tensor([[1,2,], [3,4]]) +dense_tensorB = paddle.to_tensor([[5,6], [7,8]]) +placementsA = [dist.Replicate()] +placementsB = [dist.Shard(0)] + +dist_tensorA = dist.shard_tensor(dense_tensorA, mesh, placementsA) +dist_tensorB = dist.shard_tensor(dense_tensorB, mesh, placementsB) +dist_tensorC = Matmul(dist_tensorA, dist_tensorB) +dist_tensorD = relu(dist_tensorC) +``` + +

+ +

+ +接下来就会进入自动并行的第一个核心逻辑 **切分推导**。 +当前用户标记的输入切分状态是无法被 Matmul 算子实际计算的(TensorA 的第 0 维和 TensorB 的第 1 维不匹配)。 +这时候自动并行框架会使用当前算子的切分推导规则(e.g. MatmulSPMD Rule),根据输入 tensors 的切分状态,推导出一套合法且性能较优的 输入-输出 张量的切分状态。 +在上述输入的切分状态下,框架会推导出会将 TensorA 的切分状态推导成按列切分,TensorB 保持切分状态不变,Matmul 的计算结果 TensorC 的切分状态是 Partial。 +因为后续的 Relu 算子是非线性的,输入不能是 Partial 状态,所以框架会根据 ReluSPMD Rule 将 TensorC 输入 Relu 前的的分布式状态推导成 Replicated。 +

+ +

+ +接下来就会进入自动并行的第二个核心逻辑 **切分转换**。 +框架会根据 tensor 当前的切分状态(src_placement),和切分推导规则推导出的算子计算需要的切分状态(dst_placement),添加对应的通信/张量维度变换算子。 +根据上图的切分推导,在计算 Matmul 添加 split 算子,在计算 Relue 添加 Allreduce,将输入 tensor 转换成需要的切分状态进行实际计算。 +

+ +

+ 三、自动并行和分布式策略 ------------------- @@ -96,6 +149,7 @@ dist_tensor_after_reshard = dist.reshard(dist_tensor, mesh1, placements1) import paddle import paddle.distributed as dist from paddle.io import BatchSampler, DataLoader, Dataset +import numpy as np mesh = dist.ProcessMesh([0, 1, 2, 3], dim_names=['x']) @@ -218,6 +272,7 @@ class MlpModel(paddle.nn.Layer): import paddle import paddle.distributed as dist from paddle.io import BatchSampler, DataLoader, Dataset +import numpy as np mesh0 = dist.ProcessMesh([[0, 1], [2, 3]], dim_names=['x', 'y']) # 创建进程网格 mesh1 = dist.ProcessMesh([[4, 5], [6, 7]], dim_names=['x', 'y']) # 创建进程网格 @@ -283,7 +338,9 @@ for step, inputs in enumerate(dataloader): 自动并行的 API 在设计之初,就以实现统一的用户标记接口和逻辑为目标,保证动静半框架保证在相同的用户标记下,动静态图分布式执行逻辑一致。这样用户在全流程过程中只需要标记一套动态图组网,即可以实现动态图下的分布式训练 Debug 和 静态图下的分布式推理等逻辑。整个动转静训练的逻辑如下: -![切分状态](images/dynamic_static_unified_auto_parallel.png) +

+ +

```python ... diff --git a/docs/guides/paddle_v3_features/cinn_cn.md b/docs/guides/paddle_v3_features/cinn_cn.md new file mode 100644 index 00000000000..1e0eb35f5a4 --- /dev/null +++ b/docs/guides/paddle_v3_features/cinn_cn.md @@ -0,0 +1,360 @@ +# CINN 神经网络编译器 + +## 一、概念简介 +深度学习编译器是一种专门为深度学习模型优化和部署而设计的工具,用于提高模型的计算效率、降低内存占用、加速训练推理过程。其功能是将高层次的深度学习模型转换为低层次的、高效的、底层硬件可执行的代码。简单来说,深度学习编译器在深度学习框架和底层硬件之间充当了“翻译”的角色,能够将用户定义的神经网络模型描述转化为底层硬件能够理解和执行的指令。编译器在实现这种转换的过程中,应用了一系列优化技术,以提高模型在各种硬件平台上(如 CPU、GPU)的执行效率。 +深度学习编译器的主要功能包括: +- **模型转换**:将高层次的深度学习模型转换为适合目标硬件的中间表示(IR)。 +- **优化**:应用各种编译优化技术,如图优化、内存优化、算子融合等,以提高执行效率。 +- **代码生成**:生成适合目标硬件的可执行代码。 + +## 二、背景与动机 +深度学习模型的训练和推理过程涉及大量的计算,对硬件性能要求很高。飞桨框架虽然提供了高级的编程接口和丰富的算子库,但在执行效率和模型部署方面还有很大的优化空间。使用深度学习编译器的主要动机包括: +#### 1. 优化性能与资源利用率 +深度学习模型往往需要处理大量的数据和复杂的计算,直接在高层次框架上执行可能无法充分利用底层硬件的能力。深度学习编译器能够深入硬件特性,应用多种优化技术,提高计算效率,降低延迟。并且通过优化模型的计算图和内存使用,深度学习编译器也能够明显降低模型的内存和 IO 资源的消耗,进而提高计算性能。 +#### 2. 硬件多样性支持 +不同的硬件平台有不同的特性和优化需求。在现有机制下,新的异构硬件设备接入深度学习框架需要手工实现几百个算子对应的硬件 Kernel 代码,开发的工作量非常大。如果使用深度学习编译器,理论上仅需实现新硬件 IR 层面的对接,以及相应的硬件 IR 优化策略就能完成与深度学习框架的对接,相比于实现几百个硬件 Kernel,开发的工作量会大幅减少。 +#### 3. 提升开发效率 +深度学习编译器可以自动化许多优化过程,减少手动调优的工作量。开发者只需关注模型的设计和训练,而不必深入了解底层硬件优化细节,从而提高开发效率。 + +## 三、使用示例: +飞桨框架编译器(CINN, Compiler Infrastructure for Neural Networks)使用时仅需在原先的模型动转静或推理流程下打开编译器相关 FLAGS 即可,无需对模型代码做任何改动。以下是一个使用样例: + +示例代码文件:`run_net.py` +```python +import paddle +from paddle import nn +from paddle.static import InputSpec + +# 定义神经网络 +class RMSNorm(nn.Layer): + def __init__(self): + super().__init__() + paddle.seed(2024) + self.hidden_size = 768 + self.weight = paddle.randn([self.hidden_size], dtype="float32") + self.variance_epsilon = 1e-6 + + def forward(self, hidden_states): + variance = (hidden_states * hidden_states).sum(-1, keepdim=True) / 768 + hidden_states = ( + paddle.rsqrt(variance + self.variance_epsilon) * hidden_states + ) + return hidden_states * self.weight + + +def run_net(input_data): + net = RMSNorm() + + # 指定输入变量的维度、数据类型等信息,具体接口可参考: + # https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/jit/basic_usage_cn.html#inputspec + input_spec = [ + InputSpec(shape=[1, None, 768], dtype='float32'), + ] + net = paddle.jit.to_static( + net, + input_spec=input_spec, + full_graph=True, + ) + # 使用 eval 模式 + net.eval() + # 执行计算图 + out = net(input_data) + return out + +# 创建输入数据 +input_data = paddle.randn([1, 2048, 768], dtype="float32") +# 运行神经网络 +out = run_net(input_data) +print(out) +``` + +脚本执行:`run.sh` +``` +# 打开组合算子 +export FLAGS_prim_enable_dynamic=true && export FLAGS_prim_all=true + +# 打开 CINN 编译器相关 FLAG +export FLAGS_use_cinn=true +export FLAGS_cinn_new_group_scheduler=true +export FLAGS_group_schedule_tiling_first=true +export FLAGS_cinn_bucket_compile=true + +# 打开 PIR 模式 +export FLAGS_enable_pir_api=true + +# 是否打印 Program IR 信息 +export FLAGS_print_ir=false + +python run_net.py +``` + +上述代码示例中我们创建了一个简单的`rms_norm`计算子图,使用飞桨的动转静流程将子图转为静态图并调用编译器 CINN 进行优化和执行。经过性能对比测试,在 A100 GPU 环境中上述子图使用 CINN 可以取得 3 倍左右的性能提升(该性能数据仅供学习参考,在实际应用模型中能够取得的性能提升效果一般会低于该数据)。 + +注:由于飞桨的编译器仍然处在快速迭代开发阶段,我们设置了较多 FLAGS 进行分支的选择和调试,因此现阶段在使用 CINN 时需要对如下 FLAGS(`FLAGS_prim_enable_dynamic`、 `FLAGS_cinn_new_group_scheduler`、 `FLAGS_group_schedule_tiling_first`、 `FLAGS_cinn_bucket_compile`、 `FLAGS_enable_pir_api`) 进行手动设置,待后续相关功能完备后这些 FLAGS 会默认开启,无需再手动设置。 + +## 四、设计架构 +
+
图 1 CINN 整体架构

+ +飞桨框架编译器(CINN, Compiler Infrastructure for Neural Networks)整体架构如上图所示,大体可以分为三个模块,分别是编译器前端、编译器后端和执行器部分。 + +### 1. 编译器前端 +一般来说编译器前端需要将不同框架和格式的深度学习模型转换为编译器的内部 IR 并进行图级别的优化,CINN 作为飞桨框架原生编译器,可以直接使用飞桨框架提供的模型加载和中间表示(Paddle IR,简称 PIR)组件,因此 CINN 前端的主要功能是基于 PIR 进行图层级别的优化,并对子图进行划分为后端高性能 Kernel 代码生成提供支持。CINN 前端关键的流程可分为三部分: + +#### a. 组合算子拆分 +飞桨框架中将算子划分为基础算子(也称作原子算子,语义上该算子无法更进一步拆分成其他算子。基础算子语义上可以通过重组等价实现组合算子的逻辑)和非基础算子两类大,由于非基础算子数量较多,并且在编译器中较难识别和处理,因此我们使用组合算子拆分的方式将非基础算子拆分为等价的基础算子组合,原始计算图经过组合算子拆分后可以大幅提升性能的可优化空间。 + +#### b. 图优化 Pass +在计算图层级进行 PIR 的 Pass 优化,常见的图优化 Pass 包括:常量折叠、死代码消除(DCE)、公共子表达式消除(CSE)、冗余算子消除、算子计算合并等。 + +#### c. 算子融合 +算子融合是编译器前端非常重要的一个功能,主要是将多个算子打包到一个子图中(对应为一个 FusionOp),交给编译器后端生成一个高效的硬件相关计算 Kernel。 +算子融合的本质是通过 IO 优化加速访存密集算子,如果我们将两个连续 Kernel 合并为一个 Kernel 调用,我们会减少中间变量的读写开销,因此在访存密集型的 2 个 Op 上,融合可以获取更高的性能。举个例子,如下图: +
+
图 2 算子融合示例

+ +我们有两个算子 Relu 和 Scale,因为两个算子都是 IO 密集型算子(计算复杂度不高)。正常情况下我们需要读取 A 和 B 一次,写 B 和 C 一次。但是对于融合之后的 Kernel(右图)而言,我们只需要读取 A 和写 C 一次,这样我们通过算子融合可以取得更少的访存次数,在 IO 密集算子而言,可以极大提高性能。 +具体的算子融合策略实现非常复杂,这里不做展开介绍,感兴趣的读者可以阅读相关源码 [#cinn_group_cluster_pass](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/cinn/hlir/dialect/operator/transforms/cinn_group_cluster_pass.cc)。 + +### 2. 编译器后端 +编译器后端主要负责将前端处理后的 IR 转换为目标硬件可执行的代码或硬件描述。主要功能包括基于硬件特性的 IR 优化、高效内存管理和代码生成等。 + +#### 2.1. CINN AST IR +AST IR 打印示例: +``` +ScheduleBlock(root) +{ + serial for (i, 0, 32) + { + serial for (j_0, 0, 64) + { + serial for (j_1, 0, 128) + { + ScheduleBlock(A) + { + vi, vj = axis.bind(i, j_0 * 64 + j_1) // tensor 下标与循环变量的仿射变换 + A[vi, vj] = X[vi, vj] * 2 + } + } + } + } +} +``` +CINN AST IR 中包含了以下信息,但集合和映射并不显示使用某种数据结构进行存储。 + +  **集合**:语句实例 & 内存单元 **
** +  **映射**:**
** +   访存关系:语句实例 <---> 内存单元 **
** +   依赖关系:语句实例 <---> 语句实例 **
** +   执行顺序:语句实例 -----> 语句实例 **
** + +  执行顺序 = 语句实例的先后关系 **
** +  语句实例集合范围 = 循环边界 + 循环步长 ------ 循环构成一个带约束的整数空间,即迭代空间,迭代空间决定了语句实例,语句实例充满了迭代空间。 + +#### 2.2. 基于 AST IR 的 Schedule +Schedule 为定义在 CINN AST IR 上的优化策略,常见的 Schedule 包括:LoopAlignment, Tile, Inline, Vectorize, Unroll 等。**
** +以一个组合算子为例模拟可能的 AST 变换过程:**
** + [S1, S2, 1024] ==E=> [S1, S2, 1024] ==R=> [S1, S2] ==E=> [S1, S2] ==B=> [S1, S2, 1024] ==E=> [S1, S2, 1024] + +**(1) LowerToAst 得到的结果** +``` +// Elemenwise-1 +serial for (i, 0, S1) + serial for (j, 0, S2) + serial for (k, 0, 1024) + ScheduleBlock(A) + vi, vj, vk = axis.bind(i, j, k) + A[vi, vj, vk] = X[vi, vj, vk] * 2 +// Elemenwise-2 +serial for (i, 0, S1) + serial for (j, 0, S2) + serial for (k, 0, 1024) + ScheduleBlock(B) + vi, vj, vk = axis.bind(i, j, k) + B[vi, vj, vk] = A[vi, vj, vk] + 1 +// Reduce-1 +serial for (i, 0, S1) + serial for (j, 0, S2) + ScheduleBlock(C__reduce_init) + vi, vj = axis.bind(i, j) + C_init[vi, vj] = 0 +serial for (i, 0, S1) + serial for (j, 0, S2) + serial for (k, 0, 1024) // Reduce + ScheduleBlock(C) + vi, vj, vk = axis.bind(i, j, k) + C[vi, vj] = C[vi, vj] + B[vi, vj, vk] +// Elemenwise-3 +serial for (i, 0, S1) + serial for (j, 0, S2) + ScheduleBlock(D) + vi, vj = axis.bind(i, j) + D[vi, vj] = C[vi, vj] * 2 +// Broadcast-1 +serial for (i, 0, S1) + serial for (j, 0, S2) + serial for (k, 0, 1024) // Broadcast + ScheduleBlock(E) + vi, vj, vk = axis.bind(i, j, k) + E[vi, vj, vk] = D[vi, vj] +// Elemenwise-4 +serial for (i, 0, S1) + serial for (j, 0, S2) + serial for (k, 0, 1024) + ScheduleBlock(F) + vi, vj, vk = axis.bind(i, j, k) + F[vi, vj, vk] = E[vi, vj, vk] + 1 +``` +**(2) 迭代空间对齐** +``` +// 所有 ScheduleBlock 的 loop nest 都变为以下 2 种格式中的一种 +// 1 +serial for (sp, 0, S1 * S2) // pure_spatial_iter + serial for (rb, 0, 1024) // impure_spatial_iter + ScheduleBlock(XXX) + vsp1, vsp2, vrb = axis.bind(sp / S2, sp % S2, rb) + XXX = XXXXXX +// 2 +serial for (sp, 0, S1 * S2) // pure_spatial_iter + ScheduleBlock(XXX) + vsp1, vsp2 = axis.bind(sp / S2, sp % S2) + XXX = XXXXXX +``` +**(3) Tile: 对所有 ScheduleBlock 的 loop nest 做相同的 Tile** +``` +// pure_spatial 轴 Tile 为:-1 * 16 * 64 Tile size 可为参数传入 +serial for (sp1, 0, S1 * S2 / 1024) + serial for (sp2, 0, 16) + serial for (sp3, 0, 64) // S1 * S2 / 16 / 64, predicate: sp1 * 1024 + sp2 * 16 + sp3 < S1 * S2 + XXXXXX +// impure_spatial_iter 轴 Tile 为 32 +serial for (sp1, 0, S1 * S2 / 1024) + serial for (sp2, 0, 16) + serial for (sp3, 0, 64) + serial for (rb1, 0, 32) + serial for (rb2, 0, 32) + ScheduleBlock(XXX) + predicate = sp1 * 1024 + sp2 * 16 + sp3 < S1 * S2 + vsp1 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) / S2) + vsp2 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) % S2) + vrb = axis.bind(rb1 * 32 + rb2) + XXX = XXXXX +``` +**(4) ComputeInline** +``` +// 例如 ScheduleBlock(A) inline 到 ScheduleBlock(B) +serial for (sp1, 0, S1 * S2 / 1024) + serial for (sp2, 0, 16) + serial for (sp3, 0, 64) + serial for (rb1, 0, 32) + serial for (rb2, 0, 32) + ScheduleBlock(A) + predicate = sp1 * 1024 + sp2 * 16 + sp3 < S1 * S2 + vsp1 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) / S2) + vsp2 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) % S2) + vrb = axis.bind(rb1 * 32 + rb2) + B[vsp1, vsp2, vrb] = (X[vsp1, vsp2, vrb] * 2) + 1 +``` +**(5) Reduce 优化: two step reduce & 绑定部分 reduce 轴到 cuda** +``` +// 为了简洁,此处省略 reduce_init Block 和 predicate +serial for (sp1, 0, S1 * S2 / 1024) + serial for (sp2, 0, 16) + serial for (sp3, 0, 64) + CudaBind[ThreadIdx.x] for (rb1, 0, 32) + serial for (rb2, 0, 32) + ScheduleBlock(C_rf) + vsp1 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) / S2) + vsp2 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) % S2) + vrb1 = axis.bind(rb1) + vrb2 = axis.bind(rb2) + C_rf[vsp1, vsp2, vrb1] = C_rf[vsp1, vsp2, vrb1] + B[vsp1, vsp2, vrb1 * 32 + vrb2] + ScheduleBlock(C) + vsp1 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) / S2) + vsp2 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) % S2) + vrb1 = axis.bind(rb1) + C[vsp1, vsp2] = C[vsp1, vsp2] + C_rf[vsp1, vsp2, vrb1] +``` +**(6) 循环融合: ComputeAt && SimpleComputeAt,融合外层循环乘积相同的循环,并且保证不破坏图级别依赖(规则负责)和元素级别依赖(原语负责)** +``` +serial for (sp1, 0, S1 * S2 / 1024) + serial for (sp2, 0, 16) + serial for (sp3, 0, 64) + ScheduleBlock(D) + vsp1 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) / S2) + vsp2 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) % S2) + D[vsp1, vsp2] = C[vsp1, vsp2] * 2 + serial for (rb1, 0, 32) + serial for (rb2, 0, 32) + ScheduleBlock(E) + vsp1 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) / S2) + vsp2 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) % S2) + vrb = axis.bind(rb1 * 32 + rb2) + E[vsp1, vsp2, vrb] = D[vsp1, vsp2] + ScheduleBlock(F) + vsp1 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) / S2) + vsp2 = axis.bind((sp1 * 1024 + sp2 * 16 + sp3) % S2) + vrb = axis.bind(rb1 * 32 + rb2) + F[vsp1, vsp2, vrb] = E[vsp1, vsp2, vrb] + 1 +``` +**(7) Bind Cuda 轴:在第二步中,所有 ScheduleBlock 对应的循环要 bind 到同一 Cuda 轴** +``` +serial for (sp1, 0, S1 * S2 / 1024) + CudaBind[BlockIdx.x] for (sp2, 0, 16) + CudaBind[ThreadIdx.y] for (sp3, 0, 64) + CudaBind[ThreadIdx.x] for (rb1, 0, 32) + serial for (rb2, 0, 32) + ScheduleBlock(XXX) +``` + +#### 2.3. Kernel 代码生成与编译 + +Codegen 在 CINN IR AST 上做前序遍历,打印出对应硬件的指令,并通过硬件相对应的编译器(如 llvm、nvcc 等)进行编译得到可运行的函数指针,该指针会被封装到 `JitKernelOp`` 中用于后续执行器的解析执行。 + +a. 以函数定义为例子,cuda kernel func 和 x86 kernel func 的不同的是,cuda kernel func 会在函数名前增加 `__global__` + +针对 x86 硬件,转义 `ir::_LoweredFunc_` 的代码如下: +``` +void CodeGenC::Visit(const ir::_LoweredFunc_ *op) { + PrintFunctionDeclaration(op); // 前序遍历继续转义函数名、函数参数等 + str_ += "\n"; + ... + ... +} +``` +在 NV GPU 上的转义代码如下: +``` +void CodeGenCUDA_Dev::Visit(const ir::_LoweredFunc_ *op) { + str_ += "__global__\n"; // 和 x86 的不同,增加 __global__ + PrintFunctionDeclaration(op); // 前序遍历继续转义函数名、函数参数等 + str_ += "\n"; + ... + ... +} +``` +b. 在动态形状场景下,还会 codegen 出 infer shape function, infer shape function 的 CINN IR 会在 Bucket Lowering 中得到,转义过程复用的 x86 硬件的 codegen。infer shape kernel 如下: +``` +// infer shape 函数名字的组成:kernel_name + "infer_shape" +// 函数参数: +// kernel_args: 指针数组,和 kernel func args 一致 +// kernel_args_num: kernel_args 的长度 +// tensor_shape_args: 指针数组,存储输出 tensor 的 shape +function fn_exp_0_subtract_0_infer_shape (kernel_args, kernel_args_num, tensor_shape_args) +{ + int64 S0 = cinn_get_value_in_cuda_kernel_args(kernel_args, 2) + { + // CINN IR 暂时不支持数据索引的语法,暂时用函数调用实现,下面 2 条语句等价于 + // tensor_shape_args[0] = {S0, 256ll}; + // 即第 0 个出 tensor 的 shape 为{S0, 256ll}; + infer_shape_set_value(0, 0, S0, tensor_shape_args) + infer_shape_set_value(0, 1, 256ll, tensor_shape_args) + } +} +``` + +### 3. 执行器 + +编译器生成的 Kernel 代码需要与深度学习框架执行器完成交互和集成才能最终运行起来,因此需要基于执行器的运行调度接口对编译器生成的 Kernel 进行封装。 + +接入执行器后在运行时对于经过编译器处理的子图将执行 CINN 生成的 Kernel, 否则将执行常规的 PHI 算子 Kernel。 diff --git a/docs/guides/paddle_v3_features/higher_order_ad_cn.md b/docs/guides/paddle_v3_features/higher_order_ad_cn.md new file mode 100644 index 00000000000..ddf6f160ece --- /dev/null +++ b/docs/guides/paddle_v3_features/higher_order_ad_cn.md @@ -0,0 +1,209 @@ +高阶自动微分功能支持科学计算 +================ + +本篇文章主要为你介绍飞桨的高阶微分机制,帮助你更好的使用飞桨。 + +一、背景与动机 +-------- + +深度学习模型的训练过程涉及使用随机梯度下降(SGD)等优化算法来更新模型参数。在这一过程中,深度学习框架的自动微分功能发挥着核心作用,它利用链式法则自动计算出损失函数相对于模型参数的梯度。尽管大多数深度学习任务只需计算一阶导数,但在某些 AI for Science 场景中,却需要计算高阶导数,这无疑增加了自动微分的复杂性。以 2D 矩形平板分布受载问题为例,该问题的内在机理需要使用 4 阶微分方程来描述。为了求解这类问题,深度学习框架必须支持高阶自动微分功能。 + +
+ +
+ +
+ +
+ +二、设计思想 +------------------------------ + +高阶自动微分的实现面临诸多挑战。具体而言,框架需要为每个算子编写高阶微分规则。随着阶数的增加,微分规则的复杂性也随之上升。当阶数达到三阶或更高时,编写这些规则变得极其困难,同时正确性难以保证。为了解决这一问题,飞桨提出了基于基础算子组合的高阶自动微分技术。该技术的关键在于将复杂算子(如 log_softmax)拆解为多个基础算子的组合。然后,我们对这些基础算子进行一阶自动微分变换。重要的是,基础算子经过一阶自动微分变换后,其得到的计算图仍然是由基础算子所构成。通过反复应用一阶自动微分规则,我们可以轻松地获得高阶自动微分结果。 + +**log_softmax 拆解与微分示例** + +根据 log_softmax 表达式拆解为 exp、max、log 等细粒度基础算子组成,基础算子是指由简单运算逻辑组成的有限集合,数量较少。基于飞桨的自动微分体系,使用基础算子的微分规则自动推导 log_softmax 一阶微分,注意基础算子微分规则仍由基础算子实现,因此 log_softmax 的一阶微分仍由基础算子组成。重复上述微分过程实现 log_softmax 高阶微分求解。 + +
+ +
+ +三、框架架构 +------------------------------ + +为了支持高阶自动微分,飞桨框架精心设计与实现了组合算子机制。这一机制不仅兼容动态图模式和静态图模式,而且在动态图模式下支持 N+1 阶微分的拆分,同时在静态图模式下能够进行编译器融合优化。创新性地设计并实现了动静一体的算子组合规则,这意味着同一套组合规则在动态图和静态图两种模式下均可复用,从而避免了重复开发。在构建基础算子体系时,我们以 Tensor 作为核心操作对象,确保了算子的原子性、实用性和完备性。此外,我们还支持自定义反向操作和自动重计算功能,这些特性不仅提升了模型的精度,还有效地减少了显存占用,为用户提供了更高效、更灵活的深度学习体验。 + +
+ +
+ + **基础算子集合设计** + +基础算子集合的设计需要兼顾通用性、计算效率、易用性和兼容性,此外,还需要具备可扩展性,以便可以方便地添加新的数据处理操作和模型,并可以组合支撑更加复杂的计算工作。飞桨制定了基础算子集合设计原则,1)原子性,即基础算子的操作不能拆分为更基础的操作,如不能把大于等于拆分为不小于;2)实用性,基础算子有实际应用场景;3)面向张量,基础算子的操作粒度为张量,如果一个算子需要在张量的元素粒度上进行复杂操作,则这个算子本身应为基础算子;4)完备性,可以支持复杂算子拆分需求。基于上述原则设计和实现基础算子集合,最终预期基础算子规模约控制到 200 左右,当前还在持续演进中。 + +**动静一体组合规则** + +组合规则是指使用基础算子接口组合实现的复杂算子集合,为了能够在动态图和静态图体系下复用同一套组合规则,减少编码工作量,在基础算子层,设计一套抽象接口,屏蔽动态图基础算子和静态图基础算子实现细节,组合规则的实现调用抽象接口实现,并设计一套分发机制,根据动态图和静态图数据类型的不同进行分发到具体基础算子执行,从而实现动态图和静态图不同模式下组合规则的复用。 + +**从机制上保障性能** + +随着算子细粒度拆分,算子数量会急剧膨胀,算子调度开销也会加大。动态图模式算子动态执行,无法提前优化,为了减少算子拆分造成的动态图性能损耗,飞桨采取了拆解 N+1 阶算子方法,即如果现有算子已经实现了 N 阶反向大算子,那么为了保证现有模型性能不降,实现 N+1 拆解逻辑,从而调度上优先运行 1-N 阶大算子逻辑,N+1 拆解成基础算子,保证性能同时支持高阶微分。静态图模式下,由于可以提前整图优化,基于飞桨编译器技术进行图层调度优化和算子融合优化,并且由于算子粒度更细,存在优化空间更大,部分模型上基于组合算子体系和编译器优化的模型性能已经超越了原有大算子体系下模型性能。 + +**从机制上保障显存和精度** + +模型执行过程通常是先执行前向计算,并保存反向计算依赖的中间变量,反向计算复用这些中间变量进行计算。算子细粒度拆分,使需要保存的中间变量急剧增大,模型运行需要的显存大幅增加。飞桨使用自定义反向技术解决该问题,对于一个复杂大算子,支持自定义其反向微分规则,该微分规则实现只依赖其前向大算子的输入输出,并在框架调度上优先保障走该算子的自定义反向微分,而非自动推导的微分规则,从而减少中间变量,降低显存。 + + + +四、开始使用 +------------------------------ + +飞桨提供了完善高阶自动微分求解 API,包括通用反向微分求解 paddle.grad,多元函数雅可比矩阵计算 `paddle.autograd.jacobian` ,多元函数海森矩阵计算 `paddle.autograd.hessian`. 功能与链接具体参考 4.1. + +下面通过一个简单示例演示飞桨高阶自动微分用法。 + +**第一步:导入依赖** + +```python +import paddle +``` + +**第二步:编写组网代码** + +以单层的全联接网络为例,MyNet 继承自 paddle.nn.Layer,在__init__方法中初始化网络参数,在 forward 方法中实现前向运行逻辑。注意,当前高阶自动微分支持大部分飞桨常用 API,覆盖主流的科学计算模型,如果您在写新的模型遇到飞桨高阶微分问题,可通过飞桨 ISSUE 反馈。 + +```python +class MyNet(paddle.nn.Layer): + def __init__(self): + super(MyNet, self).__init__() + self.weight = self.create_parameter(shape=(2,2), dtype=paddle.float32, is_bias=False) + self.bias = self.create_parameter(shape=(2,2), dtype=paddle.float32, is_bias=True) + self.add_parameter("weight", self.weight) + self.add_parameter("bias", self.bias) + + def forward(self, x): + y = paddle.matmul(x, self.weight) + self.bias + return paddle.tanh(y) +``` + +**第三步:创建网络及声明输入数据,执行前向计算过程** + +```python +x = paddle.randn(shape=(2,2), dtype=paddle.float32) +net = MyNet() +y = net(x) +``` + +**第四步:计算 Loss** + +为了演示高阶微分用法,此处 Loss 定义中使用了`paddle.grad` API 计算`y`对`x`二阶微分,使用`L2 norm` 归一化。 + +```python +grad1 = paddle.grad(y, x) +grad2 = paddle.grad(grad1, x) +loss = paddle.norm(grad2, p=2) + +opt = paddle.optimizer.Adam(0.01) +opt.update(loss) +``` + +**第五步:执行反向计算过程,使用用 Adam 优化器更新参数** + +```python +opt = paddle.optimizer.Adam(parameters=net.parameters()) +loss.backward() +opt.step() +``` + + + + +### 4.1 自动微分相关 API 列表 + + +API 名称 | API 功能 | +:-----: | :-----: | +[paddle.grad](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/grad_cn.html#grad) | 反向模式自动微分 | +[paddle.auto.jacobian](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/autograd/jacobian_cn.html#jacobian) | 雅可比矩阵计算 | +[paddle.autograd.hessian](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/autograd/hessian_cn.html#hessian) | 海森矩阵计算 | + + + +**使用反向微分 API paddle.grad 计算 tanh 高阶导数** + + +```python +import paddle + +# 组网代码 +x = paddle.rand((2,)) +y = paddle.tanh(x) +grad1 = paddle.grad(y, x, create_graph=True) # 一阶微分 +grad2 = paddle.grad(grad1, x, create_graph=True) # 二阶微分 +grad3 = paddle.grad(grad2, x) # 三阶微分 + +print(grad1, grad2, grad3) +# [0.41997433] [-0.6397] [0.6216267] +``` + +**使用 paddle.autograd.jacobian 计算 Jacobian 矩阵** + +```python +import paddle + +x1 = paddle.randn([3, ]) +x2 = paddle.randn([3, ]) +x1.stop_gradient = False +x2.stop_gradient = False + +y = x1 + x2 + +J = paddle.autograd.jacobian(y, (x1, x2)) +J_y_x1 = J[0][:] # evaluate result of dy/dx1 +J_y_x2 = J[1][:] # evaluate result of dy/dx2 + +print(J_y_x1.shape) +# [3, 3] +print(J_y_x2.shape) +# [3, 3] +``` + +**使用 paddle.autograd.hessian 计算 Hessian 矩阵** + +```python +import paddle + +x1 = paddle.randn([3, ]) +x2 = paddle.randn([4, ]) +x1.stop_gradient = False +x2.stop_gradient = False + +y = x1.sum() + x2.sum() + +H = paddle.autograd.hessian(y, (x1, x2)) +H_y_x1_x1 = H[0][0][:] # evaluate result of ddy/dx1x1 +H_y_x1_x2 = H[0][1][:] # evaluate result of ddy/dx1x2 +H_y_x2_x1 = H[1][0][:] # evaluate result of ddy/dx2x1 +H_y_x2_x2 = H[1][1][:] # evaluate result of ddy/dx2x2 + +print(H_y_x1_x1.shape) +# [3, 3] +print(H_y_x1_x2.shape) +# [3, 4] +print(H_y_x2_x1.shape) +# [4, 3] +print(H_y_x2_x2.shape) +# [4, 4] +``` + + + +五、飞桨支撑科学计算 AI For Science +------------------------------ + +基于飞桨框架 3.0 为科学计算提供了高阶自动微分、编译优化、分布式训练能力支撑,提供了面向通用数理问题求解的赛桨 PaddleScience 以及专注于生物计算的螺旋桨 PaddleHelix 工具组件。为了更好地支撑 AI for Science 生态,飞桨对国内外主流开源科学计算工具进行了适配,并被国际主流的科学计算深度学习库 DeepXDE 唯一推荐。在与 NVIDIA 合作适配其 AI Physics 工具 Modulus 的过程中,飞桨利用其高阶自动微分与编译优化技术,成功完成了全量模型适配,实现了方程求解类模型性能的大幅优化,相比 Modulus 现有后端求解速度平均提升 71%。 + +
+ +
diff --git a/docs/guides/paddle_v3_features/images/auto_parallel/dynamic-static-unified.png b/docs/guides/paddle_v3_features/images/auto_parallel/dynamic-static-unified.png new file mode 100644 index 00000000000..7973b6c03a0 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/auto_parallel/dynamic-static-unified.png differ diff --git a/docs/guides/06_distributed_training/images/dynamic_static_unified_auto_parallel.png b/docs/guides/paddle_v3_features/images/auto_parallel/dynamic_static_unified_auto_parallel.png similarity index 100% rename from docs/guides/06_distributed_training/images/dynamic_static_unified_auto_parallel.png rename to docs/guides/paddle_v3_features/images/auto_parallel/dynamic_static_unified_auto_parallel.png diff --git a/docs/guides/paddle_v3_features/images/auto_parallel/mesh.png b/docs/guides/paddle_v3_features/images/auto_parallel/mesh.png new file mode 100644 index 00000000000..528a8f3f917 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/auto_parallel/mesh.png differ diff --git a/docs/guides/paddle_v3_features/images/auto_parallel/reshard.png b/docs/guides/paddle_v3_features/images/auto_parallel/reshard.png new file mode 100644 index 00000000000..aa73de9d72d Binary files /dev/null and b/docs/guides/paddle_v3_features/images/auto_parallel/reshard.png differ diff --git a/docs/guides/paddle_v3_features/images/auto_parallel/shard.png b/docs/guides/paddle_v3_features/images/auto_parallel/shard.png new file mode 100644 index 00000000000..84520720c54 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/auto_parallel/shard.png differ diff --git a/docs/guides/paddle_v3_features/images/auto_parallel/shard_anonation.png b/docs/guides/paddle_v3_features/images/auto_parallel/shard_anonation.png new file mode 100644 index 00000000000..25cc7a1de01 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/auto_parallel/shard_anonation.png differ diff --git a/docs/guides/paddle_v3_features/images/auto_parallel/shard_convertion.png b/docs/guides/paddle_v3_features/images/auto_parallel/shard_convertion.png new file mode 100644 index 00000000000..fbcd1d5903b Binary files /dev/null and b/docs/guides/paddle_v3_features/images/auto_parallel/shard_convertion.png differ diff --git a/docs/guides/paddle_v3_features/images/auto_parallel/shard_propogation.png b/docs/guides/paddle_v3_features/images/auto_parallel/shard_propogation.png new file mode 100644 index 00000000000..8aa595a1e78 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/auto_parallel/shard_propogation.png differ diff --git a/docs/guides/paddle_v3_features/images/cinn/cinn_design.png b/docs/guides/paddle_v3_features/images/cinn/cinn_design.png new file mode 100644 index 00000000000..51341c63b9c Binary files /dev/null and b/docs/guides/paddle_v3_features/images/cinn/cinn_design.png differ diff --git a/docs/guides/paddle_v3_features/images/cinn/op_fusion.png b/docs/guides/paddle_v3_features/images/cinn/op_fusion.png new file mode 100644 index 00000000000..503390fed00 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/cinn/op_fusion.png differ diff --git a/docs/guides/paddle_v3_features/images/higher_order_ad/ai4s.png b/docs/guides/paddle_v3_features/images/higher_order_ad/ai4s.png new file mode 100644 index 00000000000..4e1ae538802 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/higher_order_ad/ai4s.png differ diff --git a/docs/guides/paddle_v3_features/images/higher_order_ad/architecture.png b/docs/guides/paddle_v3_features/images/higher_order_ad/architecture.png new file mode 100644 index 00000000000..187133d238d Binary files /dev/null and b/docs/guides/paddle_v3_features/images/higher_order_ad/architecture.png differ diff --git a/docs/guides/paddle_v3_features/images/higher_order_ad/background.png b/docs/guides/paddle_v3_features/images/higher_order_ad/background.png new file mode 100644 index 00000000000..e5992c808d4 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/higher_order_ad/background.png differ diff --git a/docs/guides/paddle_v3_features/images/higher_order_ad/softmax_example.png b/docs/guides/paddle_v3_features/images/higher_order_ad/softmax_example.png new file mode 100644 index 00000000000..73c7dec845a Binary files /dev/null and b/docs/guides/paddle_v3_features/images/higher_order_ad/softmax_example.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_2d_plate.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_2d_plate.png new file mode 100644 index 00000000000..336993c2eb9 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_2d_plate.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_2d_plate_pde.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_2d_plate_pde.png new file mode 100644 index 00000000000..aab37c2e0ee Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_2d_plate_pde.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_arch.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_arch.png new file mode 100644 index 00000000000..d936b6d5459 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_arch.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_cinn_arch.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_cinn_arch.png new file mode 100644 index 00000000000..40c99c6ab19 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_cinn_arch.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_hardware.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_hardware.png new file mode 100644 index 00000000000..5ecfac6ffb7 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_hardware.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_parallel.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_parallel.png new file mode 100644 index 00000000000..73e74be857e Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_parallel.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_placement.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_placement.png new file mode 100644 index 00000000000..e520464793d Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_placement.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_process_mesh.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_process_mesh.png new file mode 100644 index 00000000000..e11dd81cf90 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_process_mesh.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_rmsnorm.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_rmsnorm.png new file mode 100644 index 00000000000..9c78fa8f494 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_rmsnorm.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_rmsnorm_perf.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_rmsnorm_perf.png new file mode 100644 index 00000000000..d9a98521c48 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_rmsnorm_perf.png differ diff --git a/docs/guides/paddle_v3_features/images/overview/paddle_v3_workflow.png b/docs/guides/paddle_v3_features/images/overview/paddle_v3_workflow.png new file mode 100644 index 00000000000..76da20557c6 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/overview/paddle_v3_workflow.png differ diff --git a/docs/guides/paddle_v3_features/images/paddle_ir/overview.png b/docs/guides/paddle_v3_features/images/paddle_ir/overview.png new file mode 100644 index 00000000000..3c0cf1f50d4 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/paddle_ir/overview.png differ diff --git a/docs/guides/paddle_v3_features/images/paddle_ir/pass_design.png b/docs/guides/paddle_v3_features/images/paddle_ir/pass_design.png new file mode 100644 index 00000000000..89db07cdd6e Binary files /dev/null and b/docs/guides/paddle_v3_features/images/paddle_ir/pass_design.png differ diff --git a/docs/guides/paddle_v3_features/images/paddle_ir/pass_example.png b/docs/guides/paddle_v3_features/images/paddle_ir/pass_example.png new file mode 100644 index 00000000000..f42e955b336 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/paddle_ir/pass_example.png differ diff --git a/docs/guides/paddle_v3_features/images/paddle_ir/pir_design.png b/docs/guides/paddle_v3_features/images/paddle_ir/pir_design.png new file mode 100644 index 00000000000..328dfa9bc84 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/paddle_ir/pir_design.png differ diff --git a/docs/guides/paddle_v3_features/images/paddle_ir/vs_program.png b/docs/guides/paddle_v3_features/images/paddle_ir/vs_program.png new file mode 100644 index 00000000000..7bf1c7f1c71 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/paddle_ir/vs_program.png differ diff --git a/docs/guides/paddle_v3_features/images/sot/sot_framework.png b/docs/guides/paddle_v3_features/images/sot/sot_framework.png new file mode 100644 index 00000000000..cd097967f65 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/sot/sot_framework.png differ diff --git a/docs/guides/paddle_v3_features/images/sot/sot_procedure.png b/docs/guides/paddle_v3_features/images/sot/sot_procedure.png new file mode 100644 index 00000000000..504a06dbe23 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/sot/sot_procedure.png differ diff --git a/docs/guides/paddle_v3_features/images/sot/sot_vs_ast.png b/docs/guides/paddle_v3_features/images/sot/sot_vs_ast.png new file mode 100644 index 00000000000..2629f2fde01 Binary files /dev/null and b/docs/guides/paddle_v3_features/images/sot/sot_vs_ast.png differ diff --git a/docs/guides/paddle_v3_features/index_cn.rst b/docs/guides/paddle_v3_features/index_cn.rst new file mode 100644 index 00000000000..3ea3431c8a4 --- /dev/null +++ b/docs/guides/paddle_v3_features/index_cn.rst @@ -0,0 +1,30 @@ +############### +飞桨 3.0 全新特性 +############### + + +**以下将详细地介绍飞桨 3.0 版本下发布的全新 Feature 内容:** + +- `3.0 新特性总览 `_ :概述了飞桨框架 3.0 版本下的架构设计、新特性、性能提升等 + +- `动静统一自动并行 `_ :介绍了飞桨动静统一的自动并行编程范式 + +- `神经网络编译器 `_ :介绍了神经网络编译器自动优化的基本原理、架构和功能 + +- `高阶自动微分 `_ :介绍了飞桨高阶自动微分在科学计算领域的应用 + +- `动转静 SOT 原理及使用 `_ :介绍了动转静 SOT 原理及使用方式 + +- `PIR 基本概念和开发 `_ :介绍了飞桨新一代中间表示(PIR)的设计和开发范式 + + + +.. toctree:: + :hidden: + + overview_cn.md + auto_parallel_cn.md + cinn_cn.md + higher_order_ad_cn.md + sot_cn.md + paddle_ir_cn.md diff --git a/docs/guides/paddle_v3_features/overview_cn.md b/docs/guides/paddle_v3_features/overview_cn.md new file mode 100644 index 00000000000..57169a34bee --- /dev/null +++ b/docs/guides/paddle_v3_features/overview_cn.md @@ -0,0 +1,359 @@ +# 飞桨框架 3.0 新特性 + +## 一、概述 + +深度学习框架作为基础软件,不仅促进了深度学习技术的飞速进步,更为人工智能技术的广泛应用铺设了坚实的基础。首先深度学习框架为开发者提供了便捷易用的开发接口,这些接口对数据和操作进行了高度抽象,使得开发者能够更专注于算法和模型的设计,而不必深陷底层数据的处理细节。通过这些接口,开发者无需直接感知和应对复杂的硬件底层开发细节,从而极大地提升了开发效率和体验。其次深度学习框架还提供了自动微分这一强大功能,开发者通常只需要编写前向传播网络的代码,而繁琐的反向传播网络则交由框架自动完成。 + +飞桨框架是我国首个自主研发、开源开放且功能丰富的深度学习框架,自 2016 年起正式对外开源。2018 年,我们发布了飞桨框架 1.0 版本,该版本默认使用静态图,并着手研发动态图功能。2021 年初,飞桨框架 2.0 版本问世,它默认采用动态图,并实现了动静统一与训推一体的设计。此版本进一步融合了动态图的灵活性与静态图的高效性,同时支持了千亿参数模型的混合并行训练。在此期间,飞桨还踏上了神经网络编译器技术的探索征程。随着大模型时代的到来,模型参数规模日益扩大,训练成本也随之上升,这对深度学习框架在大规模分布式训练和性能优化方面提出了更高要求。近期,我们推出了飞桨框架 3.0-Beta 版本,标志着飞桨新一代框架技术创新之路的开启。该版本的核心特性包括动静统一自动并行技术和神经网络编译器自动优化等新技术,旨在应对当前深度学习领域的新挑战。飞桨框架 3.x 版本延续了 2.x 版本动静统一、训推一体的设计理念,其开发接口全面兼容 2.x 版本。这意味着,使用 2.x 版本开发的代码,在绝大多数情况下无需修改,即可直接在 3.x 版本上运行。 + +以下是飞桨框架 3.x 的新特性: + +- **动静统一自动并行:** 为了降低大模型的编程难度,飞桨还优化了动静统一的半自动并行编程范式,显著简化了编程的复杂度。开发者无需深入研究手动并行编程的复杂概念和 API,只需进行少量的张量切分标注,即可完成混合并行模型的构建。框架能够自动推导分布式切分状态并添加通信算子,同时还支持一键动转静分布式训练,从而大幅简化了混合并行训练代码的开发过程。动静统一方面,飞桨通过采用基于字节码的动静转换技术,全面升级了其动转静训练能力,支持自适应的图构建功能。在 700 多个飞桨产业级模型上进行了验证,实现了一键动转静训练 100%的成功率。 +- **神经网络编译器自动优化:** 飞桨神经网络编译器 CINN(Compiler Infrastructure for Neural Networks)采用与框架一体化的设计,能够支持生成式模型、科学计算模型等多种模型的高效训练与可变形状推理,为计算灵活性与高性能之间提供了一个良好的平衡点。通过算子的自动融合和代码生成技术,Llama2 和 Stable Diffusion 模型的性能提升了 30%。 +- **高阶自动微分:** 为了更好支持科学计算等场景,飞桨框架设计并实现了基于组合算子机制的高阶自动微分技术,结合神经网络编译器自动优化技术,我们测试了超过 40 多个科学计算场景的微分方程,其求解速度领先业界同类产品 70%。 +- **高扩展中间表示** :为了提升飞桨框架的可扩展性,我们研发了高扩展中间表示 PIR(Paddle Intermediate Representation)。这一表示系统性地抽象了底层核心概念,提供了灵活且高效的组件。PIR 作为基础设施,支撑着动转静、自动微分、自动并行、组合算子、图优化等多项技术,并广泛应用于分布式训练、模型压缩、推理部署等场景。通过 PIR 提供的 DRR(Declarative Rewrite Rule)机制,Pass 的开发成本可以降低 60%。我们对超过 900 个模型配置进行了测试,结果显示,在使用 PIR 后,推理的整体性能提升了超过 10%。 +- **多硬件适配:** 飞桨为大模型硬件适配提供了功能完善且低成本的方案。新硬件仅需适配 30 余个接口,即可支持大模型的训练、压缩与推理。同时,飞桨提供了基于编译器的硬件接入方式,硬件厂商只需以插件的形式实现编译器的代码生成后端,便能实现与飞桨框架的高效适配。 + +上述特性在飞桨框架 2.6 版本或更早版本时就已经开始开发,目前已达到外部可试用的阶段。由于这些新特性在使用体验、性能、二次开发便利度以及硬件适配能力等方面带来了显著提升,因此我们决定发布 3.0-Beta 版本。此版本包含了对框架 2.x 版本部分已有功能的改进,并且在不使用新特性的情况下,表现是成熟稳定的。展望未来,我们预计将在 2024 年 12 月发布飞桨框架 3.0 的正式版本。 + +## 二、设计思想 + +当前,AI 技术的发展正日新月异,引领着科技的前沿。深度学习框架的设计对于推动人工智能技术的发展至关重要,其核心设计目标是让深度学习技术的创新与应用更简单。那么如何做到这一点呢?我们需要从以下几个方面来考虑。 + +首先,框架向上对接开发者的需求。一个优秀的深度学习框架应当为开发者提供极致的开发体验。这不仅仅意味着提供一个用户友好的开发环境,更重要的是要能够大幅度减少开发者的学习成本和时间成本,同时显著提升开发的便利性。为此,飞桨框架提出了“动静统一、训推一体、自动并行”的理念,极大地提高了开发效率。 + +其次,框架向下对接硬件。现代深度学习应用往往需要在多样化的硬件平台上运行,因此,框架必须能够兼容并适配各种不同的硬件设备。这要求框架能够智能地隔离不同硬件接口之间的差异,实现广泛的硬件适配性。同时,为了充分发挥硬件的性能,框架还需要具备软硬件协同工作的能力,确保在利用硬件资源时能够达到最优的性能表现。 + +再者,框架需要考虑到 AI 技术发展的整体趋势。随着技术的不断进步,诸如 MOE(Mixture of Experts)、多模态以及科学智能(AI for Science)等前沿技术逐渐成为新的研究热点。深度学习框架应当能够紧跟这些技术发展的步伐,为研究者提供必要的支持和工具,以推动相关技术的快速发展和应用。在大模型领域,模型的显著特点是参数规模庞大、训练数据海量,以及对算力的巨大需求。随着模型复杂性的增加,计算瓶颈、存储瓶颈、访存瓶颈以及通信瓶颈等问题逐渐凸显。同时新的网络结构如 RWKV、Mamba 等也在不断涌现,为 AI 技术的发展注入了新的活力。为了解决这些问题,分布式训练和通用性能优化的需求日益迫切。在 AI for Science 领域,人工智能正引发科学发现和模式创新的深刻变革。以 AlphaFold 为代表的生物计算模型,GraphCast 等气象模型,物理信息神经网络(PINN)和傅里叶算子学习方法(FNO)都展示了 AI 在科学研究中的强大能力。为了支持科学计算模型,框架的设计需要能够支持高阶自动微分、复数运算、傅里叶变换等功能。 + +最后,框架需要能够支持产业的实际落地应用。在产业化方面,框架需要具备支持训练、压缩、推理一体化的全流程能力。这意味着,从模型的训练到优化,再到实际部署和推理,框架应当提供一套完整、高效的解决方案,以满足产业界对于深度学习技术的实际需求。 + +总的来说,飞桨将为开发者提供一个“动静统一、训推一体、自动并行、自动优化、广泛硬件适配”的深度学习框架,开发者可以像写单机代码一样写分布式代码,无需感知复杂的通信和调度逻辑,即可实现大模型的开发;可以像写数学公式一样用 Python 语言写神经网络,无需使用硬件开发语言编写复杂的算子内核代码,即可实现高效运行。 + +## 三、框架架构 + +为了实现深度学习框架的上述特性,我们必须对框架的架构进行精心设计,确保其能够支持各种复杂的模型构建,同时与多样化的芯片实现无缝对接。接下来,我们将通过直观的架构图,详细展示飞桨新一代框架内所涵盖的功能模块,以及这些模块之间的相互作用与联系。以下为飞桨框架 3.0 的架构图。 + + +
+ +
+ +飞桨框架对外提供了丰富的深度学习相关的各种开发接口,如张量表示、数学计算、模型组网、优化策略等。通过这些接口,开发者能够便捷地构建和训练自己的深度学习模型,无需深入到底层的技术细节中去。 + +在开发接口之下,飞桨框架可以划分为 4 个层次:表示层、调度层、算子层和适配层。 + +- 表示层专注于计算图的表达与转换,通过高可扩展中间表示 PIR,为动转静(动态图转为静态图)、自动微分、自动并行、组合算子以及计算图优化等核心功能提供坚实支撑。 +- 调度层则负责对代码或计算图进行智能编排与高效调度,并且能够根据实际需求进行显存和内存的管理优化,支持动态图和静态图高效执行。无论开发者选择使用动态图还是静态图进行模型开发,飞桨框架都能提供高效的执行环境,同时确保资源利用的最优化。 +- 算子层由神经网络编译器 CINN 和算子库 PHI 共同构成,涵盖了张量定义、算子定义、算子自动融合和算子内核实现等关键功能。 +- 适配层则用于实现与底层芯片适配,包括设备管理、算子适配、通信适配以及编译接入等功能。 + +## 四、动静统一自动并行 + +### 4.1 动静统一 + +我们来回顾下飞桨框架所提供的静态图和动态图两种开发模式。这两种模式在模型组网阶段的代码是完全一致的,因此我们称之为动静统一的组网方式。然而,它们之间的主要差异体现在计算图的构建和执行过程中。在静态图开发模式下,一旦计算图被创建,它将保持不变。这意味着,在运行阶段,不能再根据输入的计算数据作为判断条件,来调整计算图。相反,在动态图开发模式下,每当输入新的数据批次时,计算图会动态地生成和执行。这种灵活性使得动态图模式在现代深度学习任务中备受欢迎。然而,尽管动态图模式具有诸多优势,但也存在一个问题:由于计算图会频繁地创建和执行,这使得对其进行优化变得相当困难。特别是在推理部署场景下,动态图模式往往难以摆脱对 Python 解释器的依赖进行部署。而 Python 解释器的引入,在某些场景下,比如对性能要求很高的大模型推理部署场景或者资源受限的端侧场景,可能会导致效率低下或无法使用。为了克服这一难题,飞桨研发了动静转换技术,通过简单的一行命令(to_static),便能够将动态图的代码轻松转换为静态图代码。 + +飞桨采用的技术方案是源代码到源代码的转换,即分析并转写动态图 Python 源代码,进而生成对应的静态图 Python 源代码;在获取源代码后,使用静态 Python 解释器来执行这段静态图代码,从而得到计算图表示。动静转换技术的核心挑战在于对 Python 语法的支持程度。通过实际测试,我们发现飞桨对 Python 语法的支持率高达 94%,飞桨的动静转换功能在整图导出任务的成功率高达 95%。飞桨框架的优势在于它同时兼容动态图和静态图两种开发模式。因此,在进行动静转换时,仅需实现从动态图 Python 源代码到静态图 Python 源代码的转换。这一转换过程可以通过 Python 解释器进一步增强对 Python 语法的支持,从而大大降低了实现的难度。在训练场景,针对那些无法进行动静转换的情况,例如 Python 代码中调用 Numpy 等第三方库时,这些库的函数调用无法直接转换为静态图表示。为了解决这一问题,飞桨创新性地研发了“自适应图构建机制”。当遇到不支持的语法时,该机制会被触发,自动断开这些部分,并利用前后相邻的图进行重新构建。通过采用这种方案,我们在训练场景中可以实现 100%的动静转换成功率,从而为编译器等计算图优化技术提供了更广阔的空间。更多关于动静转换的信息,请参考以下链接:[《动转静 SOT 原理及使用》](./sot_cn.md) + +### 4.2 自动并行 + +在大模型开发场景中,多维混合并行显得尤为重要。对于百亿甚至千亿规模的大模型,一般需要使用张量模型并行、流水并行、数据并行、分组参数切片并行的混合并行方式进行训练。然而,多维混合并行的开发过程往往相当复杂。以数据并行、张量模型并行和流水线并行为例,开发者必须精心处理计算、通信、调度等多元逻辑,才能编写出正确的混合并行代码,这无疑提高了开发的难度。为了解决这一难题,我们提出了动静统一的自动并行方案。自动并行,开发者只需要提供模型结构和集群,以及少量的标记信息,飞桨框架可以根据模型结构和集群信息自动寻找合适的分布式训练策略。我们来看一下对分布式标记(DistAttr)的定义。通过使用 ProcessMesh 将一个设备(比如一块 GPU 卡)映射为一个进程,将多个设备映射为多个进程组成的一维或多维数组,下图展示了由 8 个设备构成的两种不同 ProcessMesh 抽象表示。 + +
+ +
+ +然后通过使用 Placement 来表示张量在不同设备上的切分状态,分为 Replicate、Shard 和 Partial 这 3 种切分状态。如下图所示,Replicate 表示张量在不同设备上会以复制的形式存在;Shard 表示按照特定的维度在不同设备上进行切分;Partial 表示设备上的张量不完整,需要进行 Reduce Sum 或者 Reduce Mean 等不同方式的操作后,才能得到完整的状态。 + +
+ +
+ +在完成分布式标记抽象后,我们通过调用`paddle.distributed.shard_tensor()`接口,实现对张量切分的标记。通过张量切分的标记,我们可以表示复杂的分布式混合并行,下图展示了一个具体的数据并行、张量模型并行、流水线并行组成的混合并行的例子。 + +
+ +
+ +以下代码展示了混合并行的具体例子。 + +```python +import paddle +import paddle.distributed as dist +from paddle.io import BatchSampler, DataLoader, Dataset +import numpy as np +... +mesh0 = dist.ProcessMesh([[0, 1], [2, 3]], dim_names=['x', 'y']) +mesh1 = dist.ProcessMesh([[4, 5], [6, 7]], dim_names=['x', 'y']) +... +class MlpModel(paddle.nn.Layer): + def __init__(self): + super(MlpModel, self).__init__() + # 张量切分标记 + self.w0 = dist.shard_tensor( + self.create_parameter(shape=[1024, 4096]), + mesh0, [dist.Replicate(), dist.Shard(1)]) + self.w1 = dist.shard_tensor( + self.create_parameter(shape=[4096, 1024]), + mesh1, [dist.Replicate(), dist.Shard(0)]) + + def forward(self, x): + # 张量切分标记 + dist.shard_tensor(x, mesh0, [dist.Shard(0), dist.Replicate()]) + y = paddle.matmul(x, self.w0) + # 张量重切分 + y = dist.reshard(y, mesh1, [dist.Shard(0), dist.Shard(2)]) + z = paddle.matmul(y, self.w1) + return z +... +# 创建模型 +model = MlpModel() +opt = paddle.optimizer.AdamW(...) +... +# 动转静训练 +dist_model, dist_loader = dist.to_static(model, opt, ...) +for step, data in enumerate(dist_loader()): + ... + loss = dist_model(data) + ... +``` + +我们以具体的 Llama 模型训练为例,动态图手动并行的开发方式,它要求开发者不仅要选择合适的并行策略,还必须精心设计通信逻辑;通过采用自动并行的开发方式,开发者无需再考虑复杂的通信逻辑。其分布式训练核心代码量减少了 50%,从而大大降低了开发的难度;从我们的一些实验可知,当前这种自动并行的性能优于动态图手动并行的性能。未来,我们将进一步探索无需使用张量切分标记的全自动并行,让开发者可以像写单机代码一样写分布式代码,进一步提升大模型的开发体验。更多关于自动并行的信息,请参考以下文档:[《动静统一自动并行》](./auto_parallel_cn.md) + +## 五、神经网络编译器自动优化 + +编译器(compiler)是一种计算机程序,负责将用某种编程语言编写的源代码(原始语言)转换成另一种编程语言(目标语言)。以高级编程语言编译器为例,如 gcc,它能将 C 语言代码转换成 CPU 可执行的机器指令。类似地,神经网络编译器,通常也被称为深度学习编译器,是深度学习领域特有的工具,用于将一种神经网络中间表示(IR)转换为另一种中间表示(IR)。例如,飞桨神经网络编译器 CINN,能够将神经网络中间表示转换为其他形式的中间表示,如 CUDA C 语言代码、SyCL 语言代码,或 LLVM IR。之后,利用芯片软件栈提供的编程语言编译器,比如英伟达的 NVCC(NVIDIA CUDA Compiler)编译器或 NVRTC(NVIDIA CUDA Runtime Compilation)运行时编译库,将这些中间表示进一步转换为可在英伟达 GPU 上运行的机器指令。 + +### 5.1 从 RMSNorm 说起 + +为什么在深度学习框架中需要引入编译器技术呢?让我们通过一个实例来阐释这一点。我们以 Llama 模型中经常使用的 RMS Normalization ([Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467))为例,其计算公式相对简单明了。 + +
+ +
+ +假设我们需要是实现 RMS Normalization 的计算,最简单的办法是,我们可以使用飞桨框架提供的张量运算开发接口,调用平方、求和、除法、开根号等操作来完成,代码如下: + +```python +class RMSNorm(paddle.nn.Layer): + def __init__(self): + super().__init__() + self.variance_epsilon = 1e-6 + self.size = 768 + self.weight = paddle.create_parameter( + shape=[self.size], + dtype=paddle.get_default_dtype(), + default_initializer=nn.initializer.Constant(1.0), + ) + + def forward(self, x): + variance = x.pow(2).mean(-1, keepdim=True) + x = paddle.rsqrt(variance + self.variance_epsilon) * x + return x * self.weight +``` + +从上述代码中,我们可以清晰地观察到代码与公式之间存在着良好的对应关系。具体来说,代码中的`weight`变量对应于公式中的`g`,`x`变量对应于公式中的`a`。此外,代码中的`pow`函数实现了平方运算,`mean`函数对应公式中的求和取平均操作,而`rsqrt`函数则实现了开根号后取倒数的计算。这种编写方式赋予了代码极高的灵活性和可维护性,使得开发者可以像书写数学公式一样编写代码,从而大大降低了代码的理解成本和维护成本。如果开发者希望采用新的 Normalization 策略,只需简单地修改代码即可实现新的计算公式。 + +尽管这种实现方式非常灵活,但它也面临着一个极具挑战的问题,即执行速度较慢。特别是在处理大型模型时,由于计算量巨大且算力成本高昂,这一问题尤为突出。速度慢的主要原因在于,每一次函数调用都会触发飞桨框架底层的一次算子调用。而算子作为深度学习框架的最小调度和执行单元,在执行过程中需要将显存中的数据搬运到寄存器中进行运算,并将计算结果写回到显存中。这种频繁的显存读写操作导致了计算密度降低,在访存带宽有限的情况下,显著拖慢了程序的运行速度。 + +为了解决这一问题,最简单的方法是,增加一个叫 RMSNorm 的算子,并且提供一个叫 RMSNorm 的 Python 层 API,这一方法在飞桨框架 1.x 版本就可以支持,比如我们采用以下代码实现: + +```python +class RMSNorm(paddle.nn.Layer): + def __init__(self): + super().__init__() + self.variance_epsilon = 1e-6 + self.size = 768 + self.weight = paddle.create_parameter( + shape=[self.hidden_size], + dtype=paddle.get_default_dtype(), + default_initializer=nn.initializer.Constant(1.0), + ) + + def forward(self, x): + return paddle.incubate.nn.functional.fused_rms_norm( + x=x, + norm_weight=self.weight, + norm_bias=None, + epsilon=self.variance_epsilon, + begin_norm_axis=2, + ) +``` + +以上代码通过`fused_rms_norm`开发接口实现了对`rms_norm`的调用,这一改动带来了显著的性能提升,有效解决了之前版本运行速度慢的问题。然而,这一优化方案也带来了不少弊端。 + +最突出的是,它大大提高了开发者的门槛,因为开发者现在需要深入了解和掌握飞桨框架中关于张量、算子等核心概念,并熟悉算子开发与注册的全流程。此外,为了编写出性能优异的 reduce 求和操作,开发者还需精通 CUDA C 高性能程序开发的技巧,并对 Shared Memory、Warp Divergence、Bank Conflict 等高级概念有深刻的理解。 + +其次,该方案会增加框架开发接口的数量、并降低开发接口的可用性和可维护性。由于开发者的需求通常非常灵活多变,比如各种 Normalization 策略的变种,为每个特定操作同步增加一个 Python 开发接口,将导致框架的接口数量迅速增加,目前飞桨框架已经有接近 2000 个对外公开的开发接口。同时,随着算子融合粒度的增大,每个开发接口的参数数量也急剧上升,比如一些融合类的算子开发接口甚至可能包含多达 30 多个参数,这使得开发接口变得难以使用和维护。 + +再者,该方案的一个显著影响是导致框架算子库中的算子数量不断攀升,进而使得硬件适配的成本也随之增加。尽管飞桨框架已经对算子进行了清理和规范,且不考虑融合算子的情况,但当前飞桨框架的算子库仍然包含了超过 800 个算子,这为硬件适配工作带来了极大的挑战。这些新增加的算子需要使用 CUDA C 代码来实现,并且如果希望在其他类型的硬件上运行,还需要开发相应版本的代码。考虑到需要适配的硬件种类繁多,这无疑会大幅增加开发成本。 + +以下是截取的 CUDA C 代码实现片段,从中我们可以看出代码的实现变得复杂了许多。 + +```cpp + const ComputeType row_sum_square = + BlockAllReduce(thread_sum_square); + + // use multiply instead of divide. Author(zhengzekang). + ComputeType row_rms = row_sum_square * col_divisor; + ComputeType row_inv_rms = + Rsqrt(row_rms + static_cast(epsilon)); + // save for backward + if (inv_var_data != nullptr) { + inv_var_data[row] = row_inv_rms; + } + for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) { + ComputeType pack[kPackSize]; +#pragma unroll + for (int i = 0; i < kPackSize; ++i) { + pack[i] = static_cast(buf[i * num_packs + pack_id]) * + row_inv_rms; + } + store.template store(pack, row, pack_id * kPackSize); + } +``` + +而借助神经网络编译器技术,我们能够在维持高度灵活性和易用性的基础上,实现性能的显著提升。以下 A100 平台上 RMSNorm 算子的性能测试结果便是一个明证:相较于采用 Python 开发接口组合实现的方式,经过编译优化后的算子运行速度提升了 4 倍;即便与手动算子融合的方式相比,也实现了 14%的性能提升。这一成果充分展示了飞桨框架在灵活性与性能之间寻找到的理想平衡点。 + +
+ +
+ +### 5.2 飞桨神经网络编译器 CINN + +飞桨神经网络编译器 CINN 采用了与框架一体化的设计,其基础设施是基于飞桨的高扩展中间表示 PIR。这一设计使得 CINN 能够同时支持训练和推理过程,并且具备处理动态可变形状输入的能力。在生成式大语言模型 Llama 和文生图模型 Stable Diffusion 上的实验结果显示,通过使用编译器的优化技术,相较于未采用手动性能优化的基础版本,推理速度分别实现了 36%和 30%的提升。那么,编译器究竟是如何实现深度学习任务的加速呢?以下,我们将通过一个由 Add 和 Relu 算子组成的例子来具体展示这一过程。 + +
+ +
+ +首先,该过程会利用组合算子机制,将原始的计算图拆解为由一系列基础算子构成的计算图,并在此过程中详细记录算子输入输出张量之间的形状关系,以确保其能够适应动态形状张量的复杂情况。随后,在神经网络编译器的前端部分,编译器会进行智能判断,识别出哪些基础算子具备融合潜力。对于这些可融合的基础算子,编译器会进一步调用基础的 Compute 函数,巧妙地将它们降级为由抽象语法树(AST)构成的低层中间表示(IR)。接下来,在神经网络编译器的后端部分,这些中间表示会被进一步精心转换成具体的代码实现,这既可能是 CUDA C 代码,也可能是 LLVM IR 代码,具体取决于目标平台的需求。最终,利用 NVCC 编译器或 LLVM 编译器,将这些代码转换成能够在芯片上高效运行的可执行代码,从而实现深度学习任务的显著加速。 + +更多关于神经网络编译器的信息,请参考文档[《神经网络编译器》](./cinn_cn.md)。 + +## 六、高阶自动微分 + +深度学习模型的训练过程,核心在于利用随机梯度下降(SGD)等优化算法来更新模型参数。在此过程中,深度学习框架的自动微分功能扮演着至关重要的角色,它基于链式法则自动计算出损失函数相对于模型参数的梯度。尽管在大多数深度学习任务中,仅需计算一阶导数,但在某些“AI for Science”的应用场景中,却需要计算高阶导数,这无疑大大增加了自动微分的复杂性。以 2D 矩形平板分布受载问题为例,该问题的内在机理需借助 4 阶微分方程来描述。因此,为了求解这类问题,深度学习框架必须提供高阶自动微分功能。然而,实现高阶自动微分面临着诸多挑战。具体来说,框架需要为每个算子编写高阶微分规则,而随着阶数的增加,这些微分规则的复杂性也随之上升。当阶数达到三阶或更高时,编写这些规则不仅变得极其困难,而且其正确性也难以保证。为了解决这一难题,我们提出了基于基础算子组合的高阶自动微分技术。该技术的核心思想是将复杂算子(如 log_softmax)拆解为多个基础算子的组合,然后对这些基础算子进行一阶自动微分变换。重要的是,基础算子经过一阶自动微分变换后,其所得的计算图仍然由基础算子构成。通过反复应用一阶自动微分规则,我们可以轻松地获得高阶自动微分的结果。 + +
+ +
+ +
+ +
+ +为了全面支持高阶自动微分,飞桨框架精心设计与实现了一套组合算子机制。这一机制不仅完美兼容动态图模式和静态图模式,而且在动态图模式下支持 N+1 阶微分的灵活拆分,同时在静态图模式下能够进行高效的编译器融合优化。我们创新性地设计并实现了动静一体的算子组合规则,这意味着同一套组合规则在动态图和静态图两种模式下均可无缝复用,从而有效避免了重复开发的繁琐。在构建基础算子体系时,我们以 Tensor 作为核心操作对象,严格确保了算子的原子性、实用性和完备性。此外,我们还提供了自定义反向操作和自动重计算功能,这些强大的特性不仅显著提升了模型的精度,还有效地减少了显存占用,为用户带来了更高效、更灵活的深度学习体验。 + +基于前期的工作积累,飞桨已开始积极探索科学智能(AI for Science)领域的相关工作。为了满足 AI for Science 任务的多样化需求,飞桨在框架层面实现了基于组合算子的高阶自动微分功能,并专门提供了针对科学计算的开发接口。此外,我们还实现了高阶优化器,如 LBFGS 等,以进一步提升科学计算的性能。在模型层面,我们成功研发了赛桨(PaddleScience)、螺旋桨(PaddleHelix)等系列开发套件,为科学计算提供了更为便捷、高效的解决方案。飞桨对国内外主流开源科学计算工具进行了广泛适配,如 DeepXDE、Modulus 等,并成为国际主流的科学计算深度学习库 DeepXDE 的默认推荐后端。在与 NVIDIA 合作适配 Modulus 的过程中,我们充分利用飞桨框架的高阶自动微分与编译优化技术,实现了方程求解类模型性能的大幅优化。相比 Modulus 现有的后端求解速度,我们的平均提升幅度达到了 71%。我们实现了物理信息网络(PINN)、傅里叶算子学习(FNO)等数据驱动、机理驱动以及数据机理融合的方法。这些方法在航空航天、汽车船舶、气象海洋、生命科学等多个领域都具有广泛的应用潜力,为科学研究和工程实践提供了有力的支持。 + +更多关于高阶自动微分和 AI for Science 的信息,请参考文档:[《高阶自动微分功能》](./higher_order_ad_cn.md)。 + +## 七、高扩展中间表示 PIR + +在通过动静转换技术获取计算图表示后,我们仍需对计算图进行一系列优化,如自动微分变换、分布式变换以及编译器加速等。为实现这些优化,我们需要一种“高扩展中间表示”PIR(Paddle Intermediate Representation)。PIR 具备灵活的基础组件,支持 Operation、Value、Attribute 等元素的定义,从而便于进行扩展。其中,Dialect 定义是 PIR 的核心组成部分,它类似于形式化语言中的一种表达,能够表示一个相对完整的体系,并支持开发者根据需求定制化扩展 Dialect,显著提升了框架的扩展性,这个体系涵盖了分布式、编译器、动态形状推理与控制流等多个方面。PIR 遵循 SSA(即 Static Single Assignment)原则,统一了顶层结构,实现“算子顺序性”和“计算图语义”的兼容表示。此外,PIR 还提供了更加简洁、低成本的 Pass 开发体系,并内置了一系列丰富且功能完备的 Pass 优化策略,为大模型的极致性能优化提供了强有力支撑。PIR 提供了 DRR 和 Pattern Rewriter 两种机制,以实现 IR 的灵活变化。为了验证 PIR 的有效性,我们比较了超过 900 个模型配置在使用 PIR 后的推理速度提升情况。结果显示,25%的模型推理速度提升了超过 30%,60%的模型提升了超过 10%。总体而言,使用 PIR 后,推理整体性能提升了超过 10%。这一显著提升主要归功于新 PIR 能够提前静态选择 Kernel,从而降低了调度成本和开销。此外,常量折叠策略的应用范围更广,Inplace Pass 策略机制也得到了更广泛的应用。采用新的 PIR 表示机制后,我们可以实现训推一体,展现出优异的性能和表现。更多关于 PIR 的信息,请参考文档:[《PIR 基本概念和开发》](./paddle_ir_cn.md)。 + +## 八、多硬件适配 + +深度学习框架在实现高效能计算的过程中,还面临着一个关键性挑战,即如何实现与各类硬件的有效适配。为了应对这一挑战,飞桨框架采取了全面的策略,并成功实现了多种不同的接入方式,以确保能够灵活满足不同芯片的适配需求。通过这些多样化的接入方法,飞桨框架不仅提升了深度学习应用的性能,还确保了广泛的硬件兼容性,从而为开发者提供了一个强大且灵活的工具,以适应不断变化的计算环境和需求。特别是针对大模型场景,飞桨提供了标准化硬件适配接口,只需要适配 30 余个接口,即可全面支持大模型训压推全流程;通过基础算子体系,减少硬件适配所需开发的算子数量;支持算子融合、显存复用等方式对大模型进行性能优化;支持通过神经网络编译器代码后端 CodeGen 的方式进行适配,实现算子自动融合和性能优化。 + +
+ +
+ +基于前述的先进技术,飞桨与芯片厂商携手,共同打造一个繁荣的硬件生态。这一过程可划分为三个核心阶段。首先是“共聚”阶段,我们联合多家芯片厂商,共同发起了飞桨硬件生态圈。其次是“共研”阶段,与芯片厂商携手实现软硬一体的联合优化。最后是“共创”阶段,与芯片厂商深度合作,共创繁荣生态。至今,我们已与 22 家硬件厂商伙伴成功联合推出了飞桨生态发行版,标志着合作的深入与成果的显现。同时,我们的生态圈已吸引超过 40 家成员单位加入,覆盖了主流硬件厂商,提供了极为丰富的硬件支持框架,为用户带来更加多样化的选择。 + +## 九、开始使用 + +接下来,欢迎大家使用飞桨框架 3.0-Beta 版本,并给我们反馈。在开始使用前,确认已安装飞桨框架 develop 版本。下面,我们通过一个矩阵乘和 Softmax 组成的例子来展示飞桨新一代框架是如何实现动静统一自动并行和编译器自动优化性能的。具体代码如下所示: + +```python +# test_demo.py +import paddle +import paddle.distributed as dist +import paddle.nn.functional as F +from paddle.io import Dataset, DataLoader +import numpy as np + +mesh = dist.ProcessMesh([0, 1], dim_names=["x"]) + +class DemoDataset(Dataset): + def __init__(self, num_samples): + self.num_samples = num_samples + + def __getitem__(self, idx): + return np.array([[1., 2.], [3., 4.],[5., 6.]]).astype('float32'), np.array([1.]) + + def __len__(self): + return self.num_samples + +class DemoLayer(paddle.nn.Layer): + def __init__(self): + super(DemoLayer, self).__init__() + self.w = dist.shard_tensor( + paddle.create_parameter(shape=[2, 4], dtype='float32'), + mesh, [dist.Shard(1)]) + self.b = paddle.to_tensor([0.1, 0.2, 0.3, 0.4]) + + def forward(self, x): + y = paddle.matmul(x, self.w) + z = F.softmax(y + self.b) + return z + +dataset = DemoDataset(10) +loader = DataLoader(dataset, batch_size=1) + +def loss_fn(logits, label): + loss = paddle.nn.MSELoss(reduction="sum") + logits = paddle.sum(logits, axis=[1, 2]) + return loss(logits, label) + +layer = DemoLayer() +dist_layer = dist.to_static(layer, loader, loss_fn) + +dist_layer.eval() +for data in loader(): + loss = dist_layer(data[0], data[1]) + print('loss', loss, flush=1) +``` + +因为一些功能还在开发中,为了避免对用户造成干扰,当前我们没有默认开启高扩展中间表示 PIR 和神经网络编译器自动优化功能,在开始执行前,我们需要进行环境变量设置以确保新功能生效,如下: + +```cpp +# 打开组合算子 +export FLAGS_prim_enable_dynamic=true && export FLAGS_prim_all=true + +# 打开 CINN 编译器相关 FLAG +export FLAGS_use_cinn=true +export FLAGS_cinn_new_group_scheduler=true +export FLAGS_group_schedule_tiling_first=true +export FLAGS_cinn_bucket_compile=true + +# 打开 PIR 模式 +export FLAGS_enable_pir_api=true + +# 是否打印 Program IR 信息 +export FLAGS_print_ir=false + +# 执行命令 +# python -u -m paddle.distributed.launch --gpus "0,1" test_demo.py +``` + +在设置环境变量后,我们即可正常使用飞桨框架。以上所展示例子的运行过程如下图所示: + +
+ +
+ +在开发者编写动态图代码时,利用`shard_tensor`分布式开发接口,可以轻松地标记张量切分方式。在此场景中,我们对矩阵乘法的参数`w`进行了列切分。 + +第 1 步,飞桨通过动转静技术,能够将动态图代码高效地转换为静态图代码,从而获取静态图中间表示。 + +第 2 步,通过切分推导规则,静态图中间表示可以自动转换成分布式中间表示。在这一过程中,我们可以观察到部分张量的切分标记发生了变化,并自动插入了分布式通信算子`allgather`。 + +第 3 步,通过组合算子机制,它能够将计算图中的复杂算子拆分为更小粒度的基础算子。例如,我们将`softmax`算子拆分成了`max`、`subtract`、`exp`、`sum`和`divide`等基础算子,为后续的性能优化提供了便利。 + +第 4 步,飞桨运用编译器自动优化技术,将这些基础算子自动融合,并生成高性能的内核代码,从而实现性能的提升。 + +从上述例子可以看到,基于飞桨新一代框架,开发者只需要少量张量切分标记,无需关注分布式通信逻辑,即可实现大模型的分布式训练;并且无需手写高性能算子内核代码,即可实现性能自动优化。 diff --git a/docs/guides/paddle_v3_features/paddle_ir_cn.md b/docs/guides/paddle_v3_features/paddle_ir_cn.md new file mode 100644 index 00000000000..8f5616e2b57 --- /dev/null +++ b/docs/guides/paddle_v3_features/paddle_ir_cn.md @@ -0,0 +1,127 @@ +# PIR 基本概念和开发 + +在 3.0 版本下,飞桨研发了基于 MLIR 范式的新一代中间表示技术,即 Paddle IR(下简称 PIR)。这项技术对底层的核心概念如 Operation、Attribute 等进行了系统性的抽象,为开发者提供了灵活的基础组件;同时,通过引入 Dialect 这一概念,飞桨能够全面、分层次管理框架各模块对中间表示的需求,并支持开发者根据需求定制化扩展 Dialect,显著提升了框架的扩展性。PIR 遵循 SSA(即 Static Single Assignment)原则,统一了顶层结构,实现“算子顺序性”和“计算图语义”的兼容表示。此外,PIR 还提供了更加简洁、低成本的 Pass 开发体系,并内置了一系列丰富且功能完备的 Pass 优化策略,为大模型的极致性能优化提供了强有力支撑。 + +## 一、基础概念 + +
+ +
+ +在深度学习框架 IR 概念中,「顺序性」和「图语义」是两个非常高频常用的概念。旧的中间表示体系由「顺序性」ProgramDesc 和「图语义」Graph 两个核心类共同承载。用户在静态图 API 或者动转静模块下,产生的中间表示是 Op-by-Op 的 Program,如果要应用更高层面的优化策略(比如算子融合、inplace 策略、剪枝等),框架会将由 Program 构造出 Graph,其由数据节点、算子节点和彼此关联的边构成。 +在新的 Paddle IR 中,飞桨在底层抽象了一套高度可扩展的基础组件,包括 Type、Attrbute、Op、Trait 和 Interface,并引入了 Dialect 的概念,支持开发者灵活扩展、自由定制,提供了完备鲁邦的语义表达能力;在模型表示层,通过多 Dialect 模块化管理,统一多端表示,实现了训推一体的全架构统一表示,无缝衔接组合算子和编译器,支持自动优化和多硬件适配;在图变换层,通过统一底层模块,简化基础概念,向用户提供了低成本开发、易用高性能、丰富可插拔的 Pass 优化机制。 +飞桨的新一代的 IR 表示坚持 SSA(静态单赋值)原则,模型等价于一个有向无环图。并以 Value、Operation 对计算图进行抽象, Operation 为节点,Value 为边。 + +* Operation 表示计算图中的节点:一个 Operation 表示一个算子,它里面包含了零个或多个 Region;Region 表示一个闭包,它里面包含了零个或多个 Block;Block 表示一个符合 SSA 的基本块,里面包含了零个或多个 Operation;三者循环嵌套,可以实现任意复杂的语法结构 +* Value 表示计算图中的有向边:用来将两个 Operaton 关联起来,描述了程序中的 UD 链(即 Use-Define 链);OpResult 表示定义端,定义了一个 Value,OpOperand 表示使用端,描述了对一个 Value 的使用。 + +## 二、设计初衷 +计算图中间表示(Intermediate Representation,即 IR)是深度学习框架性能优化、推理部署、编译器等方向的重要基石。近些年来,越来越多的框架和研究者将编译器技术引入到深度学习的神经网络模型优化中,并在此基础上借助编译器的理念、技术和工具对神经网络进行自动优化和代码生成。飞桨历史上在架构层面并存着多套不同的中间表示体系,其表达能力各不相同、Pass 开发维护成本较高,代码复用性较差,缺乏统一规范,存在严重的框架稳定性问题。 + +
+ +
+ + +因此在 3.0 版本下,飞桨在基础架构层面规范了中间表示 IR 定义,实现全架构统一表示,实现上下游各个方向共享开发成果: ++ **推理部署** :简化抽象计算图,解决有环问题,降低 Pass 的开发成本 ++ **分布式侧** :多 Dialect 管理算子,支持分布式属性的灵活标记 ++ **编译器侧** :严格遵循 SSA 原则,灵活支撑编译优化鲁棒性 + + +飞桨的新一代 IR 架构聚焦于高度灵活和高扩展性两个重要维度,通过更加完备且鲁邦的语义表达能力、训推全架构统一表示和高效可插拔的性能优化策略(Pass)开发机制,实现复杂语义支持,更便捷地支撑大模型自动并行下丰富的切分策略,无缝对接神经网络编译器实现自动性能优化和多硬件适配。 + +## 三、使用指南 + +飞桨新的一代 IR 是基础架构层面的升级,对于用户在 API 层面的使用是无感的,用户可保持之前动转静(即 paddle.jit.to_static)或静态图代码不变,在 3.0-Beta 下仅需额外通过 `export FLAGS_enable_pir_api=1` 开启新 IR 功能即可,如下是一个简单的使用样例。 + +```python +# test_add_relu.py + +import unittest +import numpy as np +import paddle +from paddle.static import InputSpec + +class SimpleNet(paddle.nn.Layer): + def __init__(self): + super().__init__() + + + def forward(self, x, y): + z = x + y + out = paddle.nn.functional.relu(z) + return out +# Step 1: 构建模型对象,并应用动转静策略 +specs = [InputSpec(shape=(-1, -1)), InputSpec(shape=(-1, -1))] +net = paddle.jit.to_static(SimpleNet(), specs) + +# Step 2: 准备输入,执行 forward +x = paddle.rand(shape=[16, 64], dtype=paddle.float32) +y = paddle.rand(shape=[16, 64], dtype=paddle.float32) +out = net(x, y) +print(out) +``` + +将上述文件保存为 test_add_relu.py,执行如下命令: `FLAGS_enable_pir_api=1 python test_add_relu.py` 即可。开发者可额外指定 GLOG_v=6 输出日志,查看新一代 IR 下的 Program 表示,如下所示,在动转静或静态图模式下,用户的代码经过组网 API 下会先生成 Operator Dialect 下计算图表示,在执行时飞桨会将其转换为给定硬件下的 Kernel Dialect,然后交给执行器去依次调度对应的 PHI 算子库,计算最终结果。 + +```python +{ // Operator Dialect + (%0) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"x",place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[-1,-1],stop_gradient:[true]} : () -> builtin.tensor<-1x-1xf32> + (%1) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"y",place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[-1,-1],stop_gradient:[true]} : () -> builtin.tensor<-1x-1xf32> + (%2) = "pd_op.add" (%0, %1) {stop_gradient:[true]} : (builtin.tensor<-1x-1xf32>, builtin.tensor<-1x-1xf32>) -> builtin.tensor<-1x-1xf32> + (%3) = "pd_op.relu" (%2) {stop_gradient:[true]} : (builtin.tensor<-1x-1xf32>) -> builtin.tensor<-1x-1xf32> + () = "builtin.shadow_output" (%3) {output_name:"output_0"} : (builtin.tensor<-1x-1xf32>) -> +} + +// IR after lowering +{ // Kernel Dialect + (%0) = "data(phi_kernel)" () {dtype:(pd_op.DataType)float32,kernel_key:,kernel_name:"data",name:"x",op_name:"pd_op.data",place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[-1,-1],stop_gradient:[true]} : () -> undefined_tensor<-1x-1xf32> + (%1) = "shadow_feed(phi_kernel)" (%0) {kernel_key:,kernel_name:"shadow_feed",op_name:"pd_op.shadow_feed"} : (undefined_tensor<-1x-1xf32>) -> gpu_tensor<-1x-1xf32> + (%2) = "data(phi_kernel)" () {dtype:(pd_op.DataType)float32,kernel_key:,kernel_name:"data",name:"y",op_name:"pd_op.data",place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[-1,-1],stop_gradient:[true]} : () -> undefined_tensor<-1x-1xf32> + (%3) = "shadow_feed(phi_kernel)" (%2) {kernel_key:,kernel_name:"shadow_feed",op_name:"pd_op.shadow_feed"} : (undefined_tensor<-1x-1xf32>) -> gpu_tensor<-1x-1xf32> + (%4) = "add(phi_kernel)" (%1, %3) {kernel_key:,kernel_name:"add",op_name:"pd_op.add",stop_gradient:[true]} : (gpu_tensor<-1x-1xf32>, gpu_tensor<-1x-1xf32>) -> gpu_tensor<-1x-1xf32> + (%5) = "relu(phi_kernel)" (%4) {kernel_key:,kernel_name:"relu",op_name:"pd_op.relu",stop_gradient:[true]} : (gpu_tensor<-1x-1xf32>) -> gpu_tensor<-1x-1xf32> + () = "builtin.shadow_output" (%5) {output_name:"output_0"} : (gpu_tensor<-1x-1xf32>) -> +} +``` + +## 四、架构原理 +在大模型场景下,对深度学习框架中间表示的灵活性、扩展性和完备性提出了全新的需求。飞桨通过抽象核心结构,引入 Dialect 概念,实现多 Dialect 模块化,并提供了易用高性能、低成本开发、丰富可插拔的 Pass 优化策略,串联 AI 编译器,适配支持多异构硬件,面向大模型训推流程优化提速。 + +
+ +
+ + +如上左图所示,新一代 IR 的整体设计自底向上分为三层: +### 1.灵活的基础组件 +飞桨提供了 Trait 和 Interface 两种重要机制实现了对算子 Op 的特征和接口的抽象标记。 比如 InplaceTrait 表示一个 Op 具有 Inplace 特征, InferShapeInterface 表示一个算子定义了 InferShape 函数接口等,这二者都是可以任意扩展的,只要派生自相应的基类、遵循相应的实现规则即可;并对算子体系下核心概念抽出 Type、Attrbute、Op,这三者是基于 Trait 和 Interface 进行定义的。它们会对关联自己所拥有的相应 Trait 和 Interface ;Dialect 用来对 Type、Attribtue、Op 做模块化管理, 比如 BuiltinDialect、PaddleDialect、CinnDialect 等等。一个 Dialect 里面包含了一系列的 Type、Attribtue、Op 的定义。相应的,每个 Type、Attribtue、Op 都是定义在某个唯一的 Dialect 里面。对整个 IR 框架而言, Dialect 是可以随意插拔的,也是可以任意扩展的。 + +这一层是 IR 适应多种场景的基础。这一层的每一个要素都是可定制化扩展的,一般情况下,针对一个具体的场景,比如分布式、编译器。都需要定义自己需要用到的 Trait、Interfce,然后定义自己的 Dialect,在自己的 Dialect 里面,定义自己需要用到的 Type、Attribute、Op。 + +### 2.多层级的 Dialect + +飞桨通过不同层级的 Dialect 来管理框架内不同领域的算子体系,比如 Built-in 下的 Shape Dialect 和 Control Flow Dialect,分别用户形状符号推导和控制流表示、与 PHI 算子库执行体系相关的 Operator Dialect 和 Kernel Dialect、与神经网络编译器领域相关的 CINN Dialect 等。在飞桨神经网络编译器中,主要以计算图 Operator Dialect 为输入,经过组合算子和 Pass Pipline 后,会转换为 CINN Dialect,并附加 Shape Dialect 中的符号信息,最后会 Lowering 成编译器的 AST IR。 +上述这些多层级的 Dialect 内的算子 Op 会组成 Program ,并用来表示一个具体的模型。它包含两部分:计算图 和 权重 。 +* Value、Operation 用来对计算图进行抽象。Value 表示计算图中的有向边,他用来将两个 Operaton 关联起来,描述了程序中的 UD 链 ,Operation 表示计算图中的节点。一个 Operation 表示一个算子,它里面包含了零个或多个 Region 。Region 表示一个闭包,它里面包含了零个或多个 Block。Block 表示一个符合 SSA 的基本块,里面包含了零个或多个 Operation 。三者循环嵌套,可以实现任意复杂的语法结构。 +* Weight 用来对模型的权重参数进行单独存储,这也是深度学习框架和传统编译器不一样的地方。传统编译器会将数据段内嵌到程序里面。这是因为传统编译器里面,数据和代码是强绑定的,不可分割。但是对神经网络而言,一个计算图的每个 epoch 都会存在一份权重参数,多个计算图也有可能共同一份权重参数,二者不是强绑定的 + +### 3.功能完善的 Pass 体系 + +Pass 的核心是子图匹配和替换(即图变换),是将一个 Program 通过某种规则转换为另一个新的 Program。IR 中包含了计算图中全局信息,如上下游算子的邻接关系等,更有利于进行图优化,比如常量折叠、算子融合,Inplace 策略等: + +
+ +
+ +飞桨内置了一系列计算图优化、显存优化、量化等通用 Pass,灵活可配置。并简化了基础概念,向用户提供了 2 种 Pass 开发范式:Pattern Rewriter 和 Declarative Rewrite Rule(简称 DRR),充分兼顾自定义灵活性和开发易用性,大幅降用户 Pass 优化策略的开发门槛和代码量。 + +
+ +
+ +## 五、参考资料 +1. [【方案设计】IR 底层基础类型系统设计文档](https://github.com/PaddlePaddle/community/blob/master/pfcc/paddle-code-reading/IR_Dialect/basic_concepts.md) +2. [【方案设计】IR 顶层模型结构表示设计文档](https://github.com/PaddlePaddle/community/blob/master/pfcc/paddle-code-reading/IR_Dialect/ir_program.md) +3. [【方案设计】控制流设计文档](https://github.com/PaddlePaddle/community/blob/master/pfcc/paddle-code-reading/IR_Dialect/control_flow.md) diff --git a/docs/guides/paddle_v3_features/sot_cn.md b/docs/guides/paddle_v3_features/sot_cn.md new file mode 100644 index 00000000000..11ce7043505 --- /dev/null +++ b/docs/guides/paddle_v3_features/sot_cn.md @@ -0,0 +1,237 @@ +# 动转静 SOT 原理及使用 + +## **一、背景与动机** + +因为动态图和静态图在性能和用户体验、二次开发上各有优劣,深度学习框架在架构层统一动静概念并实现用户的最佳一致性使用体验,是极具挑战性的。飞桨从用户体验角度出发,着眼于训练、推理,并紧跟大模型时代的场景需求,直面技术难题,解决用户在性能,部署和大模型定制化的痛点。但是随着技术的演进,传统的基于 AST 变换的动静统一策略开始无法处理日益复杂的用户模型,在一些不常用的语法和太过于动态的场景下会出现转写失败的情况,而转写一旦失败,意味着后续基于静态图的优化工作没有任何一个可以被运用。AST 转写方案虽然具有高层级,易于转写的特性,但由于 Python 是一门纯动态语言,以及 Paddle 静态化数据表示能力的有限性,现在的 AST 方案存在如下局限性: + +- **难以处理动态和静态相互混合的场景**。例如 numpy 和 tensor 的互相转换,见样例代码; + +- **控制流和容器的混合使用时有边界 case**,经常出现解析出错或者是无法完全表示的情况; + +- **不支持源码加密场景下使用,完备性存在上限**。比如在 C++ 端执行的 Python 代码无法进行 AST 转写,或者对于 `.pyc` 文件无法处理(某些加密场景等) + + +一个在套件和用户使用中经典的场景如下: + +```python +# 一个简单的 Case 如下: +@paddle.jit.to_static() +def unsupport_func(x): + x = 2 * x + t = x.numpy() # t 依赖了 x 的值,依赖静态图的执行结果 + t = np.ones(t) + return paddle.to_tensor(t) + +x = paddle.to_tensor([2]) +unsupport_func(x) # raise error +``` + +这里的 np.ones 因为动转静的使用,上述的 x 和 t 其实都是 Variable 类型,传入到 np.ones 中是无法获取到真实的 value 的,因此 numpy 接口会报错。而这样的 Case 也是 AST 动转静理论上无法解决的问题,本质原因是,AST 必须要求转写的函数可以**被整图的**静态图 IR 表示。 + +这些长尾的 case 虽然可以通过要求用户使用规范的做法来避免,但是这类问题还是层出不穷,因为用户不希望在写动态图时考虑动转静的场景。 + +为了解决极具灵活性的 Python 语言与深度学习框架中间表示巨大的差异性鸿沟问题,飞桨在动转静模块中引入了字节码符号化模拟执行机制(即 Symbolic OpCode Translator,简称 SOT),在字节码层级分析和模拟执行动态图模型代码,动态抽取静态组网代码,构建成为一个新的等价的 Python 函数,消除语言与表示之间的鸿沟,实现动态图到静态图的等价转写。同时这种字节码模拟执行机制,可以自适应选择触发子图级别的打断机制(即 Graph Break),实现控制流代码保持动态图运行的效果。 + +## 二、概要介绍 + +### **自适应打断机制:** + +在新的 SOT 方案中引入了自适应打断机制来获得 100% 理论动转静成功率。在旧的动转静 AST 方案中,是以源码转写的方式对整图进行转写,当遇到无法静态化的 Op 时,AST 整图转写失败。新的 SOT 方案中,首先将源码转写的方式升级为了字节码转写,当遇到无法静态化的 Op 时,我们将整图切分为子图,并使用**字节码进行粘连**,以达到转写成功的目的。在自适应打断机制加持下,用户动态图编写可以更加随意,并在子图层面享受动转静和编译器加速。 + +

+ +

+ + +### 执行流程: + +在新的 SOT 流程下,动转静是在字节码层面进行分析的,SOT 会先利用注册的 Python EvalFrame Hooker 获取到用户函数运行时的字节码和 PyFrame 上下文信息(包含了局部变量,参数等),然后使用内部实现的**字节码模拟执行器**来进行模拟执行,最后得到一个可以替换原来字节码的新 PyCodeObject 对象。模拟执行器会识别出用户函数中需要静态化的字节码和无法静态化的字节码,对于无法静态化的字节码使用打断功能会回退到动态图执行,对于可以静态化的字节码会生成一个静态图来进行替换。当第二次执行时,SOT 会先判断是否命中了上次转写的缓存,如果命中了缓存就可以直接获取上次转写的 PyCodeObject 重用。下图是整个 SOT 的执行流程。 + +

+ +

+ +## 三、框架架构 + +

+ +

+ + +上图展示了 SOT 的所有组件,针对一些名词和模块,这里进行一个简单的介绍: + +### 3.1 EvalFrame Hooker 模块 + +Python 在 2016 年的 PEP523 提案支持了自定义回调函数,将默认的执行器替换为用户自定义的解释函数。这个机制结合子图 fallback 方案的需求,我们在 Paddle 的 Pybind 层暴露了 `paddle.core.set_eval_frame` 接口。 + +### 3.2 字节码模拟器(OpcodeExecutor)模块 + +这个部分是 SOT 方案的核心,主要的功能是我们需要模拟获取到的 PyCodeObject,并进行动态和静态代码分离,因此字节码模拟器是将 Python 函数映射为新的 Python 函数的模块。对于不同的静态化程度的函数,**字节码模拟器**会将一个函数对应于下面几种可能的情况: + +1. 若能够**完全静态化**目标函数,则需要返回一个新的可执行函数,该函数能够构建目标函数对应的子图; +2. 若只能**部分静态化**目标函数,同样需要返回一个新的可执行函数,该函数将可静态化部分抽取为子图,并将无法静态化的部分抽取为子函数(可能代表着不同分支),通过 Eval Frame 机制进行递归的处理。 +3. 若完全**无法静态化**目标函数,则返回原本的目标函数,在动态图环境下进行计算。 + +我们在 SOT 项目中完成了一个完备的 Python 字节码解释器,具有如下的特点: + +- 设计良好,具备 Dispatch 机制,符合开闭原则,便于维护。 +- 支持随意触发打断和 Fallback 的能力。 +- 支持子函数递归模拟。 +- 完备的字节码支持,完备的版本支持。我们支持 python3.8 - python3.12 的几乎 90%常见字节码模拟。 + +### 3.3 自适应子图打断模块 + +**对于控制流 If、For 依赖 Tensor 的场景,需要打断构图并静态化部分函数,子图打断能力是 SOT 能够达到近 100%成功率的核心组件。** + +我们深入研究了打断的类型,设计和打断机制,并将所有的打断场景划分为了 2 个不同的行为: + +- BreakGraph :触发子图打断,当前函数会产生一个子图和一个 resume function 进行下一轮的模拟。 +- Fallback:触发子图打断,当前函数不产生子图,直接动态图运行。 + +基于不同的场景我们设计了不同的异常传播途径和不同的处理逻辑。 + +### 3.4 Tracker、Guard、缓存模块 + +子图 Fallback 的整体实现可以认为是将用户函数原始字节码转换为新的字节码,**为了避免每次传入相同输入都会重新触发开销昂贵的字节码转换操作,我们需要增加缓存机制来复用之前转写过的代码,实现 JIT 的效果。** + +但并不是任何字节码成功转换一次后第二次都是可以直接复用的,因为我们字节码的转换是基于 Frame 的初始状态进行模拟执行得到的,也就是说**转换后的字节码强依赖于 Frame 的初始状态**。当初始状态发生改变,最后转换后的字节码很有可能发生改变,因此我们需要一种机制来根据 Frame 初始状态来判断缓存过的字节码是否有效。这种转换复用的机制我们称为 Guard 函数,而 Guard 函数生成依赖字节码模拟过程中记录的每个模拟变量的 Tracker。 + +### 3.5 副作用处理模块 + +**SideEffect 是指代码执行过程中除了函数返回值之外,还对调用方产生了额外的影响,比如修改全局变量、修改可变的共享变量等。** + +在模拟执行过程中,我们的代码是在虚拟环境下执行的,在该过程中不应该也不会对真实环境进行修改。而如果用户代码产生了 SideEffect,我们需要在生成的代码里反映出相应的 SideEffect,即在字节码生成步骤中增加 SideEffect 的处理部分。副作用模块就是专门记录并处理副作用正确性的功能模块。 + +### 3.6 StatementIR 模块 + +**StatementIR 是 Paddle 动转静模块与子图 FallBack 的一个『中间桥梁』,它达到了动转静复用的目的。** + +StatementIR 与 Program 类似,都是表征计算的一个结构。**在字节码执行过程中,我们需要将所有的组网代码都『临时记录』下来,并最后将他们组网成为一个 Program 。**这里的组网代码记录的载体就是 StatementIR 。在函数结束的时刻,我们会将记录下来的 StatementIR 转化为一个函数。与原来的用户代码不同,由 StatementIR 转化为的函数可以确保一定可以动转静。这样我们可以复用原来的动转静 to_static 函数来实现静态图的执行。 + +## 四、对比 AST 方案 + +SOT 方案相比于 AST 方案有如下的优势: + +1. 【成功率提升】SOT 在遇到不支持的语法时会自动打断,并将不支持部分运行在动态图下,因此理论上可以达到近 100% 的成功率。 +2. 【转写完备性】SOT 只依赖 Python 字节码,针对无法获取源码的场景,也可以得到运行,获取正确的结果。 +3. 【控制流支持】SOT 因为支持自适应子图打断,因此可以不静态图化某些容器操作,可以更好的处理控制流与容器。不需要在静态图底层支持太多的容器类结构,比如 TensorArray 或者是 TensorDict。 +4. 【自适应打断子图】SOT 支持自适应打断子图。在无法静态化时,主动打断组网、运行静态图并获取输出,然后在进行新一轮的组网。因此可以在自图层面享受静态图和编译器的加速收益。 + +**注意:在 Save/Load 模式下需要整图导出,会自动切换到 AST 模式进行运行。** + +## 五、开始使用 + +### 5.1 使用 SOT 模式(默认模式) + +目前 SOT 模式是动转静的默认转写模式。用户只需要使用默认的 paddle.jit.to_static 就可以,下面是一个 SOT 动转静的使用样例: + +```python +import paddle +from paddle.jit import to_static +from paddle.static import InputSpec +import numpy as np +import random + +# set seed for determinated output +paddle.seed(2024) +np.random.seed(2024) +random.seed(2024) + +class SimpleNet(paddle.nn.Layer): + def __init__(self): + super().__init__() + self.linear = paddle.nn.Linear(10, 3) + + def forward(self, x, y): + x = self.linear(x) + x = x + y + np_x = x.numpy() + np_x = np.sum(np_x) * 2 + return paddle.to_tensor(np_x) + +net = SimpleNet() + +net = paddle.jit.to_static(net, full_graph=False) # 动静转换, full_graph=False 表示 SOT 模式 +x = paddle.randn((10, 10)) +y = paddle.randn((3,)) +out = net(x, y) +print(out) +``` + +输出如下: + +```bash +Tensor(shape=[], dtype=float64, place=Place(gpu:0), stop_gradient=True, + 54.16428375) +``` + +### 5.2 使用 AST 模式 + +如果确定自己的代码完全可以静态化,用户可以手动打开 AST 模式,通常 AST 模式成功率会更低,但是调度开销会更小,同时支持部署推理。 + +```python +import paddle +from paddle.jit import to_static +from paddle.static import InputSpec +import numpy as np +import random + +# set seed for determinated output +paddle.seed(2024) +np.random.seed(2024) +random.seed(2024) + +class SimpleNet(paddle.nn.Layer): + def __init__(self): + super().__init__() + self.linear = paddle.nn.Linear(10, 3) + + def forward(self, x, y): + x = self.linear(x) + x = x + y + np_x = x.numpy() + np_x = np.sum(np_x) * 2 + return paddle.to_tensor(np_x) + +net = SimpleNet() + +net = paddle.jit.to_static(net, full_graph=True) # 动静转换, full_graph=True 表示 AST 模式 +x = paddle.randn((10, 10)) +y = paddle.randn((3,)) +out = net(x, y) +print(out) +``` + +在这个 Case 中,AST 模式下会报错,因为我们尝试混合使用 numpy 和 paddle api,导致无法整图静态化。 +```bash +Traceback (most recent call last): + File "ttt.py", line 29, in + out = net(x, y) + File "/home/ssd2/xiongkun/Paddle/build/python/paddle/nn/layer/layers.py", line 1484, in __call__ + return self.forward(*inputs, **kwargs) + File "/home/ssd2/xiongkun/Paddle/build/python/paddle/jit/dy2static/program_translator.py", line 502, in __call__ + return self._perform_call(*args, **kwargs) + File "/home/ssd2/xiongkun/Paddle/build/python/paddle/jit/dy2static/program_translator.py", line 822, in _perform_call + error_data.raise_new_exception() + File "/home/ssd2/xiongkun/Paddle/build/python/paddle/jit/dy2static/error.py", line 448, in raise_new_exception + raise new_exception from None +TypeError: In transformed code: + + File "ttt.py", line 21, in forward + x = x + y + np_x = x.numpy() + np_x = np.sum(np_x) * 2 + ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE + return paddle.to_tensor(np_x) + + File "<__array_function__ internals>", line 200, in sum + + File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2324, in sum + return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims, + File "/root/miniconda3/envs/py38/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 84, in _wrapreduction + return reduction(axis=axis, out=out, **passkwargs) + + TypeError: Code 'np_x = np.sum(np_x) * 2' called numpy API np.sum, please use Paddle API to replace it. + values will be changed to variables by dy2static, numpy api can not handle variables + +``` diff --git a/docs/hardware_support/dcu/index_cn.rst b/docs/hardware_support/dcu/index_cn.rst index ec70f613e29..5e29f453f52 100644 --- a/docs/hardware_support/dcu/index_cn.rst +++ b/docs/hardware_support/dcu/index_cn.rst @@ -9,12 +9,12 @@ 飞桨框架支持基于海光 DCU 芯片的训练和推理,请参考以下内容快速体验: - `海光 DCU 基于框架的使用指南 <./paddle_tutorial_cn.html>`_ : 海光 DCU 基于框架的使用指南 -- `海光 DCU 基于套件的使用指南 <./suite_tutorial_cn.html>`_ : 海光 DCU 基于套件的使用指南 +- `海光 DCU 基于套件的使用指南 <./paddlex_tutorial_cn.html>`_ : 海光 DCU 基于套件的使用指南 - `海光 DCU 支持模型 <./support_cn.html>`_ : 海光 DCU 支持模型 .. toctree:: :hidden: paddle_tutorial_cn.md - suite_tutorial_cn.md + paddlex_tutorial_cn.md support_cn.md diff --git a/docs/hardware_support/dcu/paddle_tutorial_cn.md b/docs/hardware_support/dcu/paddle_tutorial_cn.md index cc6cebd9a9f..5d5c331edf7 100644 --- a/docs/hardware_support/dcu/paddle_tutorial_cn.md +++ b/docs/hardware_support/dcu/paddle_tutorial_cn.md @@ -8,7 +8,7 @@ * 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - * 镜像链接: registry.baidubce.com/device/paddle-dcu:dtk23.10.1-kylinv10-gcc73-py310 + * 镜像链接:ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle-dcu:dtk24.04.1-kylinv10-gcc82 ### 环境安装 @@ -19,7 +19,7 @@ *由于 dcu 代码位于飞桨主框架中,因此我们不需要安装额外的 Custom Device 包* ```shell -python -m pip install --pre paddlepaddle-rocm -i https://www.paddlepaddle.org.cn/packages/nightly/dcu/ +python -m pip install --pre paddlepaddle-dcu -i https://www.paddlepaddle.org.cn/packages/nightly/dcu/ ``` ## 二、运行示例 diff --git a/docs/hardware_support/dcu/paddlex_tutorial_cn.md b/docs/hardware_support/dcu/paddlex_tutorial_cn.md new file mode 100644 index 00000000000..07cae94c184 --- /dev/null +++ b/docs/hardware_support/dcu/paddlex_tutorial_cn.md @@ -0,0 +1,212 @@ +# 海光 DCU 基于 PaddleX 的使用指南 + +## 环境准备 + +### 环境说明 + +* 本教程介绍如何基于海光 DCU 进行 ResNet50 / DeepLabv3p 等不同领域模型的训练,总共需要 4 卡进行训练 + +* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: + + * 镜像链接:ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle-dcu:dtk24.04.1-kylinv10-gcc82 + +### 环境安装 + +1. 安装 PaddlePaddle + +*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* + +*由于 dcu 代码位于飞桨主框架中,因此我们不需要安装额外的 Custom Device 包* + +```shell +python -m pip install --pre paddlepaddle-dcu -i https://www.paddlepaddle.org.cn/packages/nightly/dcu/ +``` + +2. 安装 PaddleX 代码库 + +```shell +git clone https://github.com/PaddlePaddle/PaddleX.git + +# 如果速度较慢,可以考虑从 gitee 拉取 +# git clone https://gitee.com/paddlepaddle/PaddleX.git + +cd PaddleX + +# 安装 PaddleX whl +# -e:以可编辑模式安装,当前项目的代码更改,都会直接作用到已经安装的 PaddleX Wheel +pip install -e . +``` + +## 基于 PaddleX 训练 ResNet50 + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是图像分类模型,因此安装图像分类库 +paddlex --install PaddleClas + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 flowers 102 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/cls_flowers_examples.tar -P ./dataset +tar -xf ./dataset/cls_flowers_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/cls_flowers_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 DCU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `gpu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `gpu` ,在进行模型训练时,飞桨将自动调用 dcu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/image_classification/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/cls_flowers_examples \ + -o Global.output=resnet50_output \ + -o Global.device="gpu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `resnet50_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `resnet50_output/best_model/` 目录下,其中 `inference.pdiparams`、`inference.pdiparams.info`、`inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./resnet50_output/best_model" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg" \ + -o Global.device="gpu:0" +``` + +#### 转换 ONNX 模型 + +如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: + +a. 安装环境 + +```shell +# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 +python -m pip install paddle2onnx +``` + +b. 模型转换 + +```shell +paddle2onnx --model_dir=./resnet50_output/best_model/ \ + --model_filename=inference.pdmodel \ + --params_filename=inference.pdiparams \ + --save_file=./resnet50_output/best_model/inference.onnx \ + --enable_onnx_checker=True +``` + +该命令会在 `resnet50_output/best_model` 目录下生成 `inference.onnx` 文件 + +## 基于 PaddleX 训练 DeepLabv3+ + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是图像分割模型,因此安装图像分割库 +paddlex --install PaddleSeg + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 PaddleX 准备的 Demo 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/seg_optic_examples.tar -P ./dataset +tar -xf ./dataset/seg_optic_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +# PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/seg_optic_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 DCU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `gpu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `gpu` ,在进行模型训练时,飞桨将自动调用 dcu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `Deeplabv3_Plus-R50` + +```shell +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/seg_optic_examples \ + -o Global.output=deeplabv3p_output \ + -o Global.device="gpu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `deeplabv3p_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `deeplabv3p_output/best_model/` 目录下,其中 `model/inference.pdiparams`、`model/inference.pdiparams.info`、`model/inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./deeplabv3p_output/best_model/model/" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_semantic_segmentation_001.jpg" \ + -o Global.device="gpu:0" +``` diff --git a/docs/hardware_support/dcu/suite_tutorial_cn.md b/docs/hardware_support/dcu/suite_tutorial_cn.md deleted file mode 100644 index c914d5956e5..00000000000 --- a/docs/hardware_support/dcu/suite_tutorial_cn.md +++ /dev/null @@ -1,217 +0,0 @@ -# 海光 DCU 基于套件的使用指南 - -## 环境准备 - -### 环境说明 - -* 本教程介绍如何基于海光 DCU 进行 ResNet50 的训练,总共需要 4 卡进行训练 - -* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - - * 镜像链接: registry.baidubce.com/device/paddle-dcu:dtk23.10.1-kylinv10-gcc73-py310 - -### 环境安装 - -安装 PaddlePaddle - -*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* - -*由于 dcu 代码位于飞桨主框架中,因此我们不需要安装额外的 Custom Device 包* - -```shell -python -m pip install --pre paddlepaddle-rocm -i https://www.paddlepaddle.org.cn/packages/nightly/dcu/ -``` - -## 基于 PaddleClas 训练 ResNet50 - -### 一、安装 PaddleClas 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleClas.git -b release/2.5.1 -cd PaddleClas -python -m pip install -r requirements.txt -python -m pip install . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5.1/docs/zh_CN/models/ImageNet1k/ResNet.md#32-%E6%95%B0%E6%8D%AE%E5%87%86%E5%A4%87) 准备 ImageNet1k 数据集,准备完成后解压到 PaddleClas/dataset/目录下,目录结构如下: - -``` -PaddleClas/dataset/ILSVRC2012/ -|_ train/ -| |_ n01440764 -| | |_ n01440764_10026.JPEG -| | |_ ... -| |_ ... -| | -| |_ n15075141 -| |_ ... -| |_ n15075141_9993.JPEG -|_ val/ -| |_ ILSVRC2012_val_00000001.JPEG -| |_ ... -| |_ ILSVRC2012_val_00050000.JPEG -|_ train_list.txt -|_ val_list.txt -``` - -### 三、模型训练 - -进入 PaddleClas 目录下,执行如下命令启动 4 卡 DCU(0 ~ 3 号卡)训练,其中: - -* 参数 `-o Global.device` 指定的是即将运行的设备,为了保持和 GPU 兼容,我们在命名上做了兼容处理,dcu 设备的名字同样叫做 gpu,因此这里需要传入的是 gpu ,通过指定该参数,PaddleSeg 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 dcu,在进行模型训练时,飞桨将自动调用 dcu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 - -* 参数 `-c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` - -```shell -python -u -m paddle.distributed.launch --devices 0,1,2,3 tools/train.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.output_dir="output/ResNet50" \ - -o Global.device="gpu" -``` - -上述命令会在 PaddleClas 目录下产生一个 output/ResNet50 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最后一个 epoch 的权重放在 output/ResNet50/ 目录下的 epoch_120.pdparams 文件中,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export_model.py 执行的是 `动转静` 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定./deploy/models/ResNet50 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export_model.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.pretrained_model=./output/ResNet50/epoch_120 \ - -o Global.save_inference_dir=./deploy/models/ResNet50 \ - -o Global.device=gpu -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleClas/deploy 目录下,执行下列命令进行 DCU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -cd deploy -export FLAGS_conv_workspace_size_limit=2000 -python python/predict_cls.py \ - -c ./configs/inference_cls.yaml \ - -o Global.inference_model_dir=./models/ResNet50 \ - -o Global.use_gpu=True \ - -o Global.infer_imgs=./images/ImageNet -``` - -#### 转换 ONNX 模型 - -如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: - -a. 安装环境 - -```shell -# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 -python -m pip install paddle2onnx -``` - -b. 模型转换 - -```shell -paddle2onnx --model_dir=./deploy/models/ResNet50/ \ - --model_filename=inference.pdmodel \ - --params_filename=inference.pdiparams \ - --save_file=./deploy/models/ResNet50_onnx/inference.onnx \ - --opset_version=10 \ - --enable_onnx_checker=True -``` - -该命令会在 deploy/models/ResNet50_onnx 目录下生成 inference.onnx 文件,生成的文件可以基于 ONNX Runtime 进行推理,具体使用方式参考 [ONNX Runtime 官网](https://onnxruntime.ai/) - -## 基于 PaddleSeg 训练 DeepLabv3+ - -### 一、安装 PaddleSeg 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleSeg -b release/2.9 -cd PaddleSeg -python -m pip install -r requirements.txt -python -m pip install -e . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.9/docs/data/pre_data_cn.md#cityscapes%E6%95%B0%E6%8D%AE%E9%9B%86) 准备 Cityscapes 数据集,准备完成后解压到 PaddleSeg/data/目录下,目录结构如下: - -```shell -PaddleSeg/data/cityscapes -├── leftImg8bit -│ ├── train -│ ├── val -├── gtFine -│ ├── train -│ ├── val -``` - -### 三、模型训练 - -进入 PaddleSeg 目录下,执行如下命令启动 4 卡 DCU(0 ~ 3 号卡)训练,其中: - -* 参数 `--device` 指定的是即将运行的设备,为了保持和 GPU 兼容,我们在命名上做了兼容处理,dcu 设备的名字同样叫做 gpu,因此这里需要传入的是 gpu ,通过指定该参数,PaddleSeg 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 dcu,在进行模型训练时,飞桨将自动调用 dcu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 - -* 参数 `--config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 DeepLabv3+ - -```shell -python -u -m paddle.distributed.launch --devices 0,1,2,3 tools/train.py \ - --config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml \ - --num_workers 8 \ - --save_dir output/deeplabv3p_resnet50 \ - --log_iters 10 \ - --device gpu \ - --do_eval \ - --save_interval 1000 \ - --seed 2048 -``` - -上述命令会在 PaddleSeg 目录下产生一个 output/deeplabv3p_resnet50 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最优指标对应的权重放在 output/deeplabv3p_resnet50/best_model 目录下,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export.py 执行的是 `动转静` 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定 output/deeplabv3p_resnet50_inference_model 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export.py \ - --config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml \ - --model_path output/deeplabv3p_resnet50/best_model/model.pdparams \ - --save_dir output/deeplabv3p_resnet50_inference_model -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleSeg/deploy 目录下,执行下列命令进行 DCU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -wget https://paddleseg.bj.bcebos.com/dygraph/demo/cityscapes_demo.png - -export FLAGS_conv_workspace_size_limit=2000 -python deploy/python/infer.py \ - --config output/deeplabv3p_resnet50_inference_model/deploy.yaml \ - --image_path ./cityscapes_demo.png \ - --save_dir ./output \ - --device "gpu" -``` diff --git a/docs/hardware_support/dcu/support_cn.md b/docs/hardware_support/dcu/support_cn.md index 599539d0e19..5c16a418d85 100644 --- a/docs/hardware_support/dcu/support_cn.md +++ b/docs/hardware_support/dcu/support_cn.md @@ -1,318 +1,14 @@ # 海光 DCU 验证模型 -基于 2024Q2 devlop 版本,飞桨框架在海光 DCU 上通过精度验证的模型情况如下: +飞桨框架在海光 DCU 上通过精度验证的模型情况如下: | 模型库 | 模型名称 | 训练 | 推理 | -| - | - | - | - | -| PaddleClas | ResNet50 | √ | √ | -| PaddleClas | ResNet101 | √ | √ | -| PaddleClas | SE_ResNet50_vd | √ | √ | -| PaddleClas | VGG16 | √ | √ | -| PaddleClas | InceptionV4 | √ | √ | -| PaddleClas | AlexNet | √ | √ | -| PaddleClas | DenseNet121 | √ | √ | -| PaddleClas | CLIP_vit_base_patch16_224 | 未测试 | √ | -| PaddleClas | CSPDarkNet53 | 未测试 | √ | -| PaddleClas | CSWinTransformer_base_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_base_384 | 未测试 | √ | -| PaddleClas | CSWinTransformer_large_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_large_384 | 未测试 | √ | -| PaddleClas | CSWinTransformer_small_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_tiny_224 | 未测试 | √ | -| PaddleClas | ConvNeXt_small | 未测试 | √ | -| PaddleClas | ConvNeXt_tiny | 未测试 | √ | -| PaddleClas | CvT_13_224 | 未测试 | √ | -| PaddleClas | CvT_13_384 | 未测试 | √ | -| PaddleClas | CvT_21_224 | 未测试 | √ | -| PaddleClas | CvT_21_384 | 未测试 | √ | -| PaddleClas | DLA102 | 未测试 | √ | -| PaddleClas | DLA102x | 未测试 | √ | -| PaddleClas | DLA102x2 | 未测试 | √ | -| PaddleClas | DLA169 | 未测试 | √ | -| PaddleClas | DLA34 | 未测试 | √ | -| PaddleClas | DLA46_c | 未测试 | √ | -| PaddleClas | DLA46x_c | 未测试 | √ | -| PaddleClas | DLA60 | 未测试 | √ | -| PaddleClas | DLA60x | 未测试 | √ | -| PaddleClas | DLA60x_c | 未测试 | √ | -| PaddleClas | DPN107 | 未测试 | √ | -| PaddleClas | DPN131 | 未测试 | √ | -| PaddleClas | DPN68 | 未测试 | √ | -| PaddleClas | DPN92 | 未测试 | √ | -| PaddleClas | DPN98 | 未测试 | √ | -| PaddleClas | DSNet_base | 未测试 | √ | -| PaddleClas | DSNet_small | 未测试 | √ | -| PaddleClas | DSNet_tiny | 未测试 | √ | -| PaddleClas | DarkNet53 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_384 | 未测试 | √ | -| PaddleClas | DeiT_small_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_tiny_patch16_224 | 未测试 | √ | -| PaddleClas | DenseNet161 | 未测试 | √ | -| PaddleClas | DenseNet169 | 未测试 | √ | -| PaddleClas | DenseNet201 | 未测试 | √ | -| PaddleClas | DenseNet264 | 未测试 | √ | -| PaddleClas | DistillationModel | 未测试 | √ | -| PaddleClas | ESNet_x0_25 | 未测试 | √ | -| PaddleClas | ESNet_x0_5 | 未测试 | √ | -| PaddleClas | ESNet_x0_75 | 未测试 | √ | -| PaddleClas | ESNet_x1_0 | 未测试 | √ | -| PaddleClas | EfficientNetB0 | 未测试 | √ | -| PaddleClas | EfficientNetB1 | 未测试 | √ | -| PaddleClas | EfficientNetB2 | 未测试 | √ | -| PaddleClas | EfficientNetB3 | 未测试 | √ | -| PaddleClas | EfficientNetB4 | 未测试 | √ | -| PaddleClas | EfficientNetB5 | 未测试 | √ | -| PaddleClas | EfficientNetB6 | 未测试 | √ | -| PaddleClas | EfficientNetB7 | 未测试 | √ | -| PaddleClas | GeneralRecognitionV2_PPLCNetV2_base_ultra | 未测试 | √ | -| PaddleClas | GhostNet_x0_5 | 未测试 | √ | -| PaddleClas | GhostNet_x1_0 | 未测试 | √ | -| PaddleClas | GhostNet_x1_3 | 未测试 | √ | -| PaddleClas | GoogLeNet | 未测试 | √ | -| PaddleClas | HRNet_W18_C | 未测试 | √ | -| PaddleClas | HRNet_W30_C | 未测试 | √ | -| PaddleClas | HRNet_W32_C | 未测试 | √ | -| PaddleClas | HRNet_W40_C | 未测试 | √ | -| PaddleClas | HRNet_W44_C | 未测试 | √ | -| PaddleClas | HRNet_W48_C | 未测试 | √ | -| PaddleClas | HRNet_W64_C | 未测试 | √ | -| PaddleClas | HarDNet39_ds | 未测试 | √ | -| PaddleClas | HarDNet68 | 未测试 | √ | -| PaddleClas | HarDNet68_ds | 未测试 | √ | -| PaddleClas | HarDNet85 | 未测试 | √ | -| PaddleClas | InceptionV3 | 未测试 | √ | -| PaddleClas | LeViT_128 | 未测试 | √ | -| PaddleClas | LeViT_128S | 未测试 | √ | -| PaddleClas | LeViT_192 | 未测试 | √ | -| PaddleClas | LeViT_256 | 未测试 | √ | -| PaddleClas | LeViT_384 | 未测试 | √ | -| PaddleClas | MetaBIN_ResNet50 | 未测试 | √ | -| PaddleClas | MicroNet_M0 | 未测试 | √ | -| PaddleClas | MicroNet_M1 | 未测试 | √ | -| PaddleClas | MicroNet_M2 | 未测试 | √ | -| PaddleClas | MicroNet_M3 | 未测试 | √ | -| PaddleClas | MixNet_L | 未测试 | √ | -| PaddleClas | MixNet_M | 未测试 | √ | -| PaddleClas | MixNet_S | 未测试 | √ | -| PaddleClas | MobileNeXt_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV1 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_25 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_25 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileNetV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_25 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x1_25 | 未测试 | √ | -| PaddleClas | MobileViTV2_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_0 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileViTV3_S | 未测试 | √ | -| PaddleClas | MobileViTV3_S_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XS | 未测试 | √ | -| PaddleClas | MobileViTV3_XS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_75 | 未测试 | √ | -| PaddleClas | MobileViTV3_x1_0 | 未测试 | √ | -| PaddleClas | MobileViT_S | 未测试 | √ | -| PaddleClas | MobileViT_XS | 未测试 | √ | -| PaddleClas | MobileViT_XXS | 未测试 | √ | -| PaddleClas | NextViT_base_224 | 未测试 | √ | -| PaddleClas | NextViT_base_384 | 未测试 | √ | -| PaddleClas | NextViT_large_224 | 未测试 | √ | -| PaddleClas | NextViT_large_384 | 未测试 | √ | -| PaddleClas | NextViT_small_224 | 未测试 | √ | -| PaddleClas | NextViT_small_384 | 未测试 | √ | -| PaddleClas | PPHGNet_small | 未测试 | √ | -| PaddleClas | PPHGNet_tiny | 未测试 | √ | -| PaddleClas | PPLCNetV2_base | 未测试 | √ | -| PaddleClas | PPLCNet_x0_25 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_35 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_75 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_5 | 未测试 | √ | -| PaddleClas | PVT_V2_B0 | 未测试 | √ | -| PaddleClas | PVT_V2_B1 | 未测试 | √ | -| PaddleClas | PVT_V2_B2 | 未测试 | √ | -| PaddleClas | PVT_V2_B2_Linear | 未测试 | √ | -| PaddleClas | PVT_V2_B3 | 未测试 | √ | -| PaddleClas | PVT_V2_B4 | 未测试 | √ | -| PaddleClas | PVT_V2_B5 | 未测试 | √ | -| PaddleClas | ReXNet_1_0 | 未测试 | √ | -| PaddleClas | ReXNet_1_3 | 未测试 | √ | -| PaddleClas | ReXNet_1_5 | 未测试 | √ | -| PaddleClas | ReXNet_2_0 | 未测试 | √ | -| PaddleClas | ReXNet_3_0 | 未测试 | √ | -| PaddleClas | RepVGG_B3 | 未测试 | √ | -| PaddleClas | Res2Net101_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net200_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_14w_8s | 未测试 | √ | -| PaddleClas | Res2Net50_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_vd_26w_4s | 未测试 | √ | -| PaddleClas | ResNeSt101 | 未测试 | √ | -| PaddleClas | ResNeSt50 | 未测试 | √ | -| PaddleClas | ResNeSt50_fast_1s1x64d | 未测试 | √ | -| PaddleClas | ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNet101_vd | 未测试 | √ | -| PaddleClas | ResNet152 | 未测试 | √ | -| PaddleClas | ResNet152_vd | 未测试 | √ | -| PaddleClas | ResNet18 | 未测试 | √ | -| PaddleClas | ResNet18_vd | 未测试 | √ | -| PaddleClas | ResNet200_vd | 未测试 | √ | -| PaddleClas | ResNet34 | 未测试 | √ | -| PaddleClas | ResNet34_vd | 未测试 | √ | -| PaddleClas | ResNet50_vd | 未测试 | √ | -| PaddleClas | SENet154_vd | 未测试 | √ | -| PaddleClas | SE_ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNet18_vd | 未测试 | √ | -| PaddleClas | SE_ResNet34_vd | 未测试 | √ | -| PaddleClas | ShuffleNetV2_swish | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_25 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_33 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_0 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x2_0 | 未测试 | √ | -| PaddleClas | SqueezeNet1_0 | 未测试 | √ | -| PaddleClas | SqueezeNet1_1 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_small_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_tiny_patch4_window7_224 | 未测试 | √ | -| PaddleClas | TNT_small | 未测试 | √ | -| PaddleClas | TinyNet_A | 未测试 | √ | -| PaddleClas | TinyNet_B | 未测试 | √ | -| PaddleClas | TinyNet_C | 未测试 | √ | -| PaddleClas | TinyNet_D | 未测试 | √ | -| PaddleClas | TinyNet_E | 未测试 | √ | -| PaddleClas | UniFormer_base | 未测试 | √ | -| PaddleClas | UniFormer_base_ls | 未测试 | √ | -| PaddleClas | UniFormer_small | 未测试 | √ | -| PaddleClas | UniFormer_small_plus | 未测试 | √ | -| PaddleClas | UniFormer_small_plus_dim64 | 未测试 | √ | -| PaddleClas | VAN_B0 | 未测试 | √ | -| PaddleClas | VAN_B1 | 未测试 | √ | -| PaddleClas | VGG11 | 未测试 | √ | -| PaddleClas | VGG13 | 未测试 | √ | -| PaddleClas | VGG19 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_base_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_small_patch16_224 | 未测试 | √ | -| PaddleClas | Xception41 | 未测试 | √ | -| PaddleClas | Xception41_deeplab | 未测试 | √ | -| PaddleClas | Xception65 | 未测试 | √ | -| PaddleClas | Xception65_deeplab | 未测试 | √ | -| PaddleClas | Xception71 | 未测试 | √ | -| PaddleClas | alt_gvt_base | 未测试 | √ | -| PaddleClas | alt_gvt_large | 未测试 | √ | -| PaddleClas | alt_gvt_small | 未测试 | √ | -| PaddleClas | cae_base_patch16_224 | 未测试 | √ | -| PaddleClas | pcpvt_base | 未测试 | √ | -| PaddleClas | pcpvt_large | 未测试 | √ | -| PaddleClas | pcpvt_small | 未测试 | √ | -| PaddleDetection | SOLOv2 | 未测试 | √ | -| PaddleDetection | TTFNet | 未测试 | √ | -| PaddleDetection | dark_hrnet_w32_256x192 | 未测试 | √ | -| PaddleDetection | fairmot_hrnetv2_w18_dlafpn_30e_576x320 | 未测试 | √ | -| PaddleDetection | hrnet_w32_256x192 | 未测试 | √ | -| PaddleDetection | ppyolo_r50vd_dcn_1x_coco | 未测试 | √ | -| PaddleDetection | ppyolov2_r50vd_dcn_365e_coco | 未测试 | √ | -| PaddleDetection | solov2_r50_enhance_coco | 未测试 | √ | -| PaddleDetection | tinypose_128x96 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv2_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv2_rec | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv3_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv3_rec | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_mobile_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_server_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_server_rec | 未测试 | √ | -| PaddleOCR | ch_ppocr_mobile_v2_0_det_0 | 未测试 | √ | -| PaddleOCR | ch_ppocr_mobile_v2_0_rec | 未测试 | √ | -| PaddleOCR | ch_ppocr_server_v2_0_det_0 | 未测试 | √ | -| PaddleOCR | ch_ppocr_server_v2_0_rec | 未测试 | √ | -| PaddleOCR | det_mv3_db_v2_0_0 | 未测试 | √ | -| PaddleOCR | det_r50_db_plusplus_0 | 未测试 | √ | -| PaddleOCR | det_r50_db_v2_0_0 | 未测试 | √ | -| PaddleOCR | en_table_structure | 未测试 | √ | -| PaddleOCR | rec_abinet | 未测试 | √ | -| PaddleOCR | rec_mtb_nrtr | 未测试 | √ | -| PaddleOCR | rec_mv3_none_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_tps_bilstm_att_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_tps_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r31_sar | 未测试 | √ | -| PaddleOCR | rec_r32_gaspin_bilstm_att | 未测试 | √ | -| PaddleOCR | rec_r34_vd_none_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_tps_bilstm_att_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_tps_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r50_fpn_vd_none_srn | 未测试 | √ | -| PaddleOCR | rec_resnet_rfl | 未测试 | √ | -| PaddleOCR | rec_svtrnet | 未测试 | √ | -| PaddleOCR | rec_vitstr | 未测试 | √ | -| PaddleOCR | slanet | 未测试 | √ | -| PaddleSeg | DANet | √ | √ | -| PaddleSeg | DeepLabV3+ | √ | √ | -| PaddleSeg | FCN | √ | √ | -| PaddleSeg | GCNet | √ | √ | -| PaddleSeg | U-Net | √ | √ | -| PaddleSeg | bisenetv2 | 未测试 | √ | -| PaddleSeg | deeplabv3p_resnet50_cityscapes | 未测试 | √ | -| PaddleSeg | fcn_hrnetw18 | 未测试 | √ | -| PaddleSeg | fcn_hrnetw18_small | 未测试 | √ | -| PaddleSeg | fcn_uhrnetw18_small | 未测试 | √ | -| PaddleSeg | ocrnet_hrnetw18 | 未测试 | √ | -| PaddleSeg | ocrnet_hrnetw48 | 未测试 | √ | -| PaddleSeg | pfpnnet | 未测试 | √ | -| PaddleSeg | pp_liteseg_stdc1 | 未测试 | √ | -| PaddleSeg | pphumanseg_lite | 未测试 | √ | -| PaddleSeg | pphumanseg_server | 未测试 | √ | -| PaddleSeg | ppmatting | 未测试 | √ | -| PaddleSeg | seaformer_base | 未测试 | √ | -| PaddleSeg | sfnet | 未测试 | √ | -| PaddleVideo | STGCN | 未测试 | √ | -| PaddleVideo | AGCN | 未测试 | √ | -| PaddleVideo | AGCN2s | 未测试 | √ | -| PaddleVideo | BMN | 未测试 | √ | -| PaddleVideo | PP-TSM | 未测试 | √ | -| PaddleVideo | PP-TSN | 未测试 | √ | -| PaddleVideo | SlowFast | 未测试 | √ | -| PaddleVideo | TSN | 未测试 | √ | -| PaddleNLP | Transformer | √ | 未测试 | -| PaddleGAN | msvsr | 未测试 | √ | +| - | - | - | - | +| PaddleX | ResNet18 | √ | √ | +| PaddleX | ResNet34 | √ | √ | +| PaddleX | ResNet50 | √ | √ | +| PaddleX | ResNet101 | √ | √ | +| PaddleX | ResNet152 | √ | √ | +| PaddleX | Deeplabv3_Plus-R50 | √ | √ | +| PaddleX | Deeplabv3_Plus-R101 | √ | √ | +| PaddleNLP | BERT | √ | √ | diff --git a/docs/hardware_support/mlu/index_cn.rst b/docs/hardware_support/mlu/index_cn.rst index 4e0fddcf2bb..b4d0dbcf4bf 100644 --- a/docs/hardware_support/mlu/index_cn.rst +++ b/docs/hardware_support/mlu/index_cn.rst @@ -9,7 +9,7 @@ 飞桨框架支持基于寒武纪 MLU 芯片的训练和推理,请参考以下内容快速体验: - `寒武纪 MLU 基于框架的使用指南 <./paddle_tutorial_cn.html>`_ : 寒武纪 MLU 基于框架的使用指南 -- `寒武纪 MLU 基于套件的使用指南 <./suite_tutorial_cn.html>`_ : 寒武纪 MLU 基于套件的使用指南 +- `寒武纪 MLU 基于套件的使用指南 <./paddlex_tutorial_cn.html>`_ : 寒武纪 MLU 基于套件的使用指南 - `寒武纪 MLU 支持模型 <./support_cn.html>`_ : 寒武纪 MLU 支持模型 @@ -17,5 +17,5 @@ :hidden: paddle_tutorial_cn.md - suite_tutorial_cn.md + paddlex_tutorial_cn.md support_cn.md diff --git a/docs/hardware_support/mlu/paddle_tutorial_cn.md b/docs/hardware_support/mlu/paddle_tutorial_cn.md index 8c2de9b0bc4..9633e525331 100644 --- a/docs/hardware_support/mlu/paddle_tutorial_cn.md +++ b/docs/hardware_support/mlu/paddle_tutorial_cn.md @@ -8,7 +8,7 @@ * 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - * 镜像链接: registry.baidubce.com/device/paddle-mlu:ubuntu20-x86_64-gcc84-py310 + * 镜像链接: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-mlu:ctr2.15.0-ubuntu20-gcc84-py310 ### 环境安装 diff --git a/docs/hardware_support/mlu/paddlex_tutorial_cn.md b/docs/hardware_support/mlu/paddlex_tutorial_cn.md new file mode 100644 index 00000000000..9f4fe6bf882 --- /dev/null +++ b/docs/hardware_support/mlu/paddlex_tutorial_cn.md @@ -0,0 +1,218 @@ +# 寒武纪 MLU 基于 PaddleX 的使用指南 + +## 环境准备 + +### 环境说明 + +* 本教程介绍如何基于寒武纪 MLU 进行 ResNet50 / DeepLabv3p 等不同领域模型的训练,总共需要 4 卡进行训练 + +* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: + + * 镜像链接: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-mlu:ctr2.15.0-ubuntu20-gcc84-py310 + +### 环境安装 + +1. 安装 PaddlePaddle + +*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* + +```shell +python -m pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/ +``` + +2. 安装 CustomDevice + +*该命令会自动安装飞桨 Custom Device 每日自动构建的 nightly-build 版本* + +```shell +python -m pip install --pre paddle-custom-mlu -i https://www.paddlepaddle.org.cn/packages/nightly/mlu/ +``` + +3. 安装 PaddleX 代码库 + +```shell +git clone https://github.com/PaddlePaddle/PaddleX.git + +# 如果速度较慢,可以考虑从 gitee 拉取 +# git clone https://gitee.com/paddlepaddle/PaddleX.git + +cd PaddleX + +# 安装 PaddleX whl +# -e:以可编辑模式安装,当前项目的代码更改,都会直接作用到已经安装的 PaddleX Wheel +pip install -e . +``` + +## 基于 PaddleX 训练 ResNet50 + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是图像分类模型,因此安装图像分类库 +paddlex --install PaddleClas + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 flowers 102 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/cls_flowers_examples.tar -P ./dataset +tar -xf ./dataset/cls_flowers_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/cls_flowers_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 MLU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `mlu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `mlu` ,在进行模型训练时,飞桨将自动调用 mlu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/image_classification/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/cls_flowers_examples \ + -o Global.output=resnet50_output \ + -o Global.device="mlu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `resnet50_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `resnet50_output/best_model/` 目录下,其中 `inference.pdiparams`、`inference.pdiparams.info`、`inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./resnet50_output/best_model" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg" \ + -o Global.device="mlu:0" +``` + +#### 转换 ONNX 模型 + +如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: + +a. 安装环境 + +```shell +# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 +python -m pip install paddle2onnx +``` + +b. 模型转换 + +```shell +paddle2onnx --model_dir=./resnet50_output/best_model/ \ + --model_filename=inference.pdmodel \ + --params_filename=inference.pdiparams \ + --save_file=./resnet50_output/best_model/inference.onnx \ + --enable_onnx_checker=True +``` + +该命令会在 `resnet50_output/best_model` 目录下生成 `inference.onnx` 文件 + +## 基于 PaddleX 训练 DeepLabv3+ + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是图像分割模型,因此安装图像分割库 +paddlex --install PaddleSeg + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 PaddleX 准备的 Demo 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/seg_optic_examples.tar -P ./dataset +tar -xf ./dataset/seg_optic_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +# PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/seg_optic_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 MLU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `mlu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `mlu` ,在进行模型训练时,飞桨将自动调用 mlu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `Deeplabv3_Plus-R50` + +```shell +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/seg_optic_examples \ + -o Global.output=deeplabv3p_output \ + -o Global.device="mlu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `deeplabv3p_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `deeplabv3p_output/best_model/` 目录下,其中 `model/inference.pdiparams`、`model/inference.pdiparams.info`、`model/inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./deeplabv3p_output/best_model/model/" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_semantic_segmentation_001.jpg" \ + -o Global.device="mlu:0" +``` diff --git a/docs/hardware_support/mlu/suite_tutorial_cn.md b/docs/hardware_support/mlu/suite_tutorial_cn.md deleted file mode 100644 index d15b9d0f1e6..00000000000 --- a/docs/hardware_support/mlu/suite_tutorial_cn.md +++ /dev/null @@ -1,221 +0,0 @@ -# 寒武纪 MLU 基于套件的使用指南 - -## 环境准备 - -### 环境说明 - -* 本教程介绍如何基于寒武纪 MLU 进行 ResNet50 的训练,总共需要 4 卡进行训练 - -* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - - * 镜像链接: registry.baidubce.com/device/paddle-mlu:ubuntu20-x86_64-gcc84-py310 - -### 环境安装 - -1. 安装 PaddlePaddle - -*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* - -```shell -python -m pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/ -``` - -2. 安装 CustomDevice - -*该命令会自动安装飞桨 Custom Device 每日自动构建的 nightly-build 版本* - -```shell -python -m pip install --pre paddle-custom-mlu -i https://www.paddlepaddle.org.cn/packages/nightly/mlu/ -``` - -## 基于 PaddleClas 训练 ResNet50 - -### 一、安装 PaddleClas 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleClas.git -b release/2.5.1 -cd PaddleClas -python -m pip install -r requirements.txt -python -m pip install . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5.1/docs/zh_CN/models/ImageNet1k/ResNet.md#32-%E6%95%B0%E6%8D%AE%E5%87%86%E5%A4%87) 准备 ImageNet1k 数据集,准备完成后解压到 PaddleClas/dataset/目录下,目录结构如下: -``` -PaddleClas/dataset/ILSVRC2012/ -|_ train/ -| |_ n01440764 -| | |_ n01440764_10026.JPEG -| | |_ ... -| |_ ... -| | -| |_ n15075141 -| |_ ... -| |_ n15075141_9993.JPEG -|_ val/ -| |_ ILSVRC2012_val_00000001.JPEG -| |_ ... -| |_ ILSVRC2012_val_00050000.JPEG -|_ train_list.txt -|_ val_list.txt -``` - -### 三、模型训练 - -进入 PaddleClas 目录下,执行如下命令启动 4 卡 MLU(0 ~ 3 号卡)训练,其中: - -* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `mlu` ,通过指定该参数,PaddleClas 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 mlu,在进行模型训练时,飞桨将自动调用 mlu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 - -* 参数 `-c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` - -```shell -python -u -m paddle.distributed.launch --devices 0,1,2,3 tools/train.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.output_dir="output/ResNet50" \ - -o Global.device="mlu" -``` - -上述命令会在 PaddleClas 目录下产生一个 output/ResNet50 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最后一个 epoch 的权重放在 output/ResNet50/ 目录下的 epoch_120.pdparams 文件中,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export_model.py 执行的是 `动转静` 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定./deploy/models/ResNet50 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export_model.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.pretrained_model=./output/ResNet50/epoch_120 \ - -o Global.save_inference_dir=./deploy/models/ResNet50 \ - -o Global.device=mlu -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleClas/deploy 目录下,执行下列命令进行 MLU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -cd deploy -python python/predict_cls.py \ - -c ./configs/inference_cls.yaml \ - -o Global.inference_model_dir=./models/ResNet50 \ - -o Global.use_gpu=False \ - -o Global.use_mlu=True \ - -o Global.infer_imgs=./images/ImageNet -``` - -#### 转换 ONNX 模型 - -如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: - -a. 安装环境 - -```shell -# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 -python -m pip install paddle2onnx -``` - -b. 模型转换 - -```shell -paddle2onnx --model_dir=./deploy/models/ResNet50/ \ - --model_filename=inference.pdmodel \ - --params_filename=inference.pdiparams \ - --save_file=./deploy/models/ResNet50_onnx/inference.onnx \ - --opset_version=10 \ - --enable_onnx_checker=True -``` - -该命令会在 deploy/models/ResNet50_onnx 目录下生成 inference.onnx 文件,生成的文件可以基于 ONNX Runtime 进行推理,具体使用方式参考 [ONNX Runtime 官网](https://onnxruntime.ai/) - -## 基于 PaddleSeg 训练 DeepLabv3+ - -### 一、安装 PaddleSeg 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleSeg -b release/2.9 -cd PaddleSeg -python -m pip install -r requirements.txt -python -m pip install -e . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.9/docs/data/pre_data_cn.md#cityscapes%E6%95%B0%E6%8D%AE%E9%9B%86) 准备 Cityscapes 数据集,准备完成后解压到 PaddleSeg/data/目录下,目录结构如下: - -```shell -PaddleSeg/data/cityscapes -├── leftImg8bit -│ ├── train -│ ├── val -├── gtFine -│ ├── train -│ ├── val -``` - -### 三、模型训练 - -进入 PaddleSeg 目录下,执行如下命令启动 4 卡 MLU(0 ~ 3 号卡)训练,其中: - -* 参数 `--device` 指定的是即将运行的设备,这里需要传入的是 mlu ,通过指定该参数,PaddleSeg 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 mlu,在进行模型训练时,飞桨将自动调用 mlu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 - -* 参数 `--config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 DeepLabv3+ - -```shell -python -u -m paddle.distributed.launch --devices 0,1,2,3 tools/train.py \ - --config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml \ - --num_workers 8 \ - --save_dir output/deeplabv3p_resnet50 \ - --log_iters 10 \ - --device mlu \ - --do_eval \ - --save_interval 1000 \ - --seed 2048 -``` - -上述命令会在 PaddleSeg 目录下产生一个 output/deeplabv3p_resnet50 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最优指标对应的权重放在 output/deeplabv3p_resnet50/best_model 目录下,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export.py 执行的是 `动转静` 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定 output/deeplabv3p_resnet50_inference_model 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export.py \ - --config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml \ - --model_path output/deeplabv3p_resnet50/best_model/model.pdparams \ - --save_dir output/deeplabv3p_resnet50_inference_model -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleSeg/deploy 目录下,执行下列命令进行 MLU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -wget https://paddleseg.bj.bcebos.com/dygraph/demo/cityscapes_demo.png - -python deploy/python/infer.py \ - --config output/deeplabv3p_resnet50_inference_model/deploy.yaml \ - --image_path ./cityscapes_demo.png \ - --save_dir ./output \ - --device "mlu" -``` diff --git a/docs/hardware_support/mlu/support_cn.md b/docs/hardware_support/mlu/support_cn.md index 90fdefde4a7..6259b70bd56 100644 --- a/docs/hardware_support/mlu/support_cn.md +++ b/docs/hardware_support/mlu/support_cn.md @@ -1,316 +1,45 @@ # 寒武纪 MLU 验证模型 -基于 2024Q2 devlop 版本,飞桨框架在寒武纪 MLU 上通过精度验证的模型情况如下: +飞桨框架在寒武纪 MLU 上通过精度验证的模型情况如下: | 模型库 | 模型名称 | 训练 | 推理 | -| - | - | - | - | -| PaddleClas | ResNet50 | √ | √ | -| PaddleClas | VGG16 | √ | √ | -| PaddleClas | VGG19 | √ | √ | -| PaddleClas | InceptionV4 | √ | √ | -| PaddleClas | MobileNetV3 | √ | √ | -| PaddleClas | AlexNet | 未测试 | √ | -| PaddleClas | CLIP_vit_base_patch16_224 | 未测试 | √ | -| PaddleClas | CSPDarkNet53 | 未测试 | √ | -| PaddleClas | CSWinTransformer_base_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_base_384 | 未测试 | √ | -| PaddleClas | CSWinTransformer_large_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_large_384 | 未测试 | √ | -| PaddleClas | CSWinTransformer_small_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_tiny_224 | 未测试 | √ | -| PaddleClas | ConvNeXt_small | 未测试 | √ | -| PaddleClas | ConvNeXt_tiny | 未测试 | √ | -| PaddleClas | CvT_13_224 | 未测试 | √ | -| PaddleClas | CvT_13_384 | 未测试 | √ | -| PaddleClas | CvT_21_224 | 未测试 | √ | -| PaddleClas | CvT_21_384 | 未测试 | √ | -| PaddleClas | CycleGAN | 未测试 | √ | -| PaddleClas | DLA102 | 未测试 | √ | -| PaddleClas | DLA102x | 未测试 | √ | -| PaddleClas | DLA102x2 | 未测试 | √ | -| PaddleClas | DLA169 | 未测试 | √ | -| PaddleClas | DLA34 | 未测试 | √ | -| PaddleClas | DLA46_c | 未测试 | √ | -| PaddleClas | DLA46x_c | 未测试 | √ | -| PaddleClas | DLA60 | 未测试 | √ | -| PaddleClas | DLA60x | 未测试 | √ | -| PaddleClas | DLA60x_c | 未测试 | √ | -| PaddleClas | DPN107 | 未测试 | √ | -| PaddleClas | DPN131 | 未测试 | √ | -| PaddleClas | DPN68 | 未测试 | √ | -| PaddleClas | DPN92 | 未测试 | √ | -| PaddleClas | DPN98 | 未测试 | √ | -| PaddleClas | DSNet_base | 未测试 | √ | -| PaddleClas | DSNet_small | 未测试 | √ | -| PaddleClas | DSNet_tiny | 未测试 | √ | -| PaddleClas | DarkNet53 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_384 | 未测试 | √ | -| PaddleClas | DeiT_small_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_tiny_patch16_224 | 未测试 | √ | -| PaddleClas | DenseNet121 | 未测试 | √ | -| PaddleClas | DenseNet161 | 未测试 | √ | -| PaddleClas | DenseNet169 | 未测试 | √ | -| PaddleClas | DenseNet201 | 未测试 | √ | -| PaddleClas | DenseNet264 | 未测试 | √ | -| PaddleClas | DistillationModel | 未测试 | √ | -| PaddleClas | ESNet_x0_25 | 未测试 | √ | -| PaddleClas | ESNet_x0_5 | 未测试 | √ | -| PaddleClas | ESNet_x0_75 | 未测试 | √ | -| PaddleClas | ESNet_x1_0 | 未测试 | √ | -| PaddleClas | EfficientNetB0 | 未测试 | √ | -| PaddleClas | EfficientNetB1 | 未测试 | √ | -| PaddleClas | EfficientNetB2 | 未测试 | √ | -| PaddleClas | EfficientNetB3 | 未测试 | √ | -| PaddleClas | EfficientNetB4 | 未测试 | √ | -| PaddleClas | EfficientNetB5 | 未测试 | √ | -| PaddleClas | EfficientNetB6 | 未测试 | √ | -| PaddleClas | EfficientNetB7 | 未测试 | √ | -| PaddleClas | GeneralRecognitionV2_PPLCNetV2_base_ultra | 未测试 | √ | -| PaddleClas | GhostNet_x0_5 | 未测试 | √ | -| PaddleClas | GhostNet_x1_0 | 未测试 | √ | -| PaddleClas | GhostNet_x1_3 | 未测试 | √ | -| PaddleClas | GoogLeNet | 未测试 | √ | -| PaddleClas | HRNet_W18_C | 未测试 | √ | -| PaddleClas | HRNet_W30_C | 未测试 | √ | -| PaddleClas | HRNet_W32_C | 未测试 | √ | -| PaddleClas | HRNet_W40_C | 未测试 | √ | -| PaddleClas | HRNet_W44_C | 未测试 | √ | -| PaddleClas | HRNet_W48_C | 未测试 | √ | -| PaddleClas | HRNet_W64_C | 未测试 | √ | -| PaddleClas | HarDNet39_ds | 未测试 | √ | -| PaddleClas | HarDNet68 | 未测试 | √ | -| PaddleClas | HarDNet68_ds | 未测试 | √ | -| PaddleClas | HarDNet85 | 未测试 | √ | -| PaddleClas | InceptionV3 | 未测试 | √ | -| PaddleClas | LeViT_128 | 未测试 | √ | -| PaddleClas | LeViT_128S | 未测试 | √ | -| PaddleClas | LeViT_192 | 未测试 | √ | -| PaddleClas | LeViT_256 | 未测试 | √ | -| PaddleClas | LeViT_384 | 未测试 | √ | -| PaddleClas | MetaBIN_ResNet50 | 未测试 | √ | -| PaddleClas | MicroNet_M0 | 未测试 | √ | -| PaddleClas | MicroNet_M1 | 未测试 | √ | -| PaddleClas | MicroNet_M2 | 未测试 | √ | -| PaddleClas | MicroNet_M3 | 未测试 | √ | -| PaddleClas | MixNet_L | 未测试 | √ | -| PaddleClas | MixNet_M | 未测试 | √ | -| PaddleClas | MixNet_S | 未测试 | √ | -| PaddleClas | MobileNeXt_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV1 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_25 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_25 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileNetV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_25 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x1_25 | 未测试 | √ | -| PaddleClas | MobileViTV2_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_0 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileViTV3_S | 未测试 | √ | -| PaddleClas | MobileViTV3_S_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XS | 未测试 | √ | -| PaddleClas | MobileViTV3_XS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_75 | 未测试 | √ | -| PaddleClas | MobileViTV3_x1_0 | 未测试 | √ | -| PaddleClas | MobileViT_S | 未测试 | √ | -| PaddleClas | MobileViT_XS | 未测试 | √ | -| PaddleClas | MobileViT_XXS | 未测试 | √ | -| PaddleClas | NextViT_base_224 | 未测试 | √ | -| PaddleClas | NextViT_base_384 | 未测试 | √ | -| PaddleClas | NextViT_large_224 | 未测试 | √ | -| PaddleClas | NextViT_large_384 | 未测试 | √ | -| PaddleClas | NextViT_small_224 | 未测试 | √ | -| PaddleClas | NextViT_small_384 | 未测试 | √ | -| PaddleClas | PPHGNet_small | 未测试 | √ | -| PaddleClas | PPHGNet_tiny | 未测试 | √ | -| PaddleClas | PPLCNetV2_base | 未测试 | √ | -| PaddleClas | PPLCNet_x0_25 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_35 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_75 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_5 | 未测试 | √ | -| PaddleClas | PVT_V2_B0 | 未测试 | √ | -| PaddleClas | PVT_V2_B1 | 未测试 | √ | -| PaddleClas | PVT_V2_B2 | 未测试 | √ | -| PaddleClas | PVT_V2_B2_Linear | 未测试 | √ | -| PaddleClas | PVT_V2_B3 | 未测试 | √ | -| PaddleClas | PVT_V2_B4 | 未测试 | √ | -| PaddleClas | PVT_V2_B5 | 未测试 | √ | -| PaddleClas | ReXNet_1_0 | 未测试 | √ | -| PaddleClas | ReXNet_1_3 | 未测试 | √ | -| PaddleClas | ReXNet_1_5 | 未测试 | √ | -| PaddleClas | ReXNet_2_0 | 未测试 | √ | -| PaddleClas | ReXNet_3_0 | 未测试 | √ | -| PaddleClas | RepVGG_B3 | 未测试 | √ | -| PaddleClas | Res2Net101_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net200_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_14w_8s | 未测试 | √ | -| PaddleClas | Res2Net50_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_vd_26w_4s | 未测试 | √ | -| PaddleClas | ResNeSt101 | 未测试 | √ | -| PaddleClas | ResNeSt50 | 未测试 | √ | -| PaddleClas | ResNeSt50_fast_1s1x64d | 未测试 | √ | -| PaddleClas | ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNet101 | 未测试 | √ | -| PaddleClas | ResNet101_vd | 未测试 | √ | -| PaddleClas | ResNet152 | 未测试 | √ | -| PaddleClas | ResNet152_vd | 未测试 | √ | -| PaddleClas | ResNet18 | 未测试 | √ | -| PaddleClas | ResNet18_vd | 未测试 | √ | -| PaddleClas | ResNet200_vd | 未测试 | √ | -| PaddleClas | ResNet34 | 未测试 | √ | -| PaddleClas | ResNet34_vd | 未测试 | √ | -| PaddleClas | ResNet50_vd | 未测试 | √ | -| PaddleClas | SENet154_vd | 未测试 | √ | -| PaddleClas | SE_ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNet18_vd | 未测试 | √ | -| PaddleClas | SE_ResNet34_vd | 未测试 | √ | -| PaddleClas | SE_ResNet50_vd | 未测试 | √ | -| PaddleClas | ShuffleNetV2_swish | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_25 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_33 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_0 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x2_0 | 未测试 | √ | -| PaddleClas | SlowFast | 未测试 | √ | -| PaddleClas | SqueezeNet1_0 | 未测试 | √ | -| PaddleClas | SqueezeNet1_1 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_small_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_tiny_patch4_window7_224 | 未测试 | √ | -| PaddleClas | TinyNet_A | 未测试 | √ | -| PaddleClas | TinyNet_B | 未测试 | √ | -| PaddleClas | TinyNet_C | 未测试 | √ | -| PaddleClas | TinyNet_D | 未测试 | √ | -| PaddleClas | TinyNet_E | 未测试 | √ | -| PaddleClas | UniFormer_base | 未测试 | √ | -| PaddleClas | UniFormer_base_ls | 未测试 | √ | -| PaddleClas | UniFormer_small | 未测试 | √ | -| PaddleClas | UniFormer_small_plus | 未测试 | √ | -| PaddleClas | UniFormer_small_plus_dim64 | 未测试 | √ | -| PaddleClas | VAN_B0 | 未测试 | √ | -| PaddleClas | VAN_B1 | 未测试 | √ | -| PaddleClas | VGG11 | 未测试 | √ | -| PaddleClas | VGG13 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_base_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_small_patch16_224 | 未测试 | √ | -| PaddleClas | Xception41 | 未测试 | √ | -| PaddleClas | Xception41_deeplab | 未测试 | √ | -| PaddleClas | Xception65 | 未测试 | √ | -| PaddleClas | Xception65_deeplab | 未测试 | √ | -| PaddleClas | Xception71 | 未测试 | √ | -| PaddleClas | alt_gvt_base | 未测试 | √ | -| PaddleClas | alt_gvt_large | 未测试 | √ | -| PaddleClas | alt_gvt_small | 未测试 | √ | -| PaddleClas | pcpvt_base | 未测试 | √ | -| PaddleClas | pcpvt_large | 未测试 | √ | -| PaddleClas | pcpvt_small | 未测试 | √ | -| PaddleDetection | YOLOv3 | √ | √ | -| PaddleDetection | SSD | √ | √ | -| PaddleDetection | dark_hrnet_w32_256x192 | 未测试 | √ | -| PaddleDetection | fairmot_hrnetv2_w18_dlafpn_30e_576x320 | 未测试 | √ | -| PaddleDetection | fcos_r50_fpn_1x_coco | 未测试 | √ | -| PaddleDetection | higherhrnet_hrnet_w32_512 | 未测试 | √ | -| PaddleDetection | hrnet_w32_256x192 | 未测试 | √ | -| PaddleDetection | picodet_lcnet_1_5x_416_coco | 未测试 | √ | -| PaddleDetection | picodet_s_320_coco | 未测试 | √ | -| PaddleDetection | ppyoloe_crn_s_300e_coco | 未测试 | √ | -| PaddleDetection | ppyoloe_plus_crn_s_80e_coco | 未测试 | √ | -| PaddleDetection | ppyoloe_plus_sod_crn_l_80e_coco | 未测试 | √ | -| PaddleDetection | ppyoloe_vit_base_csppan_cae_36e_coco | 未测试 | √ | -| PaddleDetection | ppyolov2_r50vd_dcn_365e_coco | 未测试 | √ | -| PaddleDetection | solov2_r50_enhance_coco | 未测试 | √ | -| PaddleDetection | tinypose_128x96 | 未测试 | √ | -| PaddleDetection | ttfnet_darknet53_1x_coco | 未测试 | √ | -| PaddleDetection | yolov5_s_300e_coco | 未测试 | √ | -| PaddleDetection | yolov7_tiny_300e_coco | 未测试 | √ | -| PaddleNLP | BERT | √ | √ | -| PaddleNLP | Transformer | √ | | -| PaddleNLP | Bi-LSTM | √ | √ | -| PaddleNLP | bisenetv2 | 未测试 | √ | -| PaddleOCR | OCR-Clas | √ | √ | -| PaddleOCR | OCR-E2E | √ | √ | -| PaddleOCR | ch_PP-OCRv2_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv3_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv3_rec | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_mobile_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_server_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_server_rec | 未测试 | √ | -| PaddleOCR | ch_ppocr_mobile_v2_0_det_0 | 未测试 | √ | -| PaddleOCR | ch_ppocr_server_v2_0_det_0 | 未测试 | √ | -| PaddleOCR | det_mv3_db_v2_0_0 | 未测试 | √ | -| PaddleOCR | det_r50_db_plusplus_0 | 未测试 | √ | -| PaddleOCR | det_r50_db_v2_0_0 | 未测试 | √ | -| PaddleOCR | det_r50_dcn_fce_ctw_v2_0_0 | 未测试 | √ | -| PaddleOCR | en_table_structure | 未测试 | √ | -| PaddleOCR | rec_abinet | 未测试 | √ | -| PaddleOCR | rec_mv3_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_resnet_rfl | 未测试 | √ | -| PaddleOCR | rec_svtrnet | 未测试 | √ | -| PaddleOCR | rec_vitstr | 未测试 | √ | -| PaddleOCR | slanet | 未测试 | √ | -| PaddleSeg | DeepLabV3+ | √ | √ | -| PaddleSeg | U-Net | √ | √ | -| PaddleSeg | deeplabv3p_resnet50_cityscapes | 未测试 | √ | -| PaddleSeg | fastscnn | 未测试 | √ | -| PaddleSeg | fcn_hrnetw18 | 未测试 | √ | -| PaddleSeg | fcn_hrnetw18_small | 未测试 | √ | -| PaddleSeg | fcn_uhrnetw18_small | 未测试 | √ | -| PaddleSeg | ocrnet_hrnetw18 | 未测试 | √ | -| PaddleSeg | ocrnet_hrnetw48 | 未测试 | √ | -| PaddleSeg | pfpnnet | 未测试 | √ | -| PaddleSeg | pp_liteseg_stdc2 | 未测试 | √ | -| PaddleSeg | pphumanseg_lite | 未测试 | √ | -| PaddleSeg | pphumanseg_server | 未测试 | √ | -| PaddleSeg | ppmatting | 未测试 | √ | -| PaddleSeg | seaformer_base | 未测试 | √ | -| PaddleVideo | AGCN | 未测试 | √ | -| PaddleVideo | AGCN2s | 未测试 | √ | -| PaddleVideo | BMN | 未测试 | √ | -| PaddleVideo | PP-TSM | 未测试 | √ | -| PaddleVideo | PP-TSN | 未测试 | √ | -| PaddleVideo | STGCN | 未测试 | √ | -| PaddleVideo | TSN | 未测试 | √ | +| - | - | - | - | +| PaddleX | ResNet18 | √ | √ | +| PaddleX | ResNet34 | √ | √ | +| PaddleX | ResNet50 | √ | √ | +| PaddleX | ResNet101 | √ | √ | +| PaddleX | ResNet152 | √ | √ | +| PaddleX | PPLCNet_x0_25 | √ | √ | +| PaddleX | PPLCNet_x0_35 | √ | √ | +| PaddleX | PPLCNet_x0_5 | √ | √ | +| PaddleX | PPLCNet_x0_75 | √ | √ | +| PaddleX | PPLCNet_x1_0 | √ | √ | +| PaddleX | PPLCNet_x1_5 | √ | √ | +| PaddleX | PPLCNet_x2_0 | √ | √ | +| PaddleX | PPLCNet_x2_5 | √ | √ | +| PaddleX | MobileNetV3_small_x0_35 | √ | √ | +| PaddleX | MobileNetV3_small_x0_5 | √ | √ | +| PaddleX | MobileNetV3_small_x0_75 | √ | √ | +| PaddleX | MobileNetV3_small_x1_0 | √ | √ | +| PaddleX | MobileNetV3_small_x1_25 | √ | √ | +| PaddleX | MobileNetV3_large_x0_35 | √ | √ | +| PaddleX | MobileNetV3_large_x0_5 | √ | √ | +| PaddleX | MobileNetV3_large_x0_75 | √ | √ | +| PaddleX | MobileNetV3_large_x1_0 | √ | √ | +| PaddleX | MobileNetV3_large_x1_25 | √ | √ | +| PaddleX | PP-HGNet_small | √ | √ | +| PaddleX | PP-YOLOE_plus-S | √ | √ | +| PaddleX | PP-YOLOE_plus-M | √ | √ | +| PaddleX | PP-YOLOE_plus-L | √ | √ | +| PaddleX | PP-YOLOE_plus-X | √ | √ | +| PaddleX | PicoDet-S | √ | √ | +| PaddleX | PicoDet-L | √ | √ | +| PaddleX | PP-LiteSeg-T | √ | √ | +| PaddleX | PP-OCRv4_mobile_det | √ | √ | +| PaddleX | PP-OCRv4_server_det | √ | √ | +| PaddleX | PP-OCRv4_mobile_rec | √ | √ | +| PaddleX | PP-OCRv4_server_rec | √ | √ | +| PaddleX | DLinear | √ | √ | +| PaddleX | RLinear | √ | √ | +| PaddleX | NLinear | √ | √ | +| PaddleNLP | BERT | √ | √ | diff --git a/docs/hardware_support/npu/index_cn.rst b/docs/hardware_support/npu/index_cn.rst index 01ed8340009..8b385e5d4b9 100644 --- a/docs/hardware_support/npu/index_cn.rst +++ b/docs/hardware_support/npu/index_cn.rst @@ -9,12 +9,12 @@ 飞桨框架支持基于昇腾 NPU 芯片的训练和推理,请参考以下内容快速体验: - `昇腾 NPU 基于框架的使用指南 <./paddle_tutorial_cn.html>`_ : 昇腾 NPU 基于框架的使用指南 -- `昇腾 NPU 基于套件的使用指南 <./suite_tutorial_cn.html>`_ : 昇腾 NPU 基于套件的使用指南 +- `昇腾 NPU 基于套件的使用指南 <./paddlex_tutorial_cn.html>`_ : 昇腾 NPU 基于套件的使用指南 - `昇腾 NPU 支持模型 <./support_cn.html>`_ : 昇腾 NPU 支持模型 .. toctree:: :hidden: paddle_tutorial_cn.md - suite_tutorial_cn.md + paddlex_tutorial_cn.md support_cn.md diff --git a/docs/hardware_support/npu/paddle_tutorial_cn.md b/docs/hardware_support/npu/paddle_tutorial_cn.md index be3aa559f48..a144106bcc4 100644 --- a/docs/hardware_support/npu/paddle_tutorial_cn.md +++ b/docs/hardware_support/npu/paddle_tutorial_cn.md @@ -8,9 +8,11 @@ * 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - * 镜像链接:registry.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-x86_64-gcc84-py39 + * x86_64 镜像链接:ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann80T13-ubuntu20-x86_64-gcc84-py39 - * 镜像中已经默认安装了昇腾算子库 CANN-8.0.RC1 + * aarch64 镜像链接:ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann80T13-ubuntu20-aarch_64-gcc84-py39 + + * 镜像中已经默认安装了昇腾算子库 CANN-8.0.T13 * 昇腾驱动版本为 23.0.3 diff --git a/docs/hardware_support/npu/paddlex_tutorial_cn.md b/docs/hardware_support/npu/paddlex_tutorial_cn.md new file mode 100644 index 00000000000..97abc37e1c9 --- /dev/null +++ b/docs/hardware_support/npu/paddlex_tutorial_cn.md @@ -0,0 +1,299 @@ +# 昇腾 NPU 基于 PaddleX 的使用指南 + +## 环境准备 + +### 环境说明 + +* 本教程介绍如何基于昇腾 910B NPU 进行 ResNet50 / PP-YOLOE+ / DeepLabv3p 等不同领域模型的训练,总共需要 4 卡进行训练 + +* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: + + * x86_64 镜像链接:ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann80T13-ubuntu20-x86_64-gcc84-py39 + + * aarch64 镜像链接:ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-npu:cann80T13-ubuntu20-aarch_64-gcc84-py39 + + * 镜像中已经默认安装了昇腾算子库 CANN-8.0.T13 + +* 昇腾驱动版本为 23.0.3 + +### 环境安装 + +1. 安装 PaddlePaddle + +*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* + +```shell +python -m pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/ +``` + +2. 安装 CustomDevice + +*该命令会自动安装飞桨 Custom Device 每日自动构建的 nightly-build 版本* + +```shell +python -m pip install paddle-custom-npu -i https://www.paddlepaddle.org.cn/packages/nightly/npu/ +``` + +3. 安装 PaddleX 代码库 + +```shell +git clone https://github.com/PaddlePaddle/PaddleX.git + +# 如果速度较慢,可以考虑从 gitee 拉取 +# git clone https://gitee.com/paddlepaddle/PaddleX.git + +cd PaddleX + +# 安装 PaddleX whl +# -e:以可编辑模式安装,当前项目的代码更改,都会直接作用到已经安装的 PaddleX Wheel +pip install -e . +``` + +## 基于 PaddleX 训练 ResNet50 + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是图像分类模型,因此安装图像分类库 +paddlex --install PaddleClas + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 flowers 102 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/cls_flowers_examples.tar -P ./dataset +tar -xf ./dataset/cls_flowers_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/cls_flowers_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 NPU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `npu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `npu` ,在进行模型训练时,飞桨将自动调用 npu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/image_classification/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/cls_flowers_examples \ + -o Global.output=resnet50_output \ + -o Global.device="npu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `resnet50_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `resnet50_output/best_model/` 目录下,其中 `inference.pdiparams`、`inference.pdiparams.info`、`inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./resnet50_output/best_model" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg" \ + -o Global.device="npu:0" +``` + +#### 转换 ONNX 模型 + +如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: + +a. 安装环境 + +```shell +# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 +python -m pip install paddle2onnx +``` + +b. 模型转换 + +```shell +paddle2onnx --model_dir=./resnet50_output/best_model/ \ + --model_filename=inference.pdmodel \ + --params_filename=inference.pdiparams \ + --save_file=./resnet50_output/best_model/inference.onnx \ + --enable_onnx_checker=True +``` + +该命令会在 `resnet50_output/best_model` 目录下生成 `inference.onnx` 文件 + +## 基于 PaddleX 训练 PP-YOLOE+ + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是目标检测模型,因此安装目标检测库 +paddlex --install PaddleDetection + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 PaddleX 准备的 Demo 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/det_coco_examples.tar -P ./dataset +tar -xf ./dataset/det_coco_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +# PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/det_coco_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 NPU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `npu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `npu` ,在进行模型训练时,飞桨将自动调用 npu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `PP-YOLOE_plus-S` + +```shell +python main.py -c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/det_coco_examples \ + -o Global.output=ppyolo_plus_s_output \ + -o Global.device="npu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `ppyolo_plus_s_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `ppyolo_plus_s_output/best_model/` 目录下,其中 `inference.pdiparams`、`inference.pdiparams.info`、`inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/object_detection/PP-YOLOE_plus-S.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./ppyolo_plus_s_output/best_model" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_object_detection_002.png" \ + -o Global.device="npu:0" +``` + +## 基于 PaddleX 训练 DeepLabv3+ + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是图像分割模型,因此安装图像分割库 +paddlex --install PaddleSeg + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 PaddleX 准备的 Demo 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/seg_optic_examples.tar -P ./dataset +tar -xf ./dataset/seg_optic_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +# PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/seg_optic_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 NPU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `npu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `npu` ,在进行模型训练时,飞桨将自动调用 npu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `Deeplabv3_Plus-R50` + +```shell +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/seg_optic_examples \ + -o Global.output=deeplabv3p_output \ + -o Global.device="npu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `deeplabv3p_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `deeplabv3p_output/best_model/` 目录下,其中 `model/inference.pdiparams`、`model/inference.pdiparams.info`、`model/inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/semantic_segmentation/Deeplabv3_Plus-R50.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./deeplabv3p_output/best_model/model/" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_semantic_segmentation_001.jpg" \ + -o Global.device="npu:0" +``` diff --git a/docs/hardware_support/npu/suite_tutorial_cn.md b/docs/hardware_support/npu/suite_tutorial_cn.md deleted file mode 100644 index a57f544f0e6..00000000000 --- a/docs/hardware_support/npu/suite_tutorial_cn.md +++ /dev/null @@ -1,300 +0,0 @@ -# 昇腾 NPU 基于套件的使用指南 - -## 环境准备 - -### 环境说明 - -* 本教程介绍如何基于昇腾 910B NPU 进行 ResNet50 的训练,总共需要 4 卡进行训练 - -* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - - * 镜像链接:registry.baidubce.com/device/paddle-npu:cann80RC1-ubuntu20-x86_64-gcc84-py39 - - * 镜像中已经默认安装了昇腾算子库 CANN-8.0.RC1 - -* 昇腾驱动版本为 23.0.3 - -### 环境安装 - -1. 安装 PaddlePaddle - -*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* - -```shell -python -m pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/ -``` - -2. 安装 CustomDevice - -*该命令会自动安装飞桨 Custom Device 每日自动构建的 nightly-build 版本* - -```shell -python -m pip install paddle-custom-npu -i https://www.paddlepaddle.org.cn/packages/nightly/npu/ -``` - -## 基于 PaddleClas 训练 ResNet50 - -### 一、安装 PaddleClas 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleClas.git -b release/2.5.1 -cd PaddleClas -python -m pip install -r requirements.txt -python -m pip install . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5.1/docs/zh_CN/models/ImageNet1k/ResNet.md#32-%E6%95%B0%E6%8D%AE%E5%87%86%E5%A4%87) 准备 ImageNet1k 数据集,准备完成后解压到 PaddleClas/dataset/目录下,目录结构如下: -``` -PaddleClas/dataset/ILSVRC2012/ -|_ train/ -| |_ n01440764 -| | |_ n01440764_10026.JPEG -| | |_ ... -| |_ ... -| | -| |_ n15075141 -| |_ ... -| |_ n15075141_9993.JPEG -|_ val/ -| |_ ILSVRC2012_val_00000001.JPEG -| |_ ... -| |_ ILSVRC2012_val_00050000.JPEG -|_ train_list.txt -|_ val_list.txt -``` - -### 三、模型训练 - -进入 PaddleClas 目录下,执行如下命令启动 4 卡 NPU(0 ~ 3 号卡)训练,其中: - - * 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `npu` ,通过指定该参数,PaddleClas 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 npu,在进行模型训练时,飞桨将自动调用 npu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 - - * 参数 `-c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` - -```shell -python -u -m paddle.distributed.launch --devices 0,1,2,3 tools/train.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.output_dir="output/ResNet50" \ - -o Global.device="npu" -``` - -上述命令会在 PaddleClas 目录下产生一个 output/ResNet50 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最后一个 epoch 的权重放在 output/ResNet50/ 目录下的 epoch_120.pdparams 文件中,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export_model.py 执行的是 `动转静` 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定./deploy/models/ResNet50 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export_model.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.pretrained_model=./output/ResNet50/epoch_120 \ - -o Global.save_inference_dir=./deploy/models/ResNet50 -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleClas/deploy 目录下,执行下列命令进行 NPU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -cd deploy -python python/predict_cls.py \ - -c ./configs/inference_cls.yaml \ - -o Global.inference_model_dir=./models/ResNet50 \ - -o Global.use_gpu=False \ - -o Global.use_npu=True \ - -o Global.infer_imgs=./images/ImageNet -``` - -#### 转换 ONNX 模型 - -如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: - -a. 安装环境 - -```shell -# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 -python -m pip install paddle2onnx -``` - -b. 模型转换 - -```shell -paddle2onnx --model_dir=./deploy/models/ResNet50/ \ - --model_filename=inference.pdmodel \ - --params_filename=inference.pdiparams \ - --save_file=./deploy/models/ResNet50_onnx/inference.onnx \ - --opset_version=10 \ - --enable_onnx_checker=True -``` - -该命令会在 deploy/models/ResNet50_onnx 目录下生成 inference.onnx 文件,生成的文件可以基于 ONNX Runtime 进行推理,具体使用方式参考 [ONNX Runtime 官网](https://onnxruntime.ai/) - -## 基于 PaddleDetection 训 PP-YOLOE+ - -### 一、安装 PaddleDetection 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleDetection.git -b release_2_7_npu -cd PaddleDetection -python -m pip install -r requirements.txt -python -m pip install -e . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/docs/tutorials/PrepareDataSet.md#coco%E6%95%B0%E6%8D%AE) 准备 COCO 2017 数据集,准备完成后解压到 PaddleDetection/dataset/目录下,目录结构如下: - -``` -PaddleDetection/dataset/coco/ -├── annotations -│ ├── instances_train2014.json -│ ├── instances_train2017.json -│ ├── instances_val2014.json -│ ├── instances_val2017.json -│ │ ... -├── train2017 -│ ├── 000000000009.jpg -│ ├── 000000580008.jpg -│ │ ... -├── val2017 -│ ├── 000000000139.jpg -│ ├── 000000000285.jpg -│ │ ... -``` - -### 三、模型训练 - -进入 PaddleDetection 目录下,执行如下命令启动 8 卡 NPU(0 ~ 7 号卡)训练,其中: - -* 参数 `--config configs/ppyoloe/ppyoloe_plus_crn_l_80e_coco.yml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `PP-YOLOE+-l` - -```shell -export FLAGS_npu_jit_compile=0 -export FLAGS_use_stride_kernel=0 - -python -u -m paddle.distributed.launch --devices 0,1,2,3,4,5,6,7 \ - tools/train.py --eval --config configs/ppyoloe/ppyoloe_plus_crn_l_80e_coco.yml \ - --enable_ce True -``` - -上述命令会在 PaddleDetection 目录下产生一个 output/ppyoloe_plus_crn_l_80e_coco 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最优指标对应的权重放在 output/ppyoloe_plus_crn_l_80e_coco/pipeline/best_model/ 目录下,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export_model.py 执行的是 动转静 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定 output_inference/ppyoloe_plus_crn_l_80e_coco 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export_model.py -c configs/ppyoloe/ppyoloe_plus_crn_l_80e_coco.yml -o weights=output/ppyoloe_plus_crn_l_80e_coco/pipeline/best_model/model.pdparams -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleDetection/deploy 目录下,执行下列命令进行 NPU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -python deploy/python/infer.py --model_dir=output_inference/ppyoloe_plus_crn_l_80e_coco --image_file=demo/000000014439_640x640.jpg --run_mode=paddle --device=npu -``` - -## 基于 PaddleSeg 训练 DeepLabv3+ - -### 一、安装 PaddleSeg 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleSeg -b npu_cann8.0 -cd PaddleSeg -python -m pip install -r requirements.txt -python -m pip install -e . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.9/docs/data/pre_data_cn.md#cityscapes%E6%95%B0%E6%8D%AE%E9%9B%86) 准备 Cityscapes 数据集,准备完成后解压到 PaddleSeg/data/目录下,目录结构如下: - -```shell -PaddleSeg/data/cityscapes -├── leftImg8bit -│ ├── train -│ ├── val -├── gtFine -│ ├── train -│ ├── val -``` - -### 三、模型训练 - -进入 PaddleSeg 目录下,执行如下命令启动 4 卡 NPU(0 ~ 3 号卡)训练,其中: - -* 参数 `--device` 指定的是即将运行的设备,这里需要传入的是 npu ,通过指定该参数,PaddleSeg 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 npu,在进行模型训练时,飞桨将自动调用 npu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 - -* 参数 `--config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 DeepLabv3+ - -```shell -python -u -m paddle.distributed.launch --devices 0,1,2,3 tools/train.py \ - --config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml \ - --num_workers 8 \ - --save_dir output/deeplabv3p_resnet50 \ - --log_iters 10 \ - --device npu \ - --do_eval \ - --save_interval 1000 \ - --seed 2048 -``` - -上述命令会在 PaddleSeg 目录下产生一个 output/deeplabv3p_resnet50 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最优指标对应的权重放在 output/deeplabv3p_resnet50/best_model 目录下,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export.py 执行的是 `动转静` 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定 output/deeplabv3p_resnet50_inference_model 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export.py \ - --config configs/deeplabv3p/deeplabv3p_resnet50_os8_cityscapes_1024x512_80k.yml \ - --model_path output/deeplabv3p_resnet50/best_model/model.pdparams \ - --save_dir output/deeplabv3p_resnet50_inference_model -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleSeg/deploy 目录下,执行下列命令进行 NPU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -wget https://paddleseg.bj.bcebos.com/dygraph/demo/cityscapes_demo.png - -python deploy/python/infer.py \ - --config output/deeplabv3p_resnet50_inference_model/deploy.yaml \ - --image_path ./cityscapes_demo.png \ - --save_dir ./output \ - --device "npu" -``` diff --git a/docs/hardware_support/npu/support_cn.md b/docs/hardware_support/npu/support_cn.md index da9a98814ea..f5f2c4064bc 100644 --- a/docs/hardware_support/npu/support_cn.md +++ b/docs/hardware_support/npu/support_cn.md @@ -1,287 +1,67 @@ # 昇腾 NPU 验证模型 -基于 2024Q2 devlop 版本,飞桨框架在昇腾 NPU 上通过精度验证的模型情况如下: +飞桨框架在昇腾 NPU 上通过精度验证的模型情况如下: | 模型库 | 模型名称 | 训练 | 推理 | | - | - | - | - | -| PaddleClas | ResNet50 | √ | √ | -| PaddleClas | ResNet18 | √ | √ | -| PaddleClas | ResNet101 | √ | √ | -| PaddleClas | ResNet34 | √ | √ | -| PaddleClas | ResNet152 | √ | √ | -| PaddleClas | MobileNetV3_small_x1_0 | √ | √ | -| PaddleClas | convnext | √ | √ | -| PaddleClas | MobilenetV2 | √ | √ | -| PaddleClas | mobilnetv2_x0_5 | √ | √ | -| PaddleClas | mobilnetv2_x0_25 | √ | √ | -| PaddleClas | AlexNet | 未测试 | √ | -| PaddleClas | CLIP_vit_base_patch16_224 | 未测试 | √ | -| PaddleClas | CSPDarkNet53 | 未测试 | √ | -| PaddleClas | CSWinTransformer_base_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_base_384 | 未测试 | √ | -| PaddleClas | CSWinTransformer_large_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_large_384 | 未测试 | √ | -| PaddleClas | CSWinTransformer_small_224 | 未测试 | √ | -| PaddleClas | CSWinTransformer_tiny_224 | 未测试 | √ | -| PaddleClas | ConvNeXt_small | 未测试 | √ | -| PaddleClas | ConvNeXt_tiny | 未测试 | √ | -| PaddleClas | CvT_13_224 | 未测试 | √ | -| PaddleClas | CvT_13_384 | 未测试 | √ | -| PaddleClas | CvT_21_224 | 未测试 | √ | -| PaddleClas | CvT_21_384 | 未测试 | √ | -| PaddleClas | DLA102 | 未测试 | √ | -| PaddleClas | DLA102x | 未测试 | √ | -| PaddleClas | DLA102x2 | 未测试 | √ | -| PaddleClas | DLA169 | 未测试 | √ | -| PaddleClas | DLA34 | 未测试 | √ | -| PaddleClas | DLA46_c | 未测试 | √ | -| PaddleClas | DLA46x_c | 未测试 | √ | -| PaddleClas | DLA60 | 未测试 | √ | -| PaddleClas | DLA60x | 未测试 | √ | -| PaddleClas | DLA60x_c | 未测试 | √ | -| PaddleClas | DPN107 | 未测试 | √ | -| PaddleClas | DPN131 | 未测试 | √ | -| PaddleClas | DPN68 | 未测试 | √ | -| PaddleClas | DPN92 | 未测试 | √ | -| PaddleClas | DPN98 | 未测试 | √ | -| PaddleClas | DSNet_base | 未测试 | √ | -| PaddleClas | DSNet_tiny | 未测试 | √ | -| PaddleClas | DarkNet53 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_384 | 未测试 | √ | -| PaddleClas | DeiT_small_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_tiny_patch16_224 | 未测试 | √ | -| PaddleClas | DenseNet121 | 未测试 | √ | -| PaddleClas | DenseNet161 | 未测试 | √ | -| PaddleClas | DenseNet169 | 未测试 | √ | -| PaddleClas | DenseNet201 | 未测试 | √ | -| PaddleClas | DenseNet264 | 未测试 | √ | -| PaddleClas | DistillationModel | 未测试 | √ | -| PaddleClas | ESNet_x0_25 | 未测试 | √ | -| PaddleClas | ESNet_x0_5 | 未测试 | √ | -| PaddleClas | ESNet_x0_75 | 未测试 | √ | -| PaddleClas | ESNet_x1_0 | 未测试 | √ | -| PaddleClas | EfficientNetB0 | 未测试 | √ | -| PaddleClas | EfficientNetB1 | 未测试 | √ | -| PaddleClas | EfficientNetB2 | 未测试 | √ | -| PaddleClas | EfficientNetB3 | 未测试 | √ | -| PaddleClas | EfficientNetB4 | 未测试 | √ | -| PaddleClas | EfficientNetB5 | 未测试 | √ | -| PaddleClas | EfficientNetB6 | 未测试 | √ | -| PaddleClas | EfficientNetB7 | 未测试 | √ | -| PaddleClas | GhostNet_x0_5 | 未测试 | √ | -| PaddleClas | GhostNet_x1_0 | 未测试 | √ | -| PaddleClas | GhostNet_x1_3 | 未测试 | √ | -| PaddleClas | GoogLeNet | 未测试 | √ | -| PaddleClas | HRNet_W18_C | 未测试 | √ | -| PaddleClas | HRNet_W30_C | 未测试 | √ | -| PaddleClas | HRNet_W32_C | 未测试 | √ | -| PaddleClas | HRNet_W40_C | 未测试 | √ | -| PaddleClas | HRNet_W44_C | 未测试 | √ | -| PaddleClas | HRNet_W48_C | 未测试 | √ | -| PaddleClas | HRNet_W64_C | 未测试 | √ | -| PaddleClas | HarDNet39_ds | 未测试 | √ | -| PaddleClas | HarDNet68 | 未测试 | √ | -| PaddleClas | HarDNet68_ds | 未测试 | √ | -| PaddleClas | HarDNet85 | 未测试 | √ | -| PaddleClas | InceptionV3 | 未测试 | √ | -| PaddleClas | InceptionV4 | 未测试 | √ | -| PaddleClas | LeViT_128 | 未测试 | √ | -| PaddleClas | LeViT_128S | 未测试 | √ | -| PaddleClas | LeViT_192 | 未测试 | √ | -| PaddleClas | LeViT_256 | 未测试 | √ | -| PaddleClas | LeViT_384 | 未测试 | √ | -| PaddleClas | MetaBIN_ResNet50 | 未测试 | √ | -| PaddleClas | MixNet_L | 未测试 | √ | -| PaddleClas | MixNet_M | 未测试 | √ | -| PaddleClas | MixNet_S | 未测试 | √ | -| PaddleClas | MobileNeXt_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV1 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_25 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileNetV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_25 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x1_25 | 未测试 | √ | -| PaddleClas | MobileViTV2_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_0 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileViTV3_S | 未测试 | √ | -| PaddleClas | MobileViTV3_S_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XS | 未测试 | √ | -| PaddleClas | MobileViTV3_XS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_75 | 未测试 | √ | -| PaddleClas | MobileViTV3_x1_0 | 未测试 | √ | -| PaddleClas | MobileViT_S | 未测试 | √ | -| PaddleClas | MobileViT_XS | 未测试 | √ | -| PaddleClas | MobileViT_XXS | 未测试 | √ | -| PaddleClas | NextViT_base_224 | 未测试 | √ | -| PaddleClas | NextViT_base_384 | 未测试 | √ | -| PaddleClas | NextViT_large_224 | 未测试 | √ | -| PaddleClas | NextViT_large_384 | 未测试 | √ | -| PaddleClas | NextViT_small_224 | 未测试 | √ | -| PaddleClas | NextViT_small_384 | 未测试 | √ | -| PaddleClas | PPHGNet_small | 未测试 | √ | -| PaddleClas | PPHGNet_tiny | 未测试 | √ | -| PaddleClas | PPLCNetV2_base | 未测试 | √ | -| PaddleClas | PPLCNet_x0_25 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_35 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_75 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_5 | 未测试 | √ | -| PaddleClas | PVT_V2_B0 | 未测试 | √ | -| PaddleClas | PVT_V2_B1 | 未测试 | √ | -| PaddleClas | PVT_V2_B2 | 未测试 | √ | -| PaddleClas | PVT_V2_B2_Linear | 未测试 | √ | -| PaddleClas | PVT_V2_B3 | 未测试 | √ | -| PaddleClas | PVT_V2_B4 | 未测试 | √ | -| PaddleClas | PVT_V2_B5 | 未测试 | √ | -| PaddleClas | ReXNet_1_0 | 未测试 | √ | -| PaddleClas | ReXNet_1_3 | 未测试 | √ | -| PaddleClas | ReXNet_1_5 | 未测试 | √ | -| PaddleClas | ReXNet_2_0 | 未测试 | √ | -| PaddleClas | ReXNet_3_0 | 未测试 | √ | -| PaddleClas | RepVGG_B3 | 未测试 | √ | -| PaddleClas | Res2Net101_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net200_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_14w_8s | 未测试 | √ | -| PaddleClas | Res2Net50_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_vd_26w_4s | 未测试 | √ | -| PaddleClas | ResNeSt101 | 未测试 | √ | -| PaddleClas | ResNeSt50 | 未测试 | √ | -| PaddleClas | ResNeSt50_fast_1s1x64d | 未测试 | √ | -| PaddleClas | ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNet101_vd | 未测试 | √ | -| PaddleClas | ResNet152_vd | 未测试 | √ | -| PaddleClas | ResNet18_vd | 未测试 | √ | -| PaddleClas | ResNet200_vd | 未测试 | √ | -| PaddleClas | ResNet34_vd | 未测试 | √ | -| PaddleClas | ResNet50_vd | 未测试 | √ | -| PaddleClas | SENet154_vd | 未测试 | √ | -| PaddleClas | SE_ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNet18_vd | 未测试 | √ | -| PaddleClas | SE_ResNet34_vd | 未测试 | √ | -| PaddleClas | SE_ResNet50_vd | 未测试 | √ | -| PaddleClas | ShuffleNetV2_swish | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_25 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_33 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_0 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x2_0 | 未测试 | √ | -| PaddleClas | SqueezeNet1_0 | 未测试 | √ | -| PaddleClas | SqueezeNet1_1 | 未测试 | √ | -| PaddleClas | SwinTransformerV2_base_patch4_window16_256 | 未测试 | √ | -| PaddleClas | SwinTransformerV2_base_patch4_window8_256 | 未测试 | √ | -| PaddleClas | SwinTransformerV2_large_patch4_window16_256 | 未测试 | √ | -| PaddleClas | SwinTransformerV2_small_patch4_window16_256 | 未测试 | √ | -| PaddleClas | SwinTransformerV2_small_patch4_window8_256 | 未测试 | √ | -| PaddleClas | SwinTransformerV2_tiny_patch4_window16_256 | 未测试 | √ | -| PaddleClas | SwinTransformerV2_tiny_patch4_window8_256 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_small_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_tiny_patch4_window7_224 | 未测试 | √ | -| PaddleClas | TNT_small | 未测试 | √ | -| PaddleClas | TinyNet_A | 未测试 | √ | -| PaddleClas | TinyNet_B | 未测试 | √ | -| PaddleClas | TinyNet_C | 未测试 | √ | -| PaddleClas | TinyNet_D | 未测试 | √ | -| PaddleClas | TinyNet_E | 未测试 | √ | -| PaddleClas | VAN_B0 | 未测试 | √ | -| PaddleClas | VAN_B1 | 未测试 | √ | -| PaddleClas | VGG13 | 未测试 | √ | -| PaddleClas | VGG16 | 未测试 | √ | -| PaddleClas | VGG19 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_base_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_small_patch16_224 | 未测试 | √ | -| PaddleClas | Xception41 | 未测试 | √ | -| PaddleClas | Xception41_deeplab | 未测试 | √ | -| PaddleClas | Xception65 | 未测试 | √ | -| PaddleClas | Xception65_deeplab | 未测试 | √ | -| PaddleClas | Xception71 | 未测试 | √ | -| PaddleClas | alt_gvt_base | 未测试 | √ | -| PaddleClas | alt_gvt_large | 未测试 | √ | -| PaddleClas | alt_gvt_small | 未测试 | √ | -| PaddleClas | cae_base_patch16_224 | 未测试 | √ | -| PaddleClas | pcpvt_base | 未测试 | √ | -| PaddleClas | pcpvt_large | 未测试 | √ | -| PaddleClas | pcpvt_small | 未测试 | √ | -| PaddleDetection | ppyoloe_vit_base_csppan_cae_36e_coco | 未测试 | √ | -| PaddleDetection | solov2_r50_enhance_coco | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv2_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv2_rec | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv3_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_mobile_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_server_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_server_rec | 未测试 | √ | -| PaddleOCR | ch_ppocr_mobile_v2_0_rec | 未测试 | √ | -| PaddleOCR | ch_ppocr_server_v2_0_rec | 未测试 | √ | -| PaddleOCR | det_mv3_db_v2_0_0 | 未测试 | √ | -| PaddleOCR | det_r50_db_plusplus_0 | 未测试 | √ | -| PaddleOCR | en_table_structure | 未测试 | √ | -| PaddleOCR | rec_mv3_none_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_tps_bilstm_att_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_tps_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r32_gaspin_bilstm_att | 未测试 | √ | -| PaddleOCR | rec_r34_vd_none_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_tps_bilstm_att_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_tps_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_svtrnet | 未测试 | √ | -| PaddleOCR | rec_vitstr | 未测试 | √ | -| PaddleOCR | slanet | 未测试 | √ | -| PaddleSeg | Deeplabv3_R50 | √ | √ | -| PaddleSeg | Deeplabv3_R101 | √ | √ | -| PaddleSeg | Deeplabv3Plus-R50 | √ | √ | -| PaddleSeg | Deeplabv3Plus-R101 | √ | √ | -| PaddleSeg | fcn_hrnetw18 | 未测试 | √ | -| PaddleSeg | fcn_hrnetw18_small | 未测试 | √ | -| PaddleSeg | fcn_uhrnetw18_small | 未测试 | √ | -| PaddleSeg | ocrnet_hrnetw48 | 未测试 | √ | -| PaddleSeg | pfpnnet | 未测试 | √ | -| PaddleSeg | pphumanseg_lite | 未测试 | √ | -| PaddleSeg | pphumanseg_server | 未测试 | √ | -| PaddleSeg | ppmatting | 未测试 | √ | -| PaddleNLP | Bert | √ | √ | -| PaddleTS | DLinear | √ | √ | -| PaddleTS | RLinear | √ | √ | -| PaddleTS | NLinear | √ | √ | -| PaddleVideo | AGCN | 未测试 | √ | -| PaddleVideo | STGCN | 未测试 | √ | +| PaddleX | ResNet18 | √ | √ | +| PaddleX | ResNet34 | √ | √ | +| PaddleX | ResNet50 | √ | √ | +| PaddleX | ResNet101 | √ | √ | +| PaddleX | ResNet152 | √ | √ | +| PaddleX | PPLCNet_x0_25 | √ | √ | +| PaddleX | PPLCNet_x0_35 | √ | √ | +| PaddleX | PPLCNet_x0_5 | √ | √ | +| PaddleX | PPLCNet_x0_75 | √ | √ | +| PaddleX | PPLCNet_x1_0 | √ | √ | +| PaddleX | PPLCNet_x1_5 | √ | √ | +| PaddleX | PPLCNet_x2_0 | √ | √ | +| PaddleX | PPLCNet_x2_5 | √ | √ | +| PaddleX | MobileNetV3_small_x0_35 | √ | √ | +| PaddleX | MobileNetV3_small_x0_5 | √ | √ | +| PaddleX | MobileNetV3_small_x0_75 | √ | √ | +| PaddleX | MobileNetV3_small_x1_0 | √ | √ | +| PaddleX | MobileNetV3_small_x1_25 | √ | √ | +| PaddleX | MobileNetV3_large_x0_35 | √ | √ | +| PaddleX | MobileNetV3_large_x0_5 | √ | √ | +| PaddleX | MobileNetV3_large_x0_75 | √ | √ | +| PaddleX | MobileNetV3_large_x1_0 | √ | √ | +| PaddleX | MobileNetV3_large_x1_25 | √ | √ | +| PaddleX | ConvNeXt_tiny | √ | √ | +| PaddleX | MobilenetV2_x0_25 | √ | √ | +| PaddleX | MobilenetV2_x0_5 | √ | √ | +| PaddleX | MobileNetV2_x1_0 | √ | √ | +| PaddleX | MobileNetV2_x1_5 | √ | √ | +| PaddleX | MobileNetV2_x2_0 | √ | √ | +| PaddleX | SwinTransformer_base_patch4_window7_224 | √ | √ | +| PaddleX | PP-HGNet_small | √ | √ | +| PaddleX | PP-HGNetV2-B0 | √ | √ | +| PaddleX | PP-HGNetV2-B4 | √ | √ | +| PaddleX | PP-HGNetV2-B6 | √ | √ | +| PaddleX | CLIP_vit_base_patch16_224 | √ | √ | +| PaddleX | CLIP_vit_large_patch14_224 | √ | √ | +| PaddleX | PP-YOLOE_plus-S | √ | √ | +| PaddleX | PP-YOLOE_plus-M | √ | √ | +| PaddleX | PP-YOLOE_plus-L | √ | √ | +| PaddleX | PP-YOLOE_plus-X | √ | √ | +| PaddleX | RT-DETR-L | √ | √ | +| PaddleX | RT-DETR-H | √ | √ | +| PaddleX | RT-DETR-X | √ | √ | +| PaddleX | RT-DETR-R18 | √ | √ | +| PaddleX | RT-DETR-R50 | √ | √ | +| PaddleX | PicoDet-S | √ | √ | +| PaddleX | PicoDet-L | √ | √ | +| PaddleX | Deeplabv3-R50 | √ | √ | +| PaddleX | Deeplabv3-R101 | √ | √ | +| PaddleX | Deeplabv3_Plus-R50 | √ | √ | +| PaddleX | Deeplabv3_Plus-R101 | √ | √ | +| PaddleX | PP-LiteSeg-T | √ | √ | +| PaddleX | OCRNet_HRNet-W48 | √ | √ | +| PaddleX | PP-OCRv4_mobile_det | √ | √ | +| PaddleX | PP-OCRv4_server_det | √ | √ | +| PaddleX | PP-OCRv4_mobile_rec | √ | √ | +| PaddleX | PP-OCRv4_server_rec | √ | √ | +| PaddleX | DLinear | √ | √ | +| PaddleX | RLinear | √ | √ | +| PaddleX | NLinear | √ | √ | +| PaddleNLP | BERT | √ | √ | diff --git a/docs/hardware_support/xpu/index_cn.rst b/docs/hardware_support/xpu/index_cn.rst index aa6ac11000c..93360342c41 100644 --- a/docs/hardware_support/xpu/index_cn.rst +++ b/docs/hardware_support/xpu/index_cn.rst @@ -4,17 +4,21 @@ 昆仑芯 XPU 芯片 #################### -百度昆仑芯 AI 计算处理器(Baidu KUNLUN AI Computing Processor)是百度集十年 AI 产业技术实践于 2019 年推出的全功能 AI 芯片。基于自主研发的先进 XPU 架构,为云端和边缘端的人工智能业务而设计。 百度昆仑芯与飞桨及其他国产软硬件强强组合,打造一个全面领先的国产化 AI 技术生态,部署和应用于诸多 “人工智能+“的行业领域,包括智能云和高性能计算,智慧制造、智慧城市和安防等。更多昆仑芯 XPU 芯片详情及技术指标请 `点击这里 `_ 。 +昆仑芯是百度集十年 AI 产业技术实践、自主研发的 AI 芯片。该产品拥有创新的 XPU 架构、领先的产品竞争力及丰富的场景落地实践等优势,目前已被广泛部署在百度搜索、小度、大模型等业务并在百度外有数百家客户,将 AI 算力赋能千行百业智能化升级。 + +昆仑芯和飞桨的适配自 2018 年启动,当前已携手共同完成了一套端到端的 AI 计算系统解决方案,并致力于携手打造一个全栈式软硬一体的 AI 生态,夯实算力基础设施建设。“昆仑芯+飞桨”方案已在智慧金融、工业质检等领域成功部署落地。 + +更多昆仑芯 XPU 芯片详情及技术指标请 `点击这里 `_ 。 飞桨框架支持基于昆仑芯 XPU 芯片的训练和推理,请参考以下内容快速体验: - `昆仑芯 XPU 基于框架的使用指南 <./paddle_tutorial_cn.html>`_ : 昆仑芯 XPU 基于框架的使用指南 -- `昆仑芯 XPU 基于套件的使用指南 <./suite_tutorial_cn.html>`_ : 昆仑芯 XPU 基于套件的使用指南 +- `昆仑芯 XPU 基于套件的使用指南 <./paddlex_tutorial_cn.html>`_ : 昆仑芯 XPU 基于套件的使用指南 - `昆仑芯 XPU 支持模型 <./support_cn.html>`_ : 昆仑芯 XPU 支持模型 .. toctree:: :hidden: paddle_tutorial_cn.md - suite_tutorial_cn.md + paddlex_tutorial_cn.md support_cn.md diff --git a/docs/hardware_support/xpu/paddle_tutorial_cn.md b/docs/hardware_support/xpu/paddle_tutorial_cn.md index dddea1966ee..05b72fa6ade 100644 --- a/docs/hardware_support/xpu/paddle_tutorial_cn.md +++ b/docs/hardware_support/xpu/paddle_tutorial_cn.md @@ -8,7 +8,7 @@ * 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - * 镜像链接: registry.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 + * 镜像链接: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 ### 环境安装 diff --git a/docs/hardware_support/xpu/paddlex_tutorial_cn.md b/docs/hardware_support/xpu/paddlex_tutorial_cn.md new file mode 100644 index 00000000000..775cc929371 --- /dev/null +++ b/docs/hardware_support/xpu/paddlex_tutorial_cn.md @@ -0,0 +1,136 @@ +# 昆仑芯 XPU 基于 PaddleX 的使用指南 + +## 环境准备 + +### 环境说明 + +* 本教程介绍如何基于昆仑芯 XPU 进行 ResNet50 的训练,总共需要 4 卡进行训练 + +* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: + + * 镜像链接: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 + +### 环境安装 + +1. 安装 PaddlePaddle + +*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* + +*由于 xpu 代码位于飞桨主框架中,因此我们不需要安装额外的 Custom Device 包* + +```shell +python -m pip install paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu/ +``` + +2. 安装 PaddleX 代码库 + +```shell +git clone https://github.com/PaddlePaddle/PaddleX.git + +# 如果速度较慢,可以考虑从 gitee 拉取 +# git clone https://gitee.com/paddlepaddle/PaddleX.git + +cd PaddleX + +# 安装 PaddleX whl +# -e:以可编辑模式安装,当前项目的代码更改,都会直接作用到已经安装的 PaddleX Wheel +pip install -e . +``` + +## 基于 PaddleX 训练 ResNet50 + +### 一、安装 PaddleX 依赖 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 安装 PaddleX 相关依赖,由于我们使用的是图像分类模型,因此安装图像分类库 +paddlex --install PaddleClas + +# 完成安装后会有如下提示: +# All packages are installed. +``` + +### 二、数据准备 + +为了快速上手验证,我们基于 flowers 102 数据集进行快速体验: + +1. 下载数据集 + +```shell +# 跳转到 PaddleX 根目录下 +cd /path/to/paddlex + +# 下载并解压数据 +wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/cls_flowers_examples.tar -P ./dataset +tar -xf ./dataset/cls_flowers_examples.tar -C ./dataset/ +``` + +2. 数据校验 + +```shell +PaddleX 支持对数据集进行校验,确保数据集格式符合 PaddleX 的相关要求。同时在数据校验时,能够对数据集进行分析,统计数据集的基本信息。 +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=check_dataset \ + -o Global.dataset_dir=./dataset/cls_flowers_examples + +# 命令运行成功后会在 log 中打印出 Check dataset passed ! 信息 +``` + +更多关于 PaddleX 数据集说明的内容,可以查看 [PaddleX 数据集校验](https://github.com/PaddlePaddle/PaddleX/blob/release/3.0-beta/docs/tutorials/data/dataset_check.md) + +### 三、模型训练 + +进入 `PaddleX` 目录下,执行如下命令启动 4 卡 XPU(0 ~ 3 号卡)训练,其中: + +* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `xpu:0,1,2,3` ,通过指定该参数,PaddleX 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 `xpu` ,在进行模型训练时,飞桨将自动调用 xpu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 + +* 参数 `-c paddlex/configs/image_classification/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=train \ + -o Global.dataset_dir=./dataset/cls_flowers_examples \ + -o Global.output=resnet50_output \ + -o Global.device="xpu:0,1,2,3" +``` + +上述命令会在 `PaddleX` 目录下产生一个 `resnet50_output/` 目录,该目录会存放训练过程中的模型参数 + +### 四、模型推理 + +#### 基于 PaddleInference 推理 + +训练完成后,最优权重放在 `resnet50_output/best_model/` 目录下,其中 `inference.pdiparams`、`inference.pdiparams.info`、`inference.pdmodel` 3 个文件为静态图文件,用于推理使用,使用如下命令进行推理 + +```shell +python main.py -c paddlex/configs/image_classification/ResNet50.yaml \ + -o Global.mode=predict \ + -o Predict.model_dir="./resnet50_output/best_model" \ + -o Predict.input_path="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg" \ + -o Global.device="xpu:0" +``` + +#### 转换 ONNX 模型 + +如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: + +a. 安装环境 + +```shell +# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 +python -m pip install paddle2onnx +``` + +b. 模型转换 + +```shell +paddle2onnx --model_dir=./resnet50_output/best_model/ \ + --model_filename=inference.pdmodel \ + --params_filename=inference.pdiparams \ + --save_file=./resnet50_output/best_model/inference.onnx \ + --enable_onnx_checker=True +``` + +该命令会在 `resnet50_output/best_model` 目录下生成 `inference.onnx` 文件 diff --git a/docs/hardware_support/xpu/suite_tutorial_cn.md b/docs/hardware_support/xpu/suite_tutorial_cn.md deleted file mode 100644 index a21f76d99e0..00000000000 --- a/docs/hardware_support/xpu/suite_tutorial_cn.md +++ /dev/null @@ -1,134 +0,0 @@ -# 昆仑芯 XPU 基于套件的使用指南 - -## 环境准备 - -### 环境说明 - -* 本教程介绍如何基于昆仑芯 XPU 进行 ResNet50 的训练,总共需要 4 卡进行训练 - -* 考虑到环境差异性,我们推荐使用教程提供的标准镜像完成环境准备: - - * 镜像链接: registry.baidubce.com/device/paddle-xpu:ubuntu20-x86_64-gcc84-py310 - -### 环境安装 - -安装 PaddlePaddle - -*该命令会自动安装飞桨主框架每日自动构建的 nightly-build 版本* - -*由于 xpu 代码位于飞桨主框架中,因此我们不需要安装额外的 Custom Device 包* - -```shell -python -m pip install paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/packages/nightly/xpu/ -``` - -## 基于 PaddleClas 训练 ResNet50 - -### 一、安装 PaddleClas 代码库 - -```shell -git clone https://github.com/PaddlePaddle/PaddleClas.git -b release/2.5.1 -cd PaddleClas -python -m pip install -r requirements.txt -python -m pip install . -``` - -### 二、数据准备 - -请根据 [数据说明文档](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5.1/docs/zh_CN/models/ImageNet1k/ResNet.md#32-%E6%95%B0%E6%8D%AE%E5%87%86%E5%A4%87) 准备 ImageNet1k 数据集,准备完成后解压到 PaddleClas/dataset/目录下,目录结构如下: - -``` -PaddleClas/dataset/ILSVRC2012/ -|_ train/ -| |_ n01440764 -| | |_ n01440764_10026.JPEG -| | |_ ... -| |_ ... -| | -| |_ n15075141 -| |_ ... -| |_ n15075141_9993.JPEG -|_ val/ -| |_ ILSVRC2012_val_00000001.JPEG -| |_ ... -| |_ ILSVRC2012_val_00050000.JPEG -|_ train_list.txt -|_ val_list.txt -``` - -### 三、模型训练 - -进入 PaddleClas 目录下,执行如下命令启动 4 卡 XPU(0 ~ 3 号卡)训练,其中: - -* 参数 `-o Global.device` 指定的是即将运行的设备,这里需要传入的是 `xpu` ,通过指定该参数,PaddleClas 调用飞桨的设备指定接口 `paddle.set_device` 来指定运行设备为 xpu,在进行模型训练时,飞桨将自动调用 xpu 算子用于执行模型计算。关于设备指定的更多细节,可以参考官方 api [paddle.set_device](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。 - -* 参数 `-c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml` 表示读取指定目录下的配置文件,配置文件中指定了模型结构,训练超参等所有训练模型需要用到的配置,该文件中指定的模型结构为 `ResNet50` - -```shell -python -u -m paddle.distributed.launch --devices 0,1,2,3 tools/train.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.output_dir="output/ResNet50" \ - -o Global.device="xpu" -``` - -上述命令会在 PaddleClas 目录下产生一个 output/ResNet50 目录,该目录会存放训练过程中的模型参数 - -### 四、模型导出 & 推理 - -#### 模型导出 - -训练完成后,最后一个 epoch 的权重放在 output/ResNet50/ 目录下的 epoch_120.pdparams 文件中,执行以下命令将模型转成 Paddle 静态图格式存储,以获得更好的推理性能: - -* export_model.py 执行的是 `动转静` 操作,飞桨框架会对代码进行分析,将动态图代码(灵活易用)转为 静态图模型(高效),以达到更加高效的推理性能 - -* 该操作会在指定./deploy/models/ResNet50 下生成 inference.pdiparams、inference.pdiparams.info、inference.pdmodel 3 个文件 - -```shell -python tools/export_model.py \ - -c ./ppcls/configs/ImageNet/ResNet/ResNet50.yaml \ - -o Global.pretrained_model=./output/ResNet50/epoch_120 \ - -o Global.save_inference_dir=./deploy/models/ResNet50 \ - -o Global.device=xpu -``` - -#### 基于 PaddleInference 推理 - -推理代码位于 PaddleClas/deploy 目录下,执行下列命令进行 XPU 推理: - -* 该脚本将会加载上一步保存的静态图,使用飞桨预测库 PaddleInference 进行推理 - -* PaddleInference 内置了大量的高性能 Kernel,并且可以基于计算图分析,完成细粒度 OP 横向纵向融合,实现了高性能推理 - -```shell -cd deploy -python python/predict_cls.py \ - -c ./configs/inference_cls.yaml \ - -o Global.inference_model_dir=./models/ResNet50 \ - -o Global.use_gpu=False \ - -o Global.use_xpu=True \ - -o Global.infer_imgs=./images/ImageNet -``` - -#### 转换 ONNX 模型 - -如果您有额外的部署需求需要基于 ONNX 实现,我们也提供了专用的工具用于导出 ONNX 模型,参考如下步骤,即可将第一步导出的静态图模型转换为 ONNX 模型: - -a. 安装环境 - -```shell -# 安装 paddle2onnx,该工具支持将 PaddleInference 模型转换为 ONNX 格式 -python -m pip install paddle2onnx -``` - -b. 模型转换 - -```shell -paddle2onnx --model_dir=./deploy/models/ResNet50/ \ - --model_filename=inference.pdmodel \ - --params_filename=inference.pdiparams \ - --save_file=./deploy/models/ResNet50_onnx/inference.onnx \ - --opset_version=10 \ - --enable_onnx_checker=True -``` - -该命令会在 deploy/models/ResNet50_onnx 目录下生成 inference.onnx 文件,生成的文件可以基于 ONNX Runtime 进行推理,具体使用方式参考 [ONNX Runtime 官网](https://onnxruntime.ai/) diff --git a/docs/hardware_support/xpu/support_cn.md b/docs/hardware_support/xpu/support_cn.md index c5134619bb5..b96ae042442 100644 --- a/docs/hardware_support/xpu/support_cn.md +++ b/docs/hardware_support/xpu/support_cn.md @@ -1,279 +1,30 @@ # 昆仑芯 XPU 支持模型 -基于 2024Q2 devlop 版本,飞桨框架在昆仑芯 XPU 上通过精度验证的模型情况如下: +飞桨框架在昆仑芯 XPU 上通过精度验证的模型情况如下: | 模型库 | 模型名称 | 训练 | 推理 | -| - | - | - | - | -| PaddleClas | ResNet50 | √ | √ | -| PaddleClas | MobileNetV3 | √ | √ | -| PaddleClas | PP-LCNet | √ | √ | -| PaddleClas | AlexNet | 未测试 | √ | -| PaddleClas | CLIP_vit_base_patch16_224 | 未测试 | √ | -| PaddleClas | CSPDarkNet53 | 未测试 | √ | -| PaddleClas | ConvNeXt_small | 未测试 | √ | -| PaddleClas | ConvNeXt_tiny | 未测试 | √ | -| PaddleClas | CvT_13_224 | 未测试 | √ | -| PaddleClas | CvT_13_384 | 未测试 | √ | -| PaddleClas | CvT_21_224 | 未测试 | √ | -| PaddleClas | CvT_21_384 | 未测试 | √ | -| PaddleClas | DLA102 | 未测试 | √ | -| PaddleClas | DLA102x | 未测试 | √ | -| PaddleClas | DLA102x2 | 未测试 | √ | -| PaddleClas | DLA169 | 未测试 | √ | -| PaddleClas | DLA34 | 未测试 | √ | -| PaddleClas | DLA46_c | 未测试 | √ | -| PaddleClas | DLA46x_c | 未测试 | √ | -| PaddleClas | DLA60 | 未测试 | √ | -| PaddleClas | DLA60x | 未测试 | √ | -| PaddleClas | DLA60x_c | 未测试 | √ | -| PaddleClas | DPN107 | 未测试 | √ | -| PaddleClas | DPN131 | 未测试 | √ | -| PaddleClas | DPN68 | 未测试 | √ | -| PaddleClas | DPN92 | 未测试 | √ | -| PaddleClas | DPN98 | 未测试 | √ | -| PaddleClas | DSNet_base | 未测试 | √ | -| PaddleClas | DSNet_tiny | 未测试 | √ | -| PaddleClas | DarkNet53 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_base_patch16_384 | 未测试 | √ | -| PaddleClas | DeiT_small_patch16_224 | 未测试 | √ | -| PaddleClas | DeiT_tiny_patch16_224 | 未测试 | √ | -| PaddleClas | DenseNet121 | 未测试 | √ | -| PaddleClas | DenseNet161 | 未测试 | √ | -| PaddleClas | DenseNet169 | 未测试 | √ | -| PaddleClas | DenseNet201 | 未测试 | √ | -| PaddleClas | DenseNet264 | 未测试 | √ | -| PaddleClas | DistillationModel | 未测试 | √ | -| PaddleClas | ESNet_x0_25 | 未测试 | √ | -| PaddleClas | ESNet_x0_5 | 未测试 | √ | -| PaddleClas | ESNet_x0_75 | 未测试 | √ | -| PaddleClas | ESNet_x1_0 | 未测试 | √ | -| PaddleClas | EfficientNetB0 | 未测试 | √ | -| PaddleClas | EfficientNetB1 | 未测试 | √ | -| PaddleClas | EfficientNetB2 | 未测试 | √ | -| PaddleClas | EfficientNetB3 | 未测试 | √ | -| PaddleClas | EfficientNetB4 | 未测试 | √ | -| PaddleClas | EfficientNetB5 | 未测试 | √ | -| PaddleClas | EfficientNetB6 | 未测试 | √ | -| PaddleClas | EfficientNetB7 | 未测试 | √ | -| PaddleClas | GhostNet_x0_5 | 未测试 | √ | -| PaddleClas | GhostNet_x1_0 | 未测试 | √ | -| PaddleClas | GhostNet_x1_3 | 未测试 | √ | -| PaddleClas | GoogLeNet | 未测试 | √ | -| PaddleClas | HRNet_W18_C | 未测试 | √ | -| PaddleClas | HRNet_W30_C | 未测试 | √ | -| PaddleClas | HRNet_W32_C | 未测试 | √ | -| PaddleClas | HRNet_W40_C | 未测试 | √ | -| PaddleClas | HRNet_W44_C | 未测试 | √ | -| PaddleClas | HRNet_W48_C | 未测试 | √ | -| PaddleClas | HRNet_W64_C | 未测试 | √ | -| PaddleClas | HarDNet39_ds | 未测试 | √ | -| PaddleClas | HarDNet68 | 未测试 | √ | -| PaddleClas | HarDNet68_ds | 未测试 | √ | -| PaddleClas | HarDNet85 | 未测试 | √ | -| PaddleClas | InceptionV3 | 未测试 | √ | -| PaddleClas | InceptionV4 | 未测试 | √ | -| PaddleClas | LeViT_128 | 未测试 | √ | -| PaddleClas | LeViT_128S | 未测试 | √ | -| PaddleClas | LeViT_192 | 未测试 | √ | -| PaddleClas | LeViT_256 | 未测试 | √ | -| PaddleClas | LeViT_384 | 未测试 | √ | -| PaddleClas | MicroNet_M0 | 未测试 | √ | -| PaddleClas | MicroNet_M1 | 未测试 | √ | -| PaddleClas | MicroNet_M2 | 未测试 | √ | -| PaddleClas | MicroNet_M3 | 未测试 | √ | -| PaddleClas | MixNet_L | 未测试 | √ | -| PaddleClas | MixNet_M | 未测试 | √ | -| PaddleClas | MixNet_S | 未测试 | √ | -| PaddleClas | MobileNeXt_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV1 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_25 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV1_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_25 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV2_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileNetV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_large_x1_25 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_35 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_5 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x0_75 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x1_0 | 未测试 | √ | -| PaddleClas | MobileNetV3_small_x1_25 | 未测试 | √ | -| PaddleClas | MobileViTV2_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_0 | 未测试 | √ | -| PaddleClas | MobileViTV2_x1_5 | 未测试 | √ | -| PaddleClas | MobileViTV2_x2_0 | 未测试 | √ | -| PaddleClas | MobileViTV3_S | 未测试 | √ | -| PaddleClas | MobileViTV3_S_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XS | 未测试 | √ | -| PaddleClas | MobileViTV3_XS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS | 未测试 | √ | -| PaddleClas | MobileViTV3_XXS_L2 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_5 | 未测试 | √ | -| PaddleClas | MobileViTV3_x0_75 | 未测试 | √ | -| PaddleClas | MobileViTV3_x1_0 | 未测试 | √ | -| PaddleClas | NextViT_base_224 | 未测试 | √ | -| PaddleClas | NextViT_base_384 | 未测试 | √ | -| PaddleClas | NextViT_large_224 | 未测试 | √ | -| PaddleClas | NextViT_large_384 | 未测试 | √ | -| PaddleClas | NextViT_small_224 | 未测试 | √ | -| PaddleClas | NextViT_small_384 | 未测试 | √ | -| PaddleClas | PPHGNet_small | 未测试 | √ | -| PaddleClas | PPHGNet_tiny | 未测试 | √ | -| PaddleClas | PPLCNetV2_base | 未测试 | √ | -| PaddleClas | PPLCNet_x0_25 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_35 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x0_75 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x1_5 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_0 | 未测试 | √ | -| PaddleClas | PPLCNet_x2_5 | 未测试 | √ | -| PaddleClas | PVT_V2_B0 | 未测试 | √ | -| PaddleClas | PVT_V2_B1 | 未测试 | √ | -| PaddleClas | PVT_V2_B2 | 未测试 | √ | -| PaddleClas | PVT_V2_B2_Linear | 未测试 | √ | -| PaddleClas | PVT_V2_B3 | 未测试 | √ | -| PaddleClas | PVT_V2_B4 | 未测试 | √ | -| PaddleClas | PVT_V2_B5 | 未测试 | √ | -| PaddleClas | RepVGG_B3 | 未测试 | √ | -| PaddleClas | Res2Net101_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net200_vd_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_14w_8s | 未测试 | √ | -| PaddleClas | Res2Net50_26w_4s | 未测试 | √ | -| PaddleClas | Res2Net50_vd_26w_4s | 未测试 | √ | -| PaddleClas | ResNeSt101 | 未测试 | √ | -| PaddleClas | ResNeSt50 | 未测试 | √ | -| PaddleClas | ResNeSt50_fast_1s1x64d | 未测试 | √ | -| PaddleClas | ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt101_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt152_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_64x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | ResNeXt50_vd_64x4d | 未测试 | √ | -| PaddleClas | ResNet101 | 未测试 | √ | -| PaddleClas | ResNet101_vd | 未测试 | √ | -| PaddleClas | ResNet152 | 未测试 | √ | -| PaddleClas | ResNet152_vd | 未测试 | √ | -| PaddleClas | ResNet18 | 未测试 | √ | -| PaddleClas | ResNet18_vd | 未测试 | √ | -| PaddleClas | ResNet200_vd | 未测试 | √ | -| PaddleClas | ResNet34 | 未测试 | √ | -| PaddleClas | ResNet34_vd | 未测试 | √ | -| PaddleClas | ResNet50_vd | 未测试 | √ | -| PaddleClas | SENet154_vd | 未测试 | √ | -| PaddleClas | SE_ResNeXt101_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNeXt50_vd_32x4d | 未测试 | √ | -| PaddleClas | SE_ResNet18_vd | 未测试 | √ | -| PaddleClas | SE_ResNet34_vd | 未测试 | √ | -| PaddleClas | SE_ResNet50_vd | 未测试 | √ | -| PaddleClas | ShuffleNetV2_swish | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_25 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_33 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x0_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_0 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x1_5 | 未测试 | √ | -| PaddleClas | ShuffleNetV2_x2_0 | 未测试 | √ | -| PaddleClas | SlowFast | 未测试 | √ | -| PaddleClas | SqueezeNet1_0 | 未测试 | √ | -| PaddleClas | SqueezeNet1_1 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_base_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window12_384 | 未测试 | √ | -| PaddleClas | SwinTransformer_large_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_small_patch4_window7_224 | 未测试 | √ | -| PaddleClas | SwinTransformer_tiny_patch4_window7_224 | 未测试 | √ | -| PaddleClas | TNT_small | 未测试 | √ | -| PaddleClas | TinyNet_A | 未测试 | √ | -| PaddleClas | TinyNet_B | 未测试 | √ | -| PaddleClas | TinyNet_C | 未测试 | √ | -| PaddleClas | TinyNet_D | 未测试 | √ | -| PaddleClas | TinyNet_E | 未测试 | √ | -| PaddleClas | UniFormer_base | 未测试 | √ | -| PaddleClas | UniFormer_base_ls | 未测试 | √ | -| PaddleClas | UniFormer_small | 未测试 | √ | -| PaddleClas | UniFormer_small_plus | 未测试 | √ | -| PaddleClas | UniFormer_small_plus_dim64 | 未测试 | √ | -| PaddleClas | VAN_B0 | 未测试 | √ | -| PaddleClas | VAN_B1 | 未测试 | √ | -| PaddleClas | VGG13 | 未测试 | √ | -| PaddleClas | VGG16 | 未测试 | √ | -| PaddleClas | VGG19 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_base_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_base_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_224 | 未测试 | √ | -| PaddleClas | ViT_large_patch16_384 | 未测试 | √ | -| PaddleClas | ViT_large_patch32_384 | 未测试 | √ | -| PaddleClas | ViT_small_patch16_224 | 未测试 | √ | -| PaddleClas | Xception41 | 未测试 | √ | -| PaddleClas | Xception41_deeplab | 未测试 | √ | -| PaddleClas | Xception65 | 未测试 | √ | -| PaddleClas | Xception65_deeplab | 未测试 | √ | -| PaddleClas | Xception71 | 未测试 | √ | -| PaddleClas | alt_gvt_base | 未测试 | √ | -| PaddleClas | alt_gvt_large | 未测试 | √ | -| PaddleClas | alt_gvt_small | 未测试 | √ | -| PaddleClas | cae_base_patch16_224 | 未测试 | √ | -| PaddleClas | pcpvt_base | 未测试 | √ | -| PaddleClas | pcpvt_large | 未测试 | √ | -| PaddleClas | pcpvt_small | 未测试 | √ | -| PaddleDetection | cascade_mask_rcnn_r50_fpn_1x_coco | 未测试 | √ | -| PaddleDetection | cascade_rcnn_r50_fpn_1x_coco | 未测试 | √ | -| PaddleDetection | faster_rcnn_r50_1x_coco | 未测试 | √ | -| PaddleDetection | faster_rcnn_r50_fpn_1x_coco | 未测试 | √ | -| PaddleDetection | mask_rcnn_r50_1x_coco | 未测试 | √ | -| PaddleNLP | BERT | √ | √ | -| PaddleNLP | ERINE3.0 | √ | √ | -| PaddleNLP | UIE | √ | √ | -| PaddleOCR | PP-OCRv4-mobile-det | √ | √ | -| PaddleOCR | PP-OCRv4-server-det | √ | √ | -| PaddleOCR | ch_PP-OCRv2_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv2_rec | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv3_det_0 | 未测试 | √ | -| PaddleOCR | ch_PP-OCRv4_server_rec | 未测试 | √ | -| PaddleOCR | ch_ppocr_mobile_v2_0_rec | 未测试 | √ | -| PaddleOCR | ch_ppocr_server_v2_0_rec | 未测试 | √ | -| PaddleOCR | det_mv3_db_v2_0_0 | 未测试 | √ | -| PaddleOCR | det_r50_db_plusplus_0 | 未测试 | √ | -| PaddleOCR | en_table_structure | 未测试 | √ | -| PaddleOCR | rec_mv3_none_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_tps_bilstm_att_v2_0 | 未测试 | √ | -| PaddleOCR | rec_mv3_tps_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r31_sar | 未测试 | √ | -| PaddleOCR | rec_r34_vd_none_bilstm_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_r34_vd_none_none_ctc_v2_0 | 未测试 | √ | -| PaddleOCR | rec_vitstr | 未测试 | √ | -| PaddleOCR | slanet | 未测试 | √ | -| PaddleSeg | PP-LiteSeg | √ | √ | -| PaddleSeg | SFNet | √ | √ | -| PaddleSeg | deeplabv3p_resnet50 | 未测试 | √ | -| PaddleSeg | fcn_hrnetw18 | 未测试 | √ | -| PaddleSeg | fcn_hrnetw18_small | 未测试 | √ | -| PaddleSeg | fcn_uhrnetw18_small | 未测试 | √ | -| PaddleSeg | fcos_r50_fpn_1x_coco | 未测试 | √ | -| PaddleSeg | ocrnet_hrnetw48 | 未测试 | √ | -| PaddleSeg | pphumanseg_lite | 未测试 | √ | -| PaddleSeg | pphumanseg_server | 未测试 | √ | -| PaddleTS | DLinear | √ | 未测试 | -| PaddleTS | NLinear | √ | 未测试 | -| PaddleTS | RLinear | √ | 未测试 | -| PaddleTS | TiDE | √ | 未测试 | -| PaddleTS | PatchTST | √ | 未测试 | -| PaddleVideo | AGCN | 未测试 | √ | -| PaddleVideo | AGCN2s | 未测试 | √ | -| PaddleVideo | BMN | 未测试 | √ | +| - | - | - | - | +| PaddleX | ResNet18 | √ | √ | +| PaddleX | ResNet34 | √ | √ | +| PaddleX | ResNet50 | √ | √ | +| PaddleX | ResNet101 | √ | √ | +| PaddleX | ResNet152 | √ | √ | +| PaddleX | PPLCNet_x0_25 | √ | √ | +| PaddleX | PPLCNet_x0_35 | √ | √ | +| PaddleX | PPLCNet_x0_5 | √ | √ | +| PaddleX | PPLCNet_x0_75 | √ | √ | +| PaddleX | PPLCNet_x1_0 | √ | √ | +| PaddleX | PPLCNet_x1_5 | √ | √ | +| PaddleX | PPLCNet_x2_0 | √ | √ | +| PaddleX | PPLCNet_x2_5 | √ | √ | +| PaddleX | PP-HGNet_small | √ | √ | +| PaddleX | PP-LiteSeg-T | √ | √ | +| PaddleX | PP-OCRv4_mobile_det | √ | √ | +| PaddleX | PP-OCRv4_server_det | √ | √ | +| PaddleX | PP-OCRv4_mobile_rec | √ | √ | +| PaddleX | PP-OCRv4_server_rec | √ | √ | +| PaddleX | DLinear | √ | √ | +| PaddleX | RLinear | √ | √ | +| PaddleX | NLinear | √ | √ | +| PaddleNLP | BERT | √ | √ | +| PaddleNLP | ERINE3.0 | √ | √ | diff --git a/docs/install/Tables.md b/docs/install/Tables.md index e7d2405d820..6a768904947 100644 --- a/docs/install/Tables.md +++ b/docs/install/Tables.md @@ -290,11 +290,11 @@ PaddePaddle 通过编译时指定路径来实现引用各种 BLAS/CUDA/cuDNN 库 - paddlepaddle==[版本号] 例如 paddlepaddle==2.6.1 + paddlepaddle==[版本号] 例如 paddlepaddle==3.0.0b2 只支持 CPU 对应版本的 PaddlePaddle,具体版本请参见Pypi - paddlepaddle-gpu==[版本号] 例如 paddlepaddle-gpu==2.6.1 + paddlepaddle-gpu==[版本号] 例如 paddlepaddle-gpu==3.0.0b2 默认安装支持 CUDA 11.8 和 cuDNN 8 的对应[版本号]的 PaddlePaddle 安装包 @@ -302,9 +302,8 @@ PaddePaddle 通过编译时指定路径来实现引用各种 BLAS/CUDA/cuDNN 库

您可以在 [Release History](https://pypi.org/project/paddlepaddle-gpu/#history) 中找到 PaddlePaddle-gpu 的各个发行版本。 -> 其中`postXX` 对应的是 CUDA 和 cuDNN 的版本,`postXX`之前的数字代表 Paddle 的版本 -需要注意的是,命令中 paddlepaddle-gpu==2.6.1 在 windows 环境下,会默认安装支持 CUDA 11.8 和 cuDNN 8 的对应[版本号]的 PaddlePaddle 安装包 +需要注意的是,命令中 paddlepaddle-gpu==3.0.0b2 在 windows 环境下,会默认安装支持 CUDA 11.8 和 cuDNN 8 的对应[版本号]的 PaddlePaddle 安装包

@@ -326,181 +325,86 @@ PaddePaddle 通过编译时指定路径来实现引用各种 BLAS/CUDA/cuDNN 库 cpu-mkl-avx - paddlepaddle-2.6.1-cp38-cp38-linux_x86_64.whl - paddlepaddle-2.6.1-cp39-cp39-linux_x86_64.whl - paddlepaddle-2.6.1-cp310-cp310-linux_x86_64.whl - paddlepaddle-2.6.1-cp311-cp311-linux_x86_64.whl - paddlepaddle-2.6.1-cp312-cp312-linux_x86_64.whl - - - cpu-openblas-avx - paddlepaddle-2.6.1-cp38-cp38-linux_x86_64.whl - - - - - - - - - - - cuda11.2-cudnn8.1-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1.post112-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp312-cp312-linux_x86_64.whl - - - cuda11.6-cudnn8.4-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1.post116-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp312-cp312-linux_x86_64.whl - - - cuda11.7-cudnn8.4-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1.post117-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp310-cp310-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp38-cp38-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp39-cp39-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp310-cp310-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp311-cp311-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp312-cp312-linux_x86_64.whl cuda11.8-cudnn8.6-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp312-cp312-linux_x86_64.whl - - cuda12.0-cudnn8.9-mkl-gcc12.2-avx - - paddlepaddle_gpu-2.6.1.post120-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp312-cp312-linux_x86_64.whl - - - macos-cpu-openblas - - paddlepaddle-2.6.1-cp38-cp38-macosx_10_14_x86_64.whl - - paddlepaddle-2.6.1-cp39-cp39-macosx_10_14_x86_64.whl - - paddlepaddle-2.6.1-cp310-cp310-macosx_10_14_universal2.whl - - paddlepaddle-2.6.1-cp311-cp311-macosx_10_14_universal2.whl - - paddlepaddle-2.6.1-cp312-cp312-macosx_10_14_universal2.whl - - - macos-cpu-openblas-m1 - - paddlepaddle-2.6.1-cp38-cp38-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp39-cp39-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp310-cp310-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp311-cp311-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp312-cp312-macosx_11_0_arm64.whl + + paddlepaddle_gpu-3.0.0b2-cp38-cp38-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp39-cp39-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp310-cp310-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp311-cp311-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp312-cp312-linux_x86_64.whl + + cuda12.3-cudnn9.0-mkl-gcc12.2-avx + + paddlepaddle_gpu-3.0.0b2-cp38-cp38-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp39-cp39-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp310-cp310-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp311-cp311-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp312-cp312-linux_x86_64.whl + + + macos-cpu-x86 + + paddlepaddle-3.0.0b2-cp38-cp38-macosx_10_9_x86_64.whl + + paddlepaddle-3.0.0b2-cp39-cp39-macosx_10_9_x86_64.whl + + paddlepaddle-3.0.0b2-cp310-cp310-macosx_10_9_universal2.whl + + paddlepaddle-3.0.0b2-cp311-cp311-macosx_10_9_universal2.whl + + paddlepaddle-3.0.0b2-cp312-cp312-macosx_10_9_universal2.whl + + + macos-cpu-arm + + paddlepaddle-3.0.0b2-cp38-cp38-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp39-cp39-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp310-cp310-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp311-cp311-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp312-cp312-macosx_11_0_arm64.whl win-cpu-mkl-avx - paddlepaddle-2.6.1-cp38-cp38-win_amd64.whl - paddlepaddle-2.6.1-cp39-cp39-win_amd64.whl - paddlepaddle-2.6.1-cp310-cp310-win_amd64.whl - paddlepaddle-2.6.1-cp311-cp311-win_amd64.whl - paddlepaddle-2.6.1-cp312-cp312-win_amd64.whl - - - win-cpu-openblas-avx - paddlepaddle-2.6.1-cp38-cp38-win_amd64.whl - - - - - - - - - - - win-cuda11.2-cudnn8.2-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post112-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp312-cp312-win_amd64.whl - - - win-cuda11.6-cudnn8.4-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post116-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp312-cp312-win_amd64.whl - - - win-cuda11.7-cudnn8.4-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post117-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp312-cp312-win_amd64.whl + paddlepaddle-3.0.0b2-cp38-cp38-win_amd64.whl + paddlepaddle-3.0.0b2-cp39-cp39-win_amd64.whl + paddlepaddle-3.0.0b2-cp310-cp310-win_amd64.whl + paddlepaddle-3.0.0b2-cp311-cp311-win_amd64.whl + paddlepaddle-3.0.0b2-cp312-cp312-win_amd64.whl win-cuda11.8-cudnn8.6-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp312-cp312-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp38-cp38-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp39-cp39-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp310-cp310-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp311-cp311-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp312-cp312-win_amd64.whl - win-cuda12.0-cudnn8.9-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post120-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp312-cp312-win_amd64.whl - - - linux-cinn-cuda11.2-cudnn8-mkl-gcc8.2-avx - paddlepaddle_gpu-2.6.1.post112-cp38-cp38-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp39-cp39-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp310-cp310-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp311-cp311-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp312-cp312-linux_x86_64.whl - - - linux-cuda11.2-cudnn8-mkl-gcc8.2-avx-pascal - paddlepaddle_gpu-2.6.1-cp38-cp38-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp39-cp39-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp310-cp310-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp311-cp311-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp312-cp312-linux_x86_64.whl + win-cuda12.3-cudnn9.0-mkl-vs2019-avx + paddlepaddle_gpu-3.0.0b2-cp38-cp38-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp39-cp39-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp310-cp310-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp311-cp311-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp312-cp312-win_amd64.whl @@ -539,155 +443,6 @@ abi tag: 类似'cp33m', 'abi3', 'none' platform tag: 类似 'linux_x86_64', 'any' - -

-## **多版本 whl 包列表-develop** -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
版本说明 cp38-cp38 cp39-cp39 cp310-cp310 cp311-cp311 cp312-cp312
linux-cpu-mkl-avx paddlepaddle-latest-cp38-cp38-linux_x86_64.whl paddlepaddle-latest-cp39-cp39-linux_x86_64.whl paddlepaddle-latest-cp310-cp310-linux_x86_64.whl paddlepaddle-latest-cp311-cp311-linux_x86_64.whl paddlepaddle-latest-cp312-cp312-linux_x86_64.whl
linux-cpu-openblas-avx paddlepaddle-latest-cp38-cp38-linux_x86_64.whl - - - -
cuda11.2-cudnn8.1-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda11.6-cudnn8.4-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda11.7-cudnn8.4-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda11.8-cudnn8.6-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda12.0-cudnn8.9-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
mac-cpu paddlepaddle-cp38-cp38-macosx_10_9_x86_64.whl paddlepaddle-cp39-cp39-macosx_10_9_x86_64.whl paddlepaddle-cp310-cp310-macosx_10_9_x86_64.whl paddlepaddle-cp311-cp311-macosx_10_9_x86_64.whl paddlepaddle-cp312-cp312-macosx_10_9_x86_64.whl
macos-cpu-openblas-m1 paddlepaddle-cp38-cp38-macosx_11_0_arm64.whl paddlepaddle-cp39-cp39-macosx_11_0_arm64.whl paddlepaddle-cp310-cp310-macosx_11_0_arm64.whl paddlepaddle-cp311-cp311-macosx_11_0_arm64.whl paddlepaddle-cp312-cp312-macosx_11_0_arm64.whl
win-cpu-mkl-avx paddlepaddle-latest-cp38-cp38-win_amd64.whl paddlepaddle-latest-cp39-cp39-win_amd64.whl paddlepaddle-latest-cp310-cp310-win_amd64.whl paddlepaddle-latest-cp311-cp311-win_amd64.whl paddlepaddle-latest-cp312-cp312-win_amd64.whl
win-cpu-openblas-avx paddlepaddle-latest-cp38-cp38-win_amd64.whl - - - -
win-cuda11.2-cudnn8.2-mkl-vs2019-avx paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda11.6-cudnn8.4.0-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda11.7-cudnn8.4.1-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda11.8-cudnn8.6.0-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda12.0-cudnn8.9.1-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
-

- @@ -706,17 +461,17 @@ platform tag: 类似 'linux_x86_64', 'any' cd /home/work ``` ``` -docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /work/train.py +docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /work/train.py ``` 上述命令中,`-it` 参数说明容器已交互式运行;`-v $PWD:/work` 指定将当前路径(Linux 中 PWD 变量会展开为当前路径的绝对路径)挂载到容器内部的:`/work` -目录: `registry.baidubce.com/paddlepaddle/paddle` 指定需要使用的容器; 最后`/work/train.py`为容器内执行的命令,即运行训练程序。 +目录: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle` 指定需要使用的容器; 最后`/work/train.py`为容器内执行的命令,即运行训练程序。 当然,您也可以进入到 Docker 容器中,以交互式的方式执行或调试您的代码: ``` -docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /bin/bash +docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /bin/bash ``` ``` cd /work @@ -740,13 +495,13 @@ PaddlePaddle Book 是为用户和开发者制作的一个交互式的 Jupyter No 我们提供可以直接运行 PaddlePaddle Book 的 Docker 镜像,直接运行: ``` -docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book +docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book ``` 国内用户可以使用下面的镜像源来加速访问: ``` -docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book +docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book ``` 然后在浏览器中输入以下网址: @@ -765,7 +520,7 @@ http://localhost:8888/ 请不要忘记提前在物理机上安装 GPU 最新驱动。 ``` -nvidia-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash +nvidia-docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash ``` **注: 如果没有安装 nvidia-docker,可以尝试以下的方法,将 CUDA 库和 Linux 设备挂载到 Docker 容器内:** @@ -775,5 +530,5 @@ export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') \ $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') docker run ${CUDA_SO} \ -${DEVICES} -it registry.baidubce.com/paddlepaddle/paddle:latest-gpu +${DEVICES} -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu ``` diff --git a/docs/install/Tables_en.md b/docs/install/Tables_en.md index 9a651c5cd0a..2949939f18c 100644 --- a/docs/install/Tables_en.md +++ b/docs/install/Tables_en.md @@ -282,11 +282,11 @@ PaddePaddle implements references to various BLAS/CUDA/cuDNN libraries by specif - paddlepaddle==[version code] such as paddlepaddle==2.6.1 + paddlepaddle==[version code] such as paddlepaddle==3.0.0b2 Only support the corresponding version of the CPU PaddlePaddle, please refer to Pypi for the specific version. - paddlepaddle-gpu==[version code], such as paddlepaddle-gpu==2.6.1 + paddlepaddle-gpu==[version code], such as paddlepaddle-gpu==3.0.0b2 The default installation supports the PaddlePaddle installation package corresponding to [version number] of CUDA 11.2 and cuDNN 8 @@ -294,9 +294,8 @@ PaddePaddle implements references to various BLAS/CUDA/cuDNN libraries by specif

You can find various distributions of PaddlePaddle-gpu in [the Release History](https://pypi.org/project/paddlepaddle-gpu/#history). -> 'postxx' corresponds to CUDA and cuDNN versions, and the number before 'postxx' represents the version of Paddle -Please note that: in the commands, paddlepaddle-gpu==2.6.1 will install the installation package of PaddlePaddle that supports CUDA 11.2 and cuDNN 8 by default under Windows environment. +Please note that: in the commands, paddlepaddle-gpu==3.0.0b2 will install the installation package of PaddlePaddle that supports CUDA 11.2 and cuDNN 8 by default under Windows environment. @@ -320,181 +319,86 @@ Please note that: in the commands, paddlepaddle-gpu==2.6.1 will i cpu-mkl-avx - paddlepaddle-2.6.1-cp38-cp38-linux_x86_64.whl - paddlepaddle-2.6.1-cp39-cp39-linux_x86_64.whl - paddlepaddle-2.6.1-cp310-cp310-linux_x86_64.whl - paddlepaddle-2.6.1-cp311-cp311-linux_x86_64.whl - paddlepaddle-2.6.1-cp312-cp312-linux_x86_64.whl - - - cpu-openblas-avx - paddlepaddle-2.6.1-cp38-cp38-linux_x86_64.whl - - - - - - - - - - - cuda11.2-cudnn8.1-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1.post112-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post112-cp312-cp312-linux_x86_64.whl - - - cuda11.6-cudnn8.4-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1.post116-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post116-cp312-cp312-linux_x86_64.whl - - - cuda11.7-cudnn8.4-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1.post117-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post117-cp312-cp312-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp38-cp38-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp39-cp39-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp310-cp310-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp311-cp311-linux_x86_64.whl + paddlepaddle-3.0.0b2-cp312-cp312-linux_x86_64.whl cuda11.8-cudnn8.6-mkl-gcc8.2-avx - - paddlepaddle_gpu-2.6.1-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1-cp312-cp312-linux_x86_64.whl - - cuda12.0-cudnn8.9-mkl-gcc12.2-avx - - paddlepaddle_gpu-2.6.1.post120-cp38-cp38-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp39-cp39-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp310-cp310-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp311-cp311-linux_x86_64.whl - - paddlepaddle_gpu-2.6.1.post120-cp312-cp312-linux_x86_64.whl - - - macos-cpu-openblas - - paddlepaddle-2.6.1-cp38-cp38-macosx_10_14_x86_64.whl - - paddlepaddle-2.6.1-cp39-cp39-macosx_10_14_x86_64.whl - - paddlepaddle-2.6.1-cp310-cp310-macosx_10_14_universal2.whl - - paddlepaddle-2.6.1-cp311-cp311-macosx_10_14_universal2.whl - - paddlepaddle-2.6.1-cp312-cp312-macosx_10_14_universal2.whl - - - macos-cpu-openblas-m1 - - paddlepaddle-2.6.1-cp38-cp38-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp39-cp39-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp310-cp310-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp311-cp311-macosx_11_0_arm64.whl - - paddlepaddle-2.6.1-cp312-cp312-macosx_11_0_arm64.whl + + paddlepaddle_gpu-3.0.0b2-cp38-cp38-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp39-cp39-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp310-cp310-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp311-cp311-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp312-cp312-linux_x86_64.whl + + cuda12.3-cudnn9.0-mkl-gcc12.2-avx + + paddlepaddle_gpu-3.0.0b2-cp38-cp38-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp39-cp39-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp310-cp310-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp311-cp311-linux_x86_64.whl + + paddlepaddle_gpu-3.0.0b2-cp312-cp312-linux_x86_64.whl + + + macos-cpu-x86 + + paddlepaddle-3.0.0b2-cp38-cp38-macosx_10_9_x86_64.whl + + paddlepaddle-3.0.0b2-cp39-cp39-macosx_10_9_x86_64.whl + + paddlepaddle-3.0.0b2-cp310-cp310-macosx_10_9_universal2.whl + + paddlepaddle-3.0.0b2-cp311-cp311-macosx_10_9_universal2.whl + + paddlepaddle-3.0.0b2-cp312-cp312-macosx_10_9_universal2.whl + + + macos-cpu-arm + + paddlepaddle-3.0.0b2-cp38-cp38-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp39-cp39-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp310-cp310-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp311-cp311-macosx_11_0_arm64.whl + + paddlepaddle-3.0.0b2-cp312-cp312-macosx_11_0_arm64.whl win-cpu-mkl-avx - paddlepaddle-2.6.1-cp38-cp38-win_amd64.whl - paddlepaddle-2.6.1-cp39-cp39-win_amd64.whl - paddlepaddle-2.6.1-cp310-cp310-win_amd64.whl - paddlepaddle-2.6.1-cp311-cp311-win_amd64.whl - paddlepaddle-2.6.1-cp312-cp312-win_amd64.whl - - - win-cpu-openblas-avx - paddlepaddle-2.6.1-cp38-cp38-win_amd64.whl - - - - - - - - - - - win-cuda11.2-cudnn8.2-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post112-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post112-cp312-cp312-win_amd64.whl - - - win-cuda11.6-cudnn8.4-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post116-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post116-cp312-cp312-win_amd64.whl - - - win-cuda11.7-cudnn8.4-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post117-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post117-cp312-cp312-win_amd64.whl + paddlepaddle-3.0.0b2-cp38-cp38-win_amd64.whl + paddlepaddle-3.0.0b2-cp39-cp39-win_amd64.whl + paddlepaddle-3.0.0b2-cp310-cp310-win_amd64.whl + paddlepaddle-3.0.0b2-cp311-cp311-win_amd64.whl + paddlepaddle-3.0.0b2-cp312-cp312-win_amd64.whl win-cuda11.8-cudnn8.6-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1-cp312-cp312-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp38-cp38-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp39-cp39-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp310-cp310-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp311-cp311-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp312-cp312-win_amd64.whl - win-cuda12.0-cudnn8.9-mkl-vs2019-avx - paddlepaddle_gpu-2.6.1.post120-cp38-cp38-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp39-cp39-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp310-cp310-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp311-cp311-win_amd64.whl - paddlepaddle_gpu-2.6.1.post120-cp312-cp312-win_amd64.whl - - - linux-cinn-cuda11.2-cudnn8-mkl-gcc8.2-avx - paddlepaddle_gpu-2.6.1.post112-cp38-cp38-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp39-cp39-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp310-cp310-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp311-cp311-linux_x86_64.whl - paddlepaddle_gpu-2.6.1.post112-cp312-cp312-linux_x86_64.whl - - - linux-cuda11.2-cudnn8-mkl-gcc8.2-avx-pascal - paddlepaddle_gpu-2.6.1-cp38-cp38-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp39-cp39-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp310-cp310-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp311-cp311-linux_x86_64.whl - paddlepaddle_gpu-2.6.1-cp312-cp312-linux_x86_64.whl + win-cuda12.3-cudnn9.0-mkl-vs2019-avx + paddlepaddle_gpu-3.0.0b2-cp38-cp38-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp39-cp39-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp310-cp310-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp311-cp311-win_amd64.whl + paddlepaddle_gpu-3.0.0b2-cp312-cp312-win_amd64.whl @@ -537,155 +441,6 @@ abi tag: similar to 'cp33m', 'abi3', 'none' platform tag: similar to 'linux_x86_64', 'any' - -

-## **Multi-version whl package list - dev** -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Release Instruction cp38-cp38 cp39-cp39 cp310-cp310 cp311-cp311 cp312-cp312
linux-cpu-mkl-avx paddlepaddle-latest-cp38-cp38-linux_x86_64.whl paddlepaddle-latest-cp39-cp39-linux_x86_64.whl paddlepaddle-latest-cp310-cp310-linux_x86_64.whl paddlepaddle-latest-cp311-cp311-linux_x86_64.whl paddlepaddle-latest-cp312-cp312-linux_x86_64.whl
linux-cpu-openblas-avx paddlepaddle-latest-cp38-cp38-linux_x86_64.whl - - - -
cuda11.2-cudnn8.1-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda11.6-cudnn8.4-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda11.7-cudnn8.4-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda11.8-cudnn8.6-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
cuda12.0-cudnn8.9-mkl paddlepaddle_gpu-latest-cp38-cp38-linux_x86_64.whl paddlepaddle_gpu-latest-cp39-cp39-linux_x86_64.whl paddlepaddle_gpu-latest-cp310-cp310-linux_x86_64.whl paddlepaddle_gpu-latest-cp311-cp311-linux_x86_64.whl paddlepaddle_gpu-latest-cp312-cp312-linux_x86_64.whl
mac-cpu paddlepaddle-cp38-cp38-macosx_10_9_x86_64.whl paddlepaddle-cp39-cp39-macosx_10_9_x86_64.whl paddlepaddle-cp310-cp310-macosx_10_9_x86_64.whl paddlepaddle-cp311-cp311-macosx_10_9_x86_64.whl paddlepaddle-cp312-cp312-macosx_10_9_x86_64.whl
macos-cpu-openblas-m1 paddlepaddle-cp38-cp38-macosx_11_0_arm64.whl paddlepaddle-cp39-cp39-macosx_11_0_arm64.whl paddlepaddle-cp310-cp310-macosx_11_0_arm64.whl paddlepaddle-cp311-cp311-macosx_11_0_arm64.whl paddlepaddle-cp312-cp312-macosx_11_0_arm64.whl
win-cpu-mkl-avx paddlepaddle-latest-cp38-cp38-win_amd64.whl paddlepaddle-latest-cp39-cp39-win_amd64.whl paddlepaddle-latest-cp310-cp310-win_amd64.whl paddlepaddle-latest-cp311-cp311-win_amd64.whl paddlepaddle-latest-cp312-cp312-win_amd64.whl
win-cpu-openblas-avx paddlepaddle-latest-cp38-cp38-win_amd64.whl - - - -
win-cuda11.2-cudnn8.2-mkl-vs2019-avx paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda11.6-cudnn8.4.0-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda11.7-cudnn8.4.1-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda11.8-cudnn8.6.0-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
win-cuda12.0-cudnn8.9.1-mkl-avx-vs2019 paddlepaddle_gpu-latest-cp38-cp38-win_amd64.whl paddlepaddle_gpu-latest-cp39-cp39-win_amd64.whl paddlepaddle_gpu-latest-cp310-cp310-win_amd64.whl paddlepaddle_gpu-latest-cp311-cp311-win_amd64.whl paddlepaddle_gpu-latest-cp312-cp312-win_amd64.whl
-

- - @@ -701,16 +456,16 @@ Suppose you have written a PaddlePaddle program in the current directory (such a cd /home/work ``` ``` -docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /work/train.py +docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /work/train.py ``` -In the above commands, the `-it` parameter indicates that the container has been run interactively; `-v $PWD:/work` specifies that the current path (the absolute path where the PWD variable in Linux will expand to the current path) is mounted to the `:/work` directory inside the container: `registry.baidubce.com/paddlepaddle/paddle` specifies the container to be used; finally `/work/train.py` is the command executed inside the container, ie. the training program. +In the above commands, the `-it` parameter indicates that the container has been run interactively; `-v $PWD:/work` specifies that the current path (the absolute path where the PWD variable in Linux will expand to the current path) is mounted to the `:/work` directory inside the container: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle` specifies the container to be used; finally `/work/train.py` is the command executed inside the container, ie. the training program. Of course, you can also enter into the Docker container and execute or debug your code interactively: ``` -docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /bin/bash +docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /bin/bash ``` ``` cd /work @@ -732,13 +487,13 @@ Use Docker to quickly launch a local Jupyter Notebook containing the PaddlePaddl We provide a Docker image that can run the PaddlePaddle Book directly, running directly: ``` -docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book +docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book ``` Domestic users can use the following image source to speed up access: ``` -docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book +docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book ``` Then enter the following URL in your browser: @@ -756,7 +511,7 @@ http://localhost:8888/ In order to ensure that the GPU driver works properly in the image, we recommend using [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) to run the image. Don't forget to install the latest GPU drivers on your physical machine in advance. ``` -Nvidia-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash +Nvidia-docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash ``` **Note: If you don't have nvidia-docker installed, you can try the following to mount the CUDA library and Linux devices into the Docker container:** @@ -766,5 +521,5 @@ export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') \ $(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')" export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}') docker run ${CUDA_SO} \ -${DEVICES} -it registry.baidubce.com/paddlepaddle/paddle:latest-gpu +${DEVICES} -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu ``` diff --git a/docs/install/compile/linux-compile-by-make.md b/docs/install/compile/linux-compile-by-make.md index d896d387019..eb1dc506b7a 100644 --- a/docs/install/compile/linux-compile-by-make.md +++ b/docs/install/compile/linux-compile-by-make.md @@ -60,12 +60,12 @@ cd Paddle * CPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev ``` * GPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 ``` 如果您的机器不在中国大陆地区,可以直接从 [DockerHub 中的 paddle 镜像仓库](https://hub.docker.com/r/paddlepaddle/paddle/tags) 拉取镜像: @@ -90,7 +90,7 @@ cd Paddle 用从百度拉取的镜像创建容器: ``` - docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash + docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash ``` - `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test; @@ -99,7 +99,7 @@ cd Paddle - `-it`: 与宿主机保持交互状态; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可: ``` @@ -110,7 +110,7 @@ cd Paddle 用从百度拉取的镜像创建容器 ``` - docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash + docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash ``` - `--gpus all`: 在 Docker 容器中允许使用 gpu; @@ -121,7 +121,7 @@ cd Paddle - `-it`: 与宿主机保持交互状态; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可: ``` diff --git a/docs/install/compile/linux-compile-by-make_en.md b/docs/install/compile/linux-compile-by-make_en.md index f5d6db2479c..0f1a9c42827 100644 --- a/docs/install/compile/linux-compile-by-make_en.md +++ b/docs/install/compile/linux-compile-by-make_en.md @@ -58,12 +58,12 @@ For domestic users, when downloading docker is slow due to network problems, you * CPU version of PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev ``` * GPU version of PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 ``` If your machine is not in mainland China, you can pull the image directly from DockerHub: @@ -90,7 +90,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g Using the image pulled from Baidu. ``` - docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash + docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash ``` - `--name paddle-test`: names the Docker container you created as paddle-test; @@ -101,7 +101,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g - `-it`: keeps interaction with the host; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev`: use the image named `registry.baidubce.com/paddlepaddle/paddle:latest-dev` to create Docker container, /bin/bash start the /bin/bash command after entering the container. + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`: use the image named `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev` to create Docker container, /bin/bash start the /bin/bash command after entering the container. If you are using the image pulled from DockerHub, just modify the image name. ``` @@ -113,7 +113,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g Using the image pulled from Baidu. ``` - docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash + docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash ``` - `--gpus all`: gpu resources can be used in Docker container; @@ -127,7 +127,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g - `-it`: keeps interaction with the host; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`: use the image named `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2` to create Docker container, /bin/bash start the /bin/bash command after entering the container. + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`: use the image named `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2` to create Docker container, /bin/bash start the /bin/bash command after entering the container. If you are using the image pulled from DockerHub, just modify the image name. ``` diff --git a/docs/install/compile/linux-compile-by-ninja.md b/docs/install/compile/linux-compile-by-ninja.md index 018bd28e267..74ade0b7be4 100644 --- a/docs/install/compile/linux-compile-by-ninja.md +++ b/docs/install/compile/linux-compile-by-ninja.md @@ -60,12 +60,12 @@ cd Paddle * CPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev ``` * GPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 ``` 如果您的机器不在中国大陆地区,可以直接从 [DockerHub 中的 paddle 镜像仓库](https://hub.docker.com/r/paddlepaddle/paddle/tags) 拉取镜像: @@ -90,7 +90,7 @@ cd Paddle 用从百度拉取的镜像创建容器 ``` - docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash + docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash ``` - `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test; @@ -99,7 +99,7 @@ cd Paddle - `-it`: 与宿主机保持交互状态; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可: ``` @@ -110,7 +110,7 @@ cd Paddle 用从百度拉取的镜像创建容器 ``` - docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash + docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash ``` - `--gpus all`: 在 Docker 容器中允许使用 gpu; @@ -121,7 +121,7 @@ cd Paddle - `-it`: 与宿主机保持交互状态; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`registry.baidubce.com/paddlepaddle/paddle`, tag 为`latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle`, tag 为`latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可: ``` diff --git a/docs/install/compile/macos-compile-make.md b/docs/install/compile/macos-compile-make.md index 813df2634cf..12abdda7698 100644 --- a/docs/install/compile/macos-compile-make.md +++ b/docs/install/compile/macos-compile-make.md @@ -48,7 +48,7 @@ cd Paddle * CPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev ``` 如果您的机器不在中国大陆地区,可以直接从 DockerHub 拉取镜像: @@ -64,7 +64,7 @@ cd Paddle #### 5. 创建并进入满足编译环境的 Docker 容器: ``` -docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash +docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash ``` - `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test @@ -73,7 +73,7 @@ docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidub - `-it`:与宿主机保持交互状态 -- `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令 +- `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令 注意:请确保至少为 docker 分配 4g 以上的内存,否则编译过程可能因内存不足导致失败。您可以在 docker 用户界面的“Preferences-Resources”中设置容器的内存分配上限。 diff --git a/docs/install/compile/macos-compile-make_en.md b/docs/install/compile/macos-compile-make_en.md index 4cc2fbbea1f..6b7da5879eb 100644 --- a/docs/install/compile/macos-compile-make_en.md +++ b/docs/install/compile/macos-compile-make_en.md @@ -49,7 +49,7 @@ For domestic users, when downloading docker is slow due to network problems, you * CPU version of PaddlePaddle: ``` -docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev ``` If your machine is not in mainland China, you can pull the image directly from DockerHub: @@ -65,7 +65,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g #### 5. Create and enter a Docker container that meets the compilation environment: ``` -docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash +docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash ``` - `--name paddle-test`: name the Docker container you created as paddle-test, @@ -74,7 +74,7 @@ docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidub - `-it`: keeps interacting with the host; -- `registry.baidubce.com/paddlepaddle/paddle:latest-dev`: creates a Docker container with a mirror named `registry.baidubce.com/paddlepaddle/paddle:latest-dev`, /bin /bash starts the /bin/bash command after entering the container. +- `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`: creates a Docker container with a mirror named `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`, /bin /bash starts the /bin/bash command after entering the container. Note: diff --git a/docs/install/compile/macos-compile-ninja.md b/docs/install/compile/macos-compile-ninja.md index bed6f9378f2..ab2c12801e3 100644 --- a/docs/install/compile/macos-compile-ninja.md +++ b/docs/install/compile/macos-compile-ninja.md @@ -48,7 +48,7 @@ cd Paddle * CPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev ``` 如果您的机器不在中国大陆地区,可以直接从 DockerHub 拉取镜像: * CPU 版的 PaddlePaddle: @@ -58,12 +58,12 @@ cd Paddle 您可以访问[DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/)获取与您机器适配的镜像。 #### 5. 创建并进入满足编译环境的 Docker 容器: ``` -docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash +docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash ``` - `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test - `-v:$PWD:/paddle`:将当前目录挂载到 Docker 容器中的/paddle 目录下(Linux 中 PWD 变量会展开为当前路径的[绝对路径](https://baike.baidu.com/item/绝对路径/481185)) - `-it`:与宿主机保持交互状态 -- `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令 +- `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令 注意:请确保至少为 docker 分配 4g 以上的内存,否则编译过程可能因内存不足导致失败。您可以在 docker 用户界面的“Preferences-Resources”中设置容器的内存分配上限。 #### 6. 进入 Docker 后进入 paddle 目录下: ``` diff --git a/docs/install/conda/linux-conda.md b/docs/install/conda/linux-conda.md index fc55e0a4929..caece6b81dc 100644 --- a/docs/install/conda/linux-conda.md +++ b/docs/install/conda/linux-conda.md @@ -86,39 +86,29 @@ python3 -c "import platform;print(platform.architecture()[0]);print(platform.mac #### CPU 版的 PaddlePaddle + 如果您的计算机没有 NVIDIA® GPU,请安装 CPU 版的 PaddlePaddle ``` -conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ +conda install paddlepaddle==3.0.0b2 -c paddle ``` - #### GPU 版的 PaddlePaddle - -* 对于 `CUDA 11.2`,需要搭配 cuDNN 8.2.1(多卡环境下 NCCL>=2.7),安装命令为: +* 对于 `CUDA 11.8` 安装命令为: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia ``` -* 对于 `CUDA 11.6`,需要搭配 cuDNN 8.4.0(多卡环境下 NCCL>=2.7),安装命令为: +* 对于 `CUDA 12.3` 安装命令为: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia ``` -* 对于 `CUDA 11.7`,需要搭配 cuDNN 8.4.1(多卡环境下 NCCL>=2.7),安装命令为: - - ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge - ``` - -您可参考 NVIDIA 官方文档了解 CUDA 和 CUDNN 的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/) - - ## **三、验证安装** diff --git a/docs/install/conda/linux-conda_en.md b/docs/install/conda/linux-conda_en.md index 14916f84957..fefbee05785 100644 --- a/docs/install/conda/linux-conda_en.md +++ b/docs/install/conda/linux-conda_en.md @@ -91,37 +91,29 @@ You can choose the following version of PaddlePaddle to start installation: #### CPU Version of PaddlePaddle + If your computer doesn't have NVIDIA® GPU, please install `the CPU Version of PaddlePaddle` ``` -conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ +conda install paddlepaddle==3.0.0b2 -c paddle ``` - #### GPU Version of PaddlePaddle -* If you are using CUDA 11.2,cuDNN 8.2.1(for multi card support, NCCL>=2.7): +* If you are using CUDA 11.8: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia ``` -* If you are using CUDA 11.6,cuDNN 8.4.0(for multi card support, NCCL>=2.7): +* If you are using CUDA 12.3: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia ``` -* If you are using CUDA 11.7,cuDNN 8.4.1(for multi card support, NCCL>=2.7): - - ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge - ``` - -You can refer to NVIDIA official documents for installation process and configuration method of CUDA and cudnn. Please refer to [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/) - ## Verify installation diff --git a/docs/install/conda/macos-conda.md b/docs/install/conda/macos-conda.md index 30600afbf54..3a43c093759 100644 --- a/docs/install/conda/macos-conda.md +++ b/docs/install/conda/macos-conda.md @@ -83,7 +83,7 @@ python3 -c "import platform;print(platform.architecture()[0]);print(platform.mac * 目前在 macOS 环境仅支持 CPU 版 PaddlePaddle,请参考如下命令安装 Paddle: ``` - conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ + conda install paddlepaddle==3.0.0b2 -c paddle ``` ## **三、验证安装** diff --git a/docs/install/conda/macos-conda_en.md b/docs/install/conda/macos-conda_en.md index 371e218972e..ac3eff46eb5 100644 --- a/docs/install/conda/macos-conda_en.md +++ b/docs/install/conda/macos-conda_en.md @@ -87,7 +87,7 @@ conda config --set show_channel_urls yes * Currently, only the CPU version of PaddlePaddle is supported in the macOS environment. Please use the following command to install PaddlePaddle: ``` - conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ + conda install paddlepaddle==3.0.0b2 -c paddle ``` diff --git a/docs/install/conda/windows-conda.md b/docs/install/conda/windows-conda.md index 29891f4c291..6edf9dea2f6 100644 --- a/docs/install/conda/windows-conda.md +++ b/docs/install/conda/windows-conda.md @@ -90,35 +90,27 @@ python -c "import platform;print(platform.architecture()[0]);print(platform.mach 如果您的计算机没有 NVIDIA® GPU,请安装 CPU 版的 PaddlePaddle + ``` -conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ +conda install paddlepaddle==3.0.0b2 -c paddle ``` - #### GPU 版的 PaddlePaddle -* 对于 `CUDA 11.2`,需要搭配 cuDNN 8.2.1,安装命令为: +* 对于 `CUDA 11.8` 安装命令为: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia ``` -* 对于 `CUDA 11.6`,需要搭配 cuDNN 8.4.0,安装命令为: +* 对于 `CUDA 12.3` 安装命令为: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia ``` -* 对于 `CUDA 11.7`,需要搭配 cuDNN 8.4.1,安装命令为: - - ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge - ``` - -您可参考 NVIDIA 官方文档了解 CUDA 和 CUDNN 的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/) - ## **三、验证安装** diff --git a/docs/install/conda/windows-conda_en.md b/docs/install/conda/windows-conda_en.md index 736e0da59a7..a7be323dfe5 100644 --- a/docs/install/conda/windows-conda_en.md +++ b/docs/install/conda/windows-conda_en.md @@ -93,10 +93,11 @@ You can choose the following version of PaddlePaddle to start installation: #### CPU Version of PaddlePaddle + If your computer doesn't have NVIDIA® GPU, please install `the CPU Version of PaddlePaddle` ``` -conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ +conda install paddlepaddle==3.0.0b2 -c paddle ``` @@ -105,26 +106,18 @@ conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn #### GPU Version of PaddlePaddle -* If you are using CUDA 11.2,cuDNN 8.2.1: - - ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge - ``` - -* If you are using CUDA 11.6,cuDNN 8.4.0: +* If you are using CUDA 11.8: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia ``` -* If you are using CUDA 11.7,cuDNN 8.4.1: +* If you are using CUDA 12.3: ``` - conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge + conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia ``` -You can refer to NVIDIA official documents for installation process and configuration method of CUDA and cudnn. Please refer to [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/) - ## Verify installation diff --git a/docs/install/docker/docker_list.md b/docs/install/docker/docker_list.md index d1a48be1885..60179698128 100644 --- a/docs/install/docker/docker_list.md +++ b/docs/install/docker/docker_list.md @@ -18,7 +18,7 @@ - registry.baidubce.com/paddlepaddle/paddle:latest-dev + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev CPU @@ -26,7 +26,7 @@ 12.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.2-cudnn8.2-trt8.0-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.2-cudnn8.2-trt8.0-gcc82 11.2 8.2 8.0 @@ -34,7 +34,7 @@ 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.6-cudnn8.4-trt8.4-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.6-cudnn8.4-trt8.4-gcc82 11.6 8.4 8.4.0.6 @@ -42,7 +42,7 @@ 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.7-cudnn8.4-trt8.4-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.7-cudnn8.4-trt8.4-gcc82 11.7 8.4 8.4.2.4 @@ -50,7 +50,7 @@ 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.8-cudnn8.6-trt8.5-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.8-cudnn8.6-trt8.5-gcc82 11.8 8.6 8.5 @@ -58,7 +58,7 @@ 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 12.0 8.9 8.6 @@ -66,7 +66,7 @@ 12.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.3-cudnn9.0-trt8.6-gcc12.2 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.3-cudnn9.0-trt8.6-gcc12.2 12.3 9.0 8.6 diff --git a/docs/install/docker/docker_list_en.md b/docs/install/docker/docker_list_en.md index c70f84b5817..ac4eb66e8f5 100644 --- a/docs/install/docker/docker_list_en.md +++ b/docs/install/docker/docker_list_en.md @@ -18,7 +18,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle - registry.baidubce.com/paddlepaddle/paddle:latest-dev + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev CPU @@ -26,7 +26,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle 12.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.2-cudnn8.2-trt8.0-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.2-cudnn8.2-trt8.0-gcc82 11.2 8.2 8.0 @@ -34,7 +34,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.6-cudnn8.4-trt8.4-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.6-cudnn8.4-trt8.4-gcc82 11.6 8.4 8.4.0.6 @@ -42,7 +42,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.7-cudnn8.4-trt8.4-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.7-cudnn8.4-trt8.4-gcc82 11.7 8.4 8.4.2.4 @@ -50,7 +50,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.8-cudnn8.6-trt8.5-gcc82 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.8-cudnn8.6-trt8.5-gcc82 11.8 8.6 8.5 @@ -58,7 +58,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle 8.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 12.0 8.9 8.6 @@ -66,7 +66,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle 12.2 - registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.3-cudnn9.0-trt8.6-gcc12.2 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.3-cudnn9.0-trt8.6-gcc12.2 12.3 9.0 8.6 diff --git a/docs/install/docker/linux-docker.md b/docs/install/docker/linux-docker.md index 9ef65d125c1..08f2bf39c1e 100644 --- a/docs/install/docker/linux-docker.md +++ b/docs/install/docker/linux-docker.md @@ -21,46 +21,40 @@ * CPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 ``` * CPU 版的 PaddlePaddle,且镜像中预装好了 jupyter: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` * GPU 版的 PaddlePaddle(**建议拉取最新版本镜像,并确保已经成功安装 NVIDIA Container Toolkit**): ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5 ``` ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4 - ``` - ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 ``` 如果您的机器不在中国大陆地区,可以直接从 DockerHub 拉取镜像: * CPU 版的 PaddlePaddle: ``` - docker pull paddlepaddle/paddle:2.6.1 + docker pull paddlepaddle/paddle:3.0.0b2 ``` * CPU 版的 PaddlePaddle,且镜像中预装好了 jupyter: ``` - docker pull paddlepaddle/paddle:2.6.1-jupyter + docker pull paddlepaddle/paddle:3.0.0b2-jupyter ``` * GPU 版的 PaddlePaddle(**建议拉取最新版本镜像,并确保已经成功安装 NVIDIA Container Toolkit**): ``` - docker pull paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 - ``` + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5 ``` - docker pull paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4 ``` - ``` - docker pull paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 ``` 您还可以访问[DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/)获取更多镜像。 @@ -72,7 +66,7 @@ ``` - docker run --name paddle_docker -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1 /bin/bash + docker run --name paddle_docker -it -v $PWD:/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 /bin/bash ``` - `--name paddle_docker`:设定 Docker 的名称,`paddle_docker` 是自己设置的名称; @@ -83,7 +77,7 @@ - `-v $PWD:/paddle`:指定将当前路径(PWD 变量会展开为当前路径的绝对路径)挂载到容器内部的 /paddle 目录; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1`:指定需要使用的 image 名称,您可以通过`docker images`命令查看;/bin/bash 是在 Docker 中要执行的命令 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2`:指定需要使用的 image 名称,您可以通过`docker images`命令查看;/bin/bash 是在 Docker 中要执行的命令 * 使用 CPU 版本的 PaddlePaddle,且镜像中预装好了 jupyter: @@ -98,7 +92,7 @@ cd ./jupyter_docker ``` ``` - docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` - `--rm`:关闭容器后删除容器; @@ -109,13 +103,13 @@ - `-v $PWD:/home/paddle`:指定将当前路径(PWD 变量会展开为当前路径的绝对路径)挂载到容器内部的 /home/paddle 目录; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter`:指定需要使用的 image 名称,您可以通过`docker images`命令查看 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter`:指定需要使用的 image 名称,您可以通过`docker images`命令查看 * 使用 GPU 版本的 PaddlePaddle: ``` - docker run --gpus all --name paddle_docker -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash + docker run --gpus all --name paddle_docker -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 /bin/bash ``` - `--gpus all`: 在 Docker 容器中允许使用 gpu; @@ -127,7 +121,7 @@ - `-it`: 与宿主机保持交互状态; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`registry.baidubce.com/paddlepaddle/paddle`, tag 为`latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle`, tag 为`3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。 @@ -146,24 +140,20 @@ - registry.baidubce.com/paddlepaddle/paddle:2.6.1 - 安装了 2.6.1 版本 paddle 的 CPU 镜像 - - - registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter - 安装了 2.6.1 版本 paddle 的 CPU 镜像,且镜像中预装好了 jupyter,启动 docker 即运行 jupyter 服务 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 + 安装了 3.0.0b2 版本 paddle 的 CPU 镜像 - registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 - 安装了 2.6.1 版本 paddle 的 GPU 镜像,cuda 版本为 12.0,cudnn 版本为 8.9,trt 版本为 8.6 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter + 安装了 3.0.0b2 版本 paddle 的 CPU 镜像,且镜像中预装好了 jupyter,启动 docker 即运行 jupyter 服务 - registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4 - 安装了 2.6.1 版本 paddle 的 GPU 镜像,cuda 版本为 11.7,cudnn 版本为 8.4,trt 版本为 8.4 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5 + 安装了 3.0.0b2 版本 paddle 的 GPU 镜像,cuda 版本为 11.8,cudnn 版本为 8.6,trt 版本为 8.5 - registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 - 安装了 2.6.1 版本 paddle 的 GPU 镜像,cuda 版本为 11.2,cudnn 版本为 8.2,trt 版本为 8.0 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 + 安装了 3.0.0b2 版本 paddle 的 GPU 镜像,cuda 版本为 12.3,cudnn 版本为 9.0,trt 版本为 8.6 diff --git a/docs/install/docker/linux-docker_en.md b/docs/install/docker/linux-docker_en.md index 71f66e14a35..8c08f70bca5 100644 --- a/docs/install/docker/linux-docker_en.md +++ b/docs/install/docker/linux-docker_en.md @@ -21,46 +21,40 @@ For domestic users, when downloading docker is slow due to network problems, you * CPU version of PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 ``` * CPU version of PaddlePaddle, and the image is pre-installed with jupyter: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` * GPU version of PaddlePaddle(**Latest version of gpu image is recommended, and make sure NVIDIA Container Toolkit is installed successfully**): ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 ``` ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4 - ``` - ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5 ``` If your machine is not in mainland China, you can pull the image directly from DockerHub: * CPU version of PaddlePaddle: ``` - docker pull paddlepaddle/paddle:2.6.1 + docker pull paddlepaddle/paddle:3.0.0b2 ``` * CPU version of PaddlePaddle, and the image is pre-installed with jupyter: ``` - docker pull paddlepaddle/paddle:2.6.1-jupyter + docker pull paddlepaddle/paddle:3.0.0b2-jupyter ``` * GPU version of PaddlePaddle(**Latest version of gpu image is recommended, and make sure NVIDIA Container Toolkit is installed successfully**): ``` - docker pull paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 - ``` + docker pull paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 ``` - docker pull paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4 ``` - ``` - docker pull paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 + docker pull paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5 ``` You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to get more images. @@ -72,7 +66,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g ``` - docker run --name paddle_docker -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1 /bin/bash + docker run --name paddle_docker -it -v $PWD:/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 /bin/bash ``` - `--name paddle_docker`: set name of Docker, `paddle_docker` is name of docker you set; @@ -83,7 +77,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g - `-v $PWD:/paddle`: Specifies to mount the current path of the host (PWD variable in Linux will expand to the absolute path of the current path) to the /paddle directory inside the container; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker * Use GPU version of PaddlePaddle: @@ -91,7 +85,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g ``` - docker run --gpus all --name paddle_docker -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash + docker run --gpus all --name paddle_docker -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 /bin/bash ``` - `--gpus all`: gpu resources can be used in Docker container; @@ -104,7 +98,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g - `-v $PWD:/paddle`: Specifies to mount the current path of the host (PWD variable in Linux will expand to the absolute path of the current path) to the /paddle directory inside the container; - - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker * Use CPU version of PaddlePaddle with jupyter: @@ -120,7 +114,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g cd ./jupyter_docker ``` ``` - docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` - `--rm`: Delete the container after closing it; @@ -131,7 +125,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g - `-v $PWD:/home/paddle`: Specifies to mount the current path (the PWD variable will be expanded to the absolute path of the current path) to the /home/paddle directory inside the container; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command Now you have successfully used Docker to install PaddlePaddle. For more information about using Docker, see[Docker official documents](https://docs.docker.com) @@ -149,24 +143,20 @@ Now you have successfully used Docker to install PaddlePaddle. For more informat - registry.baidubce.com/paddlepaddle/paddle:2.6.1 - CPU image with 2.6.1 version of paddle installed - - - registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter - CPU image of paddle version 2.6.1 is installed, and jupyter is pre-installed in the image. Start the docker to run the jupyter service + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 + CPU image with 3.0.0b2 version of paddle installed - registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6 - GPU image of paddle version 2.6.1 is installed, cuda version is 12.0, cudnn version is 8.9, trt version is 8.6 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter + CPU image of paddle version 3.0.0b2 is installed, and jupyter is pre-installed in the image. Start the docker to run the jupyter service - registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4 - GPU image of paddle version 2.6.1 is installed, cuda version is 11.7, cudnn version is 8.4, trt version is 8.4 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 + GPU image of paddle version 3.0.0b2 is installed, cuda version is 12.3, cudnn version is 9.0, trt version is 8.6 - registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0 - GPU image of paddle version 2.6.1 is installed, cuda version is 11.2, cudnn version is 8.2, trt version is 8.0 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5 + GPU image of paddle version 3.0.0b2 is installed, cuda version is 11.8, cudnn version is 8.6, trt version is 8.5 diff --git a/docs/install/docker/macos-docker.md b/docs/install/docker/macos-docker.md index 348b66bb22a..59d56f2c2a9 100644 --- a/docs/install/docker/macos-docker.md +++ b/docs/install/docker/macos-docker.md @@ -19,24 +19,24 @@ * CPU 版的 PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 ``` * CPU 版的 PaddlePaddle,且镜像中预装好了 jupyter: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` 如果您的机器不在中国大陆地区,可以直接从 DockerHub 拉取镜像: * CPU 版的 PaddlePaddle: ``` - docker pull paddlepaddle/paddle:2.6.1 + docker pull paddlepaddle/paddle:3.0.0b2 ``` * CPU 版的 PaddlePaddle,且镜像中预装好了 jupyter: ``` - docker pull paddlepaddle/paddle:2.6.1-jupyter + docker pull paddlepaddle/paddle:3.0.0b2-jupyter ``` 您还可以访问[DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/)获取更多镜像。 @@ -48,7 +48,7 @@ ``` - docker run --name paddle_docker -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1 /bin/bash + docker run --name paddle_docker -it -v $PWD:/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 /bin/bash ``` - `--name paddle_docker`:设定 Docker 的名称,`paddle_docker` 是自己设置的名称; @@ -59,7 +59,7 @@ - `-v $PWD:/paddle`:指定将当前路径(PWD 变量会展开为当前路径的绝对路径)挂载到容器内部的 /paddle 目录; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1`:指定需要使用的 image 名称,您可以通过`docker images`命令查看;/bin/bash 是在 Docker 中要执行的命令 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2`:指定需要使用的 image 名称,您可以通过`docker images`命令查看;/bin/bash 是在 Docker 中要执行的命令 * 使用 CPU 版本的 PaddlePaddle,且镜像中预装好了 jupyter: @@ -73,7 +73,7 @@ cd ./jupyter_docker ``` ``` - docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` - `--rm`:关闭容器后删除容器; @@ -84,7 +84,7 @@ - `-v $PWD:/home/paddle`:指定将当前路径(PWD 变量会展开为当前路径的绝对路径)挂载到容器内部的 /home/paddle 目录; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter`:指定需要使用的 image 名称,您可以通过`docker images`命令查看 + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter`:指定需要使用的 image 名称,您可以通过`docker images`命令查看 @@ -104,12 +104,12 @@ - registry.baidubce.com/paddlepaddle/paddle:2.6.1 - 安装了 2.6.1 版本 paddle 的 CPU 镜像 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 + 安装了 3.0.0b2 版本 paddle 的 CPU 镜像 - registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter - 安装了 2.6.1 版本 paddle 的 CPU 镜像,且镜像中预装好了 jupyter,启动 docker 即运行 jupyter 服务 + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter + 安装了 3.0.0b2 版本 paddle 的 CPU 镜像,且镜像中预装好了 jupyter,启动 docker 即运行 jupyter 服务 diff --git a/docs/install/docker/macos-docker_en.md b/docs/install/docker/macos-docker_en.md index fadf8883c33..80148a14a04 100644 --- a/docs/install/docker/macos-docker_en.md +++ b/docs/install/docker/macos-docker_en.md @@ -19,24 +19,24 @@ For domestic users, when downloading docker is slow due to network problems, you * CPU version of PaddlePaddle: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1 + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 ``` * CPU version of PaddlePaddle, and the image is pre-installed with jupyter: ``` - docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` If your machine is not in mainland China, you can pull the image directly from DockerHub: * CPU version of PaddlePaddle: ``` - docker pull paddlepaddle/paddle:2.6.1 + docker pull paddlepaddle/paddle:3.0.0b2 ``` * CPU version of PaddlePaddle, and the image is pre-installed with jupyter: ``` - docker pull paddlepaddle/paddle:2.6.1-jupyter + docker pull paddlepaddle/paddle:3.0.0b2-jupyter ``` You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to get more images. @@ -48,7 +48,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g ``` - docker run --name paddle_docker -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1 /bin/bash + docker run --name paddle_docker -it -v $PWD:/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 /bin/bash ``` - `--name paddle_docker`: set name of Docker, `paddle_docker` is name of docker you set; @@ -59,7 +59,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g - `-v $PWD:/paddle`: Specifies to mount the current path of the host (PWD variable in Linux will expand to the absolute path of the current path) to the /paddle directory inside the container; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker * Use CPU version of PaddlePaddle with jupyter: @@ -75,7 +75,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g cd ./jupyter_docker ``` ``` - docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter + docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter ``` - `--rm`: Delete the container after closing it; @@ -86,7 +86,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g - `-v $PWD:/home/paddle`: Specifies to mount the current path (the PWD variable will be expanded to the absolute path of the current path) to the /home/paddle directory inside the container; - - `registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command + - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command @@ -105,12 +105,12 @@ Now you have successfully used Docker to install PaddlePaddle. For more informat - registry.baidubce.com/paddlepaddle/paddle:2.6.1 - CPU image with 2.6.1 version of paddle installed + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 + CPU image with 3.0.0b2 version of paddle installed - registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter - CPU image of paddle version 2.6.1 is installed, and jupyter is pre-installed in the image. Start the docker to run the jupyter service + ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter + CPU image of paddle version 3.0.0b2 is installed, and jupyter is pre-installed in the image. Start the docker to run the jupyter service diff --git a/docs/install/pip/linux-pip.md b/docs/install/pip/linux-pip.md index 19c12ab2ec4..b3a64e0cebe 100644 --- a/docs/install/pip/linux-pip.md +++ b/docs/install/pip/linux-pip.md @@ -2,6 +2,8 @@ [The Python Package Index(PyPI)](https://pypi.org/)是 Python 的包管理器。本文档为你介绍 PyPI 安装方式,飞桨提供的 PyPI 安装包支持分布式训练(多机多卡)、TensorRT 推理功能。 +* 您无需再安装 CUDA\CUDNN\NCCL 等软件, paddle whl 包中已经自带, 直接安装 paddle whl 包即可 + ## 一、环境准备 ### 1.1 如何查看您的环境 @@ -31,9 +33,6 @@ * 需要确认 pip 的版本是否满足要求,要求 pip 版本为 20.2.2 或更高版本 - ``` - python3 -m ensurepip - ``` ``` python3 -m pip --version @@ -50,7 +49,7 @@ -* 默认提供的安装包需要计算机支持 MKL +* 默认提供的安装包需要计算机支持 MKL, Intel 芯片都支持 MKL * 如果您对机器环境不了解,请下载使用[快速安装脚本](https://fast-install.bj.bcebos.com/fast_install.sh),配套说明请参考[这里](https://github.com/PaddlePaddle/FluidDoc/tree/develop/docs/install/install_script.md)。 @@ -64,131 +63,39 @@ * 如果您的计算机有 NVIDIA® GPU,请确保满足以下条件并且安装[GPU 版 PaddlePaddle](#gpu),依赖库环境版本要求如下: - * **CUDA 工具包 11.2 配合 cuDNN v8.2.1, 如需使用 PaddleTensorRT 推理,需配合 TensorRT8.0.3.4** - - * **CUDA 工具包 11.6 配合 cuDNN v8.4.0, 如需使用 PaddleTensorRT 推理,需配合 TensorRT8.4.0.6** - - * **CUDA 工具包 11.7 配合 cuDNN v8.4.1, 如需使用 PaddleTensorRT 推理,需配合 TensorRT8.4.2.4** - - * **CUDA 工具包 11.8 配合 cuDNN v8.6.0, 如需使用 PaddleTensorRT 推理,需配合 TensorRT8.5.1.7** - - * **CUDA 工具包 12.0 配合 cuDNN v8.9.1, 如需使用 PaddleTensorRT 推理,需配合 TensorRT8.6.1.6** - - * **如需使用分布式多卡环境,需配合 NCCL>=2.7** - * **GPU 运算能力超过 6.0 的硬件设备** 您可参考 NVIDIA 官方文档了解 CUDA、CUDNN 和 TensorRT 的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/),[TensorRT](https://developer.nvidia.com/tensorrt) -* 如果您需要使用多卡环境请确保您已经正确安装 nccl2,或者按照以下指令安装 nccl2(这里提供的是 CUDA11.2,cuDNN7 下 nccl2 的安装指令,更多版本的安装信息请参考 NVIDIA[官方网站](https://developer.nvidia.com/nccl)): - - - ``` - rm -f /usr/local/lib/libnccl.so - wget --no-check-certificate -q https://nccl2-deb.cdn.bcebos.com/libnccl-2.10.3-1+cuda11.4.x86_64.rpm - wget --no-check-certificate -q https://nccl2-deb.cdn.bcebos.com/libnccl-devel-2.10.3-1+cuda11.4.x86_64.rpm - wget --no-check-certificate -q https://nccl2-deb.cdn.bcebos.com/libnccl-static-2.10.3-1+cuda11.4.x86_64.rpm - rpm -ivh libnccl-2.10.3-1+cuda11.4.x86_64.rpm - rpm -ivh libnccl-devel-2.10.3-1+cuda11.4.x86_64.rpm - rpm -ivh libnccl-static-2.10.3-1+cuda11.4.x86_64.rpm - ``` #### 2.1 CPU 版的 PaddlePaddle ``` - python3 -m pip install paddlepaddle==2.6.1 -i https://mirror.baidu.com/pypi/simple + python3 -m pip install paddlepaddle==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/ ``` - #### 2.2 GPU 版的 PaddlePaddle - -2.2.1 CUDA11.2 的 PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html - ``` - - - CUDA11.2 包含 cuDNN 动态链接库的 PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html - ``` - - -2.2.3 CUDA11.6 的 PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post116 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html - ``` - - - CUDA11.6 包含 cuDNN 动态链接库的 PaddlePaddle +2.2.1 CUDA11.8 的 PaddlePaddle(依赖 gcc8+, 如果需要使用 TensorRT 可自行安装 TensorRT8.5.3.1) ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post116 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html + python3 -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ ``` -2.2.4 CUDA11.7 的 PaddlePaddle +2.2.2 CUDA12.3 的 PaddlePaddle(依赖 gcc12+, 如果需要使用 TensorRT 可自行安装 TensorRT8.6.1.6) ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html + python3 -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ ``` - CUDA11.7 包含 cuDNN 动态链接库的 PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html - ``` - - -2.2.5 CUDA11.8 的 PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1 -i https://mirror.baidu.com/pypi/simple - ``` - - - CUDA11.8 包含 cuDNN 动态链接库的 PaddlePaddle,需要先使用如下命令将 wheel 包下载到本地,再使用`python3 -m pip install [name].whl`本地安装([name]为 wheel 包名称): - - - ``` - python3 -m pip download paddlepaddle-gpu==2.6.1 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html --no-index --no-deps - - ``` - - -2.2.6 CUDA12.0 的 PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html - ``` - - - CUDA12.0 包含 cuDNN 动态链接库的 PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html - ``` - - - 注: * 飞桨对于主流各 python 版本均提供了对应的安装包,而您环境中可能有多个 Python,请确认你想使用的 python 版本并下载对应的 paddlepaddle 安装包。例如您想使用 python3.10 的环境,则安装命令为 python3.10 -m pip install paddlepaddle。 @@ -201,7 +108,7 @@ * 如果你想安装`avx`、`openblas`的 Paddle 包,可以通过以下命令将 wheel 包下载到本地,再使用`python3 -m pip install [name].whl`本地安装([name]为 wheel 包名称): ``` - python3 -m pip download paddlepaddle==2.6.1 -f https://www.paddlepaddle.org.cn/whl/linux/openblas/avx/stable.html --no-index --no-deps + python3 -m pip install https://paddle-wheel.bj.bcebos.com/3.0.0-beta0/linux/linux-cpu-openblas-avx/paddlepaddle-3.0.0b2-cp38-cp38-linux_x86_64.whl ``` diff --git a/docs/install/pip/linux-pip_en.md b/docs/install/pip/linux-pip_en.md index c032bbb1603..0108420f83a 100644 --- a/docs/install/pip/linux-pip_en.md +++ b/docs/install/pip/linux-pip_en.md @@ -1,5 +1,9 @@ # Install on Linux via PIP +[The Python Package Index(PyPI)]( https://pypi.org/ )It is a package manager for Python. This document introduces the PyPI installation method. The PyPI installation package provided by PaddlePaddle supports distributed training (multiple computers and multiple cards) and TensorRT reasoning functions. + +* You don't need to install CUDA, CUDNN, NCCL and other software anymore. The Paddle WHL package already comes with it, just install the Paddle WHL package directly + ## Environmental preparation ### 1.1 How to check your environment @@ -31,10 +35,6 @@ * It is required to confirm whether the version of pip meets the requirements. The version of pip is required to be 20.2.2 or above - ``` - python3 -m ensurepip - ``` - ``` python3 -m pip --version ``` @@ -49,10 +49,11 @@ -* The installation package provided by default requires computer support for MKL - -* If you do not know the machine environment, please download and use[Quick install script](https://fast-install.bj.bcebos.com/fast_install.sh), for instructions please refer to[here](https://github.com/PaddlePaddle/FluidDoc/tree/develop/doc/fluid/install/install_script.md)。 +* The installation package provided by default requires computer support for MKL, Intel chips all support MKL + ``` + cat /proc/cpuinfo + ``` ## INSTALLATION @@ -63,33 +64,10 @@ * If your computer has NVIDIA® GPU, please make sure that the following conditions are met and install [the GPU Version of PaddlePaddle](#gpu) - * **CUDA toolkit 11.2 with cuDNN v8.2.1(for multi card support, NCCL2.7 or higher;for PaddleTensorRT deployment, TensorRT8.0.3.4)** - - * **CUDA toolkit 11.6 with cuDNN v8.4.0(for multi card support, NCCL2.7 or higher;for PaddleTensorRT deployment, TensorRT8.4.0.6)** - - * **CUDA toolkit 11.7 with cuDNN v8.4.1(for multi card support, NCCL2.7 or higher;for PaddleTensorRT deployment, TensorRT8.4.2.4)** - - * **CUDA toolkit 11.8 with cuDNN v8.6.0(for multi card support, NCCL2.7 or higher;for PaddleTensorRT deployment, TensorRT8.5.1.7)** - - * **CUDA toolkit 12.0 with cuDNN v8.9.1(for multi card support, NCCL2.7 or higher;for PaddleTensorRT deployment, TensorRT8.6.1.6)** - - * **Hardware devices with GPU computing power over 3.5** + * **Hardware devices with GPU computing power over 6.0** You can refer to NVIDIA official documents for installation process and configuration method of CUDA, cuDNN and TensorRT. Please refer to [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/),[TensorRT](https://developer.nvidia.com/tensorrt) -* If you need to use a multi-card environment, please make sure that you have installed nccl2 correctly, or install nccl2 according to the following instructions (here are the installation instructions of nccl2 under CUDA11.2 and cuDNN7. For more version installation information, please refer to NVIDIA [Official Website](https://developer.nvidia.com/nccl)): - - - ``` - rm -f /usr/local/lib/libnccl.so - wget --no-check-certificate -q https://nccl2-deb.cdn.bcebos.com/libnccl-2.10.3-1+cuda11.4.x86_64.rpm - wget --no-check-certificate -q https://nccl2-deb.cdn.bcebos.com/libnccl-devel-2.10.3-1+cuda11.4.x86_64.rpm - wget --no-check-certificate -q https://nccl2-deb.cdn.bcebos.com/libnccl-static-2.10.3-1+cuda11.4.x86_64.rpm - rpm -ivh libnccl-2.10.3-1+cuda11.4.x86_64.rpm - rpm -ivh libnccl-devel-2.10.3-1+cuda11.4.x86_64.rpm - rpm -ivh libnccl-static-2.10.3-1+cuda11.4.x86_64.rpm - ``` - ## Installation Step @@ -102,93 +80,25 @@ You can choose the following version of PaddlePaddle to start installation: ``` - python3 -m pip install paddlepaddle==2.6.1 -i https://mirror.baidu.com/pypi/simple + python3 -m pip install paddlepaddle==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/ ``` - #### 2.2 GPU Version of PaddlePaddle - -2.2.1 If you are using CUDA 11.2 - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post112 -i https://mirror.baidu.com/pypi/simple - ``` - - - CUDA11.2 with cuDNN dynamic library PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html - ``` - - -2.2.2 If you are using CUDA 11.6 - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post116 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html - ``` - - - CUDA11.6 with cuDNN dynamic library PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post116 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html - ``` - - -2.2.3 If you are using CUDA 11.7 +2.2.4 If you are using CUDA 11.8(Dependent on GCC8+, If you need to use TensorRT, you can install TensorRT 8.5.3.1 yourself) ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html + python3 -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ ``` - CUDA11.7 with cuDNN dynamic library PaddlePaddle - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html - ``` - - -2.2.4 If you are using CUDA 11.8 - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1 -i https://mirror.baidu.com/pypi/simple - ``` - - - CUDA11.8 with cuDNN dynamic library PaddlePaddle, you can use the following command to download the wheel package to the local, and then use `python3 -m pip install [name].whl` to install locally ([name] is the name of the wheel package) - - - ``` - python3 -m pip download paddlepaddle-gpu==2.6.1 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html --no-index --no-deps - - ``` - - -2.2.5 If you are using CUDA 12.0 - - - ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html - ``` - - - CUDA12.0 with cuDNN dynamic library PaddlePaddle - +2.2.5 If you are using CUDA 12.3(Dependent on GCC8+, If you need to use TensorRT, you can install TensorRT 8.6.1.6 yourself) ``` - python3 -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/cudnnin/stable.html + python3 -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ ``` @@ -204,7 +114,7 @@ Note: * If you want to install the Paddle package with `avx` and `openblas`, you can use the following command to download the wheel package to the local, and then use `python3 -m pip install [name].whl` to install locally ([name] is the name of the wheel package): ``` - python3 -m pip download paddlepaddle==2.6.1 -f https://www.paddlepaddle.org.cn/whl/linux/openblas/avx/stable.html --no-index --no-deps + python3 -m pip install https://paddle-wheel.bj.bcebos.com/3.0.0-beta0/linux/linux-cpu-openblas-avx/paddlepaddle-3.0.0b2-cp38-cp38-linux_x86_64.whl ``` diff --git a/docs/install/pip/macos-pip.md b/docs/install/pip/macos-pip.md index 0dd481ab901..04638ea2840 100644 --- a/docs/install/pip/macos-pip.md +++ b/docs/install/pip/macos-pip.md @@ -23,7 +23,6 @@ ``` - * 需要确认 python 的版本是否满足要求 * 使用以下命令确认是 3.8/3.9/3.10/3.11/3.12 @@ -35,10 +34,6 @@ * 需要确认 pip 的版本是否满足要求,要求 pip 版本为 20.2.2 或更高版本 - ``` - python3 -m ensurepip - ``` - ``` python3 -m pip --version ``` @@ -70,7 +65,7 @@ ``` - python3 -m pip install paddlepaddle==2.6.1 -i https://mirror.baidu.com/pypi/simple + python3 -m pip install paddlepaddle==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/ ``` diff --git a/docs/install/pip/macos-pip_en.md b/docs/install/pip/macos-pip_en.md index 1ac04afc248..1b277d4dd97 100644 --- a/docs/install/pip/macos-pip_en.md +++ b/docs/install/pip/macos-pip_en.md @@ -11,7 +11,6 @@ ``` - * Confirm that the Python where you need to install PaddlePaddle is your expected location, because your computer may have multiple Python * Use the following command to output Python path. Depending on the environment, you may need to replace python3 in all command lines in the description with specific Python path @@ -21,7 +20,6 @@ ``` - * You need to confirm whether the version of Python meets the requirements * Use the following command to confirm that it is 3.8/3.9/3.10/3.11/3.12 @@ -30,9 +28,6 @@ * It is required to confirm whether the version of pip meets the requirements. The version of pip is required to be 20.2.2 or above - ``` - python3 -m ensurepip - ``` ``` python3 -m pip --version @@ -66,7 +61,7 @@ You can choose the following version of PaddlePaddle to start installation: ``` -python3 -m pip install paddlepaddle==2.6.1 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/ ``` Note: diff --git a/docs/install/pip/windows-pip.md b/docs/install/pip/windows-pip.md index 48946f16bb0..26ba892299d 100644 --- a/docs/install/pip/windows-pip.md +++ b/docs/install/pip/windows-pip.md @@ -16,10 +16,6 @@ * 需要确认 pip 的版本是否满足要求,要求 pip 版本为 20.2.2 或更高版本 - ``` - python -m ensurepip - ``` - ``` python -m pip --version ``` @@ -32,8 +28,8 @@ ``` -* 默认提供的安装包需要计算机支持 MKL * Windows 暂不支持 NCCL,分布式等相关功能 +* 默认提供的安装包需要计算机支持 MKL, Intel 芯片都支持 MKL ## 二、开始安装 @@ -46,17 +42,7 @@ * 如果您的计算机有 NVIDIA® GPU,请确保满足以下条件并且安装 GPU 版 PaddlePaddle - * **CUDA 工具包 11.2 配合 cuDNN v8.2.1,如需使用 PaddleTensorRT 推理,需配合 TensorRT8.2.4.2** - - * **CUDA 工具包 11.6 配合 cuDNN v8.4.0,如需使用 PaddleTensorRT 推理,需配合 TensorRT8.4.0.6** - - * **CUDA 工具包 11.7 配合 cuDNN v8.4.1,如需使用 PaddleTensorRT 推理,需配合 TensorRT8.4.2.4** - - * **CUDA 工具包 11.8 配合 cuDNN v8.6.0,如需使用 PaddleTensorRT 推理,需配合 TensorRT8.5.1.7** - - * **CUDA 工具包 12.0 配合 cuDNN v8.9.1, 如需使用 PaddleTensorRT 推理,需配合 TensorRT8.6.1.6** - - * **GPU 运算能力超过 3.5 的硬件设备** + * **GPU 运算能力超过 6.0 的硬件设备** * 注:目前官方发布的 windows 安装包仅包含 CUDA 11.2/11.6/11.7/11.8/12.0,如需使用其他 cuda 版本,请通过源码自行编译。您可参考 NVIDIA 官方文档了解 CUDA、CUDNN 和 TensorRT 的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/),[TensorRT](https://developer.nvidia.com/tensorrt) @@ -71,41 +57,23 @@ ``` - python -m pip install paddlepaddle==2.6.1 -i https://mirror.baidu.com/pypi/simple - ``` - -#### 2.2 GPU 版的 PaddlePaddle - - - -2.2.1 CUDA11.2 的 PaddlePaddle - - ``` - python -m pip install paddlepaddle-gpu==2.6.1.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html + python -m pip install paddlepaddle==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/ ``` -2.2.2 CUDA11.6 的 PaddlePaddle - ``` - python -m pip install paddlepaddle-gpu==2.6.1.post116 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html - ``` - -2.2.3 CUDA11.7 的 PaddlePaddle +#### 2.2 GPU 版的 PaddlePaddle - ``` - python -m pip install paddlepaddle-gpu==2.6.1.post117 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html - ``` -2.2.4 CUDA11.8 的 PaddlePaddle +2.2.4 CUDA11.8 的 PaddlePaddle(如果需要使用 TensorRT 可自行安装 TensorRT8.5.1.7) ``` - python -m pip install paddlepaddle-gpu==2.6.1 -i https://mirror.baidu.com/pypi/simple + python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ ``` -2.2.5 CUDA12.0 的 PaddlePaddle +2.2.5 CUDA12.3 的 PaddlePaddle(如果需要使用 TensorRT 可自行安装 TensorRT8.6.1.6) ``` - python -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html + python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ ``` 注: @@ -117,7 +85,7 @@ * 如果你想安装`avx`、`openblas`的 Paddle 包,可以通过以下命令将 wheel 包下载到本地,再使用`python -m pip install [name].whl`本地安装([name]为 wheel 包名称): ``` - python -m pip download paddlepaddle==2.6.1 -f https://www.paddlepaddle.org.cn/whl/windows/openblas/avx/stable.html --no-index --no-deps + python -m pip install https://paddle-wheel.bj.bcebos.com/3.0.0-beta0/windows/windows-cpu-avx-openblas-vs2017/paddlepaddle-3.0.0b2-cp38-cp38-win_amd64.whl ``` diff --git a/docs/install/pip/windows-pip_en.md b/docs/install/pip/windows-pip_en.md index e21c9fd5497..4a390b395e2 100644 --- a/docs/install/pip/windows-pip_en.md +++ b/docs/install/pip/windows-pip_en.md @@ -13,10 +13,6 @@ * Confirm whether the version of pip meets the requirements. The version of pip is required to be 20.2.2 or above - ``` - python -m ensurepip - ``` - ``` python -m pip --version ``` @@ -28,9 +24,8 @@ ``` -* The installation package provided by default requires computer support for MKL * NCCL, distribution are not supported on windows now - +* The installation package provided by default requires computer support for MKL, Intel chips all support MKL ## INSTALLATION @@ -43,17 +38,7 @@ If you installed Python via Homebrew or the Python website, `pip` was installed * If your computer has NVIDIA® GPU, please make sure that the following conditions are met and install [the GPU Version of PaddlePaddle](#gpu) - * **CUDA toolkit 11.2 with cuDNN v8.2.1(for PaddleTensorRT deployment, TensorRT8.2.4.2)** - - * **CUDA toolkit 11.6 with cuDNN v8.4.0(for PaddleTensorRT deployment, TensorRT8.4.0.6)** - - * **CUDA toolkit 11.7 with cuDNN v8.4.1(for PaddleTensorRT deployment, TensorRT8.4.2.4)** - - * **CUDA toolkit 11.8 with cuDNN v8.6.0(for PaddleTensorRT deployment, TensorRT8.5.1.7)** - - * **CUDA toolkit 12.0 with cuDNN v8.9.1(for multi card support, NCCL2.7 or higher;for PaddleTensorRT deployment, TensorRT8.6.1.6)** - - * **GPU CUDA capability over 3.5** + * **GPU CUDA capability over 6.0** You can refer to NVIDIA official documents for installation process and configuration method of CUDA, cuDNN and TensorRT. Please refer to [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/),[TensorRT](https://developer.nvidia.com/tensorrt) @@ -68,43 +53,23 @@ You can choose the following version of PaddlePaddle to start installation: ``` - python -m pip install paddlepaddle==2.6.1 -i https://mirror.baidu.com/pypi/simple + python -m pip install paddlepaddle==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/ ``` - #### 2.2 GPU Version of PaddlePaddle - -2.2.1 If you are using CUDA 11.2 - - ``` - python -m pip install paddlepaddle-gpu==2.6.1.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html - ``` - -2.2.2 If you are using CUDA 11.6 - - ``` - python -m pip install paddlepaddle-gpu==2.6.1.post116 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html - ``` - -2.2.3 If you are using CUDA 11.7 - - ``` - python -m pip install paddlepaddle-gpu==2.6.1.post117 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html - ``` - -2.2.4 If you are using CUDA 11.8 +2.2.4 If you are using CUDA 11.8(If you need to use TensorRT, you can install TensorRT 8.5.1.7 yourself) ``` - python -m pip install paddlepaddle-gpu==2.6.1 -i https://mirror.baidu.com/pypi/simple + python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ ``` -2.2.5 If you are using CUDA 12.0 +2.2.5 If you are using CUDA 12.3(If you need to use TensorRT, you can install TensorRT 8.6.1.6 yourself) ``` - python -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html + python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ ``` Note: @@ -117,7 +82,7 @@ Note: * If you want to install the Paddle package with `avx` and `openblas`, you can use the following command to download the wheel package to the local, and then use `python -m pip install [name].whl` to install locally ([name] is the name of the wheel package): ``` - python -m pip download paddlepaddle==2.6.1 -f https://www.paddlepaddle.org.cn/whl/windows/openblas/avx/stable.html --no-index --no-deps + python -m pip install https://paddle-wheel.bj.bcebos.com/3.0.0-beta0/windows/windows-cpu-avx-openblas-vs2017/paddlepaddle-3.0.0b2-cp38-cp38-win_amd64.whl ``` ## Verify installation diff --git a/docs/release_note_cn.md b/docs/release_note_cn.md index 6665039e57a..0aaa2b14b52 100644 --- a/docs/release_note_cn.md +++ b/docs/release_note_cn.md @@ -1,3486 +1,462 @@ -# 2.6.0 Release Note +# 3.0 Beta Release Note +本版本的核心特性主要包括动静统一自动并行技术和神经网络编译器自动优化等新技术,旨在应对当前深度学习领域的新挑战。飞桨框架 3.0 Beta 版本延续了 2.x 版本动静统一、训推一体的设计理念,其开发接口全面兼容 2.x 版本。这意味着,使用 2.x 版本开发的代码,在绝大多数情况下无需修改,即可直接在 3.x 版本上运行。几个重点特性具体展开说明如下: +- 动静统一自动并行:为了降低大模型的编程难度,飞桨还优化了动静统一的半自动并行编程范式,显著简化了编程的复杂度。开发者无需深入研究手动并行编程的复杂概念和 API,只需进行少量的张量切分标注,即可完成混合并行模型的构建。框架能够自动推导分布式切分状态并添加通信算子,同时还支持一键动转静分布式训练,从而大幅简化了混合并行训练代码的开发过程。动静统一方面,飞桨通过采用基于字节码的动静转换技术,全面升级了其动转静训练能力,支持自适应的图构建功能。在 700 多个飞桨产业级模型上进行了验证,实现了一键动转静训练 100%的成功率。 +- 神经网络编译器自动优化:飞桨神经网络编译器 CINN(Compiler Infrastructure for Neural Networks)采用与框架一体化的设计,能够支持生成式模型、科学计算模型等多种模型的高效训练与可变形状推理,为计算灵活性与高性能之间提供了一个良好的平衡点。通过算子的自动融合和代码生成技术,Llama2 和 Stable Diffusion 模型的性能提升了 30%。 +- 高阶自动微分:为了更好支持科学计算等场景,飞桨框架设计并实现了基于组合算子机制的高阶自动微分技术,结合神经网络编译器自动优化技术,我们测试了超过 40 多个科学计算场景的微分方程,其求解速度领先业界同类产品 70%。 +- 高扩展中间表示 :为了提升飞桨框架的可扩展性,我们研发了高扩展中间表示 PIR(Paddle Intermediate Representation)。这一表示系统性地抽象了底层核心概念,提供了灵活且高效的组件。PIR 作为基础设施,支撑着动转静、自动微分、自动并行、组合算子、图优化等多项技术,并广泛应用于分布式训练、模型压缩、推理部署等场景。通过 PIR 提供的 DRR(Declarative Rewrite Rule)机制,Pass 的开发成本可以降低 60%。我们对超过 900 个模型配置进行了测试,结果显示,在使用 PIR 后,推理的整体性能提升了超过 10%。 +- 多硬件适配:飞桨为大模型硬件适配提供了功能完善且低成本的方案。新硬件仅需适配 30 余个接口,即可支持大模型的训练、压缩与推理。同时,飞桨提供了基于编译器的硬件接入方式,硬件厂商只需以插件的形式实现编译器的代码生成后端,便能实现与飞桨框架的高效适配。飞桨硬件接入本次新增了对 4 款硬件昆仑 XPU、昇腾 NPU、海光 DCU 和寒武纪 MLU 的日常发版支持。 + +此版本包含了对框架 2.x 版本部分已有功能的持续改进,同时本版本的新特性在使用体验、性能、二次开发便利度以及硬件适配能力等方面带来了显著提升。除了上述核心特性外,此版本在用户体验层面持续丰富并增强了满足更多场景的 API 功能,针对大模型场景优化完善了分布式并行策略优化和推理功能增强,在编译安装方面做了比较彻底的易用性改进,对依赖包的安装方式和版本进行了全新同步升级,对系统安全进行了全面加固,对产品文档也进行了全面的纠错检查,同时也对一些废弃代码做了大量的清理以保证架构的简洁性。飞桨 3.0 Beta 版本在不使用新特性的情况下,表现仍然是成熟稳定的,每个新特性都提供了可灵活进行控制的开关,方便用户快速了解相关产品功能和体验对比。 + +## 1.用户体验升级 + +### 不兼容升级 +- 飞桨 API 支持隐式类型提升。在加减乘除等最常用的计算中,如果两个输入的数据类型不一样,就需要确定输出的数据类型问题。飞桨历史上的现状是部分支持且实际规则并不清楚,客观上表现为动静不一致、API 和运算符重载不一致 及 不符合交换率,特别是在大模型广泛使用 bf16/fp16 与 fp32 进行混合计算时容易出现非预期问题且难以定位。飞桨从 3.0 beta 版本开始,明确了[隐式数据类型提升规则](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/advanced/auto_type_promotion_cn.html),其中详细定义了 Tensor 与 Tensor 和 Tensor 与 1 个数(Scalar)计算结果的类型,保证了计算符合交换律,运算符重载与二元 API 结果一致,动态图与静态图结果一致。更符合用户理解和业界习惯。[#60638](https://github.com/PaddlePaddle/Paddle/pull/60638), [#63842](https://github.com/PaddlePaddle/Paddle/pull/63842), [#60011](https://github.com/PaddlePaddle/Paddle/pull/60011) + +### 废弃功能 +- 支持 0 维 Tensor 已经稳定了 2 个版本,本版本取消了在一些情况下将 0 维 Tensor 转成只含 1 个元素的 1 维 Tensor 的开关`FLAGS_set_to_1d`,这个开关是为了兼容一些套件中用 1 个元素的 1 维 Tensor 表示 0 维 Tensor 的不正确写法。即当前飞桨完全区分 0 维 Tensor 和只含 1 个元素的 1 维 Tensor 的语义,两者不等价。[#61227](https://github.com/PaddlePaddle/Paddle/pull/61227) + +### 新增 API 功能 +此版本相比上一个版本新增 126 个 API,API 功能更加丰富,以更好支持大模型、科学计算等需求,包括: +- 新增 Tensor 计算类 API。`paddle.gammaln`, `paddle.gammainc`, `paddle.gammaincc`, `paddle.sinc`, `paddle.pdist`, `paddle.histogramdd`,`paddle.signbit`, `paddle.copysign`, `paddle.bitwise_right_shift/bitwise_left_shift`, `paddle.isposinf/isneginf/isreal`, `paddle.isin`, `paddle.hsplit/dsplit`, `paddle.column_stack/row_stack/dstack/hstack/vstack`, `paddle.slice_scatter`, `paddle.masked_scatter` [#60553](https://github.com/PaddlePaddle/Paddle/pull/60553), [#59311](https://github.com/PaddlePaddle/Paddle/pull/59311), [#59357](https://github.com/PaddlePaddle/Paddle/pull/59357), [#63521](https://github.com/PaddlePaddle/Paddle/pull/63521), [#57869](https://github.com/PaddlePaddle/Paddle/pull/57869), [#57880](https://github.com/PaddlePaddle/Paddle/pull/57880), [#57882](https://github.com/PaddlePaddle/Paddle/pull/57882), [#60150](https://github.com/PaddlePaddle/Paddle/pull/60150), [#57785](https://github.com/PaddlePaddle/Paddle/pull/57785), [#58092](https://github.com/PaddlePaddle/Paddle/pull/58092), [#63523](https://github.com/PaddlePaddle/Paddle/pull/63523), [#64001](https://github.com/PaddlePaddle/Paddle/pull/64001), [#58917](https://github.com/PaddlePaddle/Paddle/pull/58917), [#59127](https://github.com/PaddlePaddle/Paddle/pull/59127), [#59973](https://github.com/PaddlePaddle/Paddle/pull/59973), [#59383](https://github.com/PaddlePaddle/Paddle/pull/59383) +- 新增概率分布类 API。`paddle.distribution.ContinuousBernoulli`, `paddle.distribution.MultivariateNormal`, `paddle.distribution.Exponential`, `paddle.distribution.Gamma`, `paddle.distribution.Binomial`, `paddle.distribution.Poisson` [#58004](https://github.com/PaddlePaddle/Paddle/pull/58004), [#57899](https://github.com/PaddlePaddle/Paddle/pull/57899), [#57856](https://github.com/PaddlePaddle/Paddle/pull/57856) +- 新增优化器类 API。`paddle.optimizer.ASGD`, `paddle.optimizer.NAdam`, `paddle.optimizer.RAdam`, `paddle.optimizer.Rprop` [#58834](https://github.com/PaddlePaddle/Paddle/pull/58834), [#63671](https://github.com/PaddlePaddle/Paddle/pull/63671), [#58851](https://github.com/PaddlePaddle/Paddle/pull/58851) +- 新增线性代数类 API。`paddle.linalg.matrix_exp` [#59715](https://github.com/PaddlePaddle/Paddle/pull/59715) +- 新增其他 API。`paddle.bernoulli_`, `paddle.nn.ZeroPad1D/ZeroPad3D`, `paddle.nn.AdaptiveLogSoftmaxWithLoss`, `paddle.Tensor.apply` [#64252](https://github.com/PaddlePaddle/Paddle/pull/64252), [#59690](https://github.com/PaddlePaddle/Paddle/pull/59690), [#63728](https://github.com/PaddlePaddle/Paddle/pull/63728), [#63302](https://github.com/PaddlePaddle/Paddle/pull/63302), [#59374](https://github.com/PaddlePaddle/Paddle/pull/59374),[#63227](https://github.com/PaddlePaddle/Paddle/pull/63227) + +### 部分 API 功能增强 +- 增强了约 30 个 API 以支持复数计算,如`paddle.log`, `paddle.log1p`, `paddle.square`, `paddle.reciprocal`等,进而扩展对更多科学计算场景的支持。[#62448](https://github.com/PaddlePaddle/Paddle/pull/62448), [#60821](https://github.com/PaddlePaddle/Paddle/pull/60821), [#60897](https://github.com/PaddlePaddle/Paddle/pull/60897), [#62764](https://github.com/PaddlePaddle/Paddle/pull/62764), [#59536](https://github.com/PaddlePaddle/Paddle/pull/59536), [#59529](https://github.com/PaddlePaddle/Paddle/pull/59529), [#63207](https://github.com/PaddlePaddle/Paddle/pull/63207), [#62237](https://github.com/PaddlePaddle/Paddle/pull/62237), [#64684](https://github.com/PaddlePaddle/Paddle/pull/64684) +- 增强了 46 个 API 的功能,使得已有 API 更易用,也更容易进行代码转换。包括但不限于增加 API 参数,扩展 API 支持的数据类型,以及修正原有不合理设计等。[#59890](https://github.com/PaddlePaddle/Paddle/pull/59890), [#63513](https://github.com/PaddlePaddle/Paddle/pull/63513), [#59674](https://github.com/PaddlePaddle/Paddle/pull/59674), [#62778](https://github.com/PaddlePaddle/Paddle/pull/62778), [#64110](https://github.com/PaddlePaddle/Paddle/pull/64110), [#63222](https://github.com/PaddlePaddle/Paddle/pull/63222), [#64331](https://github.com/PaddlePaddle/Paddle/pull/64331), [#64715](https://github.com/PaddlePaddle/Paddle/pull/64715), [#61155](https://github.com/PaddlePaddle/Paddle/pull/61155), [#60070](https://github.com/PaddlePaddle/Paddle/pull/60070), [#61974](https://github.com/PaddlePaddle/Paddle/pull/61974), [#62407](https://github.com/PaddlePaddle/Paddle/pull/62407), [#62672](https://github.com/PaddlePaddle/Paddle/pull/62672),[#62722](https://github.com/PaddlePaddle/Paddle/pull/62722), [#62876](https://github.com/PaddlePaddle/Paddle/pull/62876), [#63284](https://github.com/PaddlePaddle/Paddle/pull/63284), [#63860](https://github.com/PaddlePaddle/Paddle/pull/63860), [#60466](https://github.com/PaddlePaddle/Paddle/pull/60466), [#63690](https://github.com/PaddlePaddle/Paddle/pull/63690), [#63953](https://github.com/PaddlePaddle/Paddle/pull/63953), [#63901](https://github.com/PaddlePaddle/Paddle/pull/63901), [#62624](https://github.com/PaddlePaddle/Paddle/pull/62624), [#59857](https://github.com/PaddlePaddle/Paddle/pull/59857), [#60084](https://github.com/PaddlePaddle/Paddle/pull/60084), [#60766](https://github.com/PaddlePaddle/Paddle/pull/60766), [#62788](https://github.com/PaddlePaddle/Paddle/pull/62788), [#62937](https://github.com/PaddlePaddle/Paddle/pull/62937), [#63134](https://github.com/PaddlePaddle/Paddle/pull/63134), [#62966](https://github.com/PaddlePaddle/Paddle/pull/62966), [#63648](https://github.com/PaddlePaddle/Paddle/pull/63648), [#63881](https://github.com/PaddlePaddle/Paddle/pull/63881), [#64358](https://github.com/PaddlePaddle/Paddle/pull/64358), [#60503](https://github.com/PaddlePaddle/Paddle/pull/60503), [#63604](https://github.com/PaddlePaddle/Paddle/pull/63604), [#62338](https://github.com/PaddlePaddle/Paddle/pull/62338) +- 增强了高阶微分的单测基础设施,能够更容易地添加高阶微分的单测用例。[#62074](https://github.com/PaddlePaddle/Paddle/pull/62074) + +### API 性能提升 +- 对 Tensor 基础索引、高级索引和联合索引的性能进行了集中优化,在 GPU 上的计算性能较此前提升 2 到 31 倍,CPU 上提升 1.8 到 1004 倍。[#60254](https://github.com/PaddlePaddle/Paddle/pull/60254), [#60276](https://github.com/PaddlePaddle/Paddle/pull/60276), [#60452](https://github.com/PaddlePaddle/Paddle/pull/60452), [#60771](https://github.com/PaddlePaddle/Paddle/pull/60771), [#61021](https://github.com/PaddlePaddle/Paddle/pull/61021), [#60983](https://github.com/PaddlePaddle/Paddle/pull/60983), [#61060](https://github.com/PaddlePaddle/Paddle/pull/61060), [#60618](https://github.com/PaddlePaddle/Paddle/pull/60618) + +### Bug 修复 +- 修复 `paddle.optimizer.LBFGS` 中使用非 Tensor 进行计算导致的报错。 [#60219](https://github.com/PaddlePaddle/Paddle/pull/60219) +- 修复 `paddle.optimizer.LBFGS` 中随机数不能固定的问题。 [#60591](https://github.com/PaddlePaddle/Paddle/pull/60591) +- 修复 `set_value` 算子梯度计算不正确的问题。 [#59034](https://github.com/PaddlePaddle/Paddle/pull/59034) +- 修复 Tensor 基础索引适配 PIR 的问题。 [#60259](https://github.com/PaddlePaddle/Paddle/pull/60259), [#61103](https://github.com/PaddlePaddle/Paddle/pull/61103) +- 修复 Tensor 联合索引赋值时的问题。[#60376](https://github.com/PaddlePaddle/Paddle/issues/60376), [#60447](https://github.com/PaddlePaddle/Paddle/pull/60447) +- 修复 Tensor 联合索引取值时的问题。 [#61922](https://github.com/PaddlePaddle/Paddle/pull/61922) +- 修复 `paddle.flatten` stride 计算错误问题,并能够新增`paddle.flatten_` 。[#63084](https://github.com/PaddlePaddle/Paddle/pull/63084) +- 修复 `paddle.index_fill` 和 `paddle.index_fill_` 结果不一致问题。 [#59863](https://github.com/PaddlePaddle/Paddle/pull/59863) +- 修复 `paddle.masked_scatter`报错问题。 [#60835](https://github.com/PaddlePaddle/Paddle/pull/60835) +- 修复 `paddle.histogramdd` cpu 报错问题。 [#61891](https://github.com/PaddlePaddle/Paddle/pull/61891) +- 修复 `paddle.cast_` 在 cpu 上连续使用导致结果错误的 bug。 [#60054](https://github.com/PaddlePaddle/Paddle/pull/60054) +- 修复 `paddle.put_along_axis` 在输入 size 很大的时候存在 bug 的问题。 [#60551](https://github.com/PaddlePaddle/Paddle/pull/60551) +- 修复 `paddle.nanmedian` cpu 报错问题。 [#63221](https://github.com/PaddlePaddle/Paddle/pull/63221) +- 修复 `paddle.median` 在 min 分支下不支持输入为除浮点类型以外的类型。 [#64444](https://github.com/PaddlePaddle/Paddle/pull/64444) +- 修复 分布式场景中的 dataloader 问题。 [#62696](https://github.com/PaddlePaddle/Paddle/pull/62696), [#63378](https://github.com/PaddlePaddle/Paddle/pull/63378) +- 修复 error 提示的格式问题。 [#63106](https://github.com/PaddlePaddle/Paddle/pull/63106), [#63144](https://github.com/PaddlePaddle/Paddle/pull/63144) +- 修复 GLOG_v>=6 下格式问题。 [#63345](https://github.com/PaddlePaddle/Paddle/pull/63345) + +### 安全改善 +- 增强对 parent_ids 的检查。 [#62826](https://github.com/PaddlePaddle/Paddle/pull/62826) + +## 2.基础执行架构 + +PIR 基础功能全面升级完善,成熟度大幅提升,基于 PIR 使飞桨基础架构设计更合理、保证了框架卓越的性能表现和良好的拓展性。在此版本中,完成了 PIR 多场景的推全验证:单机场景完成动转静场景 PIR 后端切换;推理场景完成全部存量模型验证,并在 84.2%模型有 10%+收益;完成分布式场景基于 PIR 的验证。同时基于 PIR 完成控制流、backward 逻辑、save/load、OneDNN 适配等核心模块的开发验证,为飞桨 PIR 切换为默认模式,奠定了坚实的基础。对飞桨框架算子体系的功能完备性、执行效率和稳定性进一步提升,给开发者带来更好的使用和开发体验。 -## 1. 重要更新 - -- **新一代中间表示 PIR**:为了进一步提升飞桨框架的可扩展性,研制了新一代中间表示 PIR(Paddle Intermediate Representation)。实现系统性的抽象飞桨框架底层核心概念,如:Operation、Attribute 和 Type 等,为开发者提供了灵活、高效的基础组件。通过引入 Dialect 机制,可以全面、分层次地满足各模块对中间表示的需求,从而极大地提升了框架的扩展性。PIR 严格遵循 SSA(Static Single Assignment)原则,在实现了顶层结构的统一的同时,还确保了“算子顺序性”与“计算图语义”的和谐共存。此外,PIR 提供了更为简洁、低成本的 Pass 开发流程,内置了一系列丰富且功能完备的 Pass 优化策略,为大型模型的极致性能优化提供了技术支撑。 -- **动转静编译优化架构**:为了进一步提升框架的模型开发性能,飞桨动转静训练能力全面升级,支持自适应的图构建能力,在 700 多个飞桨产业级模型上验证,一键动转静训练成功率达到 100%。同时,飞桨框架的神经网络编译器 CINN 整合入飞桨主 Repo,使得编译器与飞桨更加融为一体。CINN 完成了架构的梳理和扩展能力的完善,提升系统稳定性。基于 PIR 完成动转静、组合算子、执行器和编译器的紧密联动,为飞桨框架整体性能的提升提供了更大的空间。 -- **增强动态图分布式能力**:大模型对框架的分布式训练性能提出了更高的需求。飞桨在通信库、图分析、分布式策略和任务启停等维度进行了全面优化,增强了飞桨动态图的分布式计算能力,为大型模型高效训练提供了支持。在性能方面,通过减少流水线 GPU 显存占用、采用 TensorFusion 技术、实现通信计算 overlap 以及减少非必要的数据同步拷贝等方式,进一步提升了训练性能。同时,通过环境变量控制 Optimizer 等方式提高了混合并行调试的灵活性。此外,通过相关 Bug 的修复,显著提升了系统的稳定性。 -- **动静统一自动并行架构**:为了进一步降低大模型编程和优化难度,飞桨对动静统一的半自动并行(Auto Parallel)编程范式进行了全面的优化,简化了开发者的编程复杂度。开发者无需深入了解手动并行编程范式下的复杂概念和 API 接口,如行切分、列切分等,仅需通过少量的张量切分标注即可完成混合并行模型的构建,框架便能够自动推导出所有张量和算子的分布式切分状态,并添加合适的通信算子。同时支持一键动转静进行分布式训练,使开发者能够高效地实现任意混合并行策略,大幅简化了混合并行训练代码的开发过程。 -- **硬件适配方案(CustomDevice)**:大模型场景下新硬件并行训练需求增加,飞桨新增了对分布式高级策略、自定义算子和自定义融合策略的支持。升级了分布式通信库,新增了对 MP、GroupShared、PP、SP 和 MOE 等多项高级分布式策略的支持。同时,支持厂商灵活接入不同颗粒度的 Transformer 算子库并通过融合 Pass 修改计算图进行性能加速。 -- **安装和开发体验**:采用模块化编译的方式优化了 CMake 代码的逻辑,提升了飞桨全量编译和增量编译的效率,提升了 RD 开发效率,同时支持了 Python3.12,CUDA12,Hopper 架构编译,并引入 Clang 等工具全面优化了代码格式。此外,将 C++单测从链接静态库的方式转变为链接动态库,减小编译体积。这些改进措施为用户提供更加流畅、高效的安装和开发体验。 - -## 2. 不兼容升级 - -- 为了避免误用,去除了 0 维 Tensor 兼容态开关,实现 API 行为和业界主流习惯一致。在上一个版本中,我们已经支持 0 维 Tensor,但是考虑到尽量避免部分模型的报错,添加了兼容态开关。即在一些模型套件使用较多且没有修改完成的场景中还是默认使用只有 1 个元素的 1 维 Tensor 来替代 0 维 Tensor。这个版本去除了兼容态开关,在任何场景中都不会再使用只有 1 个元素的 1 维 Tensor 来替代 0 维 Tensor,应该支持 0 维 Tensor 的 376 个 API 的行为都完成了修正和统一,彻底完成对 0 维 Tensor 的支持。[#57036](https://github.com/PaddlePaddle/Paddle/pull/57036), [#54581](https://github.com/PaddlePaddle/Paddle/pull/54581), [#54500](https://github.com/PaddlePaddle/Paddle/pull/54500) -- 为了提升 API 易用性,将 paddle.nn.functional.diag_embed 精简为 paddle.diag_embed,并支持 Tensor.diag_embed 方式使用。 [#58223](https://github.com/PaddlePaddle/Paddle/pull/58223) -- 为了解决在静态图下 Tensor 索引写(如 tensor[0] = 10)导致的微分计算错误问题,并符合静态图的规范,本版本引入了 paddle.static.setitem API。在静态图环境中,更推荐使用此 API 来支持 tensor 的索引写操作,而非下标运算符。这一变化并不影响动态图环境,其中仍允许使用下标运算符进行索引写操作。[#53682](https://github.com/PaddlePaddle/Paddle/pull/53682) -- 本版本中 paddle.fluid API 全面退出历史舞台。在本次更新中,我们彻底移除了所有 paddle.fluid API,并删除了 fluid 目录。同时,飞桨底层的少量公共组件已被整合至 paddle.base 目录中。使得飞桨用户无需再关注 fluid 相关概念和接口,进一步简化了飞桨 API 体系,提升可读性。[#56576](https://github.com/PaddlePaddle/Paddle/pull/56576), [#54424](https://github.com/PaddlePaddle/Paddle/pull/54424), [#54829](https://github.com/PaddlePaddle/Paddle/pull/54829), [#53992](https://github.com/PaddlePaddle/Paddle/pull/53992), [#54806](https://github.com/PaddlePaddle/Paddle/pull/54806), [#55754](https://github.com/PaddlePaddle/Paddle/pull/55754), [#55986](https://github.com/PaddlePaddle/Paddle/pull/55986), [#55345](https://github.com/PaddlePaddle/Paddle/pull/55345), [#56099](https://github.com/PaddlePaddle/Paddle/pull/56099), [#51717](https://github.com/PaddlePaddle/Paddle/pull/51717), [#54152](https://github.com/PaddlePaddle/Paddle/pull/54152), [#55522](https://github.com/PaddlePaddle/Paddle/pull/55522), [#55757](https://github.com/PaddlePaddle/Paddle/pull/55757), [#58521](https://github.com/PaddlePaddle/Paddle/pull/58521), [#54936](https://github.com/PaddlePaddle/Paddle/pull/54936), [#55007](https://github.com/PaddlePaddle/Paddle/pull/55007), [#55661](https://github.com/PaddlePaddle/Paddle/pull/55661), [#55970](https://github.com/PaddlePaddle/Paddle/pull/55970) - -## 3. 训练框架(含分布式) - -### Python API - -#### 升级 Tensor 索引机制 - -本版本全面优化了 Tensor 的基础索引、高级索引以及联合索引功能,以更好地符合业界标准与用户习惯。具体包括:在基础索引中增加了对 view 的支持,修正了高级索引中的一些错误行为,并实现了联合索引的读取功能。此外,我们还将索引解析下沉到 C++层面,改进了高级索引算子的性能,并移除了 bool 索引中的冗余计算。通过这些优化措施,Tensor 的基础索引、高级索引和联合索引性能得到了全面提升。[#56893](https://github.com/PaddlePaddle/Paddle/pull/56893), [#58643](https://github.com/PaddlePaddle/Paddle/pull/58643), [#57986](https://github.com/PaddlePaddle/Paddle/pull/57986), [#56272](https://github.com/PaddlePaddle/Paddle/pull/56272), [#58856](https://github.com/PaddlePaddle/Paddle/pull/58856), [#55211](https://github.com/PaddlePaddle/Paddle/pull/55211), [#57023](https://github.com/PaddlePaddle/Paddle/pull/57023), [#56613](https://github.com/PaddlePaddle/Paddle/pull/56613), [#55602](https://github.com/PaddlePaddle/Paddle/pull/55602), [#59281](https://github.com/PaddlePaddle/Paddle/pull/59281), [#57737](https://github.com/PaddlePaddle/Paddle/pull/57737) - -#### 升级 Inplace 机制 - -在之前的版本中,为了确保反向微分计算的正确性,当某个 API 的反向计算依赖于其前向输入数据时,飞桨会避免使用 Inplace 操作方式,因为这种方法可能会覆盖原始输入数据。虽然这种机制简化了实现过程,但也限制了许多 API 实现 Inplace 功能,从而影响了用户体验。 -在本版本中,飞桨对 Inplace 机制进行了全面升级。实现自动检测反向计算对前向输入的依赖关系,并在需要时保存这些输入数据,从而支持更多的 Inplace 操作。这一改进不仅提升了内存使用效率,还增强了 API 的功能性。 -此外,我们新增了 109 个支持 Inplace 操作的 API,包括 paddle.abs_、paddle.sin_/cos_/tan_、比较操作如 paddle.greater_than_/less_than_/equal_、逻辑操作如 paddle.logical_and_/logical_or_/logical_not_,以及 paddle.neg_和 paddle.log_等。在丰富飞桨的功能集同时,提升了用户在数值计算和深度学习任务中的效率与便捷性。[#54683](https://github.com/PaddlePaddle/Paddle/pull/54683), [#55078](https://github.com/PaddlePaddle/Paddle/pull/55078), [#55576](https://github.com/PaddlePaddle/Paddle/pull/55576), [#56888](https://github.com/PaddlePaddle/Paddle/pull/56888), [#55509](https://github.com/PaddlePaddle/Paddle/pull/55509), [#57093](https://github.com/PaddlePaddle/Paddle/pull/57093) - -#### 其他新增 API - -- 新增 paddle.nn.functional.scaled_dot_product_attention,显著提升大模型中注意力(attention)机制的计算效率,更好地满足大规模深度学习模型对高性能计算的需求。。[#55242](https://github.com/PaddlePaddle/Paddle/pull/55242) -- 新增了一系列科学计算相关 API,包括 paddle.cummax 和 paddle.cummin 用于累积最大值和最小值的计算,paddle.index_fill 和 paddle.masked_fill 用于按索引或掩码填充张量,paddle.linalg.pca_lowrank 用于低秩主成分分析,paddle.hypot 用于计算直角三角形的斜边长,以及 paddle.atleast_1d、paddle.atleast_2d 和 paddle.atleast_3d 用于确保张量至少有一维、二维或三维。同时,我们还提供了 paddle.select_scatter 和 paddle.diagonal_scatter 用于更灵活地选择和散列张量数据,以及 paddle.multigammaln 用于计算多伽马函数的自然对数。此外,本版本新增优化器相关 API,包括:paddle.optimizer.lr.LinearLR 和 paddle.optimizer.lr.CosineAnnealingWarmRestarts 学习率调度策略;引入了 paddle.io.SubsetRandomSampler 以支持从数据子集中进行随机采样。这些新增 API 将进一步提升飞桨在各类应用场景中的灵活性和高效性。。 [#57416](https://github.com/PaddlePaddle/Paddle/pull/57416), [#53546](https://github.com/PaddlePaddle/Paddle/pull/53546), [#53743](https://github.com/PaddlePaddle/Paddle/pull/53743), [#57295](https://github.com/PaddlePaddle/Paddle/pull/57295), [#57726](https://github.com/PaddlePaddle/Paddle/pull/57726), [#58764](https://github.com/PaddlePaddle/Paddle/pull/58764), [#58323](https://github.com/PaddlePaddle/Paddle/pull/58323), [#57720](https://github.com/PaddlePaddle/Paddle/pull/57720), [#58209](https://github.com/PaddlePaddle/Paddle/pull/58209), [#58214](https://github.com/PaddlePaddle/Paddle/pull/58214), [#57792](https://github.com/PaddlePaddle/Paddle/pull/57792), [#51395](https://github.com/PaddlePaddle/Paddle/pull/51395), [#57724](https://github.com/PaddlePaddle/Paddle/pull/57724), [#57355](https://github.com/PaddlePaddle/Paddle/pull/57355), [#57744](https://github.com/PaddlePaddle/Paddle/pull/57744), [#58244](https://github.com/PaddlePaddle/Paddle/pull/58244), [#57599](https://github.com/PaddlePaddle/Paddle/pull/57599), [#59343](https://github.com/PaddlePaddle/Paddle/pull/59343), [#57879](https://github.com/PaddlePaddle/Paddle/pull/57879) - -### 新一代中间表示(PIR) - -PIR(Paddle Intermediate Representation)对底层的核心概念如 Operation、Attribute 和 Type 等进行了系统性的抽象,为开发者构建了一套灵活且强大的基础组件。此外,通过引入 Dialect 这一概念,飞桨框架能够全面且分层次地管理各模块对中间表示(IR)的需求,并支持开发者根据特定需求定制化扩展 Dialect,从而显著提升了框架的扩展性和适应性。在设计上,PIR 严格遵循 SSA(Static Single Assignment)原则,统一了顶层结构,实现了“算子顺序性”与“计算图语义”的兼容表示,为复杂的计算流程提供了清晰且一致的视图。为了进一步优化大模型的性能,PIR 还提供了一套更加简洁、低成本的 Pass 开发流程,包括 DRR(Declarative Rewrite Rule)和模式重写器(Pattern Rewriter)。同时,内置了一系列丰富且功能完备的 Pass 优化策略,这些策略能够针对大模型的特点进行深度优化,从而为大模型的极致性能提供了强有力支撑。通过这些创新设计和优化手段,PIR 为飞桨框架的高效运行和持续扩展奠定了坚实基础。 - -#### 新功能 - -- 系统抽象了 IR 底层的核心概念,为开发者提供了灵活的基础组件,如 Operation、Attribute、Value、Type、Trait、Interface 等。[#56354](https://github.com/PaddlePaddle/Paddle/pull/56354),[#57106](https://github.com/PaddlePaddle/Paddle/pull/57106),[#57349](https://github.com/PaddlePaddle/Paddle/pull/57349),[#54844](https://github.com/PaddlePaddle/Paddle/pull/54844),[#54984](https://github.com/PaddlePaddle/Paddle/pull/54984),[#54565](https://github.com/PaddlePaddle/Paddle/pull/54565),[#54562](https://github.com/PaddlePaddle/Paddle/pull/54562),[#57249](https://github.com/PaddlePaddle/Paddle/pull/57249),[#57550](https://github.com/PaddlePaddle/Paddle/pull/57550),[#59278](https://github.com/PaddlePaddle/Paddle/pull/59278),[#54875](https://github.com/PaddlePaddle/Paddle/pull/54875),[#55041](https://github.com/PaddlePaddle/Paddle/pull/55041),[#54987](https://github.com/PaddlePaddle/Paddle/pull/54987),[#55903](https://github.com/PaddlePaddle/Paddle/pull/55903),[#57582](https://github.com/PaddlePaddle/Paddle/pull/57582),[#57580](https://github.com/PaddlePaddle/Paddle/pull/57580),[#58052](https://github.com/PaddlePaddle/Paddle/pull/58052),[#55322](https://github.com/PaddlePaddle/Paddle/pull/55322),[#57418](https://github.com/PaddlePaddle/Paddle/pull/57418),[#57635](https://github.com/PaddlePaddle/Paddle/pull/57635),[#55328](https://github.com/PaddlePaddle/Paddle/pull/55328),[#57463](https://github.com/PaddlePaddle/Paddle/pull/57463),[#59791](https://github.com/PaddlePaddle/Paddle/pull/59791),[#59821](https://github.com/PaddlePaddle/Paddle/pull/59821),[#59115](https://github.com/PaddlePaddle/Paddle/pull/59115),[#57461](https://github.com/PaddlePaddle/Paddle/pull/57461),[#59392](https://github.com/PaddlePaddle/Paddle/pull/59392),[#57373](https://github.com/PaddlePaddle/Paddle/pull/57373),[#59118](https://github.com/PaddlePaddle/Paddle/pull/59118) -- 新增引入 Dialect 机制,支持全面、分层次管理框架各个模块对中间表示的需求,且内置了 Builtin Dialect,支持开发者根据需求自定义化扩展 Dialect。 [#56325](https://github.com/PaddlePaddle/Paddle/pull/56325),[#57539](https://github.com/PaddlePaddle/Paddle/pull/57539),[#54682](https://github.com/PaddlePaddle/Paddle/pull/54682),[#55381](https://github.com/PaddlePaddle/Paddle/pull/55381),[#56156](https://github.com/PaddlePaddle/Paddle/pull/56156),[#56431](https://github.com/PaddlePaddle/Paddle/pull/56431),[#56615](https://github.com/PaddlePaddle/Paddle/pull/56615),[#57103](https://github.com/PaddlePaddle/Paddle/pull/57103),[#57209](https://github.com/PaddlePaddle/Paddle/pull/57209) -- 规范化了飞桨静态图算子体系,新增 OperatorDialect、KernelDialect,以 Dialect 形式分层管理编译期和执行期的算子表示概念差异性,架构更加清晰。[#56284](https://github.com/PaddlePaddle/Paddle/pull/56284),[#54469](https://github.com/PaddlePaddle/Paddle/pull/54469),[#58660](https://github.com/PaddlePaddle/Paddle/pull/58660),[#58975](https://github.com/PaddlePaddle/Paddle/pull/58975),[#56680](https://github.com/PaddlePaddle/Paddle/pull/56680),[#54790](https://github.com/PaddlePaddle/Paddle/pull/54790),[#54826](https://github.com/PaddlePaddle/Paddle/pull/54826),[#54840](https://github.com/PaddlePaddle/Paddle/pull/54840),[#55699](https://github.com/PaddlePaddle/Paddle/pull/55699),[#55648](https://github.com/PaddlePaddle/Paddle/pull/55648),[#55880](https://github.com/PaddlePaddle/Paddle/pull/55880),[#56101](https://github.com/PaddlePaddle/Paddle/pull/56101),[#56754](https://github.com/PaddlePaddle/Paddle/pull/56754),[#54944](https://github.com/PaddlePaddle/Paddle/pull/54944),[#56836](https://github.com/PaddlePaddle/Paddle/pull/56836),[#57185](https://github.com/PaddlePaddle/Paddle/pull/57185),[#58757](https://github.com/PaddlePaddle/Paddle/pull/58757),[#56243](https://github.com/PaddlePaddle/Paddle/pull/56243),[#56436](https://github.com/PaddlePaddle/Paddle/pull/56436),[#57741](https://github.com/PaddlePaddle/Paddle/pull/57741),[#59124](https://github.com/PaddlePaddle/Paddle/pull/59124),[#57054](https://github.com/PaddlePaddle/Paddle/pull/57054),[#56984](https://github.com/PaddlePaddle/Paddle/pull/56984),[#57403](https://github.com/PaddlePaddle/Paddle/pull/57403),[#57904](https://github.com/PaddlePaddle/Paddle/pull/57904),[#58031](https://github.com/PaddlePaddle/Paddle/pull/58031),[#56924](https://github.com/PaddlePaddle/Paddle/pull/56924),[#59270](https://github.com/PaddlePaddle/Paddle/pull/59270),[#55343](https://github.com/PaddlePaddle/Paddle/pull/55343),[#56557](https://github.com/PaddlePaddle/Paddle/pull/56557),[#55693](https://github.com/PaddlePaddle/Paddle/pull/55693),[#54428](https://github.com/PaddlePaddle/Paddle/pull/54428) -- 新增 ShapeDialect,内置了丰富的 Shape 操作算子,用于面向 AI 编译器的动态 Shape 约束关系和表达式的构建。[#56727](https://github.com/PaddlePaddle/Paddle/pull/56727),[#59254](https://github.com/PaddlePaddle/Paddle/pull/59254),[#58368](https://github.com/PaddlePaddle/Paddle/pull/58368),[#57069](https://github.com/PaddlePaddle/Paddle/pull/57069),[#57337](https://github.com/PaddlePaddle/Paddle/pull/57337),[#56351](https://github.com/PaddlePaddle/Paddle/pull/56351),[#57029](https://github.com/PaddlePaddle/Paddle/pull/57029),[#58036](https://github.com/PaddlePaddle/Paddle/pull/58036),[#59032](https://github.com/PaddlePaddle/Paddle/pull/59032),[#57961](https://github.com/PaddlePaddle/Paddle/pull/57961),[#56427](https://github.com/PaddlePaddle/Paddle/pull/56427),[#57459](https://github.com/PaddlePaddle/Paddle/pull/57459) -- 统一了框架 Program 顶层结构,支持兼容表示“算子顺序性”和“计算图语义”,解耦对 ir::Graph 的依赖,且严格遵循 SSA (即 Static Single Assignment)原则。[#59369](https://github.com/PaddlePaddle/Paddle/pull/59369),[#54563](https://github.com/PaddlePaddle/Paddle/pull/54563),[#57051](https://github.com/PaddlePaddle/Paddle/pull/57051),[#57306](https://github.com/PaddlePaddle/Paddle/pull/57306),[#57857](https://github.com/PaddlePaddle/Paddle/pull/57857) -- 新增了 IrPrinter 和 IrPaser 组件,支持 PIR Program 的序列化和反序列化功能,提供了友好的 PIR 开发调试体验。[#55695](https://github.com/PaddlePaddle/Paddle/pull/55695),[#59449](https://github.com/PaddlePaddle/Paddle/pull/59449),[#54369](https://github.com/PaddlePaddle/Paddle/pull/54369),[#54499](https://github.com/PaddlePaddle/Paddle/pull/54499),[#55518](https://github.com/PaddlePaddle/Paddle/pull/55518),[#55784](https://github.com/PaddlePaddle/Paddle/pull/55784),[#57180](https://github.com/PaddlePaddle/Paddle/pull/57180),[#57471](https://github.com/PaddlePaddle/Paddle/pull/57471),[#54859](https://github.com/PaddlePaddle/Paddle/pull/54859),[#54968](https://github.com/PaddlePaddle/Paddle/pull/54968),[#55209](https://github.com/PaddlePaddle/Paddle/pull/55209),[#57314](https://github.com/PaddlePaddle/Paddle/pull/57314),[#57969](https://github.com/PaddlePaddle/Paddle/pull/57969) -- 基于 DRR(即 Declarative Rewrite Rule) 和 Pattern Rewriter 构建了全新、简洁、低成本的 Pass 开发体系,并内置了一系列丰富且功能完备的 Pass 优化策略,加速训练和推理执行过程。[#54385](https://github.com/PaddlePaddle/Paddle/pull/54385),[#54738](https://github.com/PaddlePaddle/Paddle/pull/54738),[#55859](https://github.com/PaddlePaddle/Paddle/pull/55859),[#56638](https://github.com/PaddlePaddle/Paddle/pull/56638),[#57090](https://github.com/PaddlePaddle/Paddle/pull/57090),[#58673](https://github.com/PaddlePaddle/Paddle/pull/58673),[#59415](https://github.com/PaddlePaddle/Paddle/pull/59415),[#56729](https://github.com/PaddlePaddle/Paddle/pull/56729),[#58655](https://github.com/PaddlePaddle/Paddle/pull/58655) -- 新增 ProgramTranslator 组件,支持由 ProgramDesc 一键转换为飞桨新一代 IR 表示,并提供了易用的 C++和 Python 接口。[#55433](https://github.com/PaddlePaddle/Paddle/pull/55433),[#54470](https://github.com/PaddlePaddle/Paddle/pull/54470),[#58044](https://github.com/PaddlePaddle/Paddle/pull/58044),[#58390](https://github.com/PaddlePaddle/Paddle/pull/58390),[#58100](https://github.com/PaddlePaddle/Paddle/pull/58100),[#55403](https://github.com/PaddlePaddle/Paddle/pull/55403),[#55406](https://github.com/PaddlePaddle/Paddle/pull/55406),[#54719](https://github.com/PaddlePaddle/Paddle/pull/54719),[#56550](https://github.com/PaddlePaddle/Paddle/pull/56550),[#55448](https://github.com/PaddlePaddle/Paddle/pull/55448),[#55453](https://github.com/PaddlePaddle/Paddle/pull/55453),[#56294](https://github.com/PaddlePaddle/Paddle/pull/56294),[#56308](https://github.com/PaddlePaddle/Paddle/pull/56308),[#56842](https://github.com/PaddlePaddle/Paddle/pull/56842),[#58517](https://github.com/PaddlePaddle/Paddle/pull/58517) -- 借助自动代码生成技术,一键生成飞桨框架全量静态图算子表示。将静态图组网逻辑下沉至 C++端,统一绑定到\_C_ops 模块,大幅精简 Python 端 API 代码,实现飞桨框架 API 的极致化动静统一,升级了诸多 Python API 以支持新 IR 静态图组网。[#56570](https://github.com/PaddlePaddle/Paddle/pull/56570),[#55745](https://github.com/PaddlePaddle/Paddle/pull/55745),[#56955](https://github.com/PaddlePaddle/Paddle/pull/56955),[#57298](https://github.com/PaddlePaddle/Paddle/pull/57298),[#57946](https://github.com/PaddlePaddle/Paddle/pull/57946),[#57248](https://github.com/PaddlePaddle/Paddle/pull/57248),[#56080](https://github.com/PaddlePaddle/Paddle/pull/56080),[#54396](https://github.com/PaddlePaddle/Paddle/pull/54396),[#54551](https://github.com/PaddlePaddle/Paddle/pull/54551),[#56520](https://github.com/PaddlePaddle/Paddle/pull/56520),[#55002](https://github.com/PaddlePaddle/Paddle/pull/55002),[#57067](https://github.com/PaddlePaddle/Paddle/pull/57067),[#59320](https://github.com/PaddlePaddle/Paddle/pull/59320),[#59348](https://github.com/PaddlePaddle/Paddle/pull/59348),[#57164](https://github.com/PaddlePaddle/Paddle/pull/57164),[#57267](https://github.com/PaddlePaddle/Paddle/pull/57267),[#59064](https://github.com/PaddlePaddle/Paddle/pull/59064),[#54340](https://github.com/PaddlePaddle/Paddle/pull/54340),[#54895](https://github.com/PaddlePaddle/Paddle/pull/54895),[#55004](https://github.com/PaddlePaddle/Paddle/pull/55004),[#56196](https://github.com/PaddlePaddle/Paddle/pull/56196),[#56862](https://github.com/PaddlePaddle/Paddle/pull/56862),[#58991](https://github.com/PaddlePaddle/Paddle/pull/58991),[#55428](https://github.com/PaddlePaddle/Paddle/pull/55428),[#55909](https://github.com/PaddlePaddle/Paddle/pull/55909),[#56241](https://github.com/PaddlePaddle/Paddle/pull/56241),[#56526](https://github.com/PaddlePaddle/Paddle/pull/56526),[#56571](https://github.com/PaddlePaddle/Paddle/pull/56571),[#56518](https://github.com/PaddlePaddle/Paddle/pull/56518),[#57016](https://github.com/PaddlePaddle/Paddle/pull/57016),[#56653](https://github.com/PaddlePaddle/Paddle/pull/56653),[#56809](https://github.com/PaddlePaddle/Paddle/pull/56809),[#57158](https://github.com/PaddlePaddle/Paddle/pull/57158),[#55422](https://github.com/PaddlePaddle/Paddle/pull/55422),[#55458](https://github.com/PaddlePaddle/Paddle/pull/55458),[#55432](https://github.com/PaddlePaddle/Paddle/pull/55432),[#55467](https://github.com/PaddlePaddle/Paddle/pull/55467),[#55483](https://github.com/PaddlePaddle/Paddle/pull/55483),[#55419](https://github.com/PaddlePaddle/Paddle/pull/55419),[#55517](https://github.com/PaddlePaddle/Paddle/pull/55517),[#55500](https://github.com/PaddlePaddle/Paddle/pull/55500),[#56674](https://github.com/PaddlePaddle/Paddle/pull/56674),[#57693](https://github.com/PaddlePaddle/Paddle/pull/57693),[#55008](https://github.com/PaddlePaddle/Paddle/pull/55008),[#57166](https://github.com/PaddlePaddle/Paddle/pull/57166),[#57157](https://github.com/PaddlePaddle/Paddle/pull/57157),[#57159](https://github.com/PaddlePaddle/Paddle/pull/57159),[#57175](https://github.com/PaddlePaddle/Paddle/pull/57175),[#57325](https://github.com/PaddlePaddle/Paddle/pull/57325),[#57330](https://github.com/PaddlePaddle/Paddle/pull/57330),[#57415](https://github.com/PaddlePaddle/Paddle/pull/57415),[#57122](https://github.com/PaddlePaddle/Paddle/pull/57122),[#57393](https://github.com/PaddlePaddle/Paddle/pull/57393),[#57344](https://github.com/PaddlePaddle/Paddle/pull/57344),[#57667](https://github.com/PaddlePaddle/Paddle/pull/57667),[#57348](https://github.com/PaddlePaddle/Paddle/pull/57348),[#57700](https://github.com/PaddlePaddle/Paddle/pull/57700),[#58093](https://github.com/PaddlePaddle/Paddle/pull/58093),[#58005](https://github.com/PaddlePaddle/Paddle/pull/58005),[#58081](https://github.com/PaddlePaddle/Paddle/pull/58081),[#58094](https://github.com/PaddlePaddle/Paddle/pull/58094),[#58137](https://github.com/PaddlePaddle/Paddle/pull/58137),[#58287](https://github.com/PaddlePaddle/Paddle/pull/58287),[#58352](https://github.com/PaddlePaddle/Paddle/pull/58352),[#58340](https://github.com/PaddlePaddle/Paddle/pull/58340),[#58363](https://github.com/PaddlePaddle/Paddle/pull/58363),[#58331](https://github.com/PaddlePaddle/Paddle/pull/58331),[#58343](https://github.com/PaddlePaddle/Paddle/pull/58343),[#58317](https://github.com/PaddlePaddle/Paddle/pull/58317),[#58450](https://github.com/PaddlePaddle/Paddle/pull/58450),[#58377](https://github.com/PaddlePaddle/Paddle/pull/58377),[#58466](https://github.com/PaddlePaddle/Paddle/pull/58466),[#58470](https://github.com/PaddlePaddle/Paddle/pull/58470),[#58491](https://github.com/PaddlePaddle/Paddle/pull/58491),[#58546](https://github.com/PaddlePaddle/Paddle/pull/58546),[#58587](https://github.com/PaddlePaddle/Paddle/pull/58587),[#58453](https://github.com/PaddlePaddle/Paddle/pull/58453),[#58634](https://github.com/PaddlePaddle/Paddle/pull/58634),[#58604](https://github.com/PaddlePaddle/Paddle/pull/58604),[#58605](https://github.com/PaddlePaddle/Paddle/pull/58605),[#58593](https://github.com/PaddlePaddle/Paddle/pull/58593),[#58675](https://github.com/PaddlePaddle/Paddle/pull/58675),[#58699](https://github.com/PaddlePaddle/Paddle/pull/58699),[#58384](https://github.com/PaddlePaddle/Paddle/pull/58384),[#58629](https://github.com/PaddlePaddle/Paddle/pull/58629),[#58579](https://github.com/PaddlePaddle/Paddle/pull/58579),[#58695](https://github.com/PaddlePaddle/Paddle/pull/58695),[#58548](https://github.com/PaddlePaddle/Paddle/pull/58548),[#58688](https://github.com/PaddlePaddle/Paddle/pull/58688),[#58792](https://github.com/PaddlePaddle/Paddle/pull/58792),[#58843](https://github.com/PaddlePaddle/Paddle/pull/58843),[#58840](https://github.com/PaddlePaddle/Paddle/pull/58840),[#58718](https://github.com/PaddlePaddle/Paddle/pull/58718),[#58883](https://github.com/PaddlePaddle/Paddle/pull/58883),[#58785](https://github.com/PaddlePaddle/Paddle/pull/58785),[#58608](https://github.com/PaddlePaddle/Paddle/pull/58608),[#58781](https://github.com/PaddlePaddle/Paddle/pull/58781),[#58783](https://github.com/PaddlePaddle/Paddle/pull/58783),[#58429](https://github.com/PaddlePaddle/Paddle/pull/58429),[#58685](https://github.com/PaddlePaddle/Paddle/pull/58685),[#58696](https://github.com/PaddlePaddle/Paddle/pull/58696),[#58690](https://github.com/PaddlePaddle/Paddle/pull/58690),[#58831](https://github.com/PaddlePaddle/Paddle/pull/58831),[#58929](https://github.com/PaddlePaddle/Paddle/pull/58929),[#58740](https://github.com/PaddlePaddle/Paddle/pull/58740),[#58937](https://github.com/PaddlePaddle/Paddle/pull/58937),[#58782](https://github.com/PaddlePaddle/Paddle/pull/58782),[#58833](https://github.com/PaddlePaddle/Paddle/pull/58833),[#58882](https://github.com/PaddlePaddle/Paddle/pull/58882),[#58935](https://github.com/PaddlePaddle/Paddle/pull/58935),[#58931](https://github.com/PaddlePaddle/Paddle/pull/58931),[#59041](https://github.com/PaddlePaddle/Paddle/pull/59041),[#59040](https://github.com/PaddlePaddle/Paddle/pull/59040),[#58877](https://github.com/PaddlePaddle/Paddle/pull/58877),[#58888](https://github.com/PaddlePaddle/Paddle/pull/58888),[#59042](https://github.com/PaddlePaddle/Paddle/pull/59042),[#58780](https://github.com/PaddlePaddle/Paddle/pull/58780),[#58682](https://github.com/PaddlePaddle/Paddle/pull/58682),[#58815](https://github.com/PaddlePaddle/Paddle/pull/58815),[#58676](https://github.com/PaddlePaddle/Paddle/pull/58676),[#58678](https://github.com/PaddlePaddle/Paddle/pull/58678),[#58446](https://github.com/PaddlePaddle/Paddle/pull/58446),[#59077](https://github.com/PaddlePaddle/Paddle/pull/59077),[#59091](https://github.com/PaddlePaddle/Paddle/pull/59091),[#58661](https://github.com/PaddlePaddle/Paddle/pull/58661),[#58832](https://github.com/PaddlePaddle/Paddle/pull/58832),[#58642](https://github.com/PaddlePaddle/Paddle/pull/58642),[#58698](https://github.com/PaddlePaddle/Paddle/pull/58698),[#59313](https://github.com/PaddlePaddle/Paddle/pull/59313),[#59371](https://github.com/PaddlePaddle/Paddle/pull/59371),[#58700](https://github.com/PaddlePaddle/Paddle/pull/58700),[#58953](https://github.com/PaddlePaddle/Paddle/pull/58953),[#58879](https://github.com/PaddlePaddle/Paddle/pull/58879),[#59469](https://github.com/PaddlePaddle/Paddle/pull/59469),[#59573](https://github.com/PaddlePaddle/Paddle/pull/59573),[#59481](https://github.com/PaddlePaddle/Paddle/pull/59481),[#59419](https://github.com/PaddlePaddle/Paddle/pull/59419),[#59509](https://github.com/PaddlePaddle/Paddle/pull/59509),[#58735](https://github.com/PaddlePaddle/Paddle/pull/58735),[#59616](https://github.com/PaddlePaddle/Paddle/pull/59616),[#59582](https://github.com/PaddlePaddle/Paddle/pull/59582),[#59420](https://github.com/PaddlePaddle/Paddle/pull/59420),[#59500](https://github.com/PaddlePaddle/Paddle/pull/59500),[#58911](https://github.com/PaddlePaddle/Paddle/pull/58911),[#59535](https://github.com/PaddlePaddle/Paddle/pull/59535),[#54891](https://github.com/PaddlePaddle/Paddle/pull/54891),[#56794](https://github.com/PaddlePaddle/Paddle/pull/56794),[#57477](https://github.com/PaddlePaddle/Paddle/pull/57477),[#57929](https://github.com/PaddlePaddle/Paddle/pull/57929),[#57765](https://github.com/PaddlePaddle/Paddle/pull/57765),[#58693](https://github.com/PaddlePaddle/Paddle/pull/58693),[#58603](https://github.com/PaddlePaddle/Paddle/pull/58603),[#56291](https://github.com/PaddlePaddle/Paddle/pull/56291),[#57123](https://github.com/PaddlePaddle/Paddle/pull/57123),[#57317](https://github.com/PaddlePaddle/Paddle/pull/57317),[#57341](https://github.com/PaddlePaddle/Paddle/pull/57341),[#57020](https://github.com/PaddlePaddle/Paddle/pull/57020),[#57324](https://github.com/PaddlePaddle/Paddle/pull/57324),[#57761](https://github.com/PaddlePaddle/Paddle/pull/57761),[#57762](https://github.com/PaddlePaddle/Paddle/pull/57762),[#57907](https://github.com/PaddlePaddle/Paddle/pull/57907),[#57909](https://github.com/PaddlePaddle/Paddle/pull/57909),[#58099](https://github.com/PaddlePaddle/Paddle/pull/58099),[#58110](https://github.com/PaddlePaddle/Paddle/pull/58110),[#58114](https://github.com/PaddlePaddle/Paddle/pull/58114),[#58139](https://github.com/PaddlePaddle/Paddle/pull/58139),[#58144](https://github.com/PaddlePaddle/Paddle/pull/58144),[#58165](https://github.com/PaddlePaddle/Paddle/pull/58165),[#58194](https://github.com/PaddlePaddle/Paddle/pull/58194),[#58138](https://github.com/PaddlePaddle/Paddle/pull/58138),[#58113](https://github.com/PaddlePaddle/Paddle/pull/58113),[#58245](https://github.com/PaddlePaddle/Paddle/pull/58245),[#58318](https://github.com/PaddlePaddle/Paddle/pull/58318),[#58105](https://github.com/PaddlePaddle/Paddle/pull/58105),[#58348](https://github.com/PaddlePaddle/Paddle/pull/58348),[#58235](https://github.com/PaddlePaddle/Paddle/pull/58235),[#58354](https://github.com/PaddlePaddle/Paddle/pull/58354),[#58341](https://github.com/PaddlePaddle/Paddle/pull/58341),[#58445](https://github.com/PaddlePaddle/Paddle/pull/58445),[#58418](https://github.com/PaddlePaddle/Paddle/pull/58418),[#58239](https://github.com/PaddlePaddle/Paddle/pull/58239),[#58473](https://github.com/PaddlePaddle/Paddle/pull/58473),[#58239](https://github.com/PaddlePaddle/Paddle/pull/58239),[#58391](https://github.com/PaddlePaddle/Paddle/pull/58391),[#58501](https://github.com/PaddlePaddle/Paddle/pull/58501),[#58519](https://github.com/PaddlePaddle/Paddle/pull/58519),[#58416](https://github.com/PaddlePaddle/Paddle/pull/58416),[#58588](https://github.com/PaddlePaddle/Paddle/pull/58588),[#58531](https://github.com/PaddlePaddle/Paddle/pull/58531),[#58730](https://github.com/PaddlePaddle/Paddle/pull/58730),[#58773](https://github.com/PaddlePaddle/Paddle/pull/58773),[#58862](https://github.com/PaddlePaddle/Paddle/pull/58862),[#58946](https://github.com/PaddlePaddle/Paddle/pull/58946),[#58500](https://github.com/PaddlePaddle/Paddle/pull/58500),[#56585](https://github.com/PaddlePaddle/Paddle/pull/56585),[#57480](https://github.com/PaddlePaddle/Paddle/pull/57480),[#57433](https://github.com/PaddlePaddle/Paddle/pull/57433),[#58498](https://github.com/PaddlePaddle/Paddle/pull/58498) - -#### 功能优化 - -- 升级了静态图执行器,扩展了更多 Kernel Instruction 类型,支持加载 PIR 并高效调度执行,在训练、推理环节都有显存和性能收益。[#54570](https://github.com/PaddlePaddle/Paddle/pull/54570),[#58665](https://github.com/PaddlePaddle/Paddle/pull/58665),[#57291](https://github.com/PaddlePaddle/Paddle/pull/57291),[#54452](https://github.com/PaddlePaddle/Paddle/pull/54452),[#57431](https://github.com/PaddlePaddle/Paddle/pull/57431),[#54692](https://github.com/PaddlePaddle/Paddle/pull/54692),[#55112](https://github.com/PaddlePaddle/Paddle/pull/55112),[#55210](https://github.com/PaddlePaddle/Paddle/pull/55210),[#55401](https://github.com/PaddlePaddle/Paddle/pull/55401),[#55772](https://github.com/PaddlePaddle/Paddle/pull/55772),[#55828](https://github.com/PaddlePaddle/Paddle/pull/55828),[#56148](https://github.com/PaddlePaddle/Paddle/pull/56148),[#54763](https://github.com/PaddlePaddle/Paddle/pull/54763),[#56886](https://github.com/PaddlePaddle/Paddle/pull/56886),[#57284](https://github.com/PaddlePaddle/Paddle/pull/57284),[#57268](https://github.com/PaddlePaddle/Paddle/pull/57268),[#57791](https://github.com/PaddlePaddle/Paddle/pull/57791),[#56789](https://github.com/PaddlePaddle/Paddle/pull/56789),[#56704](https://github.com/PaddlePaddle/Paddle/pull/56704),[#57594](https://github.com/PaddlePaddle/Paddle/pull/57594),[#58397](https://github.com/PaddlePaddle/Paddle/pull/58397),[#58337](https://github.com/PaddlePaddle/Paddle/pull/58337),[#58756](https://github.com/PaddlePaddle/Paddle/pull/58756),[#58371](https://github.com/PaddlePaddle/Paddle/pull/58371) -- 面向 PIR 重构了自动微分模块,迁移适配了高阶自动微分功能,优化了 Stop Gradient 传递机制,逻辑更加清晰,功能更加鲁棒。[#55660](https://github.com/PaddlePaddle/Paddle/pull/55660),[#57084](https://github.com/PaddlePaddle/Paddle/pull/57084),[#56890](https://github.com/PaddlePaddle/Paddle/pull/56890),[#58942](https://github.com/PaddlePaddle/Paddle/pull/58942),[#59373](https://github.com/PaddlePaddle/Paddle/pull/59373),[#57206](https://github.com/PaddlePaddle/Paddle/pull/57206),[#58145](https://github.com/PaddlePaddle/Paddle/pull/58145),[#55235](https://github.com/PaddlePaddle/Paddle/pull/55235),[#57255](https://github.com/PaddlePaddle/Paddle/pull/57255),[#56925](https://github.com/PaddlePaddle/Paddle/pull/56925),[#55957](https://github.com/PaddlePaddle/Paddle/pull/55957),[#56163](https://github.com/PaddlePaddle/Paddle/pull/56163),[#56316](https://github.com/PaddlePaddle/Paddle/pull/56316),[#57294](https://github.com/PaddlePaddle/Paddle/pull/57294),[#57449](https://github.com/PaddlePaddle/Paddle/pull/57449),[#59520](https://github.com/PaddlePaddle/Paddle/pull/59520),[#59565](https://github.com/PaddlePaddle/Paddle/pull/59565),[#56265](https://github.com/PaddlePaddle/Paddle/pull/56265),[#56512](https://github.com/PaddlePaddle/Paddle/pull/56512),[#56650](https://github.com/PaddlePaddle/Paddle/pull/56650),[#57183](https://github.com/PaddlePaddle/Paddle/pull/57183),[#57956](https://github.com/PaddlePaddle/Paddle/pull/57956),[#59100](https://github.com/PaddlePaddle/Paddle/pull/59100) -- 优化了控制流前向,反向算子的设计和表示,引入 ControlFlow Dialect,并支持 ProgramDesc 下控制流算子到 PIR 的转换和执行。[#58729](https://github.com/PaddlePaddle/Paddle/pull/58729),[#57364](https://github.com/PaddlePaddle/Paddle/pull/57364),[#58625](https://github.com/PaddlePaddle/Paddle/pull/58625),[#57475](https://github.com/PaddlePaddle/Paddle/pull/57475),[#57265](https://github.com/PaddlePaddle/Paddle/pull/57265),[#56799](https://github.com/PaddlePaddle/Paddle/pull/56799),[#59033](https://github.com/PaddlePaddle/Paddle/pull/59033),[#57342](https://github.com/PaddlePaddle/Paddle/pull/57342),[#57801](https://github.com/PaddlePaddle/Paddle/pull/57801),[#57958](https://github.com/PaddlePaddle/Paddle/pull/57958),[#57949](https://github.com/PaddlePaddle/Paddle/pull/57949),[#57937](https://github.com/PaddlePaddle/Paddle/pull/57937),[#59231](https://github.com/PaddlePaddle/Paddle/pull/59231),[#59496](https://github.com/PaddlePaddle/Paddle/pull/59496),[#59321](https://github.com/PaddlePaddle/Paddle/pull/59321),[#58088](https://github.com/PaddlePaddle/Paddle/pull/58088),[#58198](https://github.com/PaddlePaddle/Paddle/pull/58198),[#58024](https://github.com/PaddlePaddle/Paddle/pull/58024),[#58089](https://github.com/PaddlePaddle/Paddle/pull/58089),[#58086](https://github.com/PaddlePaddle/Paddle/pull/58086),[#59175](https://github.com/PaddlePaddle/Paddle/pull/59175),[#59423](https://github.com/PaddlePaddle/Paddle/pull/59423),[#59567](https://github.com/PaddlePaddle/Paddle/pull/59567),[#58098](https://github.com/PaddlePaddle/Paddle/pull/58098),[#58163](https://github.com/PaddlePaddle/Paddle/pull/58163),[#58250](https://github.com/PaddlePaddle/Paddle/pull/58250),[#58277](https://github.com/PaddlePaddle/Paddle/pull/58277),[#58355](https://github.com/PaddlePaddle/Paddle/pull/58355),[#59020](https://github.com/PaddlePaddle/Paddle/pull/59020),[#59200](https://github.com/PaddlePaddle/Paddle/pull/59200),[#59585](https://github.com/PaddlePaddle/Paddle/pull/59585),[#58109](https://github.com/PaddlePaddle/Paddle/pull/58109) -- 动转静执行流程升级支持 PIR,优化了动转静子图 Pass 机制,支持用户在@to_static 功能下尝鲜使用 PIR 体系下功能。[#57566](https://github.com/PaddlePaddle/Paddle/pull/57566),[#55620](https://github.com/PaddlePaddle/Paddle/pull/55620),[#56791](https://github.com/PaddlePaddle/Paddle/pull/56791),[#57357](https://github.com/PaddlePaddle/Paddle/pull/57357),[#59152](https://github.com/PaddlePaddle/Paddle/pull/59152),[#59312](https://github.com/PaddlePaddle/Paddle/pull/59312),[#58630](https://github.com/PaddlePaddle/Paddle/pull/58630),[#56035](https://github.com/PaddlePaddle/Paddle/pull/56035),[#59447](https://github.com/PaddlePaddle/Paddle/pull/59447),[#57361](https://github.com/PaddlePaddle/Paddle/pull/57361),[#59261](https://github.com/PaddlePaddle/Paddle/pull/59261),[#59774](https://github.com/PaddlePaddle/Paddle/pull/59774) -- 升级了组合算子功能,引入 Backend 概念分层管理动、静态图组合算子模块逻辑,将必要组件和算子拆分规则下沉至 C++,大幅降低了维护成本。[#58153](https://github.com/PaddlePaddle/Paddle/pull/58153),[#56391](https://github.com/PaddlePaddle/Paddle/pull/56391),[#56614](https://github.com/PaddlePaddle/Paddle/pull/56614),[#57030](https://github.com/PaddlePaddle/Paddle/pull/57030),[#57554](https://github.com/PaddlePaddle/Paddle/pull/57554),[#58018](https://github.com/PaddlePaddle/Paddle/pull/58018),[#58130](https://github.com/PaddlePaddle/Paddle/pull/58130),[#58581](https://github.com/PaddlePaddle/Paddle/pull/58581),[#58679](https://github.com/PaddlePaddle/Paddle/pull/58679),[#59054](https://github.com/PaddlePaddle/Paddle/pull/59054),[#55480](https://github.com/PaddlePaddle/Paddle/pull/55480),[#58451](https://github.com/PaddlePaddle/Paddle/pull/58451),[#55647](https://github.com/PaddlePaddle/Paddle/pull/55647),[#56342](https://github.com/PaddlePaddle/Paddle/pull/56342),[#56798](https://github.com/PaddlePaddle/Paddle/pull/56798),[#57561](https://github.com/PaddlePaddle/Paddle/pull/57561),[#58023](https://github.com/PaddlePaddle/Paddle/pull/58023),[#57722](https://github.com/PaddlePaddle/Paddle/pull/57722) - -#### 性能优化 - -- 新增 DCE、constant_folding_pass 等 PIR Program 算子和结构优化的 Pass。[#54935](https://github.com/PaddlePaddle/Paddle/pull/54935),[#59430](https://github.com/PaddlePaddle/Paddle/pull/59430),[#58753](https://github.com/PaddlePaddle/Paddle/pull/58753),[#58732](https://github.com/PaddlePaddle/Paddle/pull/58732) +### 功能优化 +- 完善 PIR 的基础功能,包含基础的类型系统增强、调试、打印、Pass 开发、AMP 支持等,提升 PIR 的研发效率。[#60723](https://github.com/PaddlePaddle/Paddle/pull/60723), [#60677](https://github.com/PaddlePaddle/Paddle/pull/60677), [#60783](https://github.com/PaddlePaddle/Paddle/pull/60783), [#60798](https://github.com/PaddlePaddle/Paddle/pull/60798), [#61053](https://github.com/PaddlePaddle/Paddle/pull/61053), [#61366](https://github.com/PaddlePaddle/Paddle/pull/61366), [#61446](https://github.com/PaddlePaddle/Paddle/pull/61446), [#60024](https://github.com/PaddlePaddle/Paddle/pull/60024), [#59939](https://github.com/PaddlePaddle/Paddle/pull/59939), [#63376](https://github.com/PaddlePaddle/Paddle/pull/63376), [#61853](https://github.com/PaddlePaddle/Paddle/pull/61853), [#63914](https://github.com/PaddlePaddle/Paddle/pull/63914), [#60170](https://github.com/PaddlePaddle/Paddle/pull/60170), [#60678](https://github.com/PaddlePaddle/Paddle/pull/60678), [#64093](https://github.com/PaddlePaddle/Paddle/pull/64093), [#64065](https://github.com/PaddlePaddle/Paddle/pull/64065), [#62451](https://github.com/PaddlePaddle/Paddle/pull/62451), [#59784](https://github.com/PaddlePaddle/Paddle/pull/59784), [#60136](https://github.com/PaddlePaddle/Paddle/pull/60136), [#63336](https://github.com/PaddlePaddle/Paddle/pull/63336), [#62108](https://github.com/PaddlePaddle/Paddle/pull/62108), [#60860](https://github.com/PaddlePaddle/Paddle/pull/60860), [#60536](https://github.com/PaddlePaddle/Paddle/pull/60536), [#60590](https://github.com/PaddlePaddle/Paddle/pull/60590), [#60752](https://github.com/PaddlePaddle/Paddle/pull/60752), [#61435](https://github.com/PaddlePaddle/Paddle/pull/61435), [#62977](https://github.com/PaddlePaddle/Paddle/pull/62977), [#62139](https://github.com/PaddlePaddle/Paddle/pull/62139), [#60432](https://github.com/PaddlePaddle/Paddle/pull/60432), [#61452](https://github.com/PaddlePaddle/Paddle/pull/61452), [#61978](https://github.com/PaddlePaddle/Paddle/pull/61978), [#62262](https://github.com/PaddlePaddle/Paddle/pull/62262), [#62422](https://github.com/PaddlePaddle/Paddle/pull/62422), [#60359](https://github.com/PaddlePaddle/Paddle/pull/60359), [#62989](https://github.com/PaddlePaddle/Paddle/pull/62989), [#61297](https://github.com/PaddlePaddle/Paddle/pull/61297), [#61399](https://github.com/PaddlePaddle/Paddle/pull/61399), [#61871](https://github.com/PaddlePaddle/Paddle/pull/61871), [#61496](https://github.com/PaddlePaddle/Paddle/pull/61496), [#62413](https://github.com/PaddlePaddle/Paddle/pull/62413) +- 优化飞桨执行器执行逻辑,完善 Pass 体系,提升训推性能表现,并更好的支持分布式并行的逻辑运行。 [#60182](https://github.com/PaddlePaddle/Paddle/pull/60182), [#60516](https://github.com/PaddlePaddle/Paddle/pull/60516), [#63573](https://github.com/PaddlePaddle/Paddle/pull/63573), [#60181](https://github.com/PaddlePaddle/Paddle/pull/60181), [#59792](https://github.com/PaddlePaddle/Paddle/pull/59792), [#62025](https://github.com/PaddlePaddle/Paddle/pull/62025), [#61160](https://github.com/PaddlePaddle/Paddle/pull/61160), [#61188](https://github.com/PaddlePaddle/Paddle/pull/61188), [#61277](https://github.com/PaddlePaddle/Paddle/pull/61277), [#61669](https://github.com/PaddlePaddle/Paddle/pull/61669), [#60823](https://github.com/PaddlePaddle/Paddle/pull/60823), [#61310](https://github.com/PaddlePaddle/Paddle/pull/61310), [#60892](https://github.com/PaddlePaddle/Paddle/pull/60892), [#60578](https://github.com/PaddlePaddle/Paddle/pull/60578), [#61657](https://github.com/PaddlePaddle/Paddle/pull/61657), [#62638](https://github.com/PaddlePaddle/Paddle/pull/62638), [#63960](https://github.com/PaddlePaddle/Paddle/pull/63960), [#64234](https://github.com/PaddlePaddle/Paddle/pull/64234) -2. 新增 fused_attention,fused_dropout_add,fused_gemm_epilogue_pass,fused_linear_param_grad_add_pass,fused_weight_only_linear_pass,fused_softmax_mask_upper_triangle 等优化算子融合类 Pass,提升训练和推理性能。[#57557](https://github.com/PaddlePaddle/Paddle/pull/57557),[#58272](https://github.com/PaddlePaddle/Paddle/pull/58272),[#58188](https://github.com/PaddlePaddle/Paddle/pull/58188),[#58401](https://github.com/PaddlePaddle/Paddle/pull/58401),[#59366](https://github.com/PaddlePaddle/Paddle/pull/59366),[#57655](https://github.com/PaddlePaddle/Paddle/pull/57655),[#57360](https://github.com/PaddlePaddle/Paddle/pull/57360),[#56672](https://github.com/PaddlePaddle/Paddle/pull/56672),[#58537](https://github.com/PaddlePaddle/Paddle/pull/58537),[#56247](https://github.com/PaddlePaddle/Paddle/pull/56247),[#59391](https://github.com/PaddlePaddle/Paddle/pull/59391),[#58897](https://github.com/PaddlePaddle/Paddle/pull/58897),[#54933](https://github.com/PaddlePaddle/Paddle/pull/54933) +### PIR 新功能 +- 基于 PIR 实现反向逻辑,直接生成反向计算图,同时支持高阶微分。 [#60174](https://github.com/PaddlePaddle/Paddle/pull/60174), [#60328](https://github.com/PaddlePaddle/Paddle/pull/60328), [#60818](https://github.com/PaddlePaddle/Paddle/pull/60818), [#61352](https://github.com/PaddlePaddle/Paddle/pull/61352), [#61661](https://github.com/PaddlePaddle/Paddle/pull/61661), [#61927](https://github.com/PaddlePaddle/Paddle/pull/61927), [#62772](https://github.com/PaddlePaddle/Paddle/pull/62772), [#60360](https://github.com/PaddlePaddle/Paddle/pull/60360), [#60866](https://github.com/PaddlePaddle/Paddle/pull/60866), [#60970](https://github.com/PaddlePaddle/Paddle/pull/60970), [#60810](https://github.com/PaddlePaddle/Paddle/pull/60810), [#64696](https://github.com/PaddlePaddle/Paddle/pull/64696), [#59844](https://github.com/PaddlePaddle/Paddle/pull/59844), [#59999](https://github.com/PaddlePaddle/Paddle/pull/59999), [#60262](https://github.com/PaddlePaddle/Paddle/pull/60262), [#60338](https://github.com/PaddlePaddle/Paddle/pull/60338), [#59935](https://github.com/PaddlePaddle/Paddle/pull/59935), [#59982](https://github.com/PaddlePaddle/Paddle/pull/59982), [#60221](https://github.com/PaddlePaddle/Paddle/pull/60221), [#62621](https://github.com/PaddlePaddle/Paddle/pull/62621), [#60044](https://github.com/PaddlePaddle/Paddle/pull/60044), [#59790](https://github.com/PaddlePaddle/Paddle/pull/59790), [#60529](https://github.com/PaddlePaddle/Paddle/pull/60529), [#61378](https://github.com/PaddlePaddle/Paddle/pull/61378), [#61584](https://github.com/PaddlePaddle/Paddle/pull/61584) +- 基于 PIR 实现控制流逻辑,提升 PIR 的表达能力,更好的支持训练和推理等多场景业务。[#61396](https://github.com/PaddlePaddle/Paddle/pull/61396), [#64045](https://github.com/PaddlePaddle/Paddle/pull/64045), [#60953](https://github.com/PaddlePaddle/Paddle/pull/60953), [#61091](https://github.com/PaddlePaddle/Paddle/pull/61091), [#61304](https://github.com/PaddlePaddle/Paddle/pull/61304), [#62093](https://github.com/PaddlePaddle/Paddle/pull/62093), [#64710](https://github.com/PaddlePaddle/Paddle/pull/64710), [#60668](https://github.com/PaddlePaddle/Paddle/pull/60668), [#60433](https://github.com/PaddlePaddle/Paddle/pull/60433), [#60963](https://github.com/PaddlePaddle/Paddle/pull/60963), [#61192](https://github.com/PaddlePaddle/Paddle/pull/61192), [#60895](https://github.com/PaddlePaddle/Paddle/pull/60895), [#60017](https://github.com/PaddlePaddle/Paddle/pull/60017), [#60369](https://github.com/PaddlePaddle/Paddle/pull/60369), [#60330](https://github.com/PaddlePaddle/Paddle/pull/60330), [#60364](https://github.com/PaddlePaddle/Paddle/pull/60364), [#61416](https://github.com/PaddlePaddle/Paddle/pull/61416), [#60460](https://github.com/PaddlePaddle/Paddle/pull/60460), [#60703](https://github.com/PaddlePaddle/Paddle/pull/60703), [#61027](https://github.com/PaddlePaddle/Paddle/pull/61027) +- 基于 PIR 实现 save/load 逻辑,打通 PIR 和上下游训练和推理业务的流程。 [#63438](https://github.com/PaddlePaddle/Paddle/pull/63438), [#63574](https://github.com/PaddlePaddle/Paddle/pull/63574), [#64281](https://github.com/PaddlePaddle/Paddle/pull/64281), [#64327](https://github.com/PaddlePaddle/Paddle/pull/64327), [#63622](https://github.com/PaddlePaddle/Paddle/pull/63622), [#64507](https://github.com/PaddlePaddle/Paddle/pull/64507), [#63389](https://github.com/PaddlePaddle/Paddle/pull/63389), [#63539](https://github.com/PaddlePaddle/Paddle/pull/63539), [#63749](https://github.com/PaddlePaddle/Paddle/pull/63749), [#63957](https://github.com/PaddlePaddle/Paddle/pull/63957), [#64044](https://github.com/PaddlePaddle/Paddle/pull/64044), [#64121](https://github.com/PaddlePaddle/Paddle/pull/64121), [#64239](https://github.com/PaddlePaddle/Paddle/pull/64239), [#63818](https://github.com/PaddlePaddle/Paddle/pull/63818), [#63910](https://github.com/PaddlePaddle/Paddle/pull/63910),[#63380](https://github.com/PaddlePaddle/Paddle/pull/63380)[#63380](https://github.com/PaddlePaddle/Paddle/pull/63380),[#63275](https://github.com/PaddlePaddle/Paddle/pull/63275),[#63663](https://github.com/PaddlePaddle/Paddle/pull/63663),[#64692](https://github.com/PaddlePaddle/Paddle/pull/64692),[#63958](https://github.com/PaddlePaddle/Paddle/pull/63958) +- 完成 OneDNN 相关基础功能开发和验证,为 OneDNN 全面切换做准备。 [#60680](https://github.com/PaddlePaddle/Paddle/pull/60680), [#60665](https://github.com/PaddlePaddle/Paddle/pull/60665), [#63162](https://github.com/PaddlePaddle/Paddle/pull/63162), [#59917](https://github.com/PaddlePaddle/Paddle/pull/59917), [#62901](https://github.com/PaddlePaddle/Paddle/pull/62901), [#59918](https://github.com/PaddlePaddle/Paddle/pull/59918), [#60257](https://github.com/PaddlePaddle/Paddle/pull/60257), [#60502](https://github.com/PaddlePaddle/Paddle/pull/60502), [#61062](https://github.com/PaddlePaddle/Paddle/pull/61062), [#61170](https://github.com/PaddlePaddle/Paddle/pull/61170), [#61474](https://github.com/PaddlePaddle/Paddle/pull/61474), [#60874](https://github.com/PaddlePaddle/Paddle/pull/60874), [#61495](https://github.com/PaddlePaddle/Paddle/pull/61495), [#61664](https://github.com/PaddlePaddle/Paddle/pull/61664), [#61649](https://github.com/PaddlePaddle/Paddle/pull/61649), [#61592](https://github.com/PaddlePaddle/Paddle/pull/61592), [#61667](https://github.com/PaddlePaddle/Paddle/pull/61667), [#61137](https://github.com/PaddlePaddle/Paddle/pull/61137), [#60952](https://github.com/PaddlePaddle/Paddle/pull/60952), [#61651](https://github.com/PaddlePaddle/Paddle/pull/61651), [#62126](https://github.com/PaddlePaddle/Paddle/pull/62126), [#62187](https://github.com/PaddlePaddle/Paddle/pull/62187), [#61307](https://github.com/PaddlePaddle/Paddle/pull/61307), [#62734](https://github.com/PaddlePaddle/Paddle/pull/62734), [#60974](https://github.com/PaddlePaddle/Paddle/pull/60974), [#61451](https://github.com/PaddlePaddle/Paddle/pull/61451), [#61011](https://github.com/PaddlePaddle/Paddle/pull/61011), [#61218](https://github.com/PaddlePaddle/Paddle/pull/61218), [#61623](https://github.com/PaddlePaddle/Paddle/pull/61623), [#61893](https://github.com/PaddlePaddle/Paddle/pull/61893), [#61876](https://github.com/PaddlePaddle/Paddle/pull/61876), [#61892](https://github.com/PaddlePaddle/Paddle/pull/61892), [#62085](https://github.com/PaddlePaddle/Paddle/pull/62085), [#62220](https://github.com/PaddlePaddle/Paddle/pull/62220), [#62244](https://github.com/PaddlePaddle/Paddle/pull/62244), [#62265](https://github.com/PaddlePaddle/Paddle/pull/62265), [#60754](https://github.com/PaddlePaddle/Paddle/pull/60754), [#60896](https://github.com/PaddlePaddle/Paddle/pull/60896), [#61868](https://github.com/PaddlePaddle/Paddle/pull/61868), [#61659](https://github.com/PaddlePaddle/Paddle/pull/61659), [#62241](https://github.com/PaddlePaddle/Paddle/pull/62241), [#62471](https://github.com/PaddlePaddle/Paddle/pull/62471), [#61165](https://github.com/PaddlePaddle/Paddle/pull/61165),[#64441](https://github.com/PaddlePaddle/Paddle/pull/64441),[#63141](https://github.com/PaddlePaddle/Paddle/pull/63141),[#63145](https://github.com/PaddlePaddle/Paddle/pull/63145),[#63592](https://github.com/PaddlePaddle/Paddle/pull/63592),[#63617](https://github.com/PaddlePaddle/Paddle/pull/63617),[#63518](https://github.com/PaddlePaddle/Paddle/pull/63518),[#63726](https://github.com/PaddlePaddle/Paddle/pull/63726),[#63853](https://github.com/PaddlePaddle/Paddle/pull/63853),[#63812](https://github.com/PaddlePaddle/Paddle/pull/63812),[#63811](https://github.com/PaddlePaddle/Paddle/pull/63811),[#64524](https://github.com/PaddlePaddle/Paddle/pull/64524),[#62993](https://github.com/PaddlePaddle/Paddle/pull/62993),[#63516](https://github.com/PaddlePaddle/Paddle/pull/63516),[#62998](https://github.com/PaddlePaddle/Paddle/pull/62998),[#63151](https://github.com/PaddlePaddle/Paddle/pull/63151),[#64661](https://github.com/PaddlePaddle/Paddle/pull/64661),[#64433](https://github.com/PaddlePaddle/Paddle/pull/64433),[#64448](https://github.com/PaddlePaddle/Paddle/pull/64448),[#63201](https://github.com/PaddlePaddle/Paddle/pull/63201),[#63230](https://github.com/PaddlePaddle/Paddle/pull/63230),[#63233](https://github.com/PaddlePaddle/Paddle/pull/63233),[#63281](https://github.com/PaddlePaddle/Paddle/pull/63281),[#64671](https://github.com/PaddlePaddle/Paddle/pull/64671),[#63274](https://github.com/PaddlePaddle/Paddle/pull/63274) +- 基于 PIR 实现 Sparse 相关逻辑,包含基础的 Type 类型和算子表达,并完成 Sparse 重点功能验证。 [#62868](https://github.com/PaddlePaddle/Paddle/pull/62868), [#63015](https://github.com/PaddlePaddle/Paddle/pull/63015), [#62894](https://github.com/PaddlePaddle/Paddle/pull/62894) -### 动转静能力增强 +### 动转静功能优化 +优化动转静基础能力,适配 SOT 训练场景下的动态维度,支持 Python3.12。 +- 完成动转静场景的 PIR 适配。[#60988](https://github.com/PaddlePaddle/Paddle/pull/60988), [#61936](https://github.com/PaddlePaddle/Paddle/pull/61936), [#59929](https://github.com/PaddlePaddle/Paddle/pull/59929), [#61790](https://github.com/PaddlePaddle/Paddle/pull/61790), [#64323](https://github.com/PaddlePaddle/Paddle/pull/64323), [#62030](https://github.com/PaddlePaddle/Paddle/pull/62030), [#61143](https://github.com/PaddlePaddle/Paddle/pull/61143), [#62680](https://github.com/PaddlePaddle/Paddle/pull/62680), [#63309](https://github.com/PaddlePaddle/Paddle/pull/63309), [#63311](https://github.com/PaddlePaddle/Paddle/pull/63311), [#62199](https://github.com/PaddlePaddle/Paddle/pull/62199) +- SOT 适配 Python 3.12 版本字节码,动转静 SOT 功能能够在 Python 3.12 版本使用。[#61414](https://github.com/PaddlePaddle/Paddle/pull/61414), [#59562](https://github.com/PaddlePaddle/Paddle/pull/59562), [#61031](https://github.com/PaddlePaddle/Paddle/pull/61031), [#61272](https://github.com/PaddlePaddle/Paddle/pull/61272), [#61412](https://github.com/PaddlePaddle/Paddle/pull/61412), [#61305](https://github.com/PaddlePaddle/Paddle/pull/61305), [#61964](https://github.com/PaddlePaddle/Paddle/pull/61964), [#62008](https://github.com/PaddlePaddle/Paddle/pull/62008), [#62028](https://github.com/PaddlePaddle/Paddle/pull/62028), [#61995](https://github.com/PaddlePaddle/Paddle/pull/61995), [#62073](https://github.com/PaddlePaddle/Paddle/pull/62073), [#62120](https://github.com/PaddlePaddle/Paddle/pull/62120), [#62218](https://github.com/PaddlePaddle/Paddle/pull/62218), [#62155](https://github.com/PaddlePaddle/Paddle/pull/62155) +- SOT 完成训练场景动态维度的适配,避免维度发生改变,触发重复构图,提升运行效率。[#64278](https://github.com/PaddlePaddle/Paddle/pull/64278), [#64435](https://github.com/PaddlePaddle/Paddle/pull/64435), [#64499](https://github.com/PaddlePaddle/Paddle/pull/64499), [#64500](https://github.com/PaddlePaddle/Paddle/pull/64500), [#62080](https://github.com/PaddlePaddle/Paddle/pull/62080) -动态图到静态图的转换是深度学习框架中的一项关键技术,它允许开发者在灵活性和训练效率之间找到最佳平衡。本版本飞桨对动转静核心功能进行了全面升级,在飞桨产业级模型库的 700 多个模型中,动转静训练的成功率高达 100%。 +### 算子机制 +针对飞桨框架部分算子 Kernel 实现不完备、计算逻辑不高效等问题,我们对飞桨的部分算子功能和算子体系内部机制做了进一步的完善优化,修复部分已知问题,并新增了一些特性支持。 +- 针对 XPU Kernel,优化了`numel`、`concat`、`slice`等算子的数据类型支持,以及`AdamW`优化器的混合精度训练支持等。[#63715](https://github.com/PaddlePaddle/Paddle/pull/63715), [#61617](https://github.com/PaddlePaddle/Paddle/pull/61617), [#61694](https://github.com/PaddlePaddle/Paddle/pull/61694), [#64542](https://github.com/PaddlePaddle/Paddle/pull/64542), [#63644](https://github.com/PaddlePaddle/Paddle/pull/63644), [#61340](https://github.com/PaddlePaddle/Paddle/pull/61340), [#63108](https://github.com/PaddlePaddle/Paddle/pull/63108) +- 对部分算子进行了功能和性能的改进。[#59413](https://github.com/PaddlePaddle/Paddle/pull/59413), [#60295](https://github.com/PaddlePaddle/Paddle/pull/60295), [#64304](https://github.com/PaddlePaddle/Paddle/pull/64304), [#60979](https://github.com/PaddlePaddle/Paddle/pull/60979), [#63556](https://github.com/PaddlePaddle/Paddle/pull/63556), [#63061](https://github.com/PaddlePaddle/Paddle/pull/63061), [#62533](https://github.com/PaddlePaddle/Paddle/pull/62533) +- 完善组合算子的内部实现机制,并且为部分算子新增和优化组合拆分逻辑。[#59448](https://github.com/PaddlePaddle/Paddle/pull/59448), [#60505](https://github.com/PaddlePaddle/Paddle/pull/60505), [#59891](https://github.com/PaddlePaddle/Paddle/pull/59891), [#63161](https://github.com/PaddlePaddle/Paddle/pull/63161), [#63245](https://github.com/PaddlePaddle/Paddle/pull/63245), [#63782](https://github.com/PaddlePaddle/Paddle/pull/63782), [#64346](https://github.com/PaddlePaddle/Paddle/pull/64346), [#63156](https://github.com/PaddlePaddle/Paddle/pull/63156), [#63171](https://github.com/PaddlePaddle/Paddle/pull/63171), [#61315](https://github.com/PaddlePaddle/Paddle/pull/61315), [#61701](https://github.com/PaddlePaddle/Paddle/pull/61701), [#61874](https://github.com/PaddlePaddle/Paddle/pull/61874), [#61873](https://github.com/PaddlePaddle/Paddle/pull/61873), [#62059](https://github.com/PaddlePaddle/Paddle/pull/62059), [#61912](https://github.com/PaddlePaddle/Paddle/pull/61912), [#62112](https://github.com/PaddlePaddle/Paddle/pull/62112), [#63011](https://github.com/PaddlePaddle/Paddle/pull/63011), [#63009](https://github.com/PaddlePaddle/Paddle/pull/63009), [#64714](https://github.com/PaddlePaddle/Paddle/pull/64714) -#### 新功能 +### Bug 修复 +- 修复 PIR、执行器、动转静等相关的 Bug。[#64442](https://github.com/PaddlePaddle/Paddle/pull/64442), [#60443](https://github.com/PaddlePaddle/Paddle/pull/60443), [#60122](https://github.com/PaddlePaddle/Paddle/pull/60122), [#60625](https://github.com/PaddlePaddle/Paddle/pull/60625), [#60607](https://github.com/PaddlePaddle/Paddle/pull/60607), [#60705](https://github.com/PaddlePaddle/Paddle/pull/60705), [#61110](https://github.com/PaddlePaddle/Paddle/pull/61110), [#61278](https://github.com/PaddlePaddle/Paddle/pull/61278), [#61448](https://github.com/PaddlePaddle/Paddle/pull/61448), [#61491](https://github.com/PaddlePaddle/Paddle/pull/61491), [#61692](https://github.com/PaddlePaddle/Paddle/pull/61692), [#62100](https://github.com/PaddlePaddle/Paddle/pull/62100), [#62239](https://github.com/PaddlePaddle/Paddle/pull/62239), [#62365](https://github.com/PaddlePaddle/Paddle/pull/62365), [#62758](https://github.com/PaddlePaddle/Paddle/pull/62758), [#63395](https://github.com/PaddlePaddle/Paddle/pull/63395), [#64272](https://github.com/PaddlePaddle/Paddle/pull/64272), [#62165](https://github.com/PaddlePaddle/Paddle/pull/62165), [#64151](https://github.com/PaddlePaddle/Paddle/pull/64151), [#64204](https://github.com/PaddlePaddle/Paddle/pull/64204), [#64815](https://github.com/PaddlePaddle/Paddle/pull/64815), [#63757](https://github.com/PaddlePaddle/Paddle/pull/63757), [#61972](https://github.com/PaddlePaddle/Paddle/pull/61972), [#64806](https://github.com/PaddlePaddle/Paddle/pull/64806), [#60010](https://github.com/PaddlePaddle/Paddle/pull/60010), [#60461](https://github.com/PaddlePaddle/Paddle/pull/60461), [#60310](https://github.com/PaddlePaddle/Paddle/pull/60310), [#62006](https://github.com/PaddlePaddle/Paddle/pull/62006), [#61591](https://github.com/PaddlePaddle/Paddle/pull/61591), [#60327](https://github.com/PaddlePaddle/Paddle/pull/60327), [#60720](https://github.com/PaddlePaddle/Paddle/pull/60720), [#64656](https://github.com/PaddlePaddle/Paddle/pull/64656), [#60236](https://github.com/PaddlePaddle/Paddle/pull/60236), [#60684](https://github.com/PaddlePaddle/Paddle/pull/60684), [#60790](https://github.com/PaddlePaddle/Paddle/pull/60790), [#60944](https://github.com/PaddlePaddle/Paddle/pull/60944), [#62056](https://github.com/PaddlePaddle/Paddle/pull/62056), [#62891](https://github.com/PaddlePaddle/Paddle/pull/62891), [#64676](https://github.com/PaddlePaddle/Paddle/pull/64676), [#60271](https://github.com/PaddlePaddle/Paddle/pull/60271), [#60634](https://github.com/PaddlePaddle/Paddle/pull/60634), [#60663](https://github.com/PaddlePaddle/Paddle/pull/60663), [#60827](https://github.com/PaddlePaddle/Paddle/pull/60827), [#60845](https://github.com/PaddlePaddle/Paddle/pull/60845), [#60905](https://github.com/PaddlePaddle/Paddle/pull/60905), [#60945](https://github.com/PaddlePaddle/Paddle/pull/60945), [#60949](https://github.com/PaddlePaddle/Paddle/pull/60949), [#61107](https://github.com/PaddlePaddle/Paddle/pull/61107), [#61111](https://github.com/PaddlePaddle/Paddle/pull/61111), [#61117](https://github.com/PaddlePaddle/Paddle/pull/61117), [#61158](https://github.com/PaddlePaddle/Paddle/pull/61158), [#61177](https://github.com/PaddlePaddle/Paddle/pull/61177), [#61355](https://github.com/PaddlePaddle/Paddle/pull/61355), [#61593](https://github.com/PaddlePaddle/Paddle/pull/61593), [#61666](https://github.com/PaddlePaddle/Paddle/pull/61666), [#61934](https://github.com/PaddlePaddle/Paddle/pull/61934), [#62216](https://github.com/PaddlePaddle/Paddle/pull/62216), [#62491](https://github.com/PaddlePaddle/Paddle/pull/62491), [#62515](https://github.com/PaddlePaddle/Paddle/pull/62515), [#62594](https://github.com/PaddlePaddle/Paddle/pull/62594), [#62605](https://github.com/PaddlePaddle/Paddle/pull/62605), [#62895](https://github.com/PaddlePaddle/Paddle/pull/62895), [#62913](https://github.com/PaddlePaddle/Paddle/pull/62913), [#64413](https://github.com/PaddlePaddle/Paddle/pull/64413), [#59947](https://github.com/PaddlePaddle/Paddle/pull/59947), [#60264](https://github.com/PaddlePaddle/Paddle/pull/60264), [#60721](https://github.com/PaddlePaddle/Paddle/pull/60721), [#63113](https://github.com/PaddlePaddle/Paddle/pull/63113), [#63629](https://github.com/PaddlePaddle/Paddle/pull/63629), [#64300](https://github.com/PaddlePaddle/Paddle/pull/64300), [#64450](https://github.com/PaddlePaddle/Paddle/pull/64450), [#64532](https://github.com/PaddlePaddle/Paddle/pull/64532), [#64561](https://github.com/PaddlePaddle/Paddle/pull/64561), [#64625](https://github.com/PaddlePaddle/Paddle/pull/64625), [#64731](https://github.com/PaddlePaddle/Paddle/pull/64731), [#60059](https://github.com/PaddlePaddle/Paddle/pull/60059), [#60487](https://github.com/PaddlePaddle/Paddle/pull/60487), [#60423](https://github.com/PaddlePaddle/Paddle/pull/60423), [#61599](https://github.com/PaddlePaddle/Paddle/pull/61599), [#62032](https://github.com/PaddlePaddle/Paddle/pull/62032), [#62686](https://github.com/PaddlePaddle/Paddle/pull/62686), [#64055](https://github.com/PaddlePaddle/Paddle/pull/64055), [#60751](https://github.com/PaddlePaddle/Paddle/pull/60751), [#61646](https://github.com/PaddlePaddle/Paddle/pull/61646), [#60454](https://github.com/PaddlePaddle/Paddle/pull/60454), [#62530](https://github.com/PaddlePaddle/Paddle/pull/62530), [#62821](https://github.com/PaddlePaddle/Paddle/pull/62821), [#64454](https://github.com/PaddlePaddle/Paddle/pull/64454), [#64754](https://github.com/PaddlePaddle/Paddle/pull/64754), [#59860](https://github.com/PaddlePaddle/Paddle/pull/59860), [#60280](https://github.com/PaddlePaddle/Paddle/pull/60280), [#60357](https://github.com/PaddlePaddle/Paddle/pull/60357), [#60363](https://github.com/PaddlePaddle/Paddle/pull/60363), [#60900](https://github.com/PaddlePaddle/Paddle/pull/60900), [#61185](https://github.com/PaddlePaddle/Paddle/pull/61185), [#61505](https://github.com/PaddlePaddle/Paddle/pull/61505), [#61644](https://github.com/PaddlePaddle/Paddle/pull/61644), [#62256](https://github.com/PaddlePaddle/Paddle/pull/62256), [#62396](https://github.com/PaddlePaddle/Paddle/pull/62396), [#63040](https://github.com/PaddlePaddle/Paddle/pull/63040), [#63409](https://github.com/PaddlePaddle/Paddle/pull/63409), [#63764](https://github.com/PaddlePaddle/Paddle/pull/63764), [#59571](https://github.com/PaddlePaddle/Paddle/pull/59571), [#59894](https://github.com/PaddlePaddle/Paddle/pull/59894), [#59569](https://github.com/PaddlePaddle/Paddle/pull/59569), [#59896](https://github.com/PaddlePaddle/Paddle/pull/59896), [#60015](https://github.com/PaddlePaddle/Paddle/pull/60015), [#60081](https://github.com/PaddlePaddle/Paddle/pull/60081), [#60164](https://github.com/PaddlePaddle/Paddle/pull/60164), [#60200](https://github.com/PaddlePaddle/Paddle/pull/60200), [#60211](https://github.com/PaddlePaddle/Paddle/pull/60211), [#60267](https://github.com/PaddlePaddle/Paddle/pull/60267), [#60458](https://github.com/PaddlePaddle/Paddle/pull/60458), [#60395](https://github.com/PaddlePaddle/Paddle/pull/60395), [#60907](https://github.com/PaddlePaddle/Paddle/pull/60907), [#60707](https://github.com/PaddlePaddle/Paddle/pull/60707), [#60993](https://github.com/PaddlePaddle/Paddle/pull/60993), [#61401](https://github.com/PaddlePaddle/Paddle/pull/61401), [#61433](https://github.com/PaddlePaddle/Paddle/pull/61433), [#61450](https://github.com/PaddlePaddle/Paddle/pull/61450), [#61577](https://github.com/PaddlePaddle/Paddle/pull/61577), [#61575](https://github.com/PaddlePaddle/Paddle/pull/61575), [#61703](https://github.com/PaddlePaddle/Paddle/pull/61703), [#61711](https://github.com/PaddlePaddle/Paddle/pull/61711), [#61883](https://github.com/PaddlePaddle/Paddle/pull/61883), [#61822](https://github.com/PaddlePaddle/Paddle/pull/61822), [#62012](https://github.com/PaddlePaddle/Paddle/pull/62012), [#61858](https://github.com/PaddlePaddle/Paddle/pull/61858), [#62176](https://github.com/PaddlePaddle/Paddle/pull/62176), [#62257](https://github.com/PaddlePaddle/Paddle/pull/62257), [#62470](https://github.com/PaddlePaddle/Paddle/pull/62470), [#62536](https://github.com/PaddlePaddle/Paddle/pull/62536), [#62606](https://github.com/PaddlePaddle/Paddle/pull/62606), [#62808](https://github.com/PaddlePaddle/Paddle/pull/62808), [#62854](https://github.com/PaddlePaddle/Paddle/pull/62854), [#62879](https://github.com/PaddlePaddle/Paddle/pull/62879), [#62864](https://github.com/PaddlePaddle/Paddle/pull/62864), [#63063](https://github.com/PaddlePaddle/Paddle/pull/63063), [#62958](https://github.com/PaddlePaddle/Paddle/pull/62958), [#63397](https://github.com/PaddlePaddle/Paddle/pull/63397), [#63805](https://github.com/PaddlePaddle/Paddle/pull/63805), [#63694](https://github.com/PaddlePaddle/Paddle/pull/63694), [#64168](https://github.com/PaddlePaddle/Paddle/pull/64168), [#64184](https://github.com/PaddlePaddle/Paddle/pull/64184), [#64174](https://github.com/PaddlePaddle/Paddle/pull/64174), [#64315](https://github.com/PaddlePaddle/Paddle/pull/64315), [#64362](https://github.com/PaddlePaddle/Paddle/pull/64362), [#64400](https://github.com/PaddlePaddle/Paddle/pull/64400), [#64475](https://github.com/PaddlePaddle/Paddle/pull/64475), [#64458](https://github.com/PaddlePaddle/Paddle/pull/64458), [#64548](https://github.com/PaddlePaddle/Paddle/pull/64548), [#59858](https://github.com/PaddlePaddle/Paddle/pull/59858), [#61132](https://github.com/PaddlePaddle/Paddle/pull/61132), [#62010](https://github.com/PaddlePaddle/Paddle/pull/62010), [#62069](https://github.com/PaddlePaddle/Paddle/pull/62069), [#62707](https://github.com/PaddlePaddle/Paddle/pull/62707), [#62921](https://github.com/PaddlePaddle/Paddle/pull/62921), [#63085](https://github.com/PaddlePaddle/Paddle/pull/63085), [#63321](https://github.com/PaddlePaddle/Paddle/pull/63321), [#63351](https://github.com/PaddlePaddle/Paddle/pull/63351), [#63549](https://github.com/PaddlePaddle/Paddle/pull/63549), [#64567](https://github.com/PaddlePaddle/Paddle/pull/64567), [#59936](https://github.com/PaddlePaddle/Paddle/pull/59936), [#60269](https://github.com/PaddlePaddle/Paddle/pull/60269), [#60879](https://github.com/PaddlePaddle/Paddle/pull/60879), [#61314](https://github.com/PaddlePaddle/Paddle/pull/61314), [#61391](https://github.com/PaddlePaddle/Paddle/pull/61391), [#61479](https://github.com/PaddlePaddle/Paddle/pull/61479), [#61789](https://github.com/PaddlePaddle/Paddle/pull/61789), [#61832](https://github.com/PaddlePaddle/Paddle/pull/61832), [#61864](https://github.com/PaddlePaddle/Paddle/pull/61864), [#61917](https://github.com/PaddlePaddle/Paddle/pull/61917), [#62052](https://github.com/PaddlePaddle/Paddle/pull/62052), [#62068](https://github.com/PaddlePaddle/Paddle/pull/62068), [#62293](https://github.com/PaddlePaddle/Paddle/pull/62293), [#62479](https://github.com/PaddlePaddle/Paddle/pull/62479), [#62506](https://github.com/PaddlePaddle/Paddle/pull/62506), [#59948](https://github.com/PaddlePaddle/Paddle/pull/59948), [#64118](https://github.com/PaddlePaddle/Paddle/pull/64118), [#64126](https://github.com/PaddlePaddle/Paddle/pull/64126), [#64195](https://github.com/PaddlePaddle/Paddle/pull/64195), [#64307](https://github.com/PaddlePaddle/Paddle/pull/64307), [#64314](https://github.com/PaddlePaddle/Paddle/pull/64314), [#64276](https://github.com/PaddlePaddle/Paddle/pull/64276), [#64312](https://github.com/PaddlePaddle/Paddle/pull/64312), [#64350](https://github.com/PaddlePaddle/Paddle/pull/64350), [#64319](https://github.com/PaddlePaddle/Paddle/pull/64319), [#64463](https://github.com/PaddlePaddle/Paddle/pull/64463), [#64457](https://github.com/PaddlePaddle/Paddle/pull/64457), [#64455](https://github.com/PaddlePaddle/Paddle/pull/64455), [#64487](https://github.com/PaddlePaddle/Paddle/pull/64487), [#64645](https://github.com/PaddlePaddle/Paddle/pull/64645), [#63155](https://github.com/PaddlePaddle/Paddle/pull/63155), [#59893](https://github.com/PaddlePaddle/Paddle/pull/59893), [#63332](https://github.com/PaddlePaddle/Paddle/pull/63332), [#63332](https://github.com/PaddlePaddle/Paddle/pull/63332), [#64786](https://github.com/PaddlePaddle/Paddle/pull/64786), [#60515](https://github.com/PaddlePaddle/Paddle/pull/60515), [#60627](https://github.com/PaddlePaddle/Paddle/pull/60627), [#60863](https://github.com/PaddlePaddle/Paddle/pull/60863), [#60854](https://github.com/PaddlePaddle/Paddle/pull/60854), [#61447](https://github.com/PaddlePaddle/Paddle/pull/61447), [#61440](https://github.com/PaddlePaddle/Paddle/pull/61440), [#61932](https://github.com/PaddlePaddle/Paddle/pull/61932), [#62131](https://github.com/PaddlePaddle/Paddle/pull/62131), [#62252](https://github.com/PaddlePaddle/Paddle/pull/62252), [#62283](https://github.com/PaddlePaddle/Paddle/pull/62283), [#62358](https://github.com/PaddlePaddle/Paddle/pull/62358), [#62411](https://github.com/PaddlePaddle/Paddle/pull/62411), [#62424](https://github.com/PaddlePaddle/Paddle/pull/62424), [#62810](https://github.com/PaddlePaddle/Paddle/pull/62810), [#62811](https://github.com/PaddlePaddle/Paddle/pull/62811), [#62896](https://github.com/PaddlePaddle/Paddle/pull/62896), [#62947](https://github.com/PaddlePaddle/Paddle/pull/62947), [#63182](https://github.com/PaddlePaddle/Paddle/pull/63182), [#63190](https://github.com/PaddlePaddle/Paddle/pull/63190), [#63294](https://github.com/PaddlePaddle/Paddle/pull/63294), [#63306](https://github.com/PaddlePaddle/Paddle/pull/63306), [#63352](https://github.com/PaddlePaddle/Paddle/pull/63352), [#63404](https://github.com/PaddlePaddle/Paddle/pull/63404), [#63474](https://github.com/PaddlePaddle/Paddle/pull/63474), [#64013](https://github.com/PaddlePaddle/Paddle/pull/64013), [#64674](https://github.com/PaddlePaddle/Paddle/pull/64674),[#60055](https://github.com/PaddlePaddle/Paddle/pull/60055),[#62050](https://github.com/PaddlePaddle/Paddle/pull/62050),[#62770](https://github.com/PaddlePaddle/Paddle/pull/62770),[#63234](https://github.com/PaddlePaddle/Paddle/pull/63234),[#63374](https://github.com/PaddlePaddle/Paddle/pull/63374),[#64277](https://github.com/PaddlePaddle/Paddle/pull/64277), [#63420](https://github.com/PaddlePaddle/Paddle/pull/63420), [#60312](https://github.com/PaddlePaddle/Paddle/pull/60312), [#63810](https://github.com/PaddlePaddle/Paddle/pull/63810), [#64631](https://github.com/PaddlePaddle/Paddle/pull/64631), [#63970](https://github.com/PaddlePaddle/Paddle/pull/63970), [#63708](https://github.com/PaddlePaddle/Paddle/pull/63708), [#62062](https://github.com/PaddlePaddle/Paddle/pull/62062), [#60898](https://github.com/PaddlePaddle/Paddle/pull/60898), [#62373](https://github.com/PaddlePaddle/Paddle/pull/62373), [#59878](https://github.com/PaddlePaddle/Paddle/pull/59878) +- 修复部分算子机制、算子实现逻辑和相关单测的 Bug。[#63792](https://github.com/PaddlePaddle/Paddle/pull/63792), [#60570](https://github.com/PaddlePaddle/Paddle/pull/60570), [#61572](https://github.com/PaddlePaddle/Paddle/pull/61572), [#59971](https://github.com/PaddlePaddle/Paddle/pull/59971), [#61336](https://github.com/PaddlePaddle/Paddle/pull/61336), [#63276](https://github.com/PaddlePaddle/Paddle/pull/63276), [#63251](https://github.com/PaddlePaddle/Paddle/pull/63251), [#63697](https://github.com/PaddlePaddle/Paddle/pull/63697), [#63706](https://github.com/PaddlePaddle/Paddle/pull/63706), [#64685](https://github.com/PaddlePaddle/Paddle/pull/64685), [#64009](https://github.com/PaddlePaddle/Paddle/pull/64009), [#62461](https://github.com/PaddlePaddle/Paddle/pull/62461), [#61568](https://github.com/PaddlePaddle/Paddle/pull/61568), [#63912](https://github.com/PaddlePaddle/Paddle/pull/63912), [#60475](https://github.com/PaddlePaddle/Paddle/pull/60475), [#60222](https://github.com/PaddlePaddle/Paddle/pull/60222), [#63961](https://github.com/PaddlePaddle/Paddle/pull/63961), [#63593](https://github.com/PaddlePaddle/Paddle/pull/63593) -- 采用 Python Eval Frame 和虚拟机模拟执行技术,创新性地实现了自适应的 Graph Break 机制。这一机制特别针对控制流场景,通过引入 CallLayer 机制,充分利用飞桨动静统一的优势,支持 AST(抽象语法树)与字节码模拟的混合模式,有效捕获控制流算子,从而大幅度提高了计算图的静态化能力。在缓存优化层面,融合了公共子表达式消除等高级优化技术,显著提升了 Guard 的执行效率。这些优化措施不仅减少了冗余计算,还提高了整体系统的运行速度。为了增强系统的鲁棒性,设计了一个简洁高效的数据中间层结构。这一结构支持 SideEffects 的正确性恢复,确保了系统在复杂环境下的稳定性和可靠性。此外,广泛兼容 Python 3.8 至 3.11 的主流解释器版本,为用户提供了广泛的适用性。[#57824](https://github.com/PaddlePaddle/Paddle/pull/57824),[#55887](https://github.com/PaddlePaddle/Paddle/pull/55887),[#58155](https://github.com/PaddlePaddle/Paddle/pull/58155),[#56107](https://github.com/PaddlePaddle/Paddle/pull/56107),[#57490](https://github.com/PaddlePaddle/Paddle/pull/57490),[#58829](https://github.com/PaddlePaddle/Paddle/pull/58829),[#57240](https://github.com/PaddlePaddle/Paddle/pull/57240),[#57588](https://github.com/PaddlePaddle/Paddle/pull/57588),[#58117](https://github.com/PaddlePaddle/Paddle/pull/58117),[#59823](https://github.com/PaddlePaddle/Paddle/pull/59823),[#56077](https://github.com/PaddlePaddle/Paddle/pull/56077),[#58956](https://github.com/PaddlePaddle/Paddle/pull/58956),[#57653](https://github.com/PaddlePaddle/Paddle/pull/57653),[#59855](https://github.com/PaddlePaddle/Paddle/pull/59855),[#59017](https://github.com/PaddlePaddle/Paddle/pull/59017),[#58424](https://github.com/PaddlePaddle/Paddle/pull/58424),[#58187](https://github.com/PaddlePaddle/Paddle/pull/58187),[#57793](https://github.com/PaddlePaddle/Paddle/pull/57793),[#59698](https://github.com/PaddlePaddle/Paddle/pull/59698),[#59747](https://github.com/PaddlePaddle/Paddle/pull/59747),[#59710](https://github.com/PaddlePaddle/Paddle/pull/59710),[#59297](https://github.com/PaddlePaddle/Paddle/pull/59297),[#58423](https://github.com/PaddlePaddle/Paddle/pull/58423),[#56262](https://github.com/PaddlePaddle/Paddle/pull/56262),[#58103](https://github.com/PaddlePaddle/Paddle/pull/58103),[#58538](https://github.com/PaddlePaddle/Paddle/pull/58538),[#58771](https://github.com/PaddlePaddle/Paddle/pull/58771),[#59191](https://github.com/PaddlePaddle/Paddle/pull/59191),[#57754](https://github.com/PaddlePaddle/Paddle/pull/57754),[#59439](https://github.com/PaddlePaddle/Paddle/pull/59439),[#59816](https://github.com/PaddlePaddle/Paddle/pull/59816),[#59035](https://github.com/PaddlePaddle/Paddle/pull/59035) -- 新增对 PyLayer 功能的动转静语法转写解析,使得 PyLayer 在动态图与静态图之间的转换更加顺畅。现在,用户可以在 PyLayer 下无缝地进行动转静的训练,并轻松导出推理模型。[#56108](https://github.com/PaddlePaddle/Paddle/pull/56108),[#56531](https://github.com/PaddlePaddle/Paddle/pull/56531),[#57066](https://github.com/PaddlePaddle/Paddle/pull/57066),[#57633](https://github.com/PaddlePaddle/Paddle/pull/57633) +### 开发者相关内容 +- 开发者相关内容,包含 PIR 切换、单测开启、功能验证等 PR。 [#60621](https://github.com/PaddlePaddle/Paddle/pull/60621), [#59703](https://github.com/PaddlePaddle/Paddle/pull/59703), [#59694](https://github.com/PaddlePaddle/Paddle/pull/59694), [#59717](https://github.com/PaddlePaddle/Paddle/pull/59717), [#59729](https://github.com/PaddlePaddle/Paddle/pull/59729), [#59730](https://github.com/PaddlePaddle/Paddle/pull/59730), [#60216](https://github.com/PaddlePaddle/Paddle/pull/60216), [#60238](https://github.com/PaddlePaddle/Paddle/pull/60238), [#60246](https://github.com/PaddlePaddle/Paddle/pull/60246), [#60343](https://github.com/PaddlePaddle/Paddle/pull/60343), [#60302](https://github.com/PaddlePaddle/Paddle/pull/60302), [#60870](https://github.com/PaddlePaddle/Paddle/pull/60870), [#59956](https://github.com/PaddlePaddle/Paddle/pull/59956), [#60795](https://github.com/PaddlePaddle/Paddle/pull/60795), [#62528](https://github.com/PaddlePaddle/Paddle/pull/62528), [#59932](https://github.com/PaddlePaddle/Paddle/pull/59932), [#59636](https://github.com/PaddlePaddle/Paddle/pull/59636), [#59959](https://github.com/PaddlePaddle/Paddle/pull/59959), [#59734](https://github.com/PaddlePaddle/Paddle/pull/59734), [#60287](https://github.com/PaddlePaddle/Paddle/pull/60287), [#60347](https://github.com/PaddlePaddle/Paddle/pull/60347), [#60335](https://github.com/PaddlePaddle/Paddle/pull/60335), [#60332](https://github.com/PaddlePaddle/Paddle/pull/60332), [#59631](https://github.com/PaddlePaddle/Paddle/pull/59631), [#60255](https://github.com/PaddlePaddle/Paddle/pull/60255), [#60329](https://github.com/PaddlePaddle/Paddle/pull/60329), [#60401](https://github.com/PaddlePaddle/Paddle/pull/60401), [#60522](https://github.com/PaddlePaddle/Paddle/pull/60522), [#60792](https://github.com/PaddlePaddle/Paddle/pull/60792), [#59617](https://github.com/PaddlePaddle/Paddle/pull/59617), [#60277](https://github.com/PaddlePaddle/Paddle/pull/60277), [#60584](https://github.com/PaddlePaddle/Paddle/pull/60584), [#60911](https://github.com/PaddlePaddle/Paddle/pull/60911), [#61322](https://github.com/PaddlePaddle/Paddle/pull/61322), [#60838](https://github.com/PaddlePaddle/Paddle/pull/60838), [#60602](https://github.com/PaddlePaddle/Paddle/pull/60602), [#61458](https://github.com/PaddlePaddle/Paddle/pull/61458), [#61607](https://github.com/PaddlePaddle/Paddle/pull/61607), [#61960](https://github.com/PaddlePaddle/Paddle/pull/61960), [#60484](https://github.com/PaddlePaddle/Paddle/pull/60484), [#61662](https://github.com/PaddlePaddle/Paddle/pull/61662), [#62263](https://github.com/PaddlePaddle/Paddle/pull/62263), [#62270](https://github.com/PaddlePaddle/Paddle/pull/62270), [#62469](https://github.com/PaddlePaddle/Paddle/pull/62469), [#62416](https://github.com/PaddlePaddle/Paddle/pull/62416), [#62443](https://github.com/PaddlePaddle/Paddle/pull/62443), [#62412](https://github.com/PaddlePaddle/Paddle/pull/62412), [#62541](https://github.com/PaddlePaddle/Paddle/pull/62541), [#62634](https://github.com/PaddlePaddle/Paddle/pull/62634), [#62369](https://github.com/PaddlePaddle/Paddle/pull/62369), [#60805](https://github.com/PaddlePaddle/Paddle/pull/60805), [#62644](https://github.com/PaddlePaddle/Paddle/pull/62644), [#62494](https://github.com/PaddlePaddle/Paddle/pull/62494), [#62767](https://github.com/PaddlePaddle/Paddle/pull/62767), [#62735](https://github.com/PaddlePaddle/Paddle/pull/62735), [#62802](https://github.com/PaddlePaddle/Paddle/pull/62802), [#62801](https://github.com/PaddlePaddle/Paddle/pull/62801), [#62783](https://github.com/PaddlePaddle/Paddle/pull/62783), [#62579](https://github.com/PaddlePaddle/Paddle/pull/62579), [#62833](https://github.com/PaddlePaddle/Paddle/pull/62833), [#62668](https://github.com/PaddlePaddle/Paddle/pull/62668), [#62972](https://github.com/PaddlePaddle/Paddle/pull/62972), [#62505](https://github.com/PaddlePaddle/Paddle/pull/62505), [#63005](https://github.com/PaddlePaddle/Paddle/pull/63005), [#62900](https://github.com/PaddlePaddle/Paddle/pull/62900), [#60577](https://github.com/PaddlePaddle/Paddle/pull/60577), [#60877](https://github.com/PaddlePaddle/Paddle/pull/60877), [#61076](https://github.com/PaddlePaddle/Paddle/pull/61076), [#61038](https://github.com/PaddlePaddle/Paddle/pull/61038), [#61112](https://github.com/PaddlePaddle/Paddle/pull/61112), [#61120](https://github.com/PaddlePaddle/Paddle/pull/61120), [#61582](https://github.com/PaddlePaddle/Paddle/pull/61582), [#61119](https://github.com/PaddlePaddle/Paddle/pull/61119), [#61036](https://github.com/PaddlePaddle/Paddle/pull/61036), [#61289](https://github.com/PaddlePaddle/Paddle/pull/61289), [#60695](https://github.com/PaddlePaddle/Paddle/pull/60695), [#61039](https://github.com/PaddlePaddle/Paddle/pull/61039), [#61963](https://github.com/PaddlePaddle/Paddle/pull/61963), [#62118](https://github.com/PaddlePaddle/Paddle/pull/62118), [#62797](https://github.com/PaddlePaddle/Paddle/pull/62797), [#62807](https://github.com/PaddlePaddle/Paddle/pull/62807), [#62887](https://github.com/PaddlePaddle/Paddle/pull/62887), [#62830](https://github.com/PaddlePaddle/Paddle/pull/62830), [#62849](https://github.com/PaddlePaddle/Paddle/pull/62849), [#62750](https://github.com/PaddlePaddle/Paddle/pull/62750), [#62965](https://github.com/PaddlePaddle/Paddle/pull/62965), [#59742](https://github.com/PaddlePaddle/Paddle/pull/59742), [#59867](https://github.com/PaddlePaddle/Paddle/pull/59867), [#60836](https://github.com/PaddlePaddle/Paddle/pull/60836), [#60902](https://github.com/PaddlePaddle/Paddle/pull/60902), [#61228](https://github.com/PaddlePaddle/Paddle/pull/61228), [#60037](https://github.com/PaddlePaddle/Paddle/pull/60037), [#60079](https://github.com/PaddlePaddle/Paddle/pull/60079), [#60173](https://github.com/PaddlePaddle/Paddle/pull/60173), [#60373](https://github.com/PaddlePaddle/Paddle/pull/60373), [#60380](https://github.com/PaddlePaddle/Paddle/pull/60380), [#60381](https://github.com/PaddlePaddle/Paddle/pull/60381), [#60750](https://github.com/PaddlePaddle/Paddle/pull/60750), [#61065](https://github.com/PaddlePaddle/Paddle/pull/61065), [#61122](https://github.com/PaddlePaddle/Paddle/pull/61122), [#61074](https://github.com/PaddlePaddle/Paddle/pull/61074), [#61204](https://github.com/PaddlePaddle/Paddle/pull/61204), [#61191](https://github.com/PaddlePaddle/Paddle/pull/61191), [#61182](https://github.com/PaddlePaddle/Paddle/pull/61182), [#61219](https://github.com/PaddlePaddle/Paddle/pull/61219), [#61296](https://github.com/PaddlePaddle/Paddle/pull/61296), [#61503](https://github.com/PaddlePaddle/Paddle/pull/61503), [#61484](https://github.com/PaddlePaddle/Paddle/pull/61484), [#61513](https://github.com/PaddlePaddle/Paddle/pull/61513), [#61476](https://github.com/PaddlePaddle/Paddle/pull/61476), [#61510](https://github.com/PaddlePaddle/Paddle/pull/61510), [#61511](https://github.com/PaddlePaddle/Paddle/pull/61511), [#61526](https://github.com/PaddlePaddle/Paddle/pull/61526), [#61524](https://github.com/PaddlePaddle/Paddle/pull/61524), [#61525](https://github.com/PaddlePaddle/Paddle/pull/61525), [#61466](https://github.com/PaddlePaddle/Paddle/pull/61466), [#61497](https://github.com/PaddlePaddle/Paddle/pull/61497), [#61538](https://github.com/PaddlePaddle/Paddle/pull/61538), [#61533](https://github.com/PaddlePaddle/Paddle/pull/61533), [#61530](https://github.com/PaddlePaddle/Paddle/pull/61530), [#61468](https://github.com/PaddlePaddle/Paddle/pull/61468), [#61527](https://github.com/PaddlePaddle/Paddle/pull/61527), [#61535](https://github.com/PaddlePaddle/Paddle/pull/61535), [#61512](https://github.com/PaddlePaddle/Paddle/pull/61512), [#61531](https://github.com/PaddlePaddle/Paddle/pull/61531), [#61539](https://github.com/PaddlePaddle/Paddle/pull/61539), [#61532](https://github.com/PaddlePaddle/Paddle/pull/61532), [#61521](https://github.com/PaddlePaddle/Paddle/pull/61521), [#61517](https://github.com/PaddlePaddle/Paddle/pull/61517), [#61518](https://github.com/PaddlePaddle/Paddle/pull/61518), [#61550](https://github.com/PaddlePaddle/Paddle/pull/61550), [#61545](https://github.com/PaddlePaddle/Paddle/pull/61545), [#61548](https://github.com/PaddlePaddle/Paddle/pull/61548), [#61519](https://github.com/PaddlePaddle/Paddle/pull/61519), [#61549](https://github.com/PaddlePaddle/Paddle/pull/61549), [#61574](https://github.com/PaddlePaddle/Paddle/pull/61574), [#61585](https://github.com/PaddlePaddle/Paddle/pull/61585), [#61581](https://github.com/PaddlePaddle/Paddle/pull/61581), [#61553](https://github.com/PaddlePaddle/Paddle/pull/61553), [#61504](https://github.com/PaddlePaddle/Paddle/pull/61504), [#61603](https://github.com/PaddlePaddle/Paddle/pull/61603), [#61534](https://github.com/PaddlePaddle/Paddle/pull/61534), [#61567](https://github.com/PaddlePaddle/Paddle/pull/61567), [#61523](https://github.com/PaddlePaddle/Paddle/pull/61523), [#61565](https://github.com/PaddlePaddle/Paddle/pull/61565), [#61564](https://github.com/PaddlePaddle/Paddle/pull/61564), [#61707](https://github.com/PaddlePaddle/Paddle/pull/61707), [#61560](https://github.com/PaddlePaddle/Paddle/pull/61560), [#61684](https://github.com/PaddlePaddle/Paddle/pull/61684), [#61706](https://github.com/PaddlePaddle/Paddle/pull/61706), [#61724](https://github.com/PaddlePaddle/Paddle/pull/61724), [#61719](https://github.com/PaddlePaddle/Paddle/pull/61719), [#61729](https://github.com/PaddlePaddle/Paddle/pull/61729), [#61763](https://github.com/PaddlePaddle/Paddle/pull/61763), [#61755](https://github.com/PaddlePaddle/Paddle/pull/61755), [#61737](https://github.com/PaddlePaddle/Paddle/pull/61737), [#61750](https://github.com/PaddlePaddle/Paddle/pull/61750), [#61753](https://github.com/PaddlePaddle/Paddle/pull/61753), [#61756](https://github.com/PaddlePaddle/Paddle/pull/61756), [#61777](https://github.com/PaddlePaddle/Paddle/pull/61777), [#61758](https://github.com/PaddlePaddle/Paddle/pull/61758), [#61731](https://github.com/PaddlePaddle/Paddle/pull/61731), [#61771](https://github.com/PaddlePaddle/Paddle/pull/61771), [#61739](https://github.com/PaddlePaddle/Paddle/pull/61739), [#61559](https://github.com/PaddlePaddle/Paddle/pull/61559), [#61717](https://github.com/PaddlePaddle/Paddle/pull/61717), [#61733](https://github.com/PaddlePaddle/Paddle/pull/61733), [#61563](https://github.com/PaddlePaddle/Paddle/pull/61563), [#61546](https://github.com/PaddlePaddle/Paddle/pull/61546), [#61566](https://github.com/PaddlePaddle/Paddle/pull/61566), [#61562](https://github.com/PaddlePaddle/Paddle/pull/61562), [#61793](https://github.com/PaddlePaddle/Paddle/pull/61793), [#61902](https://github.com/PaddlePaddle/Paddle/pull/61902), [#61905](https://github.com/PaddlePaddle/Paddle/pull/61905), [#61904](https://github.com/PaddlePaddle/Paddle/pull/61904), [#62227](https://github.com/PaddlePaddle/Paddle/pull/62227), [#62332](https://github.com/PaddlePaddle/Paddle/pull/62332), [#62653](https://github.com/PaddlePaddle/Paddle/pull/62653), [#62681](https://github.com/PaddlePaddle/Paddle/pull/62681), [#62709](https://github.com/PaddlePaddle/Paddle/pull/62709), [#62794](https://github.com/PaddlePaddle/Paddle/pull/62794), [#62938](https://github.com/PaddlePaddle/Paddle/pull/62938), [#63185](https://github.com/PaddlePaddle/Paddle/pull/63185), [#63754](https://github.com/PaddlePaddle/Paddle/pull/63754), [#63769](https://github.com/PaddlePaddle/Paddle/pull/63769), [#63793](https://github.com/PaddlePaddle/Paddle/pull/63793), [#63830](https://github.com/PaddlePaddle/Paddle/pull/63830), [#63939](https://github.com/PaddlePaddle/Paddle/pull/63939), [#64340](https://github.com/PaddlePaddle/Paddle/pull/64340), [#64657](https://github.com/PaddlePaddle/Paddle/pull/64657), [#62527](https://github.com/PaddlePaddle/Paddle/pull/62527), [#64088](https://github.com/PaddlePaddle/Paddle/pull/64088), [#60203](https://github.com/PaddlePaddle/Paddle/pull/60203), [#60372](https://github.com/PaddlePaddle/Paddle/pull/60372), [#60685](https://github.com/PaddlePaddle/Paddle/pull/60685), [#60815](https://github.com/PaddlePaddle/Paddle/pull/60815), [#60791](https://github.com/PaddlePaddle/Paddle/pull/60791), [#60864](https://github.com/PaddlePaddle/Paddle/pull/60864), [#60851](https://github.com/PaddlePaddle/Paddle/pull/60851), [#60844](https://github.com/PaddlePaddle/Paddle/pull/60844), [#60694](https://github.com/PaddlePaddle/Paddle/pull/60694), [#60855](https://github.com/PaddlePaddle/Paddle/pull/60855), [#60869](https://github.com/PaddlePaddle/Paddle/pull/60869), [#60948](https://github.com/PaddlePaddle/Paddle/pull/60948), [#61042](https://github.com/PaddlePaddle/Paddle/pull/61042), [#61455](https://github.com/PaddlePaddle/Paddle/pull/61455), [#61580](https://github.com/PaddlePaddle/Paddle/pull/61580), [#61589](https://github.com/PaddlePaddle/Paddle/pull/61589), [#61609](https://github.com/PaddlePaddle/Paddle/pull/61609), [#61616](https://github.com/PaddlePaddle/Paddle/pull/61616), [#61715](https://github.com/PaddlePaddle/Paddle/pull/61715), [#61716](https://github.com/PaddlePaddle/Paddle/pull/61716), [#61759](https://github.com/PaddlePaddle/Paddle/pull/61759), [#61555](https://github.com/PaddlePaddle/Paddle/pull/61555), [#61492](https://github.com/PaddlePaddle/Paddle/pull/61492), [#61805](https://github.com/PaddlePaddle/Paddle/pull/61805), [#61712](https://github.com/PaddlePaddle/Paddle/pull/61712), [#61615](https://github.com/PaddlePaddle/Paddle/pull/61615), [#61713](https://github.com/PaddlePaddle/Paddle/pull/61713), [#62129](https://github.com/PaddlePaddle/Paddle/pull/62129), [#59294](https://github.com/PaddlePaddle/Paddle/pull/59294), [#59865](https://github.com/PaddlePaddle/Paddle/pull/59865), [#60270](https://github.com/PaddlePaddle/Paddle/pull/60270), [#60547](https://github.com/PaddlePaddle/Paddle/pull/60547), [#60698](https://github.com/PaddlePaddle/Paddle/pull/60698), [#60762](https://github.com/PaddlePaddle/Paddle/pull/60762), [#60753](https://github.com/PaddlePaddle/Paddle/pull/60753), [#60966](https://github.com/PaddlePaddle/Paddle/pull/60966), [#60976](https://github.com/PaddlePaddle/Paddle/pull/60976), [#61100](https://github.com/PaddlePaddle/Paddle/pull/61100), [#61203](https://github.com/PaddlePaddle/Paddle/pull/61203), [#61210](https://github.com/PaddlePaddle/Paddle/pull/61210), [#61424](https://github.com/PaddlePaddle/Paddle/pull/61424), [#61213](https://github.com/PaddlePaddle/Paddle/pull/61213), [#61275](https://github.com/PaddlePaddle/Paddle/pull/61275), [#61276](https://github.com/PaddlePaddle/Paddle/pull/61276), [#61279](https://github.com/PaddlePaddle/Paddle/pull/61279), [#61292](https://github.com/PaddlePaddle/Paddle/pull/61292), [#61295](https://github.com/PaddlePaddle/Paddle/pull/61295), [#61298](https://github.com/PaddlePaddle/Paddle/pull/61298), [#61299](https://github.com/PaddlePaddle/Paddle/pull/61299), [#61301](https://github.com/PaddlePaddle/Paddle/pull/61301), [#61302](https://github.com/PaddlePaddle/Paddle/pull/61302), [#61329](https://github.com/PaddlePaddle/Paddle/pull/61329), [#61804](https://github.com/PaddlePaddle/Paddle/pull/61804), [#62745](https://github.com/PaddlePaddle/Paddle/pull/62745), [#62909](https://github.com/PaddlePaddle/Paddle/pull/62909), [#64247](https://github.com/PaddlePaddle/Paddle/pull/64247), [#64308](https://github.com/PaddlePaddle/Paddle/pull/64308), [#60690](https://github.com/PaddlePaddle/Paddle/pull/60690), [#61149](https://github.com/PaddlePaddle/Paddle/pull/61149), [#61145](https://github.com/PaddlePaddle/Paddle/pull/61145), [#61193](https://github.com/PaddlePaddle/Paddle/pull/61193), [#61207](https://github.com/PaddlePaddle/Paddle/pull/61207), [#61229](https://github.com/PaddlePaddle/Paddle/pull/61229), [#61236](https://github.com/PaddlePaddle/Paddle/pull/61236), [#61244](https://github.com/PaddlePaddle/Paddle/pull/61244), [#61242](https://github.com/PaddlePaddle/Paddle/pull/61242), [#61263](https://github.com/PaddlePaddle/Paddle/pull/61263), [#61370](https://github.com/PaddlePaddle/Paddle/pull/61370), [#61410](https://github.com/PaddlePaddle/Paddle/pull/61410), [#61480](https://github.com/PaddlePaddle/Paddle/pull/61480), [#61522](https://github.com/PaddlePaddle/Paddle/pull/61522), [#61540](https://github.com/PaddlePaddle/Paddle/pull/61540), [#61520](https://github.com/PaddlePaddle/Paddle/pull/61520), [#61625](https://github.com/PaddlePaddle/Paddle/pull/61625), [#61700](https://github.com/PaddlePaddle/Paddle/pull/61700), [#61708](https://github.com/PaddlePaddle/Paddle/pull/61708), [#61736](https://github.com/PaddlePaddle/Paddle/pull/61736), [#61889](https://github.com/PaddlePaddle/Paddle/pull/61889), [#61952](https://github.com/PaddlePaddle/Paddle/pull/61952), [#62033](https://github.com/PaddlePaddle/Paddle/pull/62033), [#62637](https://github.com/PaddlePaddle/Paddle/pull/62637), [#62777](https://github.com/PaddlePaddle/Paddle/pull/62777), [#62779](https://github.com/PaddlePaddle/Paddle/pull/62779), [#63226](https://github.com/PaddlePaddle/Paddle/pull/63226), [#63287](https://github.com/PaddlePaddle/Paddle/pull/63287), [#63398](https://github.com/PaddlePaddle/Paddle/pull/63398), [#63431](https://github.com/PaddlePaddle/Paddle/pull/63431), [#64000](https://github.com/PaddlePaddle/Paddle/pull/64000), [#64058](https://github.com/PaddlePaddle/Paddle/pull/64058), [#64059](https://github.com/PaddlePaddle/Paddle/pull/64059), [#64063](https://github.com/PaddlePaddle/Paddle/pull/64063), [#64066](https://github.com/PaddlePaddle/Paddle/pull/64066), [#64089](https://github.com/PaddlePaddle/Paddle/pull/64089), [#64170](https://github.com/PaddlePaddle/Paddle/pull/64170), [#64235](https://github.com/PaddlePaddle/Paddle/pull/64235), [#64237](https://github.com/PaddlePaddle/Paddle/pull/64237), [#64243](https://github.com/PaddlePaddle/Paddle/pull/64243), [#64242](https://github.com/PaddlePaddle/Paddle/pull/64242), [#64286](https://github.com/PaddlePaddle/Paddle/pull/64286), [#64322](https://github.com/PaddlePaddle/Paddle/pull/64322), [#64317](https://github.com/PaddlePaddle/Paddle/pull/64317), [#64490](https://github.com/PaddlePaddle/Paddle/pull/64490), [#60138](https://github.com/PaddlePaddle/Paddle/pull/60138), [#62384](https://github.com/PaddlePaddle/Paddle/pull/62384), [#59702](https://github.com/PaddlePaddle/Paddle/pull/59702), [#60341](https://github.com/PaddlePaddle/Paddle/pull/60341), [#60636](https://github.com/PaddlePaddle/Paddle/pull/60636), [#60714](https://github.com/PaddlePaddle/Paddle/pull/60714), [#60716](https://github.com/PaddlePaddle/Paddle/pull/60716), [#60700](https://github.com/PaddlePaddle/Paddle/pull/60700), [#60702](https://github.com/PaddlePaddle/Paddle/pull/60702), [#60704](https://github.com/PaddlePaddle/Paddle/pull/60704), [#60715](https://github.com/PaddlePaddle/Paddle/pull/60715), [#60713](https://github.com/PaddlePaddle/Paddle/pull/60713), [#60711](https://github.com/PaddlePaddle/Paddle/pull/60711), [#60724](https://github.com/PaddlePaddle/Paddle/pull/60724), [#60803](https://github.com/PaddlePaddle/Paddle/pull/60803), [#61331](https://github.com/PaddlePaddle/Paddle/pull/61331), [#63286](https://github.com/PaddlePaddle/Paddle/pull/63286), [#60473](https://github.com/PaddlePaddle/Paddle/pull/60473), [#61046](https://github.com/PaddlePaddle/Paddle/pull/61046), [#61859](https://github.com/PaddlePaddle/Paddle/pull/61859), [#60675](https://github.com/PaddlePaddle/Paddle/pull/60675), [#60719](https://github.com/PaddlePaddle/Paddle/pull/60719), [#62863](https://github.com/PaddlePaddle/Paddle/pull/62863), [#63013](https://github.com/PaddlePaddle/Paddle/pull/63013), [#61293](https://github.com/PaddlePaddle/Paddle/pull/61293), [#62781](https://github.com/PaddlePaddle/Paddle/pull/62781), [#62935](https://github.com/PaddlePaddle/Paddle/pull/62935), [#63014](https://github.com/PaddlePaddle/Paddle/pull/63014), [#64203](https://github.com/PaddlePaddle/Paddle/pull/64203), [#63349](https://github.com/PaddlePaddle/Paddle/pull/63349), [#59572](https://github.com/PaddlePaddle/Paddle/pull/59572), [#59911](https://github.com/PaddlePaddle/Paddle/pull/59911), [#59861](https://github.com/PaddlePaddle/Paddle/pull/59861), [#60014](https://github.com/PaddlePaddle/Paddle/pull/60014), [#59913](https://github.com/PaddlePaddle/Paddle/pull/59913), [#58889](https://github.com/PaddlePaddle/Paddle/pull/58889), [#60114](https://github.com/PaddlePaddle/Paddle/pull/60114), [#59928](https://github.com/PaddlePaddle/Paddle/pull/59928), [#60180](https://github.com/PaddlePaddle/Paddle/pull/60180), [#60168](https://github.com/PaddlePaddle/Paddle/pull/60168), [#60166](https://github.com/PaddlePaddle/Paddle/pull/60166), [#60250](https://github.com/PaddlePaddle/Paddle/pull/60250), [#60247](https://github.com/PaddlePaddle/Paddle/pull/60247), [#60172](https://github.com/PaddlePaddle/Paddle/pull/60172), [#59661](https://github.com/PaddlePaddle/Paddle/pull/59661), [#58880](https://github.com/PaddlePaddle/Paddle/pull/58880), [#60291](https://github.com/PaddlePaddle/Paddle/pull/60291), [#58881](https://github.com/PaddlePaddle/Paddle/pull/58881), [#58955](https://github.com/PaddlePaddle/Paddle/pull/58955), [#58684](https://github.com/PaddlePaddle/Paddle/pull/58684), [#58708](https://github.com/PaddlePaddle/Paddle/pull/58708), [#60323](https://github.com/PaddlePaddle/Paddle/pull/60323), [#58762](https://github.com/PaddlePaddle/Paddle/pull/58762), [#60048](https://github.com/PaddlePaddle/Paddle/pull/60048), [#60345](https://github.com/PaddlePaddle/Paddle/pull/60345), [#60325](https://github.com/PaddlePaddle/Paddle/pull/60325), [#59627](https://github.com/PaddlePaddle/Paddle/pull/59627), [#60416](https://github.com/PaddlePaddle/Paddle/pull/60416), [#60434](https://github.com/PaddlePaddle/Paddle/pull/60434), [#59801](https://github.com/PaddlePaddle/Paddle/pull/59801), [#60619](https://github.com/PaddlePaddle/Paddle/pull/60619), [#60445](https://github.com/PaddlePaddle/Paddle/pull/60445), [#60666](https://github.com/PaddlePaddle/Paddle/pull/60666), [#60353](https://github.com/PaddlePaddle/Paddle/pull/60353), [#60733](https://github.com/PaddlePaddle/Paddle/pull/60733), [#60693](https://github.com/PaddlePaddle/Paddle/pull/60693), [#60350](https://github.com/PaddlePaddle/Paddle/pull/60350), [#61096](https://github.com/PaddlePaddle/Paddle/pull/61096), [#61121](https://github.com/PaddlePaddle/Paddle/pull/61121), [#61164](https://github.com/PaddlePaddle/Paddle/pull/61164), [#62054](https://github.com/PaddlePaddle/Paddle/pull/62054), [#62136](https://github.com/PaddlePaddle/Paddle/pull/62136), [#62508](https://github.com/PaddlePaddle/Paddle/pull/62508), [#62988](https://github.com/PaddlePaddle/Paddle/pull/62988), [#63472](https://github.com/PaddlePaddle/Paddle/pull/63472), [#60193](https://github.com/PaddlePaddle/Paddle/pull/60193), [#60197](https://github.com/PaddlePaddle/Paddle/pull/60197), [#60198](https://github.com/PaddlePaddle/Paddle/pull/60198), [#60346](https://github.com/PaddlePaddle/Paddle/pull/60346), [#60318](https://github.com/PaddlePaddle/Paddle/pull/60318), [#60645](https://github.com/PaddlePaddle/Paddle/pull/60645), [#60650](https://github.com/PaddlePaddle/Paddle/pull/60650), [#60660](https://github.com/PaddlePaddle/Paddle/pull/60660), [#60706](https://github.com/PaddlePaddle/Paddle/pull/60706), [#60799](https://github.com/PaddlePaddle/Paddle/pull/60799), [#60837](https://github.com/PaddlePaddle/Paddle/pull/60837), [#60817](https://github.com/PaddlePaddle/Paddle/pull/60817), [#60820](https://github.com/PaddlePaddle/Paddle/pull/60820), [#60894](https://github.com/PaddlePaddle/Paddle/pull/60894), [#61079](https://github.com/PaddlePaddle/Paddle/pull/61079), [#61087](https://github.com/PaddlePaddle/Paddle/pull/61087), [#61073](https://github.com/PaddlePaddle/Paddle/pull/61073), [#61072](https://github.com/PaddlePaddle/Paddle/pull/61072), [#61127](https://github.com/PaddlePaddle/Paddle/pull/61127), [#61097](https://github.com/PaddlePaddle/Paddle/pull/61097), [#61365](https://github.com/PaddlePaddle/Paddle/pull/61365), [#61456](https://github.com/PaddlePaddle/Paddle/pull/61456), [#61846](https://github.com/PaddlePaddle/Paddle/pull/61846), [#62217](https://github.com/PaddlePaddle/Paddle/pull/62217), [#62519](https://github.com/PaddlePaddle/Paddle/pull/62519), [#62881](https://github.com/PaddlePaddle/Paddle/pull/62881), [#62880](https://github.com/PaddlePaddle/Paddle/pull/62880), [#59723](https://github.com/PaddlePaddle/Paddle/pull/59723), [#59722](https://github.com/PaddlePaddle/Paddle/pull/59722), [#59797](https://github.com/PaddlePaddle/Paddle/pull/59797), [#59960](https://github.com/PaddlePaddle/Paddle/pull/59960), [#59761](https://github.com/PaddlePaddle/Paddle/pull/59761), [#59996](https://github.com/PaddlePaddle/Paddle/pull/59996), [#60009](https://github.com/PaddlePaddle/Paddle/pull/60009), [#58896](https://github.com/PaddlePaddle/Paddle/pull/58896), [#60051](https://github.com/PaddlePaddle/Paddle/pull/60051), [#60410](https://github.com/PaddlePaddle/Paddle/pull/60410), [#60420](https://github.com/PaddlePaddle/Paddle/pull/60420), [#60548](https://github.com/PaddlePaddle/Paddle/pull/60548), [#60575](https://github.com/PaddlePaddle/Paddle/pull/60575), [#60726](https://github.com/PaddlePaddle/Paddle/pull/60726), [#60809](https://github.com/PaddlePaddle/Paddle/pull/60809), [#61346](https://github.com/PaddlePaddle/Paddle/pull/61346), [#61222](https://github.com/PaddlePaddle/Paddle/pull/61222), [#61099](https://github.com/PaddlePaddle/Paddle/pull/61099), [#62254](https://github.com/PaddlePaddle/Paddle/pull/62254), [#62269](https://github.com/PaddlePaddle/Paddle/pull/62269), [#62362](https://github.com/PaddlePaddle/Paddle/pull/62362) +- 完善飞桨底层报错检查等机制,方便开发者调试。[#62571](https://github.com/PaddlePaddle/Paddle/pull/62571), [#62602](https://github.com/PaddlePaddle/Paddle/pull/62602), [#60903](https://github.com/PaddlePaddle/Paddle/pull/60903), [#64695](https://github.com/PaddlePaddle/Paddle/pull/64695), [#59907](https://github.com/PaddlePaddle/Paddle/pull/59907), [#62018](https://github.com/PaddlePaddle/Paddle/pull/62018), [#62839](https://github.com/PaddlePaddle/Paddle/pull/62839), [#60651](https://github.com/PaddlePaddle/Paddle/pull/60651), [#61488](https://github.com/PaddlePaddle/Paddle/pull/61488), [#64064](https://github.com/PaddlePaddle/Paddle/pull/64064), [#63192](https://github.com/PaddlePaddle/Paddle/pull/63192), [#63525](https://github.com/PaddlePaddle/Paddle/pull/63525)。 -#### Bug Fix +### 漏洞修复 +- 修复潜在的安全漏洞。[#59957](https://github.com/PaddlePaddle/Paddle/pull/59957), [#61032](https://github.com/PaddlePaddle/Paddle/pull/61032), [#61356](https://github.com/PaddlePaddle/Paddle/pull/61356), [#61573](https://github.com/PaddlePaddle/Paddle/pull/61573), [#61671](https://github.com/PaddlePaddle/Paddle/pull/61671), [#62345](https://github.com/PaddlePaddle/Paddle/pull/62345), [#60097](https://github.com/PaddlePaddle/Paddle/pull/60097), [#61161](https://github.com/PaddlePaddle/Paddle/pull/61161), [#61294](https://github.com/PaddlePaddle/Paddle/pull/61294), [#61349](https://github.com/PaddlePaddle/Paddle/pull/61349), [#61344](https://github.com/PaddlePaddle/Paddle/pull/61344), [#61162](https://github.com/PaddlePaddle/Paddle/pull/61162), [#61285](https://github.com/PaddlePaddle/Paddle/pull/61285), [#61826](https://github.com/PaddlePaddle/Paddle/pull/61826), [#59967](https://github.com/PaddlePaddle/Paddle/pull/59967), [#59976](https://github.com/PaddlePaddle/Paddle/pull/59976), [#59979](https://github.com/PaddlePaddle/Paddle/pull/59979)[#60527](https://github.com/PaddlePaddle/Paddle/pull/60527),[#60646](https://github.com/PaddlePaddle/Paddle/pull/60646),[#61827](https://github.com/PaddlePaddle/Paddle/pull/61827) -- 修复了动转静在 is_test=True 模式部分场景下出现显存异常的问题。[#58350](https://github.com/PaddlePaddle/Paddle/pull/58350) -- 修复了被@to_static 装饰的函数在类似 foo(x,x,y) 场景下 jit.save 模型导出的问题。[#55963](https://github.com/PaddlePaddle/Paddle/pull/55963) -- 修复了部分 API 行为动静逻辑不统一问题,提升了动转静整图转换成功率和使用体验。[#56092](https://github.com/PaddlePaddle/Paddle/pull/56092) +### 废弃功能 +- 清理废弃的执行器等逻辑,减少冗余代码。[#64822](https://github.com/PaddlePaddle/Paddle/pull/64822), [#60941](https://github.com/PaddlePaddle/Paddle/pull/60941) -#### 漏洞修复 +## 3.编译器架构 +在 3.0 版本下,编译器架构进行了重要升级。基于 Shape Dialect 构建了符号自动推导和化简体系,支持符号表达、约束构建,支撑了编译器动态形状下的端到端执行。同时飞桨编译器 CINN 全新升级了子图自动融合和 Pass Pipline 机制,合并了动、静态形状的核心模块,合并迭代路径,架构清晰统一。在此版本下,编译器在 AST Compute、Schedule 策略、Tiling 等重要后端模块进行了重构,提升了编译器的通用优化能力,在飞桨产业套件模型子图和典型大模型 Llama2-13B、Stable Diffusion 模型上验证了动形状的训练、推理正确性和提速性能。 -- 修复了动转静语法转写模块使用 eval()存在的潜在安全漏洞问题。[#60100](https://github.com/PaddlePaddle/Paddle/pull/60100) +### 新功能 +1. 升级了全新的子图自动融合机制,创新性提出了 TrivialOp 和 ReduceOp 融合理论,支持更广泛的垂直融合和水平融合范围,保障了子图融合的正确性和鲁棒性,充分发挥神经网络编译器的融合潜力([#63340](https://github.com/PaddlePaddle/Paddle/pull/63340)、[#63913](https://github.com/PaddlePaddle/Paddle/pull/63913)、[#63579](https://github.com/PaddlePaddle/Paddle/pull/63579)、[#63605](https://github.com/PaddlePaddle/Paddle/pull/63605)、[#60769](https://github.com/PaddlePaddle/Paddle/pull/60769)、[#62088](https://github.com/PaddlePaddle/Paddle/pull/62088)、[#63124](https://github.com/PaddlePaddle/Paddle/pull/63124)、[#63658](https://github.com/PaddlePaddle/Paddle/pull/63658)、[#64557](https://github.com/PaddlePaddle/Paddle/pull/64557)、[#63318](https://github.com/PaddlePaddle/Paddle/pull/63318)、[#62545](https://github.com/PaddlePaddle/Paddle/pull/62545)) +2. 新增支持了动态形状的符号推导功能,基于 Shape Dialect 实现了动态符号构建、自动推导、约束表达、符号化简等机制,引入 DimExpr 概念,升级支持了飞桨框架 150+个典型基础算子的 InferSymbolicShape 逻辑,为编译器支持动态形状下的训练和推理提供更多信息([#60843](https://github.com/PaddlePaddle/Paddle/pull/60843)、[#62662](https://github.com/PaddlePaddle/Paddle/pull/62662)、[#63790](https://github.com/PaddlePaddle/Paddle/pull/63790)、[#60098](https://github.com/PaddlePaddle/Paddle/pull/60098)、[#60511](https://github.com/PaddlePaddle/Paddle/pull/60511)、[#61232](https://github.com/PaddlePaddle/Paddle/pull/61232)、[#61939](https://github.com/PaddlePaddle/Paddle/pull/61939)、[#62798](https://github.com/PaddlePaddle/Paddle/pull/62798)、[#62955](https://github.com/PaddlePaddle/Paddle/pull/62955)、[#63029](https://github.com/PaddlePaddle/Paddle/pull/63029)、[#60572](https://github.com/PaddlePaddle/Paddle/pull/60572)、[#61035](https://github.com/PaddlePaddle/Paddle/pull/61035)、[#61224](https://github.com/PaddlePaddle/Paddle/pull/61224)、[#61587](https://github.com/PaddlePaddle/Paddle/pull/61587)、[#61937](https://github.com/PaddlePaddle/Paddle/pull/61937)、[#62314](https://github.com/PaddlePaddle/Paddle/pull/62314)、[#62394](https://github.com/PaddlePaddle/Paddle/pull/62394)、[#62569](https://github.com/PaddlePaddle/Paddle/pull/62569)、[#62495](https://github.com/PaddlePaddle/Paddle/pull/62495)、[#62844](https://github.com/PaddlePaddle/Paddle/pull/62844)、[#63000](https://github.com/PaddlePaddle/Paddle/pull/63000)、[#63016](https://github.com/PaddlePaddle/Paddle/pull/63016)、[#64222](https://github.com/PaddlePaddle/Paddle/pull/64222)、[#60129](https://github.com/PaddlePaddle/Paddle/pull/60129)、[#60899](https://github.com/PaddlePaddle/Paddle/pull/60899)、[#61342](https://github.com/PaddlePaddle/Paddle/pull/61342)、[#61439](https://github.com/PaddlePaddle/Paddle/pull/61439)、[#62766](https://github.com/PaddlePaddle/Paddle/pull/62766)、[#61133](https://github.com/PaddlePaddle/Paddle/pull/61133)、[#61430](https://github.com/PaddlePaddle/Paddle/pull/61430)、[#61498](https://github.com/PaddlePaddle/Paddle/pull/61498)、[#61680](https://github.com/PaddlePaddle/Paddle/pull/61680)、[#63367](https://github.com/PaddlePaddle/Paddle/pull/63367)、[#62151](https://github.com/PaddlePaddle/Paddle/pull/62151)、[#62665](https://github.com/PaddlePaddle/Paddle/pull/62665)、[#61407](https://github.com/PaddlePaddle/Paddle/pull/61407)、[#61502](https://github.com/PaddlePaddle/Paddle/pull/61502)、[#61655](https://github.com/PaddlePaddle/Paddle/pull/61655)、[#64115](https://github.com/PaddlePaddle/Paddle/pull/64115)、[#61791](https://github.com/PaddlePaddle/Paddle/pull/61791)、[#62141](https://github.com/PaddlePaddle/Paddle/pull/62141)、[#63422](https://github.com/PaddlePaddle/Paddle/pull/63422)、[#63577](https://github.com/PaddlePaddle/Paddle/pull/63577)、[#63978](https://github.com/PaddlePaddle/Paddle/pull/63978)、[#63576](https://github.com/PaddlePaddle/Paddle/pull/63576)、[#63947](https://github.com/PaddlePaddle/Paddle/pull/63947)、[#64332](https://github.com/PaddlePaddle/Paddle/pull/64332)、[#63990](https://github.com/PaddlePaddle/Paddle/pull/63990)) +3. 新增了 Pass Pipline 功能,包括 PdToCinn、CinnPreprocess、BuildGroupOp、DivideGroupOp、CinnLowering、精度检查等 Pass 策略,统一支持动、静形状下子图的 Lowering 和执行,架构清晰([#61611](https://github.com/PaddlePaddle/Paddle/pull/61611)、[#62612](https://github.com/PaddlePaddle/Paddle/pull/62612)、[#64354](https://github.com/PaddlePaddle/Paddle/pull/64354)、[#61848](https://github.com/PaddlePaddle/Paddle/pull/61848)、[#62316](https://github.com/PaddlePaddle/Paddle/pull/62316)、[#64152](https://github.com/PaddlePaddle/Paddle/pull/64152)、[#61619](https://github.com/PaddlePaddle/Paddle/pull/61619)、[#62318](https://github.com/PaddlePaddle/Paddle/pull/62318)、[#61977](https://github.com/PaddlePaddle/Paddle/pull/61977)、[#62211](https://github.com/PaddlePaddle/Paddle/pull/62211)、[#63972](https://github.com/PaddlePaddle/Paddle/pull/63972)、[#63686](https://github.com/PaddlePaddle/Paddle/pull/63686)、[#64505](https://github.com/PaddlePaddle/Paddle/pull/64505)) +4. 新增支持了 BuketLower 和 DyShapeSchdule 功能,根据动态形状的范围实现自动分桶编译优化;并适配升级了 CodeGen 模块逻辑,支持 InferShape 函数生成和 Host 函数的条件分支分发功能,支撑大模型的动态 Shape 下训练推理加速([#62730](https://github.com/PaddlePaddle/Paddle/pull/62730)、[#61115](https://github.com/PaddlePaddle/Paddle/pull/61115)、[#59941](https://github.com/PaddlePaddle/Paddle/pull/59941)、[#62207](https://github.com/PaddlePaddle/Paddle/pull/62207)、[#64318](https://github.com/PaddlePaddle/Paddle/pull/64318)、[#64345](https://github.com/PaddlePaddle/Paddle/pull/64345)、[#60519](https://github.com/PaddlePaddle/Paddle/pull/60519)、[#62584](https://github.com/PaddlePaddle/Paddle/pull/62584)、[#60828](https://github.com/PaddlePaddle/Paddle/pull/60828)、[#60533](https://github.com/PaddlePaddle/Paddle/pull/60533)、[#61436](https://github.com/PaddlePaddle/Paddle/pull/61436)、[#62071](https://github.com/PaddlePaddle/Paddle/pull/62071)、[#63971](https://github.com/PaddlePaddle/Paddle/pull/63971)、[#61656](https://github.com/PaddlePaddle/Paddle/pull/61656)、[#63083](https://github.com/PaddlePaddle/Paddle/pull/63083)、[#64405](https://github.com/PaddlePaddle/Paddle/pull/64405)、[#63047](https://github.com/PaddlePaddle/Paddle/pull/63047)、[#64655](https://github.com/PaddlePaddle/Paddle/pull/64655)、[#63095](https://github.com/PaddlePaddle/Paddle/pull/63095)、[#63829](https://github.com/PaddlePaddle/Paddle/pull/63829)、[#63572](https://github.com/PaddlePaddle/Paddle/pull/63572)) +5. 新增支持了编译缓存策略,自动识别、合并和复用相同子图结构的编译结果,使用多线程提升编译效率,提升用户的使用体验([#62952](https://github.com/PaddlePaddle/Paddle/pull/62952)、[#63269](https://github.com/PaddlePaddle/Paddle/pull/63269)、[#64718](https://github.com/PaddlePaddle/Paddle/pull/64718)、[#61367](https://github.com/PaddlePaddle/Paddle/pull/61367)、[#63305](https://github.com/PaddlePaddle/Paddle/pull/63305)、[#63750](https://github.com/PaddlePaddle/Paddle/pull/63750)、[#63871](https://github.com/PaddlePaddle/Paddle/pull/63871)、[#64893](https://github.com/PaddlePaddle/Paddle/pull/64893)) +6. 新增支持了 GenerateShape 机制,添加了对应的 AST Compute 算子定义,支持动态符号的自动解析,以及在 Lowering 阶段自动生成 ShapeOp([#64167](https://github.com/PaddlePaddle/Paddle/pull/64167)、[#64636](https://github.com/PaddlePaddle/Paddle/pull/64636)、[#61993](https://github.com/PaddlePaddle/Paddle/pull/61993)、[#64843](https://github.com/PaddlePaddle/Paddle/pull/64843)、[#62587](https://github.com/PaddlePaddle/Paddle/pull/62587)) -### 动态图分布式能力增强 +### 功能优化 +1. 优化了 BuildCinnPass 逻辑,升级编译器对黑白名单算子的感知策略,提升了 Pass 逻辑的鲁棒性([#62372](https://github.com/PaddlePaddle/Paddle/pull/62372)、[#61081](https://github.com/PaddlePaddle/Paddle/pull/61081)、[#61225](https://github.com/PaddlePaddle/Paddle/pull/61225)、[#58863](https://github.com/PaddlePaddle/Paddle/pull/58863)) +2. 优化了 OpLoweringGroup 数据结构,移除了不必要的接口和成员,降低上下游模块的耦合度([#62339](https://github.com/PaddlePaddle/Paddle/pull/62339)) +3. 优化了编译器关于架构 Arch 的组件设计,抽象硬件概念,降低国产硬件的适配成本([#63530](https://github.com/PaddlePaddle/Paddle/pull/63530)、[#64347](https://github.com/PaddlePaddle/Paddle/pull/64347)、[#64506](https://github.com/PaddlePaddle/Paddle/pull/64506)、[#64587](https://github.com/PaddlePaddle/Paddle/pull/64587)) +4. 升级了编译器后端算子 AST Compute 模块,适配支持了动态 Shape 的计算逻辑([#62488](https://github.com/PaddlePaddle/Paddle/pull/62488)、[#63581](https://github.com/PaddlePaddle/Paddle/pull/63581)、[#63687](https://github.com/PaddlePaddle/Paddle/pull/63687)、[#63654](https://github.com/PaddlePaddle/Paddle/pull/63654)、[#64217](https://github.com/PaddlePaddle/Paddle/pull/64217)) + +### 性能优化 +1. 优化了 AST IR 的 Schedule 逻辑,重构了 Vectorize、Unroll、AxisBind、ComputeAt 等核心模块,合并动静形状迭代路径,降低开发维护成本([#60449](https://github.com/PaddlePaddle/Paddle/pull/60449)、[#60155](https://github.com/PaddlePaddle/Paddle/pull/60155)、[#60342](https://github.com/PaddlePaddle/Paddle/pull/60342)、[#60498](https://github.com/PaddlePaddle/Paddle/pull/60498)、[#60538](https://github.com/PaddlePaddle/Paddle/pull/60538)、[#60190](https://github.com/PaddlePaddle/Paddle/pull/60190)、[#61197](https://github.com/PaddlePaddle/Paddle/pull/61197)、[#63140](https://github.com/PaddlePaddle/Paddle/pull/63140)、[#61156](https://github.com/PaddlePaddle/Paddle/pull/61156)) +2. 优化了 Tiling 策略和 temp Buffer 功能,支持 warp-level 内存连续 Read 和 cache_read cache_write 功能, 提升子图执行性能([#64240](https://github.com/PaddlePaddle/Paddle/pull/64240)、[#60562](https://github.com/PaddlePaddle/Paddle/pull/60562)、[#64711](https://github.com/PaddlePaddle/Paddle/pull/64711)、[#62856](https://github.com/PaddlePaddle/Paddle/pull/62856)、[#61576](https://github.com/PaddlePaddle/Paddle/pull/61576)、[#61901](https://github.com/PaddlePaddle/Paddle/pull/61901)、[#62581](https://github.com/PaddlePaddle/Paddle/pull/62581)、[#61987](https://github.com/PaddlePaddle/Paddle/pull/61987)、[#60190](https://github.com/PaddlePaddle/Paddle/pull/60190)、[#63138](https://github.com/PaddlePaddle/Paddle/pull/63138)、[#62517](https://github.com/PaddlePaddle/Paddle/pull/62517)) +3. 支持 Schedule 配置的自动搜索功能,AOT 式离线保存机制实现子图 Kernel 的性能加速([#64271](https://github.com/PaddlePaddle/Paddle/pull/64271)、[#64588](https://github.com/PaddlePaddle/Paddle/pull/64588)、[#64694](https://github.com/PaddlePaddle/Paddle/pull/64694)、[#64620](https://github.com/PaddlePaddle/Paddle/pull/64620)、[#64702](https://github.com/PaddlePaddle/Paddle/pull/64702)、[#63086](https://github.com/PaddlePaddle/Paddle/pull/63086)) +4. 支持了 OptimizeReductionTactic 优化策略,提升 Reduce 场景下的 kernel 性能([#6066](https://github.com/PaddlePaddle/Paddle/pull/60661)、[#61363](https://github.com/PaddlePaddle/Paddle/pull/61363)、[#60881](https://github.com/PaddlePaddle/Paddle/pull/60881)、[#63859](https://github.com/PaddlePaddle/Paddle/pull/63859)) +5. 增强了 DCE Pass 功能,移除了多余的 If/For 分支代码,提升执行效率([#61682](https://github.com/PaddlePaddle/Paddle/pull/61682)) +6. 新增支持了 FuseParallelMatmulPass Pass,可融合多个 Matmul 算子实现加速([#63623](https://github.com/PaddlePaddle/Paddle/pull/63623)) + +### Bug 修复 +1. 修复了部分特殊算子在 Lowering 到编译器时 BUG,提升了端到端使用的用户体验([#60800](https://github.com/PaddlePaddle/Paddle/pull/60800)、[#64720](https://github.com/PaddlePaddle/Paddle/pull/64720)、[#62593](https://github.com/PaddlePaddle/Paddle/pull/62593)、[#62661](https://github.com/PaddlePaddle/Paddle/pull/62661)、[#64626](https://github.com/PaddlePaddle/Paddle/pull/64626)、[#63320](https://github.com/PaddlePaddle/Paddle/pull/63320)、[#64581](https://github.com/PaddlePaddle/Paddle/pull/64581)、[#61608](https://github.com/PaddlePaddle/Paddle/pull/61608)、[#64135](https://github.com/PaddlePaddle/Paddle/pull/64135)、[#64659](https://github.com/PaddlePaddle/Paddle/pull/64659)、[#62391](https://github.com/PaddlePaddle/Paddle/pull/62391)、[#62490](https://github.com/PaddlePaddle/Paddle/pull/62490)、[#63891](https://github.com/PaddlePaddle/Paddle/pull/63891)、[#64529](https://github.com/PaddlePaddle/Paddle/pull/64529)) +2. 修复了部分算子符号推导实现逻辑的 BUG([#62141](https://github.com/PaddlePaddle/Paddle/pull/62141)、[#62376](https://github.com/PaddlePaddle/Paddle/pull/62376)、[#62941](https://github.com/PaddlePaddle/Paddle/pull/62941)、[#63322](https://github.com/PaddlePaddle/Paddle/pull/63322)、[#64672](https://github.com/PaddlePaddle/Paddle/pull/64672)、[#64407](https://github.com/PaddlePaddle/Paddle/pull/64407)、[#60241](https://github.com/PaddlePaddle/Paddle/pull/60241)、[#60440](https://github.com/PaddlePaddle/Paddle/pull/60440)、[#62503](https://github.com/PaddlePaddle/Paddle/pull/62503)、[#62997](https://github.com/PaddlePaddle/Paddle/pull/62997)、[#63169](https://github.com/PaddlePaddle/Paddle/pull/63169)、[#61098](https://github.com/PaddlePaddle/Paddle/pull/61098)、[#63973](https://github.com/PaddlePaddle/Paddle/pull/63973)、[#62248](https://github.com/PaddlePaddle/Paddle/pull/62248)、[#62321](https://github.com/PaddlePaddle/Paddle/pull/62321)、[#63755](https://github.com/PaddlePaddle/Paddle/pull/63755)、[#63917](https://github.com/PaddlePaddle/Paddle/pull/63917)、[#63903](https://github.com/PaddlePaddle/Paddle/pull/63903)、[#64173](https://github.com/PaddlePaddle/Paddle/pull/64173)、[#64525](https://github.com/PaddlePaddle/Paddle/pull/64525)、[#64615](https://github.com/PaddlePaddle/Paddle/pull/64615)、[#62247](https://github.com/PaddlePaddle/Paddle/pull/62247)、[#62455](https://github.com/PaddlePaddle/Paddle/pull/62455)、[#62898](https://github.com/PaddlePaddle/Paddle/pull/62898)、[#62867](https://github.com/PaddlePaddle/Paddle/pull/62867)、[#63608](https://github.com/PaddlePaddle/Paddle/pull/63608)、[#63789](https://github.com/PaddlePaddle/Paddle/pull/63789)、[#64085](https://github.com/PaddlePaddle/Paddle/pull/64085)、[#64136](https://github.com/PaddlePaddle/Paddle/pull/64136)、[#64181](https://github.com/PaddlePaddle/Paddle/pull/64181)) +3. 修复了动静形状下编译器执行结果错误的诸多问题,提升了框架机制的鲁棒性([#60813](https://github.com/PaddlePaddle/Paddle/pull/60813)、[#61877](https://github.com/PaddlePaddle/Paddle/pull/61877)、[#61909](https://github.com/PaddlePaddle/Paddle/pull/61909)、[#62954](https://github.com/PaddlePaddle/Paddle/pull/62954)、[#63614](https://github.com/PaddlePaddle/Paddle/pull/63614)、[#60339](https://github.com/PaddlePaddle/Paddle/pull/60339)、[#60623](https://github.com/PaddlePaddle/Paddle/pull/60623)、[#60658](https://github.com/PaddlePaddle/Paddle/pull/60658)、[#60669](https://github.com/PaddlePaddle/Paddle/pull/60669)、[#58823](https://github.com/PaddlePaddle/Paddle/pull/58823)、[#62483](https://github.com/PaddlePaddle/Paddle/pull/62483)、[#62742](https://github.com/PaddlePaddle/Paddle/pull/62742)、[#61797](https://github.com/PaddlePaddle/Paddle/pull/61797)、[#63411](https://github.com/PaddlePaddle/Paddle/pull/63411)、[#64077](https://github.com/PaddlePaddle/Paddle/pull/64077)、[#62736](https://github.com/PaddlePaddle/Paddle/pull/62736)、[#62390](https://github.com/PaddlePaddle/Paddle/pull/62390)、[#63689](https://github.com/PaddlePaddle/Paddle/pull/63689)) + +### 废弃功能 +1. 移除了 adt DimExpr、SymbolicDimExpr、ShapedTypeInterface 等无用的符号相关组件([#60901](https://github.com/PaddlePaddle/Paddle/pull/60901)、[#60933](https://github.com/PaddlePaddle/Paddle/pull/60933)、[#60744](https://github.com/PaddlePaddle/Paddle/pull/60744)、[#64176](https://github.com/PaddlePaddle/Paddle/pull/64176)、[#64140](https://github.com/PaddlePaddle/Paddle/pull/64140)) +2. 移除了旧的 Group Cluster、旧 IR 下的前端表示等相关组件,提升架构层面的简洁性([#63683](https://github.com/PaddlePaddle/Paddle/pull/63683)、[#64630](https://github.com/PaddlePaddle/Paddle/pull/64630)、[#61380](https://github.com/PaddlePaddle/Paddle/pull/61380)) + +## 4.自动并行架构 +为了进一步增强自动并行(Auto Parallel)架构在大模型训练场景的可用性,飞桨完善了动-静态图自动并行的功能,包括新增 Sharding、interleaved pipeline 等并行策略,支持 lazy 初始化参数,新增和完善部分算子的切分推导规则等,并在多个主流大语言模型中全面验证了自动并行架构。同时,为打造飞桨全新 3.0 架构,静态图自动并行架构基于新一代中间表示 PIR 进行了全面升级,扩展实现了 DistDialect,在计算图表示中原生支持了分布式属性(DistAttr)和分布式张量(DistTensor),并打通了静态图自动并行全流程,进一步增强了自动并行的动静统一和飞桨架构的统一性。最后,新增和完善了多项性能优化技术,包括 zero bubble pipeline 调度策略等,在 Llama-2 13B/70B 等典型大模型上实现端到端训练性能持平或领先手动并行方式。 + +### 功能完善 +- 新增 dtensor_from_local 接口,用于从切分后的局部张量创建 DistTensor(与之对应的,shard_tensor 是从切分前的全局张量创建 DistTensor)。[#60206](https://github.com/PaddlePaddle/Paddle/pull/60206) +- 新增 unshard_tensor 接口,用于将 DistTensor 转为全局张量,该接口与 shard_tensor 是互逆操作。[#60272](https://github.com/PaddlePaddle/Paddle/pull/60272) +- 为减少训练时的显存占用,新增 Sharding 策略,包括 stage1,stage2 和 stage3。[#61926](https://github.com/PaddlePaddle/Paddle/pull/61926), [#62711](https://github.com/PaddlePaddle/Paddle/pull/62711), [#62486](https://github.com/PaddlePaddle/Paddle/pull/62486), [#62230](https://github.com/PaddlePaddle/Paddle/pull/62230) +- 为解决先初始化参数再切分参数时可能出现的显存不足问题,新增自动并行参数 LazyInit 功能,支持先切分参数,再初始化参数。[#60316](https://github.com/PaddlePaddle/Paddle/pull/60316), [#60441](https://github.com/PaddlePaddle/Paddle/pull/60441), [#60563](https://github.com/PaddlePaddle/Paddle/pull/60563), [#61792](https://github.com/PaddlePaddle/Paddle/pull/61792) +- 为减少流水线并行的 bubble,新增 interleaved pipeline 并行策略,同时支持通过配置的方式自动将用户组网的 pipeline 并行自动转为 interleaved pipeline 并行,让用户无需在组网中进行复杂的标记。[#59751](https://github.com/PaddlePaddle/Paddle/pull/59751), [#60050](https://github.com/PaddlePaddle/Paddle/pull/60050), [#60467](https://github.com/PaddlePaddle/Paddle/pull/60467), [#60868](https://github.com/PaddlePaddle/Paddle/pull/60868), [#60187](https://github.com/PaddlePaddle/Paddle/pull/60187), [#62884](https://github.com/PaddlePaddle/Paddle/pull/62884), [#60560](https://github.com/PaddlePaddle/Paddle/pull/60560), [#61541](https://github.com/PaddlePaddle/Paddle/pull/61541) +- 新增 stack, gather, scatter_grad, cumsum, unbind, swiglu, fused_linear_param_grad 等算子的切分推导规则,完善和优化 fused_rope, reshape, flatten, fused_rms_norm, slice, tile, flash_attn, cross_entropy 等算子切分推导规则实现,解决在部分模型组网场景中不兼容的问题。[#62720](https://github.com/PaddlePaddle/Paddle/pull/62720), [#64202](https://github.com/PaddlePaddle/Paddle/pull/64202), [#63361](https://github.com/PaddlePaddle/Paddle/pull/63361), [#63290](https://github.com/PaddlePaddle/Paddle/pull/63290), [#61460](https://github.com/PaddlePaddle/Paddle/pull/61460), [#59986](https://github.com/PaddlePaddle/Paddle/pull/59986), [#61184](https://github.com/PaddlePaddle/Paddle/pull/61184), [#60144](https://github.com/PaddlePaddle/Paddle/pull/60144), [#62525](https://github.com/PaddlePaddle/Paddle/pull/62525), [#62053](https://github.com/PaddlePaddle/Paddle/pull/62053), [#60709](https://github.com/PaddlePaddle/Paddle/pull/60709), [#60111](https://github.com/PaddlePaddle/Paddle/pull/60111), [#63681](https://github.com/PaddlePaddle/Paddle/pull/63681), [#62180](https://github.com/PaddlePaddle/Paddle/pull/62180), [#60794](https://github.com/PaddlePaddle/Paddle/pull/60794), [#60632](https://github.com/PaddlePaddle/Paddle/pull/60632), [#62439](https://github.com/PaddlePaddle/Paddle/pull/62439) +- 完善分布式 checkpoint 存储和加载功能,支持 master_weights 存储,修复随机挂问题。[#60027](https://github.com/PaddlePaddle/Paddle/pull/60027), [#59872](https://github.com/PaddlePaddle/Paddle/pull/59872) +- 为支持任意 shape 张量的自动并行,新增支持张量非均匀切分特性。[#62611](https://github.com/PaddlePaddle/Paddle/pull/62611), [#61432](https://github.com/PaddlePaddle/Paddle/pull/61432) +- 为支持用户在自动并行组网中使用自定义算子,支持用户在框架外注册自定义该类算子的切分推导规则。 [#60509](https://github.com/PaddlePaddle/Paddle/pull/60509) +- 完善切分转换规则,支持从任意状态转为 replicate 以及从 replicate 状态转换为任意状态。[#60281](https://github.com/PaddlePaddle/Paddle/pull/60281), [#59869](https://github.com/PaddlePaddle/Paddle/pull/59869) +- 新增 MoE 专家并行策略(experimental),目前仅支持动态图自动并行。[#63904](https://github.com/PaddlePaddle/Paddle/pull/63904) +- 修复自动并行与动态图执行、动转静等流程适配的部分问题。[#60214](https://github.com/PaddlePaddle/Paddle/pull/60214), [#60546](https://github.com/PaddlePaddle/Paddle/pull/60546), [#62082](https://github.com/PaddlePaddle/Paddle/pull/62082), [#61313](https://github.com/PaddlePaddle/Paddle/pull/61313), [#61840](https://github.com/PaddlePaddle/Paddle/pull/61840), [#60614](https://github.com/PaddlePaddle/Paddle/pull/60614), [#60234](https://github.com/PaddlePaddle/Paddle/pull/60234), [#64813](https://github.com/PaddlePaddle/Paddle/pull/64813), [#61606](https://github.com/PaddlePaddle/Paddle/pull/61606), [#63405](https://github.com/PaddlePaddle/Paddle/pull/63405), [#64334](https://github.com/PaddlePaddle/Paddle/pull/64334), [#60504](https://github.com/PaddlePaddle/Paddle/pull/60504) + +### 性能优化 +- 为减少流水线并行中的 bubble,支持 backward 中参数和激活的反向计算拆分,新增 zero bubble pipeline 调度策略,提升训练性能。[#62865](https://github.com/PaddlePaddle/Paddle/pull/62865), [#62737](https://github.com/PaddlePaddle/Paddle/pull/62737), [#64534](https://github.com/PaddlePaddle/Paddle/pull/64534), +- 为提升序列并行(sequence parallel)的性能,对相关通信操作和计算操作进行 fusion,并优化冗余的 transopse 操作。[#64807](https://github.com/PaddlePaddle/Paddle/pull/64807), [#63948](https://github.com/PaddlePaddle/Paddle/pull/63948), [#64316](https://github.com/PaddlePaddle/Paddle/pull/64316), [#64119](https://github.com/PaddlePaddle/Paddle/pull/64119) +- 优化静态图自动并行图优化耗时,减少从启动训练到第一个 step 完成的延时。[#59912](https://github.com/PaddlePaddle/Paddle/pull/59912), [#61817](https://github.com/PaddlePaddle/Paddle/pull/61817), [#60022](https://github.com/PaddlePaddle/Paddle/pull/60022), [#60125](https://github.com/PaddlePaddle/Paddle/pull/60125) +- 优化混合并行场景下相关通信操作的耗时。[#62157](https://github.com/PaddlePaddle/Paddle/pull/62157), [#61622](https://github.com/PaddlePaddle/Paddle/pull/61622) +- 优化自动并行动转静下参数的的冗余显存占用。[#62746](https://github.com/PaddlePaddle/Paddle/pull/62746) +- 完善自动并行的混合精度训练功能,支持设置局部 auto_cast 和黑白名单,支持 master grad 功能,适配不同的并行策略等。[60158](https://github.com/PaddlePaddle/Paddle/pull/60158), [#59987](https://github.com/PaddlePaddle/Paddle/pull/59987), [#62629](https://github.com/PaddlePaddle/Paddle/pull/62629), [#60385](https://github.com/PaddlePaddle/Paddle/pull/60385), [#62015](https://github.com/PaddlePaddle/Paddle/pull/62015), [#60514](https://github.com/PaddlePaddle/Paddle/pull/60514), [#61221](https://github.com/PaddlePaddle/Paddle/pull/61221), [#60779](https://github.com/PaddlePaddle/Paddle/pull/60779), [#63228](https://github.com/PaddlePaddle/Paddle/pull/63228) +- 优化 type promotion 和 amp 带来的非必要的 cast,提升性能。[#63293](https://github.com/PaddlePaddle/Paddle/pull/63293), [#63228](https://github.com/PaddlePaddle/Paddle/pull/63228) + +### 静态图自动并行架构升级 +- 基于新一代中间表示 PIR,新增 DistDialect,在计算图表示中原生支持了分布式属性(DistAttr)和分布式张量(DistTensor),实现了分布式属性和张量或算子的直接绑定,使自动并行架构更简洁统一。[#63828](https://github.com/PaddlePaddle/Paddle/pull/63828), [#64299](https://github.com/PaddlePaddle/Paddle/pull/64299), [#63870](https://github.com/PaddlePaddle/Paddle/pull/63870), [#64144](https://github.com/PaddlePaddle/Paddle/pull/64144), [#62524](https://github.com/PaddlePaddle/Paddle/pull/62524), [#62630](https://github.com/PaddlePaddle/Paddle/pull/62630), [#62897](https://github.com/PaddlePaddle/Paddle/pull/62897), [#60478](https://github.com/PaddlePaddle/Paddle/pull/60478), [#60574](https://github.com/PaddlePaddle/Paddle/pull/60574), [#63876](https://github.com/PaddlePaddle/Paddle/pull/63876), [#63798](https://github.com/PaddlePaddle/Paddle/pull/63798), [#62560](https://github.com/PaddlePaddle/Paddle/pull/62560), [#63676](https://github.com/PaddlePaddle/Paddle/pull/63676) +- 完成自动并行 PIR 新架构对 shard_tensor、reshard、to_static 等 API 的适配,支持用户将动态图模型组网直接转成 PIR 静态计算图进行优化和训练。[#62945](https://github.com/PaddlePaddle/Paddle/pull/62945), [#62356](https://github.com/PaddlePaddle/Paddle/pull/62356), [#60175](https://github.com/PaddlePaddle/Paddle/pull/60175), [#62654](https://github.com/PaddlePaddle/Paddle/pull/62654), [#63347](https://github.com/PaddlePaddle/Paddle/pull/63347) +- 优化静态图自动并行的图优化编译过程,通过重构优化静半中计算图切分和通信解析两个主要过程的实现,减少静态图编译优化耗时。[#64137](https://github.com/PaddlePaddle/Paddle/pull/64137), [#62201](https://github.com/PaddlePaddle/Paddle/pull/62201), [#64143](https://github.com/PaddlePaddle/Paddle/pull/64143), [#62560](https://github.com/PaddlePaddle/Paddle/pull/62560) +- 优化静态图中切分推导规则的调用流程,实现切分推导结果在动-静态图下的一致,提升了架构的统一性和稳定性。 [#62659](https://github.com/PaddlePaddle/Paddle/pull/62659), [#62547](https://github.com/PaddlePaddle/Paddle/pull/62547), [#63117](https://github.com/PaddlePaddle/Paddle/pull/63117), [#63434](https://github.com/PaddlePaddle/Paddle/pull/63434), [#63770](https://github.com/PaddlePaddle/Paddle/pull/63770), [#64361](https://github.com/PaddlePaddle/Paddle/pull/64361), [#63073](https://github.com/PaddlePaddle/Paddle/pull/63073) +- 升级静态图中张量切分转换的实现,动-静态图下使用一致的切分转换通信规则,保障动-静态图下张量切分转换执行逻辑和结果的一致性,提升用户体验。[#62718](https://github.com/PaddlePaddle/Paddle/pull/62718), [#62694](https://github.com/PaddlePaddle/Paddle/pull/62694), [#60215](https://github.com/PaddlePaddle/Paddle/pull/60215), [#63362](https://github.com/PaddlePaddle/Paddle/pull/63362), [#63072](https://github.com/PaddlePaddle/Paddle/pull/63072), [#63962](https://github.com/PaddlePaddle/Paddle/pull/63962), [#64223](https://github.com/PaddlePaddle/Paddle/pull/64223), [#61796](https://github.com/PaddlePaddle/Paddle/pull/61796), [#64465](https://github.com/PaddlePaddle/Paddle/pull/64465), [#64623](https://github.com/PaddlePaddle/Paddle/pull/64623), [#64418](https://github.com/PaddlePaddle/Paddle/pull/64418) + +### 训练策略自动搜索和调优 +为提升训练策略自动搜索和调优工具(AutoTuner)的易用性,支持用户自定义搜索项,支持设置搜索项的优先级,支持用户配置不合法的策略组合,全面增强了运行时和运行后日志中的报错信息,支持在 NPU 设备上进行 AutoTuner。[#60101](https://github.com/PaddlePaddle/Paddle/pull/60101), [#60294](https://github.com/PaddlePaddle/Paddle/pull/60294), [#61898](https://github.com/PaddlePaddle/Paddle/pull/61898), [#60248](https://github.com/PaddlePaddle/Paddle/pull/60248), [#60417](https://github.com/PaddlePaddle/Paddle/pull/60417), [#60954](https://github.com/PaddlePaddle/Paddle/pull/60954), [#61499](https://github.com/PaddlePaddle/Paddle/pull/61499), [#62724](https://github.com/PaddlePaddle/Paddle/pull/62724), [#60954](https://github.com/PaddlePaddle/Paddle/pull/60954), [#63693](https://github.com/PaddlePaddle/Paddle/pull/63693), [#62853](https://github.com/PaddlePaddle/Paddle/pull/62853), [#62984](https://github.com/PaddlePaddle/Paddle/pull/62984) + +## 5.Cuda 训练性能优化 +本次升级从算子计算效率、分布式通信优化、显存优化等多个角度实现了大模型训练效率的提升。 + +### 功能完善 +- FlashAttention 算子功能增强,包含支持 NVIDIA SM90 GPU 编译,支持 Group Query Attention,支持 cuDNN 接入,支持 QKV-packed 形式输入等。[#59820](https://github.com/PaddlePaddle/Paddle/pull/59820),[#60776](https://github.com/PaddlePaddle/Paddle/pull/60776),[#58680](https://github.com/PaddlePaddle/Paddle/pull/58680),[#63289](https://github.com/PaddlePaddle/Paddle/pull/63289) +- repeat_interleave 算子添加 BFloat16 数据类型的支持。[#61854](https://github.com/PaddlePaddle/Paddle/pull/61854) +- 针对 fused_scale_bias_add_relu、fused_scale_bias_relu_conv_bn、fused_dconv_drelu_dbn 等 ResNet 类模型接口参数多、算子易用性查等问题,添加了 fuse_resunit pass,支持上述算子的自动融合,实现通用性能优化。([#59771](https://github.com/PaddlePaddle/Paddle/pull/59771)) -为了满足大型模型的需求,本版本重点提升了飞桨动态图的分布式计算能力。在通信库、图分析、分布式策略和任务启停等方面进行了多方面的改进,为大型模型训练提供了全面的支持。在性能方面,我们通过减少流水并行 GPU 显存占用、采用 TensorFusion 技术、实现通信计算 overlap 以及减少非必要的数据同步拷贝等方式,进一步提升了训练性能。同时,通过环境变量控制 Optimizer 等方式提高了混合并行调试的灵活性。此外,通过修复相关 Bug,进一步提升了系统的稳定性。 +### 性能提升 +- 针对 Llama 类模型 SwiGLU 激活模块计算过程显存占用较大的问题,新增了 SwiGLU 融合算子,节省中间变量的显存占用,从而降低大模型训练过程显存开销,减少重计算以提升性能,Llama-70B 模型性能提升 9%。 [#61508](https://github.com/PaddlePaddle/Paddle/pull/61508) +- 针对序列并行(Sequence Parallel)过程通信占比较高的问题,实现了序列并行反向过程通信与 Matmul 计算的 overlap,节省端到端耗时,在大模型训练场景端到端性能提升 1%~2%。[#62284](https://github.com/PaddlePaddle/Paddle/pull/62284),[#63531](https://github.com/PaddlePaddle/Paddle/pull/63531) +- 针对 Sharding 反向通信后仍需要除以 nranks 导致训练速度慢的问题,支持了反向通信与除以 nranks 运算的融合,支持 ReduceScatter Average 的模式,提升大模型训练性能。[#62623](https://github.com/PaddlePaddle/Paddle/pull/62623) +- 针对张量模型并行过程输入数据广播过程导致训练速度抖动的问题,修复了数据广播过程的不必要的 CPU 和 GPU 间的同步,保证训练速度的稳定性。[#60816](https://github.com/PaddlePaddle/Paddle/pull/60816) +- 针对流水线模型并行 P2P 通信时间较长导致训练速度低下的问题,实现了 P2P 通信与前反向计算的 overlap,大模型端到端训练性能提升 2%~3%。[#61935](https://github.com/PaddlePaddle/Paddle/pull/61935),[#62051](https://github.com/PaddlePaddle/Paddle/pull/62051,[#62051](https://github.com/PaddlePaddle/Paddle/pull/62051)) +- 针对 fused_linear_param_grad_add 算子 bias 梯度计算效率低下问题,优化了 bias 梯度计算环节的计算效率,大模型端到端训练性能提升 0.2%。[#63114](https://github.com/PaddlePaddle/Paddle/pull/63114) +- 针对 Sharding 反向计算结束后参数广播过程耗时较长的问题,实现了参数广播与下一个 step 计算的 overlap,大模型端到端训练性能提升 2%以上。[#63945](https://github.com/PaddlePaddle/Paddle/pull/63945) +- 针对流水线并行训练过程梯度占用显存过高从而引入过多重计算导致训练速度慢的问题,实现了梯度动态释放技术,大模型端到端训练性能提升 3.4%。[#59739](https://github.com/PaddlePaddle/Paddle/pull/59739) + +### Bug 修复 +- 修复 StreamSafeCUDAAllocator CUDA Event 资源泄露导致大模型训练降速等问题。[#64621](https://github.com/PaddlePaddle/Paddle/pull/64621) +- 修复 fused_rotary_position_embedding 算子反向计算错误的 bug。[#60217](https://github.com/PaddlePaddle/Paddle/pull/60217) +- 修复自定义算子在 AMP 场景下无法通过黑白名单控制计算精度的 bug。[#60052](https://github.com/PaddlePaddle/Paddle/pull/60052) +- 修复 add_、divide_等原生支持不同数据类型运算的算子在类型提升时发生预期外的类型提升的 bug。[#64302](https://github.com/PaddlePaddle/Paddle/pull/64302) + +## 6.分布式策略增强 +重点强化了飞桨动态图分布式计算功能体验,对 AutoTuner、流水线并行、Sharding 等并行策略做了多方面的功能改进,增强了大模型训练的灵活性;新增 Flash Attention Mask 等功能,显著降低大模型训练特别是长 sequence 训练的显存占用,提升训练性能,为大模型训练提供更强的能力支持;另外修复了若干 Bug 以及潜在的安全性风险,显著提升了系统整体稳定性。 -#### 新功能 +### 功能优化 +- 优化了 Autotuner 的搜索空间,大幅提升了搜索的性能。[#62608](https://github.com/PaddlePaddle/Paddle/pull/62608) +- 针对流水线并行中由于在 eval 过程检查发送类型,导致训练可能出错的问题,增加训练配置,跳过流水线发送的冗余接收检查,灵活性更高、性能更好。[#63001](https://github.com/PaddlePaddle/Paddle/pull/63001) +- 在动态图流水并行中,增加了发送和接收数据的大小和类型的的检查,增加报错信息,使得鲁棒性、可调试性更好。[#59405](https://github.com/PaddlePaddle/Paddle/pull/59405) +- 支持动态图流水并行设定多个损失函数,并返回多个 loss,提升了动态图流水线的灵活性。[#63167](https://github.com/PaddlePaddle/Paddle/pull/63167) +- 在动态图流水并行中,增加流水线缓存清除配置选项,可以及时清除流水线中发送和接受的 cache,更好的支持动态 batchsize 训练。[#62277](https://github.com/PaddlePaddle/Paddle/pull/62277) +- 针对 sharding stage3 策略无法逐位对齐的问题,将无序的 set 集合换成了有序的 OrderedSet,避免了累加顺序导致的误差,修复完后可以逐位对齐。[#60085](https://github.com/PaddlePaddle/Paddle/pull/60085) +- 为了进一步降低针对序列并行中显存占用,新增重计算 allgather 的方法,减少 allgather 的 activation 的显存大小。[#64244](https://github.com/PaddlePaddle/Paddle/pull/64244) + +### 动态图新功能 +- 针对 autotuner 的搜索空间,新增了 refined recompute 的搜索维度,使得搜索结果更精准,调优模型的门槛更低。[#62430](https://github.com/PaddlePaddle/Paddle/pull/62430) +- 针对虚拟流水线并行中,需要限制训练批大小的问题,修改了流水线调度方式,解除批大小限制,支持更灵活的批大小。[#61561](https://github.com/PaddlePaddle/Paddle/pull/61561),[#60314](https://github.com/PaddlePaddle/Paddle/pull/60134) +- 针对使用 flash attention 具有 mask 时,mask 的显存占用随序列长度呈二次方复杂度、性能低的问题,使用稀疏的 mask 表达、优化 mask 的显存,显存复杂度从序列长度的二次方降低为一次方,减少了存储的访问次数,同时使用 share memory 加速访存,大幅提升性能。[#62029](https://github.com/PaddlePaddle/Paddle/pull/62029) +- 动态图 Sharding 并行策略新增完善通信和计算 overlap 功能,提升训练过程中的性能。[#60455](https://github.com/PaddlePaddle/Paddle/pull/60455) + +### 通信库功能优化 +- 增强 NCCL 通信库的功能,支持初始化时传入额外的初始化参数以支持定制的 NCCL 库的初始化。[#62193](https://github.com/PaddlePaddle/Paddle/pull/62193) +- 增加 NCCL 库路径查找功能,支持更灵活的 NCCL 库查找方式。[#62492](https://github.com/PaddlePaddle/Paddle/pull/62492) + +### Bug 修复 +- 修复 fused\_linear\_param\_grad\_add\_kernel 算子 dbias_out 空间申请问题,同时增加梯度地址检查逻辑,使得报错信息更易调试。[#363433](https://github.com/PaddlePaddle/Paddle/pull/63433),[#64460](https://github.com/PaddlePaddle/Paddle/pull/64460) +- 修复 sharding 策略在支持 reduce_avg 操作中、comm_overlap 在关闭时未对梯度进行缩放的问题。[#62702](https://github.com/PaddlePaddle/Paddle/pull/62702) +- 解决 Stage2 中 main grad 计算顺序、fusion 相关的 bug。[#59142](https://github.com/PaddlePaddle/Paddle/pull/59142) +- 修复 sharding 策略下,当开启 reduce_avg 通信操作时,无法找到该开关属性的问题。[#62502](https://github.com/PaddlePaddle/Paddle/pull/62502) +- sharding stage1 训练支持非训练参数训练,解决部分参数设置 stop_gradient=True 的问题。[#62616](https://github.com/PaddlePaddle/Paddle/pull/62616) +- 修正 TCP 关闭时打印的信息,防止误导用户。[#62631](https://github.com/PaddlePaddle/Paddle/pull/62631) +- 针对数据并行训练中,部分梯度没有初始化,出现 segmentation fault 错误,修改 DataParallel 训练问题,解决多卡训练出错的问题。[#62299](https://github.com/PaddlePaddle/Paddle/pull/62299) +- 针对开启序列并行的场景,修复了部分模型因为权重冻结而导致的 bug。[#63596](https://github.com/PaddlePaddle/Paddle/pull/63596) +- 针对单路 dp 的 autotuner 场景,修复了一些 bug。[#60757](https://github.com/PaddlePaddle/Paddle/pull/60757) +- 修复流水并行策略 aadiff bug。 ([#64716](https://github.com/PaddlePaddle/Paddle/pull/64716)) +- 移除部分分布式单测。 ([#62762](https://github.com/PaddlePaddle/Paddle/pull/62762)) + +### 安全风险修复 +- 针对 prune\_by\_memory\_estimation 算子中存在安全泄露风险,修补安全漏洞。[#61320](https://github.com/PaddlePaddle/Paddle/pull/61320) + +## 7.参数服务器 +本次更新主要修复了参数服务器使用过程的若干 bug 以及编译安装等问题。 + +### Bug 修复 +- 针对 unique 算子读写越界的问题,修复了 unique 算子计算过程长度设置错误问题,保证 unique 算子运算正确性。[#60840](https://github.com/PaddlePaddle/Paddle/pull/60840) +- 针对 PGLBox 训练过程 save/load 功能缺失以及编译错误等问题,修复了 PGLBox save/load 和编译过程的若干 bug,保证了 PGLBox 功能的正确性。[#63905](https://github.com/PaddlePaddle/Paddle/pull/63905) +- 针对 CPUPS 训练过程触发 GPUPS 逻辑导致训练挂掉的问题,修复了 CPUPS 中 use_ps_gpu 的设置值,保证 CPUPS 训练流程的正确性。[#61406](https://github.com/PaddlePaddle/Paddle/pull/61406) +- 针对 GPUPS 在 CUDA 12.3 中训练出 cudaErrorInvalidResourceHandle 错误的问题,加入了 device id 切换机制,保证在正确的设备上进行对应的资源操作。[#63391](https://github.com/PaddlePaddle/Paddle/pull/63391) +- 针对 PGLBox Embedding Dump 过程出现乱码的问题,修复了 C++ std::string 使用不当的 bug,保证 Embedding Dump 结果的正确性。[#65179](https://github.com/PaddlePaddle/Paddle/pull/65179) + +### 文档完善 +- 在 RPC 接口文档中接入安全警告,提醒用户需要在安全的网络条件下使用此接口。[#64100](https://github.com/PaddlePaddle/Paddle/pull/64100) + +### 安全加强 +- 修复若干代码安全问题,防止恶意代码注入。[#60023](https://github.com/PaddlePaddle/Paddle/pull/60023),[#60544](https://github.com/PaddlePaddle/Paddle/pull/60544),[#60615](https://github.com/PaddlePaddle/Paddle/pull/60615) + +## 8.推理部署 +推理框架基于 PIR 升级了 GPU、XPU、CPU 硬件下 PASS,相比上个版本可大幅减少代码行数,提升开发效率。底层执行器升级到了新版异步执行器,在大多数模型上提升推理性能。完成基于 CINN 编译器进行推理加速的适配对接。针对这些特性增加了开关,用户可设置开启。此外,Paddle Inference 还支持了原生与 TensorRT 子图混合推理下直接加载优化后的序列化模型,可以减少启动时耗时。针对 Paddle-TensorRT 增加灵活控制节点计算精度、子图是否进入 TensorRT 计算等接口,方便调试。 性能优化上,GPU、XPU、CPU 都增加了较多 Transformer 及 LLM 计算加速的融合算子,如分组注意力机制融合算子、GQA 结构、WINT4 等支持,并支持通过 PASS 自动匹配。 + +### 新增功能 +- Paddle-TensorRT + - Paddle-TensorRT 底层调用的 API 升级,在 TensorRT 版本大于 8.5 以上时,调用的 EnqueueV2 API (后续会被废弃)升级为 EnqueueV3 API。[#60807](https://github.com/PaddlePaddle/Paddle/pull/60807) + - 增加配置 config.exp_disable_tensorrt_subgraph()可以设置一些子图不进入 TensorRT。[#61967](https://github.com/PaddlePaddle/Paddle/pull/61967) + - 增加配置 config.exp_disable_tensorrt_dynamic_shape_ops()可设置动态 shape 输入的算子不进入 TensorRT,默认值为 False。[#62352](https://github.com/PaddlePaddle/Paddle/pull/62352) + - 增加配置 config.exp_specify_tensorrt_subgraph_precision()可以设置节点跑不同的精度类型。[#62402](https://github.com/PaddlePaddle/Paddle/pull/62402) +- Inference 中增加开启 CINN 编译器的开关,配置推理 config 时,通过 config.enable_cinn()开启 CINN。[#61949](https://github.com/PaddlePaddle/Paddle/pull/61949) +- Inference 升级使用 PIR 机制 + - config 增加 enable_new_ir()接口使能 PIR。[#61968](https://github.com/PaddlePaddle/Paddle/pull/61968) + - config 增加 set_optimization_level()接口可设置不同优化等级。[#61968](https://github.com/PaddlePaddle/Paddle/pull/61968) + - PIR 机制下 PASS 功能支持自定义 C++PASS。[#62468](https://github.com/PaddlePaddle/Paddle/pull/62468) + - 推理库对外暴露 PIR 相关实现头文件,支持用户基于 PIR 的二次开发,如自定义 Pass 开发等。[#61863](https://github.com/PaddlePaddle/Paddle/pull/61863),[#62293](https://github.com/PaddlePaddle/Paddle/pull/62293) + - PIR 机制下支持通过对 Predictor 注册 Hook 操作算子的输入输出。[#63101](https://github.com/PaddlePaddle/Paddle/pull/63101) +- 多层 Transformer 融合算子 fused_multi_transformer_op 融合算子支持 GQA 计算。[#64125](https://github.com/PaddlePaddle/Paddle/pull/64125) + +### 功能完善 +- 推理支持直接加载优化后的模型,使得可以完全跳过 IR 优化,使用该方式部署可以最大程度降低框架开销。[#61598](https://github.com/PaddlePaddle/Paddle/pull/61598) +- 支持加载保存下来的经过 IR PASS 优化后的模型推理时,重新指定 shape 范围信息文件。[#60457](https://github.com/PaddlePaddle/Paddle/pull/60457) +- 控制流算子的子图内可收集 Shape 信息,支持使用 Paddle-TensorRT 推理加速。[#60451](https://github.com/PaddlePaddle/Paddle/pull/60451) ,[#59588](https://github.com/PaddlePaddle/Paddle/pull/59588) +- GPU 原生推理的混合精度 PASS(auto_mixed_precision_pass)支持处理稀疏 Tensor。[#62656](https://github.com/PaddlePaddle/Paddle/pull/62656) +- XPU 硬件相关 + - XPU 针对 Conv 和 FC 的融合 PASS 支持 Float 到 INT31 类型的转换。[#59981](https://github.com/PaddlePaddle/Paddle/pull/59981) + - XPU 的 strided slice 算子支持设置 strides 未负数。 [#62268](https://github.com/PaddlePaddle/Paddle/pull/62268) + - XPU 的多层 Encoder 融合 PASS 可以自适应序列长度并支持变长 [#63825](https://github.com/PaddlePaddle/Paddle/pull/63825) +- Paddle TensorRT INT8 计算模式下支持 tile 算子进入 TensorRT 计算,提升部分模型 INT8 性能。 [#60189](https://github.com/PaddlePaddle/Paddle/pull/60189) + +### 模型压缩 +主要针对训练后量化(Post Training Quantization,PTQ)和量化训练(Quantization Aware Trainig,QAT)做了 bug 修复和功能优化。 +- 支持模按照通道内分组的模拟量化[#61828](https://github.com/PaddlePaddle/Paddle/pull/61828) +- 支持动态图下自动保存量化 scale 到模型参数文件中[#59441](https://github.com/PaddlePaddle/Paddle/pull/59441) +- 去除中 dataloader 必须是 DataLoader 实例的限制[#61798](https://github.com/PaddlePaddle/Paddle/pull/61798) + +### 性能优化 +- 推理执行器升级,保正性能不变情况下,大幅度降低运行时显存占用,可通过 config.enable_use_executor(True)来使用。[#57920](https://github.com/PaddlePaddle/Paddle/pull/57920),[#58452](https://github.com/PaddlePaddle/Paddle/pull/58452),[#63350](https://github.com/PaddlePaddle/Paddle/pull/63350),[#64466](https://github.com/PaddlePaddle/Paddle/pull/64466) +- 升级 paddle inference 的 oneDNN 版本到 v3.4,其中整体性能相比 v3.3 版本有提升。 [#64661](https://github.com/PaddlePaddle/Paddle/pull/64661) +- 升级基于 CUTLASS 支持矩阵乘与激活的融合计算。 ([#61925](https://github.com/PaddlePaddle/Paddle/pull/61925)) + +#### PIR 机制下新增通用 PASS +- 添加 identity_op_clean_pass 和 matmul_scale_fuse_pass。 [#59840](https://github.com/PaddlePaddle/Paddle/pull/59840) +- 添加 fused_flash_attn_pass,该 pass 会调用 flash_attention 替换原始的 attention 计算。[#64213](https://github.com/PaddlePaddle/Paddle/pull/64213),[#64707](https://github.com/PaddlePaddle/Paddle/pull/64707),[#63304](https://github.com/PaddlePaddle/Paddle/pull/63304) +- 推理 PIR 新架构下全新升级 layout 布局调整算法,支持 conv 类、norm 类等算子的 NHWC 推理,在 SD 模型上测试大幅提升性能。[#63628](https://github.com/PaddlePaddle/Paddle/pull/63628),[#64634](https://github.com/PaddlePaddle/Paddle/pull/64634),[#64658](https://github.com/PaddlePaddle/Paddle/pull/64658),[#64708](https://github.com/PaddlePaddle/Paddle/pull/64708),[#64830](https://github.com/PaddlePaddle/Paddle/pull/64830),[#64896](https://github.com/PaddlePaddle/Paddle/pull/64896) +- 增加 remove_redundant_transpose PASS。 [#63357](https://github.com/PaddlePaddle/Paddle/pull/63357) +- 在推理中使能 CSE PASS,提升推理性能。[#64523](https://github.com/PaddlePaddle/Paddle/pull/64523) -- 通信库新增 TraceHang 功能,当集群训练出现 Hang 的问题时,能够快速的定位到出现问题的节点。[#59217](https://github.com/PaddlePaddle/Paddle/pull/59217) -- 为了提升训练效率和降低显存,动态图支持 stride 机制。[#55156](https://github.com/PaddlePaddle/Paddle/pull/55156),[#54762](https://github.com/PaddlePaddle/Paddle/pull/54762),[#55850](https://github.com/PaddlePaddle/Paddle/pull/55850),[#59190](https://github.com/PaddlePaddle/Paddle/pull/59190),[#57005](https://github.com/PaddlePaddle/Paddle/pull/57005),[#57005](https://github.com/PaddlePaddle/Paddle/pull/57005),[#57331](https://github.com/PaddlePaddle/Paddle/pull/57331),[#58033](https://github.com/PaddlePaddle/Paddle/pull/58033),[#58033](https://github.com/PaddlePaddle/Paddle/pull/58033),[#58303](https://github.com/PaddlePaddle/Paddle/pull/58303),[#57835](https://github.com/PaddlePaddle/Paddle/pull/57835),[#57189](https://github.com/PaddlePaddle/Paddle/pull/57189) -- 为了方便计算图的分析,增强 paddleviz 功能。[#56837](https://github.com/PaddlePaddle/Paddle/pull/56837),[#57626](https://github.com/PaddlePaddle/Paddle/pull/57626) -- 分布式 Sharding 策略(Stage1,2,3)新增 main_grad 功能,以支持更高精度的梯度累加,减少低精度累加带来的精度损失。[#57972](https://github.com/PaddlePaddle/Paddle/pull/57972),[#57934](https://github.com/PaddlePaddle/Paddle/pull/57934),[#57473](https://github.com/PaddlePaddle/Paddle/pull/57473),[#57537](https://github.com/PaddlePaddle/Paddle/pull/57537),[#59611](https://github.com/PaddlePaddle/Paddle/pull/59611),[#57960](https://github.com/PaddlePaddle/Paddle/pull/57960) -- Sharding Stage1 策略新增开关变量,可以控制是否对 Optimizer 进行 fusion 计算。[#58790](https://github.com/PaddlePaddle/Paddle/pull/58790) -- Recompute 功能新增对 Tuple 输入参数的支持,增强了 Recompute 接口的调用能力。[#56793](https://github.com/PaddlePaddle/Paddle/pull/56793) -- 增强 Launch 功能,动态图下无需指定 endpoints 也可以进行分布式训练。 [#54636](https://github.com/PaddlePaddle/Paddle/pull/54636) +#### GPU 性能优化 +含新增融合算子及 PIR 机制下新增 PASS。 +- 稀疏卷积算子(sparse conv)性能优化,提升 BEV 等模型的推理性能。[#63067](https://github.com/PaddlePaddle/Paddle/pull/63067) +- 新增基于 flash attention 的融合 PASS。 [#63220](https://github.com/PaddlePaddle/Paddle/pull/63220) +- 理支持 elementwise_add+group_norm+silu 激活的算子融合 pattern 及其对应融合 kernel。[#64199](https://github.com/PaddlePaddle/Paddle/pull/64199) +- 矩阵乘计算支持 groupwise 的 Weight only INT4 计算。[#60422](https://github.com/PaddlePaddle/Paddle/pull/60422) 、[#63212](https://github.com/PaddlePaddle/Paddle/pull/63212) 、[#60204](https://github.com/PaddlePaddle/Paddle/pull/60204)) +- 分组注意力机制融合算子 block_multi_head_attention 的算子实现支持 KV Cache 量化。[#59951](https://github.com/PaddlePaddle/Paddle/pull/59951)) +- 推理使用 CUTLASS 升级 conv 融合算子实现并支持 PASS 自动融合支持 bias 与 activation,新算子相较原先 cuDNN 实现有显著的性能加速。需通过 config.exp_enable_use_cutlass(True)使用。[#64201](https://github.com/PaddlePaddle/Paddle/pull/64201)、[#64641](https://github.com/PaddlePaddle/Paddle/pull/64641) +- 添加 blha_get_max_len 算子并去除了 block_multihead_attention 中每次调用 get_max_len 的行为,该功能应用于大模型动态推理加速。[#64246](https://github.com/PaddlePaddle/Paddle/pull/64246) +- 数据排布优化 PASS 禁止 conv 融合算子 FP32 精度类型时使用 NHWC 模式计算,原因是 cuDNN 在此条件下会导致性能退化。[#63400](https://github.com/PaddlePaddle/Paddle/pull/63400) +- GPU 峰值显存优化,升级底层接口 TryShrinkMemory 升级支持 GPU place 下支持释放显存池空闲显存,某些场景下可大幅度削减峰值显存。[#61319](https://github.com/PaddlePaddle/Paddle/pull/61319) -#### 功能优化 +#### CPU 性能优化 +含新增融合算子及 PIR 机制下新增 PASS 并优化部分 Kernel。 +- 添加 scale_matmul_fuse_pass [#63313](https://github.com/PaddlePaddle/Paddle/pull/63313) +- 融合算子 fused_bias_residual_layernorm 和 fused_rms_norm 添加 CPU 实现,大幅度推理速度。[#63196](https://github.com/PaddlePaddle/Paddle/pull/63196)、[#63165](https://github.com/PaddlePaddle/Paddle/pull/63165) +- 新增 Deconvolution kernel 的缓存优化,从而大大提升该算子的执行速度。 [#60922](https://github.com/PaddlePaddle/Paddle/pull/60922) +- PIR 下新增 depthwise_conv 融合 PASS,将 depthwise_conv 算子转换为 conv2d,从而使用 onednn conv2d 的 kernel 优化,提升该算子推理速度。 [#63051](https://github.com/PaddlePaddle/Paddle/pull/63051) +- PIR 下新增 Conv 与激活的融合 PASS(conv_activation_mkldnn_fuse_pass),支持 conv 和 13 种激活函数进行融合,大大提升 conv 相关算子的推理速度。 [#63145](https://github.com/PaddlePaddle/Paddle/pull/63145) +- PIR 下新增多种算子和 unsqueeze 的算子融合 PASS(operator_unsqueeze_onednn_fuse_pass),提升推理速度。 [#63592](https://github.com/PaddlePaddle/Paddle/pull/63592) +- PIR 下新增将 reshape 融合进多个算子的 PASS (operator_reshape_onednn_fuse_pass)。 [#63812](https://github.com/PaddlePaddle/Paddle/pull/63812) +- PIR 下新增 scale 融合 PASS (operator_scale_onednn_fuse_pass)。 [#63811](https://github.com/PaddlePaddle/Paddle/pull/63811) +- PIR 下新增 conv 与 bias 融合的 PASS (conv2d_transpose_bias 算子) 。 [#62241](https://github.com/PaddlePaddle/Paddle/pull/62241) +- PIR 下新增 onednn_placement_pass,支持了 151 种算子从 Phi 算子转换为 oneDNN 算子,从而使用 oneDNN 高性能库进行优化,提升推理速度。 [#63982](https://github.com/PaddlePaddle/Paddle/pull/63982) +- PIR 下新增 elementwise 类型算子和 13 种激活函数的融合,大大提升 cpu 下开启 onednn 的推理速度。 [#63516](https://github.com/PaddlePaddle/Paddle/pull/63516) +- PIR 下新增多个 conv + concat + 激活函数和 fused_conv + concat + 激活函数的融合,大大提升了 conv 下有 concat 和激活函数的情况下推理速度。 [#62993](https://github.com/PaddlePaddle/Paddle/pull/62993)、 [#62713](https://github.com/PaddlePaddle/Paddle/pull/62713) +- PIR 下新增 matmul+add 算子融合 PASS (matmul_elementwise_add_fuse_pass)。[#62715](https://github.com/PaddlePaddle/Paddle/pull/62715) +- PIR 下新增 scale 参数折叠 PASS(scale_matmul_fuse_pass)。[#63313](https://github.com/PaddlePaddle/Paddle/pull/63313) +- PIR 下新增 softplus 与 12 种激活函数融合 PASS(softplus_activation_fuse_pass)。[#63617](https://github.com/PaddlePaddle/Paddle/pull/63617) +- PIR 下新增 fc 算子转换 PASS(fc_onednn_enable_pass)。[#63518](https://github.com/PaddlePaddle/Paddle/pull/63518) +- PIR 下新增自注意力算子融合 PASS(self_attention_fuse_pass)。[#63726](https://github.com/PaddlePaddle/Paddle/pull/63726) +- PIR 下新增 fc 与 12 种激活函数融合 PASS(fc_activation_fuse_pass)。[#63853](https://github.com/PaddlePaddle/Paddle/pull/63853) +- PIR 下新增 BatchNorm 折叠 PASS(conv2d_bn_onednn_fuse_pass),扩增后续 pass 的融合几率。[#64524](https://github.com/PaddlePaddle/Paddle/pull/64524) +- PIR 下新增 matmul 与 12 种激活函数融合 PASS(matmul_activation_fuse_pass)。[#62901](https://github.com/PaddlePaddle/Paddle/pull/62901) +- PIR 下新增 reshape + transpose + reshape 融合 PASS(shuffle_channel_detect_pass),在特定条件下融合为 shuffle_channel 算子。[#64053](https://github.com/PaddlePaddle/Paddle/pull/64053) +- PIR 下新增 reshape + transpose + matmul 融合 PASS(reshape_transpose_matmul_fuse_pass)。[#62998](https://github.com/PaddlePaddle/Paddle/pull/62998) +- PIR 下新增 matmul + transpose + reshape 融合 PASS(matmul_transpose_reshape_fuse_pass),在部分场景下显著提升性能。[#63151](https://github.com/PaddlePaddle/Paddle/pull/63151)(https://github.com/PaddlePaddle/Paddle/pull/63151) + +### Bug 修复 +- 修复 faster_rcnn_swin_tiny_fpn_1x_coco 等模型中的混合精度转换问题,解决了 mixed_precision_pass 的错误。 [#64673](https://github.com/PaddlePaddle/Paddle/pull/64673) +- 阻止 fused_conv2d_add_act pass 在激活函数为 sigmoid 中被生效(cudnn 版本 8.0~8.7 之间时,融合 conv2d 和 sigmoid 会导致性能退化)。[#64717](https://github.com/PaddlePaddle/Paddle/pull/64717) +- 修复 self_dp_attention 和 fused_layer_norm_avx_kernel 在 Clang12 中的编译问题。 [#63414](https://github.com/PaddlePaddle/Paddle/pull/63414) +- 修复部分模型在 IR/Pass 阶段 qdq 算子中的 scale 和 zeroPoint 过早删除的问题。 [#62225](https://github.com/PaddlePaddle/Paddle/pull/62225) +- 修复同时开启 Config.UseOptimizedModel()和 config.EnableMemoryOptim()时导致报错的问题。 [#62501](https://github.com/PaddlePaddle/Paddle/pull/62501) +- 增加 matmul_scale_fuse_pass 的约束,其中输入 w 必须是权重,否则不会匹配该 pass。 [#62850](https://github.com/PaddlePaddle/Paddle/pull/62850) +- 保持 inference 模型输出键顺序保证与动态图模型导出时的顺序一致。 [#63791](https://github.com/PaddlePaddle/Paddle/pull/63791) +- 修复子图在常量折节 PASS 在"被折叠的 op 和其输入输出不在一个子图时"出错问题。 [#62148](https://github.com/PaddlePaddle/Paddle/pull/62148) +- 修复 PaddleTRT 模式下若干运行时问题。包括 int8 模式下 yolo_box 算子引起的量化校准表生成失败、reduce 算子 dim 属性数据类型未正确处理引起的报错。[#61596](https://github.com/PaddlePaddle/Paddle/pull/61596) +- 修复混合精度推理模式下若干运行时报错问题。包括 fused conv2d 算子间共享权重未正确转换权重 layout、fused conv2d 算子 backend 未正常选择为 cuDNN、fused conv2d 算子在 NHWC 下错误处理 bias 维度、错误处理 norm 类算子的输入数据类型引起的报错。[#60955](https://github.com/PaddlePaddle/Paddle/pull/60955)、[#60076](https://github.com/PaddlePaddle/Paddle/pull/60076)、[#63007](https://github.com/PaddlePaddle/Paddle/pull/63007)、[#63988](https://github.com/PaddlePaddle/Paddle/pull/63988) +- 修复 config.delete_pass 功能未生效问题。[#61056](https://github.com/PaddlePaddle/Paddle/pull/61056) +- PIR 中修复 While 控制流的 GC 机制,提前回收不需要的输入,减少峰值显存,例如在 LLaMA 7B 模型中减少 2GB 显存。[#63062](https://github.com/PaddlePaddle/Paddle/pull/63062) +- 修正了 OneDNN mean kernel 回退错误。 [#64676](https://github.com/PaddlePaddle/Paddle/pull/64676) +- 修正 conv_bias_fuse_pass 新增了若干强约束, 例如 bias 的 shape 不能为 1,从而保证 pass 推理结果稳定。 [#64412](https://github.com/PaddlePaddle/Paddle/pull/64412) +- 修正 conv_elementwise_add_onednn_fuse_pass 新增了若干强约束,例如 conv2d_out 和 residual_param 的尺寸必须一致,从而保证 pass 推理稳定。 [#64448](https://github.com/PaddlePaddle/Paddle/pull/64448) +- 修复在特定情况下,反复插入量化反量化算子的问题 [#63082](https://github.com/PaddlePaddle/Paddle/pull/63082) + +## 9.硬件适配 + +### 适配方案 (Custom Device) +飞桨硬件接入本次新增了对 4 款硬件昆仑 XPU、昇腾 NPU、海光 DCU 和寒武纪 MLU 的日常发版支持,同时通过大模型训练和推理部署的打磨修复了分布式通信中存在的问题,并通过显存优化、计算和通信的 overlap 等功能进行性能优化。其次、本次各个硬件还新增了大量 BFloat16 数据类型的算子支持,以及众多算子融合 Pass 和各个硬件上的融合算子,通过软硬联合的方式接入硬件大 Transformer 算子库来充分提升大模型性能。 -- 实现动静统一的新通信库,通信算子全面适配 PHI 算子体系,减少开发和维护成本,更好地支持动态图和自动并行架构升级。[#54417](https://github.com/PaddlePaddle/Paddle/pull/54417),[#57768](https://github.com/PaddlePaddle/Paddle/pull/57768),[#57897](https://github.com/PaddlePaddle/Paddle/pull/57897),[#55537](https://github.com/PaddlePaddle/Paddle/pull/55537),[#56604](https://github.com/PaddlePaddle/Paddle/pull/56604),[#57519](https://github.com/PaddlePaddle/Paddle/pull/57519),[#56088](https://github.com/PaddlePaddle/Paddle/pull/56088),[#57153](https://github.com/PaddlePaddle/Paddle/pull/57153),[#57161](https://github.com/PaddlePaddle/Paddle/pull/57161),[#57252](https://github.com/PaddlePaddle/Paddle/pull/57252),[#57251](https://github.com/PaddlePaddle/Paddle/pull/57251),[#57208](https://github.com/PaddlePaddle/Paddle/pull/57208),[#57305](https://github.com/PaddlePaddle/Paddle/pull/57305),[#57424](https://github.com/PaddlePaddle/Paddle/pull/57424),[#57548](https://github.com/PaddlePaddle/Paddle/pull/57548),[#57560](https://github.com/PaddlePaddle/Paddle/pull/57560),[#57564](https://github.com/PaddlePaddle/Paddle/pull/57564),[#57233](https://github.com/PaddlePaddle/Paddle/pull/57233),[#55726](https://github.com/PaddlePaddle/Paddle/pull/55726),[#58073](https://github.com/PaddlePaddle/Paddle/pull/58073) -- TCPStore 改为单例以便更灵活地支持动态图和自动并行功能。[#55956](https://github.com/PaddlePaddle/Paddle/pull/55956) -- 改善了 MP/PP/SP 等分布式策略的可维护性和灵活性,包含增加打印 warning、报错信息,对代码文件进行结构清理,梳理 PP 对输入的限制等。[#54448](https://github.com/PaddlePaddle/Paddle/pull/54448),[#59762](https://github.com/PaddlePaddle/Paddle/pull/59762),[#55462](https://github.com/PaddlePaddle/Paddle/pull/55462),[#54788](https://github.com/PaddlePaddle/Paddle/pull/54788),[#54664](https://github.com/PaddlePaddle/Paddle/pull/54664),[#56456](https://github.com/PaddlePaddle/Paddle/pull/56456),[#55540](https://github.com/PaddlePaddle/Paddle/pull/55540) -- PP 策略中增加可以在计算流中进行 P2P 通信的支持,通信模式更加灵活。[#54747](https://github.com/PaddlePaddle/Paddle/pull/54747) -- Sharding 策略支持对梯度进行 reduce 操作。[#58842](https://github.com/PaddlePaddle/Paddle/pull/58842),[#57967](https://github.com/PaddlePaddle/Paddle/pull/57967),[#55495](https://github.com/PaddlePaddle/Paddle/pull/55495) +#### 新增功能 +- 新增分布式策略 sharding stage1 v2 的支持。[#61500](https://github.com/PaddlePaddle/Paddle/pull/61500) +- 支持分布式通信模块支持 BF16 数据类型。新增部分算子对 BF16 数据类型的支持,如 empty、shape 等。[#60768](https://github.com/PaddlePaddle/Paddle/pull/60768),[#62140](https://github.com/PaddlePaddle/Paddle/pull/62140),[#62604](https://github.com/PaddlePaddle/Paddle/pull/62604) +- 新增 get_comm_name 接口的支持,对 memory stat 功能支持, 支持 Profiler 对内存时间的记录。[#62556](https://github.com/PaddlePaddle/Paddle/pull/62556),[#61030](https://github.com/PaddlePaddle/Paddle/pull/61030),[#62292](https://github.com/PaddlePaddle/Paddle/pull/62292) +- 新增部分融合策略和算子的支持,包括 silu_fuse_pass, conv_elementwise_add_act_fuse_pass, generator offset 的支持。 [#60595](https://github.com/PaddlePaddle/Paddle/pull/60595),[#60708](https://github.com/PaddlePaddle/Paddle/pull/60708),[#60616](https://github.com/PaddlePaddle/Paddle/pull/60616) #### 性能优化 - -- 实现 PP 策略的最后一层及时释放 output,以节约显存。[#54505](https://github.com/PaddlePaddle/Paddle/pull/54505) -- MP 策略 Tensor fusion 支持传入 params group,增强 Tensor fusion 功能;增加 allreduce 异步通信性能,通过计算和通信的 overlap 提升训练性能。[#57690](https://github.com/PaddlePaddle/Paddle/pull/57690),[#55662](https://github.com/PaddlePaddle/Paddle/pull/55662) -- Sharding 策略反向计算和梯度通信进行 overlap 以提升训练性能。Sharding stage1 新增 Tensor fusion 和 fuse grad clip,optimizer 等优化提高计算效率。支持 VPP 与 DP/Sharding Stage1 的 overlap,提升通信计算并行度。优化 Sharding Stage1 在 FP16 下的性能,在 check finite 阶段只对本 sharding rank 负责的梯度进行检查,降低计算开销;增加环境变量,控制是否进行 Optimize,以节约显存支持,实现使用更少的资源进行模型训练调试。[#55598](https://github.com/PaddlePaddle/Paddle/pull/55598),[#55427](https://github.com/PaddlePaddle/Paddle/pull/55427),[#56063](https://github.com/PaddlePaddle/Paddle/pull/56063),[#55766](https://github.com/PaddlePaddle/Paddle/pull/55766),[#59848](https://github.com/PaddlePaddle/Paddle/pull/59848) -- 混合并行策略将 PP/VPP 下的 Tensor fusion 提到运行前,解决运行时 fuse 对显存额外开销的问题。通过减少非必需的同步 memcpy,以提升模型训练性能。[#54403](https://github.com/PaddlePaddle/Paddle/pull/54403),[#57215](https://github.com/PaddlePaddle/Paddle/pull/57215) - -#### Bug Fix - -- 修复了 PP、Launch 功能、MP 策略以及 fuse_rope 等 13 个 bug,增强了分布式策略的稳定性;机制层面,修复 inplace,tensor 引用的错误,提升稳定性。[#55116](https://github.com/PaddlePaddle/Paddle/pull/55116),[#55782](https://github.com/PaddlePaddle/Paddle/pull/55782),[#59609](https://github.com/PaddlePaddle/Paddle/pull/59609),[#57394](https://github.com/PaddlePaddle/Paddle/pull/57394),[#55864](https://github.com/PaddlePaddle/Paddle/pull/55864),[#58482](https://github.com/PaddlePaddle/Paddle/pull/58482),[#54571](https://github.com/PaddlePaddle/Paddle/pull/54571),[#55896](https://github.com/PaddlePaddle/Paddle/pull/55896),[#54648](https://github.com/PaddlePaddle/Paddle/pull/54648),[#58307](https://github.com/PaddlePaddle/Paddle/pull/58307),[#55679](https://github.com/PaddlePaddle/Paddle/pull/55679),[#58133](https://github.com/PaddlePaddle/Paddle/pull/58133),[#58408](https://github.com/PaddlePaddle/Paddle/pull/58408),[#59707](https://github.com/PaddlePaddle/Paddle/pull/59707),[#55342](https://github.com/PaddlePaddle/Paddle/pull/55342),[#54703](https://github.com/PaddlePaddle/Paddle/pull/54703),[#54869](https://github.com/PaddlePaddle/Paddle/pull/54869),[#55568](https://github.com/PaddlePaddle/Paddle/pull/55568),[#55233](https://github.com/PaddlePaddle/Paddle/pull/55233),[#56418](https://github.com/PaddlePaddle/Paddle/pull/56418),[#56428](https://github.com/PaddlePaddle/Paddle/pull/56428),[#56892](https://github.com/PaddlePaddle/Paddle/pull/56892),[#57192](https://github.com/PaddlePaddle/Paddle/pull/57192),[#59161](https://github.com/PaddlePaddle/Paddle/pull/59161),[#59340](https://github.com/PaddlePaddle/Paddle/pull/59340),[#57006](https://github.com/PaddlePaddle/Paddle/pull/57006),[#57353](https://github.com/PaddlePaddle/Paddle/pull/57353),[#57352](https://github.com/PaddlePaddle/Paddle/pull/57352),[#59088](https://github.com/PaddlePaddle/Paddle/pull/59088) -- 修复了 PP 策略无法及时释放单层 output 的 bug,以及初始化过程中可能会 Hang 的 bug。 [#54624](https://github.com/PaddlePaddle/Paddle/pull/54624),[#58844](https://github.com/PaddlePaddle/Paddle/pull/58844),[#54673](https://github.com/PaddlePaddle/Paddle/pull/54673),[#58376](https://github.com/PaddlePaddle/Paddle/pull/58376) -- 修复了 MP 策略下,当输入数据类型不统一时计算出错的 bug,修复了 MP 策略下参数同步的 bug 和没有正确使用用户输入 config 的 bug。[#58858](https://github.com/PaddlePaddle/Paddle/pull/58858),[#57918](https://github.com/PaddlePaddle/Paddle/pull/57918),[#58037](https://github.com/PaddlePaddle/Paddle/pull/58037) -- 统一 dygraph 和 dynamic 模式的判断方法。[#54633](https://github.com/PaddlePaddle/Paddle/pull/54633) -- 修复了 fuse_rope 中 sin 和 cos 的 Shape 不对的 bug。[#56132](https://github.com/PaddlePaddle/Paddle/pull/56132) -- 修复了 Luanch 功能分布式场景下 endpoints 太长导致不能启动任务的 bug,同时修复了 endpoints 可能乱序的 bug。 [#55011](https://github.com/PaddlePaddle/Paddle/pull/55011),[#55478](https://github.com/PaddlePaddle/Paddle/pull/55478) -- 修复了 MEA 功能可能导致 segmentation fault error 的 bug。[#55408](https://github.com/PaddlePaddle/Paddle/pull/55408) - -### 自动并行 - -本版本对动静统一自动并行(Auto Parallel)编程范式进行了全面的优化,简化了开发者的编程复杂度。开发者无需深入了解手动并行编程范式下的复杂概念和 API 接口,如行切分、列切分等。仅需通过少量的张量切分标注即可完成混合并行模型的构建。框架能够自动推导出所有张量和算子的分布式切分状态,并添加合适的通信算子。同时支持一键动转静进行分布式训练,使开发者能够高效轻松地实现任意混合并行策略,大幅降低了混合并行训练代码的开发成本。 - -#### 完善了自动并行核心功能 - -- 实现 process_mesh、placement、shard_tensor、reshard、dtensor_from_fn、unshard_dtensor、shard_layer、to_static 等自动并行核心接口 [#55494](https://github.com/PaddlePaddle/Paddle/pull/55494),[#59059](https://github.com/PaddlePaddle/Paddle/pull/59059),[#56561](https://github.com/PaddlePaddle/Paddle/pull/56561),[#54425](https://github.com/PaddlePaddle/Paddle/pull/54425),[#59557](https://github.com/PaddlePaddle/Paddle/pull/59557),[#59682](https://github.com/PaddlePaddle/Paddle/pull/59682),[#56565](https://github.com/PaddlePaddle/Paddle/pull/56565),[#59862](https://github.com/PaddlePaddle/Paddle/pull/59862),[#59856](https://github.com/PaddlePaddle/Paddle/pull/59856),[#59342](https://github.com/PaddlePaddle/Paddle/pull/59342),[#59575](https://github.com/PaddlePaddle/Paddle/pull/59575),[#57604](https://github.com/PaddlePaddle/Paddle/pull/57604),[#57293](https://github.com/PaddlePaddle/Paddle/pull/57293),[#57278](https://github.com/PaddlePaddle/Paddle/pull/57278) -- 实现基于 Enisum 表达式的切分推导规则,并完成 20+类算子切分推导规则,覆盖 LLaMA、GPT 等主流生成式大语言模型。[#55196](https://github.com/PaddlePaddle/Paddle/pull/55196),[#53863](https://github.com/PaddlePaddle/Paddle/pull/53863),[#56257](https://github.com/PaddlePaddle/Paddle/pull/56257),[#55394](https://github.com/PaddlePaddle/Paddle/pull/55394),[#54810](https://github.com/PaddlePaddle/Paddle/pull/54810),[#55508](https://github.com/PaddlePaddle/Paddle/pull/55508),[#56257](https://github.com/PaddlePaddle/Paddle/pull/56257),[#57813](https://github.com/PaddlePaddle/Paddle/pull/57813),[#58149](https://github.com/PaddlePaddle/Paddle/pull/58149),[#58506](https://github.com/PaddlePaddle/Paddle/pull/58506),[#58563](https://github.com/PaddlePaddle/Paddle/pull/58563),[#58360](https://github.com/PaddlePaddle/Paddle/pull/58360),[#58920](https://github.com/PaddlePaddle/Paddle/pull/58920),[#59050](https://github.com/PaddlePaddle/Paddle/pull/59050),[#58760](https://github.com/PaddlePaddle/Paddle/pull/58760),[#59083](https://github.com/PaddlePaddle/Paddle/pull/59083),[#59236](https://github.com/PaddlePaddle/Paddle/pull/59236),[#59350](https://github.com/PaddlePaddle/Paddle/pull/59350),[#59411](https://github.com/PaddlePaddle/Paddle/pull/59411),[#59260](https://github.com/PaddlePaddle/Paddle/pull/59260),[#54373](https://github.com/PaddlePaddle/Paddle/pull/54373),[#54991](https://github.com/PaddlePaddle/Paddle/pull/54991),[#55397](https://github.com/PaddlePaddle/Paddle/pull/55397),[#55350](https://github.com/PaddlePaddle/Paddle/pull/55350),[#55177](https://github.com/PaddlePaddle/Paddle/pull/55177),[#56443](https://github.com/PaddlePaddle/Paddle/pull/56443),[#58097](https://github.com/PaddlePaddle/Paddle/pull/58097),[#56509](https://github.com/PaddlePaddle/Paddle/pull/56509),[#56502](https://github.com/PaddlePaddle/Paddle/pull/56502),[#56504](https://github.com/PaddlePaddle/Paddle/pull/56504),[#56506](https://github.com/PaddlePaddle/Paddle/pull/56506),[#56507](https://github.com/PaddlePaddle/Paddle/pull/56507),[#56505](https://github.com/PaddlePaddle/Paddle/pull/56505),[#57176](https://github.com/PaddlePaddle/Paddle/pull/57176),[#57374](https://github.com/PaddlePaddle/Paddle/pull/57374),[#57573](https://github.com/PaddlePaddle/Paddle/pull/57573),[#57545](https://github.com/PaddlePaddle/Paddle/pull/57545),[#57875](https://github.com/PaddlePaddle/Paddle/pull/57875),[#57866](https://github.com/PaddlePaddle/Paddle/pull/57866),[#58854](https://github.com/PaddlePaddle/Paddle/pull/58854),[#59109](https://github.com/PaddlePaddle/Paddle/pull/59109),[#59185](https://github.com/PaddlePaddle/Paddle/pull/59185),[#58913](https://github.com/PaddlePaddle/Paddle/pull/58913),[#59547](https://github.com/PaddlePaddle/Paddle/pull/59547),[#58296](https://github.com/PaddlePaddle/Paddle/pull/58296),[#59545](https://github.com/PaddlePaddle/Paddle/pull/59545),[#59039](https://github.com/PaddlePaddle/Paddle/pull/59039),[#59002](https://github.com/PaddlePaddle/Paddle/pull/59002),[#58087](https://github.com/PaddlePaddle/Paddle/pull/58087),[#56367](https://github.com/PaddlePaddle/Paddle/pull/56367),[#57877](https://github.com/PaddlePaddle/Paddle/pull/57877),[#56839](https://github.com/PaddlePaddle/Paddle/pull/56839),[#59003](https://github.com/PaddlePaddle/Paddle/pull/59003),[#57269](https://github.com/PaddlePaddle/Paddle/pull/57269),[#55130](https://github.com/PaddlePaddle/Paddle/pull/55130),[#58474](https://github.com/PaddlePaddle/Paddle/pull/58474),[#57197](https://github.com/PaddlePaddle/Paddle/pull/57197),[#57467](https://github.com/PaddlePaddle/Paddle/pull/57467),[#57259](https://github.com/PaddlePaddle/Paddle/pull/57259),[#57280](https://github.com/PaddlePaddle/Paddle/pull/57280),[#56508](https://github.com/PaddlePaddle/Paddle/pull/56508) -- 实现动静统一的分布式 checkpoint 存储和加载,支持任意按切分状态存储和加载时重切分。[#59659](https://github.com/PaddlePaddle/Paddle/pull/59659),[#59843](https://github.com/PaddlePaddle/Paddle/pull/59843),[#60033](https://github.com/PaddlePaddle/Paddle/pull/60033),[#60034](https://github.com/PaddlePaddle/Paddle/pull/60034) - -#### 增强动态图半自动并行能力 - -- 基础数据结构补充:C++端新增 DistTensor、Placements 等分布式特有的基础数据结构,并暴露到 Python 端,支持对相关属性和值的调试打印。[#58930](https://github.com/PaddlePaddle/Paddle/pull/58930),[#59068](https://github.com/PaddlePaddle/Paddle/pull/59068),[#55436](https://github.com/PaddlePaddle/Paddle/pull/55436),[#56449](https://github.com/PaddlePaddle/Paddle/pull/56449),[#59683](https://github.com/PaddlePaddle/Paddle/pull/59683),[#55593](https://github.com/PaddlePaddle/Paddle/pull/55593),[#58032](https://github.com/PaddlePaddle/Paddle/pull/58032),[#56368](https://github.com/PaddlePaddle/Paddle/pull/56368),[#59086](https://github.com/PaddlePaddle/Paddle/pull/59086) -- 在前、反向算子执行流程中添加 SPMD 推导与 Reshard 的生成逻辑,适配 vector、optional 等多类型输入输出以及 cpu fallback、多 kernel 选择等特殊机制。[#56602](https://github.com/PaddlePaddle/Paddle/pull/56602),[#57321](https://github.com/PaddlePaddle/Paddle/pull/57321),[#57092](https://github.com/PaddlePaddle/Paddle/pull/57092),[#56831](https://github.com/PaddlePaddle/Paddle/pull/56831),[#57119](https://github.com/PaddlePaddle/Paddle/pull/57119),[#58819](https://github.com/PaddlePaddle/Paddle/pull/58819),[#58254](https://github.com/PaddlePaddle/Paddle/pull/58254),[#55698](https://github.com/PaddlePaddle/Paddle/pull/55698),[#59241](https://github.com/PaddlePaddle/Paddle/pull/59241),[#59328](https://github.com/PaddlePaddle/Paddle/pull/59328),[#58644](https://github.com/PaddlePaddle/Paddle/pull/58644),[#56202](https://github.com/PaddlePaddle/Paddle/pull/56202),[#59159](https://github.com/PaddlePaddle/Paddle/pull/59159),[#58573](https://github.com/PaddlePaddle/Paddle/pull/58573),[#59246](https://github.com/PaddlePaddle/Paddle/pull/59246),[#59133](https://github.com/PaddlePaddle/Paddle/pull/59133),[#59186](https://github.com/PaddlePaddle/Paddle/pull/59186),[#57505](https://github.com/PaddlePaddle/Paddle/pull/57505),[#57241](https://github.com/PaddlePaddle/Paddle/pull/57241),[#58928](https://github.com/PaddlePaddle/Paddle/pull/58928) - -- 对 custom 算子、手写算子等特殊类型的算子,适配自动并行的执行逻辑。支持 DistTensor 和 DenseTensor 作为混合输入时的自动转换。[#57774](https://github.com/PaddlePaddle/Paddle/pull/57774),[#59108](https://github.com/PaddlePaddle/Paddle/pull/59108),[#58436](https://github.com/PaddlePaddle/Paddle/pull/58436),[#59523](https://github.com/PaddlePaddle/Paddle/pull/59523),[#59136](https://github.com/PaddlePaddle/Paddle/pull/59136),[#59352](https://github.com/PaddlePaddle/Paddle/pull/59352),[#59062](https://github.com/PaddlePaddle/Paddle/pull/59062),[#58434](https://github.com/PaddlePaddle/Paddle/pull/58434),[#59148](https://github.com/PaddlePaddle/Paddle/pull/59148),[#58553](https://github.com/PaddlePaddle/Paddle/pull/58553),[#58716](https://github.com/PaddlePaddle/Paddle/pull/58716),[#58369](https://github.com/PaddlePaddle/Paddle/pull/58369),[#59061](https://github.com/PaddlePaddle/Paddle/pull/59061),[#58841](https://github.com/PaddlePaddle/Paddle/pull/58841),[#59139](https://github.com/PaddlePaddle/Paddle/pull/59139),[#59141](https://github.com/PaddlePaddle/Paddle/pull/59141),[#58837](https://github.com/PaddlePaddle/Paddle/pull/58837),[#59137](https://github.com/PaddlePaddle/Paddle/pull/59137),[#59143](https://github.com/PaddlePaddle/Paddle/pull/59143) - -- 动态图执行体系完善:适配 Autograd 执行过程,支持动态图的反向梯度聚合、AMP、Hook、PyLayer、View、自定义算子等周围机制。[#58437](https://github.com/PaddlePaddle/Paddle/pull/58437),[#58769](https://github.com/PaddlePaddle/Paddle/pull/58769),[#58796](https://github.com/PaddlePaddle/Paddle/pull/58796),[#58339](https://github.com/PaddlePaddle/Paddle/pull/58339),[#58409](https://github.com/PaddlePaddle/Paddle/pull/58409),[#58772](https://github.com/PaddlePaddle/Paddle/pull/58772),[#58380](https://github.com/PaddlePaddle/Paddle/pull/58380),[#58447](https://github.com/PaddlePaddle/Paddle/pull/58447),[#58706](https://github.com/PaddlePaddle/Paddle/pull/58706),[#58656](https://github.com/PaddlePaddle/Paddle/pull/58656),[#58172](https://github.com/PaddlePaddle/Paddle/pull/58172),[#59401](https://github.com/PaddlePaddle/Paddle/pull/59401),[#58727](https://github.com/PaddlePaddle/Paddle/pull/58727),[#58238](https://github.com/PaddlePaddle/Paddle/pull/58238),[#59243](https://github.com/PaddlePaddle/Paddle/pull/59243),[#58469](https://github.com/PaddlePaddle/Paddle/pull/58469),[#58442](https://github.com/PaddlePaddle/Paddle/pull/58442),[#58487](https://github.com/PaddlePaddle/Paddle/pull/58487),[#58476](https://github.com/PaddlePaddle/Paddle/pull/58476),[#59706](https://github.com/PaddlePaddle/Paddle/pull/59706) - -- 新增对 PP、SP 等分布式策略的支持。[#58126](https://github.com/PaddlePaddle/Paddle/pull/58126),[#59766](https://github.com/PaddlePaddle/Paddle/pull/59766),[#59060](https://github.com/PaddlePaddle/Paddle/pull/59060),[#59841](https://github.com/PaddlePaddle/Paddle/pull/59841),[#58609](https://github.com/PaddlePaddle/Paddle/pull/58609),[#59688](https://github.com/PaddlePaddle/Paddle/pull/59688),[#58449](https://github.com/PaddlePaddle/Paddle/pull/58449)、[#59598](https://github.com/PaddlePaddle/Paddle/pull/59598) -- 新增多种 Reshard 策略,支持张量在不同分布式状态间的转换。[#58592](https://github.com/PaddlePaddle/Paddle/pull/58592),[#59138](https://github.com/PaddlePaddle/Paddle/pull/59138),[#59367](https://github.com/PaddlePaddle/Paddle/pull/59367),[#59621](https://github.com/PaddlePaddle/Paddle/pull/59621),[#59758](https://github.com/PaddlePaddle/Paddle/pull/59758),[#59777](https://github.com/PaddlePaddle/Paddle/pull/59777),[#56975](https://github.com/PaddlePaddle/Paddle/pull/56975),[#58550](https://github.com/PaddlePaddle/Paddle/pull/58550),[#58703](https://github.com/PaddlePaddle/Paddle/pull/58703),[#57210](https://github.com/PaddlePaddle/Paddle/pull/57210),[#58734](https://github.com/PaddlePaddle/Paddle/pull/58734),[#56833](https://github.com/PaddlePaddle/Paddle/pull/56833),[#59292](https://github.com/PaddlePaddle/Paddle/pull/59292),[#57432](https://github.com/PaddlePaddle/Paddle/pull/57432),[#57568](https://github.com/PaddlePaddle/Paddle/pull/57568),[#56553](https://github.com/PaddlePaddle/Paddle/pull/56553),[#58284](https://github.com/PaddlePaddle/Paddle/pull/58284),[#56039](https://github.com/PaddlePaddle/Paddle/pull/56039),[#55552](https://github.com/PaddlePaddle/Paddle/pull/55552),[#56149](https://github.com/PaddlePaddle/Paddle/pull/56149) - -#### 静态图半自动并行能力增强 - -- 新增 Sequence Parallel 并行策略;流水线并行新增: FThenB、Interleaved 1F1B、Eager 1F1B、VPP 等调度模式,支持流水线调度的可视化,并支持上述策略与原有并行策略的混合并行;升级梯度同步机制,支持数据在任意 broadcast 维度后需要的梯度同步。[#57605](https://github.com/PaddlePaddle/Paddle/pull/57605),[#54727](https://github.com/PaddlePaddle/Paddle/pull/54727),[#54409](https://github.com/PaddlePaddle/Paddle/pull/54409),[#54787](https://github.com/PaddlePaddle/Paddle/pull/54787),[#58313](https://github.com/PaddlePaddle/Paddle/pull/58313),[#59179](https://github.com/PaddlePaddle/Paddle/pull/59179),[#59416](https://github.com/PaddlePaddle/Paddle/pull/59416),[#59719](https://github.com/PaddlePaddle/Paddle/pull/59719),[#59822](https://github.com/PaddlePaddle/Paddle/pull/59822),[#59057](https://github.com/PaddlePaddle/Paddle/pull/59057),[#59522](https://github.com/PaddlePaddle/Paddle/pull/59522),[#57061](https://github.com/PaddlePaddle/Paddle/pull/57061) -- 执行体系与 PIR 进一步适配,打通 PIR 的优化 Pass,分布式场景下支持了 fuse_linear fuse 优化,实现性能提升。[#58459](https://github.com/PaddlePaddle/Paddle/pull/58459),[#58528](https://github.com/PaddlePaddle/Paddle/pull/58528),[#55555](https://github.com/PaddlePaddle/Paddle/pull/55555),[#59757](https://github.com/PaddlePaddle/Paddle/pull/59757),[#59102](https://github.com/PaddlePaddle/Paddle/pull/59102),[#57917](https://github.com/PaddlePaddle/Paddle/pull/57917) -- 底层架构升级: 执行器升级支持图依赖信息复用和静态化 kernel 选择;整图切分补全机制升级,切换新切分推导规则并支持更多长尾 cases 的正确切分补全;优化了静态图分布式下对控制流的支持,适配更多场景;优化了整图编译速度、日志信息格式等提升用户体验。 [#55389](https://github.com/PaddlePaddle/Paddle/pull/55389),[#55650](https://github.com/PaddlePaddle/Paddle/pull/55650),[#54938](https://github.com/PaddlePaddle/Paddle/pull/54938),[#57447](https://github.com/PaddlePaddle/Paddle/pull/57447),[#57751](https://github.com/PaddlePaddle/Paddle/pull/57751),[#57742](https://github.com/PaddlePaddle/Paddle/pull/57742),[#59524](https://github.com/PaddlePaddle/Paddle/pull/59524),[#59526](https://github.com/PaddlePaddle/Paddle/pull/59526),[#58669](https://github.com/PaddlePaddle/Paddle/pull/58669),[#57616](https://github.com/PaddlePaddle/Paddle/pull/57616),[#56511](https://github.com/PaddlePaddle/Paddle/pull/56511),[#55727](https://github.com/PaddlePaddle/Paddle/pull/55727),[#58906](https://github.com/PaddlePaddle/Paddle/pull/58906),[#56016](https://github.com/PaddlePaddle/Paddle/pull/56016),[#54897](https://github.com/PaddlePaddle/Paddle/pull/54897) -- 优化静态图显存管理,新增精细化重计算策略;优化混合精度适配,支持用户手动指定 cast 范围等场景;支持 Cross Entropy 的并行计算;支持 scaled_dot_product_attention、fuse_rope 等融合算子;执行调度优化,支持张量并行、流水线并行中通信计算间更好地 Overlap。[#58421](https://github.com/PaddlePaddle/Paddle/pull/58421),[#58533](https://github.com/PaddlePaddle/Paddle/pull/58533),[#59498](https://github.com/PaddlePaddle/Paddle/pull/59498),[#59498](https://github.com/PaddlePaddle/Paddle/pull/59498),[#59187](https://github.com/PaddlePaddle/Paddle/pull/59187),[#59188](https://github.com/PaddlePaddle/Paddle/pull/59188),[#58172](https://github.com/PaddlePaddle/Paddle/pull/58172),[#58628](https://github.com/PaddlePaddle/Paddle/pull/58628),[#56185](https://github.com/PaddlePaddle/Paddle/pull/56185),[#56696](https://github.com/PaddlePaddle/Paddle/pull/56696),[#59497](https://github.com/PaddlePaddle/Paddle/pull/59497),[#58304](https://github.com/PaddlePaddle/Paddle/pull/58304),[#58977](https://github.com/PaddlePaddle/Paddle/pull/58977) - -#### AutoTuner - -本版本实现基于 Profiling 的并行策略自动搜索和调优工具 AutoTuner,能够在给定模型和硬件资源的条件下,自动将并行策略和优化策略进行组合,并选取有效的组合配置运行实验,从而搜索出大模型训练和推理的最佳配置。此外,AutoTuner 实现了多种剪枝优化策略,包括显存建模等,能够大幅度减少搜索空间和搜索时间。[#54460](https://github.com/PaddlePaddle/Paddle/pull/54460),[#54668](https://github.com/PaddlePaddle/Paddle/pull/54668),[#59794](https://github.com/PaddlePaddle/Paddle/pull/59794),[#59727](https://github.com/PaddlePaddle/Paddle/pull/59727),[#59782](https://github.com/PaddlePaddle/Paddle/pull/59782),[#54834](https://github.com/PaddlePaddle/Paddle/pull/54834),[#58127](https://github.com/PaddlePaddle/Paddle/pull/58127),[#56968](https://github.com/PaddlePaddle/Paddle/pull/56968),[#55466](https://github.com/PaddlePaddle/Paddle/pull/55466),[#56939](https://github.com/PaddlePaddle/Paddle/pull/56939),[#58183](https://github.com/PaddlePaddle/Paddle/pull/58183),[#58314](https://github.com/PaddlePaddle/Paddle/pull/58314),[#55499](https://github.com/PaddlePaddle/Paddle/pull/55499),[#59748](https://github.com/PaddlePaddle/Paddle/pull/59748) - -### 算子库 - -#### 不兼容升级 - -为了提升飞桨框架的可维护性,删除框架中部分废弃的算子(如 diag_v1, isfinite_v1, pad2d_v1 等),通过飞桨 1.x 版本训练所保存的使用到这些算子的模型将无法在飞桨新版本上进行推理。[#57895](https://github.com/PaddlePaddle/Paddle/pull/57895),[#57892](https://github.com/PaddlePaddle/Paddle/pull/57892),[#57898](https://github.com/PaddlePaddle/Paddle/pull/57898),[#57730](https://github.com/PaddlePaddle/Paddle/pull/57730),[#57732](https://github.com/PaddlePaddle/Paddle/pull/57732),[#57810](https://github.com/PaddlePaddle/Paddle/pull/57810),[#57884](https://github.com/PaddlePaddle/Paddle/pull/57884),[#57794](https://github.com/PaddlePaddle/Paddle/pull/57794),[#57926](https://github.com/PaddlePaddle/Paddle/pull/57926),[#57925](https://github.com/PaddlePaddle/Paddle/pull/57925),[#57807](https://github.com/PaddlePaddle/Paddle/pull/57807),[#57808](https://github.com/PaddlePaddle/Paddle/pull/57808) - -#### 算子库功能增强 - -- 飞桨 PHI 算子库复数计算功能进一步增强,累计新增支持复数计算 Kernel 40+。[#55380](https://github.com/PaddlePaddle/Paddle/pull/55380), [#56349](https://github.com/PaddlePaddle/Paddle/pull/56349), [#56412](https://github.com/PaddlePaddle/Paddle/pull/56412), [#56323](https://github.com/PaddlePaddle/Paddle/pull/56323), [#56723](https://github.com/PaddlePaddle/Paddle/pull/56723), [#56457](https://github.com/PaddlePaddle/Paddle/pull/56457), [#56903](https://github.com/PaddlePaddle/Paddle/pull/56903)[#56914](https://github.com/PaddlePaddle/Paddle/pull/56914), [#57116](https://github.com/PaddlePaddle/Paddle/pull/57116), [#56048](https://github.com/PaddlePaddle/Paddle/pull/56048), [#57244](https://github.com/PaddlePaddle/Paddle/pull/57244), [#57639](https://github.com/PaddlePaddle/Paddle/pull/57639), [#57638](https://github.com/PaddlePaddle/Paddle/pull/57638), [#57540](https://github.com/PaddlePaddle/Paddle/pull/57540), [#58545](https://github.com/PaddlePaddle/Paddle/pull/58545), [#58336](https://github.com/PaddlePaddle/Paddle/pull/58336), [#58532](https://github.com/PaddlePaddle/Paddle/pull/58532), [#58839](https://github.com/PaddlePaddle/Paddle/pull/58839), [#59079](https://github.com/PaddlePaddle/Paddle/pull/59079), [#59277](https://github.com/PaddlePaddle/Paddle/pull/59277), [#59122](https://github.com/PaddlePaddle/Paddle/pull/59122), [#57058](https://github.com/PaddlePaddle/Paddle/pull/57058) - -- 优化和新增部分算子的 XPU Kernel,并增强了 XPU Kernel 对 bfloat16 等数据类型的运算支持。[#54478](https://github.com/PaddlePaddle/Paddle/pull/54478), [#57740](https://github.com/PaddlePaddle/Paddle/pull/57740), [#58346](https://github.com/PaddlePaddle/Paddle/pull/58346), [#58456](https://github.com/PaddlePaddle/Paddle/pull/58456), [#58662](https://github.com/PaddlePaddle/Paddle/pull/58662), [#59066](https://github.com/PaddlePaddle/Paddle/pull/59066), [#59263](https://github.com/PaddlePaddle/Paddle/pull/59263)), [#59375](https://github.com/PaddlePaddle/Paddle/pull/59375), [#59505](https://github.com/PaddlePaddle/Paddle/pull/59505), [#59653](https://github.com/PaddlePaddle/Paddle/pull/59653), [#55001](https://github.com/PaddlePaddle/Paddle/pull/55001), [#57272](https://github.com/PaddlePaddle/Paddle/pull/57272), [#56169](https://github.com/PaddlePaddle/Paddle/pull/56169), [#59454](https://github.com/PaddlePaddle/Paddle/pull/59454), [#59480](https://github.com/PaddlePaddle/Paddle/pull/59480), [#55914](https://github.com/PaddlePaddle/Paddle/pull/55914), [#54758](https://github.com/PaddlePaddle/Paddle/pull/54758), [#54827](https://github.com/PaddlePaddle/Paddle/pull/54827), [#58364](https://github.com/PaddlePaddle/Paddle/pull/58364), [#58419](https://github.com/PaddlePaddle/Paddle/pull/58419), [#58982](https://github.com/PaddlePaddle/Paddle/pull/58982), [#57216](https://github.com/PaddlePaddle/Paddle/pull/57216), [#59166](https://github.com/PaddlePaddle/Paddle/pull/59166), [#55033](https://github.com/PaddlePaddle/Paddle/pull/55033), [#55375](https://github.com/PaddlePaddle/Paddle/pull/55375), [#58805](https://github.com/PaddlePaddle/Paddle/pull/58805), [#59389](https://github.com/PaddlePaddle/Paddle/pull/59389), [#57077](https://github.com/PaddlePaddle/Paddle/pull/57077), [#55166](https://github.com/PaddlePaddle/Paddle/pull/55166), [#56773](https://github.com/PaddlePaddle/Paddle/pull/56773) - -- 新增了用于优化大模型训练和推理性能的常见算子。[#55758](https://github.com/PaddlePaddle/Paddle/pull/55758), [#54998](https://github.com/PaddlePaddle/Paddle/pull/54998), [#55400](https://github.com/PaddlePaddle/Paddle/pull/55400), [#54630](https://github.com/PaddlePaddle/Paddle/pull/54630), [#55969](https://github.com/PaddlePaddle/Paddle/pull/55969), [#55026](https://github.com/PaddlePaddle/Paddle/pull/55026), [#58986](https://github.com/PaddlePaddle/Paddle/pull/58986) - -- 完善算子库 Tensor Strided 机制。[#59422](https://github.com/PaddlePaddle/Paddle/pull/59422), [#59325](https://github.com/PaddlePaddle/Paddle/pull/59325), [#56863](https://github.com/PaddlePaddle/Paddle/pull/56863), [#56882](https://github.com/PaddlePaddle/Paddle/pull/56882), [#56947](https://github.com/PaddlePaddle/Paddle/pull/56947) - -- 对算子 Kernel 中的函数实现以及模板调用接口进行了编译优化,降低算子库编包体积。[#57083](https://github.com/PaddlePaddle/Paddle/pull/57083), [#57299](https://github.com/PaddlePaddle/Paddle/pull/57299), [#57261](https://github.com/PaddlePaddle/Paddle/pull/57261), [#57290](https://github.com/PaddlePaddle/Paddle/pull/57290), [#57118](https://github.com/PaddlePaddle/Paddle/pull/57118), [#57551](https://github.com/PaddlePaddle/Paddle/pull/57551), [#57509](https://github.com/PaddlePaddle/Paddle/pull/57509), [#57558](https://github.com/PaddlePaddle/Paddle/pull/57558), [#57064](https://github.com/PaddlePaddle/Paddle/pull/57064), [#57365](https://github.com/PaddlePaddle/Paddle/pull/57365), [#57327](https://github.com/PaddlePaddle/Paddle/pull/57327), [#57603](https://github.com/PaddlePaddle/Paddle/pull/57603), [#57671](https://github.com/PaddlePaddle/Paddle/pull/57671), [#57672](https://github.com/PaddlePaddle/Paddle/pull/57672), [#57631](https://github.com/PaddlePaddle/Paddle/pull/57631), [#57082](https://github.com/PaddlePaddle/Paddle/pull/57082), [#57721](https://github.com/PaddlePaddle/Paddle/pull/57721), [#57823](https://github.com/PaddlePaddle/Paddle/pull/57823), [#57821](https://github.com/PaddlePaddle/Paddle/pull/57821), [#57815](https://github.com/PaddlePaddle/Paddle/pull/57815), [#57822](https://github.com/PaddlePaddle/Paddle/pull/57822), [#57541](https://github.com/PaddlePaddle/Paddle/pull/57541), [#57817](https://github.com/PaddlePaddle/Paddle/pull/57817), [#57838](https://github.com/PaddlePaddle/Paddle/pull/57838) +- 分布式通信策略 Sharing 在 Broadcast 参数采用异步策略,提升计算和通信的 overlap。 [#59745](https://github.com/PaddlePaddle/Paddle/pull/59745) +- 新增 STRIDED Layout 算子支持,提升算子性能。[#62532](https://github.com/PaddlePaddle/Paddle/pull/62532),[#62697](https://github.com/PaddlePaddle/Paddle/pull/62697),[#62649](https://github.com/PaddlePaddle/Paddle/pull/62649) +- 优化 elementwise_mul 算子内存使用。[#62377](https://github.com/PaddlePaddle/Paddle/pull/62377) #### Bug 修复 +- 修复分布式策略 Sharing 下的错误。[#61942](https://github.com/PaddlePaddle/Paddle/pull/61942),[#62236](https://github.com/PaddlePaddle/Paddle/pull/62236),[#62305](https://github.com/PaddlePaddle/Paddle/pull/62305),[#62535](https://github.com/PaddlePaddle/Paddle/pull/62535),[#62572](https://github.com/PaddlePaddle/Paddle/pull/62572),[#61601](https://github.com/PaddlePaddle/Paddle/pull/61601) +- 修复 c_embedding 算子不在 PHI namespace 下导致的算子无法注册的问题。[#60774](https://github.com/PaddlePaddle/Paddle/pull/60774) +- 修复 xccl_comm 释放问题。[#60465](https://github.com/PaddlePaddle/Paddle/pull/60465) +- 修复 index_put 算子 fallbacking cpu 时导致的数据地址错误。[#61842](https://github.com/PaddlePaddle/Paddle/pull/61842) +- 修复 stream_safe_custom_device_allocator 的问题。[#63369](https://github.com/PaddlePaddle/Paddle/pull/63369) +- 修复分布式下 worker 端口冲突问题。[#61409](https://github.com/PaddlePaddle/Paddle/pull/61409) +- 修复 comm 数据类型以提升设备兼容性。[#62306](https://github.com/PaddlePaddle/Paddle/pull/62306) +- 统一通信数据类型的使用为 phi::DataType。[#62464](https://github.com/PaddlePaddle/Paddle/pull/62464),[#62562](https://github.com/PaddlePaddle/Paddle/pull/62562) +- 修复 PD_ConfigEnableCustomDevice 缺少 precision 参数问题。[#63702](https://github.com/PaddlePaddle/Paddle/pull/63702) + +### 昆仑 XPU -- 修复了飞桨框架适配 CUDA 12 的一些问题。[#54640](https://github.com/PaddlePaddle/Paddle/pull/54640), [#57820](https://github.com/PaddlePaddle/Paddle/pull/57820), [#58958](https://github.com/PaddlePaddle/Paddle/pull/58958), [#58179](https://github.com/PaddlePaddle/Paddle/pull/58179), [#55594](https://github.com/PaddlePaddle/Paddle/pull/55594) - -### CUDA - -#### 新功能 - -- 新增调试类 API paddle.amp.debugging.check_check_numerics,计算并返回这个 Tensor 数值中异常值(NaN、Inf)和零元素的数量。[#54301](https://github.com/PaddlePaddle/Paddle/pull/54301) -- 新增 fused_rope 融合算子,加速 LLaMA 类大模型训练。[#54351](https://github.com/PaddlePaddle/Paddle/pull/54351) -- 更新 CUDNN Frontend API 版本到 v0.9.1,并新增加速 ResNet 网络的 fused_scale_bias_add_relu 融合算子。注意该功能处于实验期,默认不开启。[#58367](https://github.com/PaddlePaddle/Paddle/pull/58367), [#54949](https://github.com/PaddlePaddle/Paddle/pull/54949), [#58504](https://github.com/PaddlePaddle/Paddle/pull/58504) -- 基于 Flash-Attention v2,添加 Tensor 类似 Mask 功能支持,反向算子支持确定性计算,便于调试。[#57276](https://github.com/PaddlePaddle/Paddle/pull/57276), [#56363](https://github.com/PaddlePaddle/Paddle/pull/56363) -- 修改稀疏 conv3d 后端实现以支持 2d 形状,避免前端 reshape 的开销。[#54707](https://github.com/PaddlePaddle/Paddle/pull/54707) -- 新增 matmul_int8 算子。([#55228](https://github.com/PaddlePaddle/Paddle/pull/55228)) - -#### 功能优化 - -- 优化 CUDA Graph 对随机数算子的支持。[#58310](https://github.com/PaddlePaddle/Paddle/pull/58310) -- 自动混合精度训练默认功能加强,包括: - - 优化自动混合精度训练接口的使用体验。[#58152](https://github.com/PaddlePaddle/Paddle/pull/58152),[#55364](https://github.com/PaddlePaddle/Paddle/pull/55364),[#57903](https://github.com/PaddlePaddle/Paddle/pull/57903) - - 将 fused_attention、fused_feedforward、fused_gemm_epilogue 等矩阵计算类算子加入框架默认的白名单,并统一动静态图默认黑白名单设置。[#55373](https://github.com/PaddlePaddle/Paddle/pull/55373), [#55713](https://github.com/PaddlePaddle/Paddle/pull/55713) - - argsort、dist、erfinv、nanmedian、poisson 算子和 lamb 优化器算子支持 FP16、BF16 低精度计算。[#51662](https://github.com/PaddlePaddle/Paddle/pull/51662), [#55105](https://github.com/PaddlePaddle/Paddle/pull/55105), [#55287](https://github.com/PaddlePaddle/Paddle/pull/55287), [#55824](https://github.com/PaddlePaddle/Paddle/pull/55824), [#56056](https://github.com/PaddlePaddle/Paddle/pull/56056), [#56184](https://github.com/PaddlePaddle/Paddle/pull/56184), [#55641](https://github.com/PaddlePaddle/Paddle/pull/55641) - - 修复 elementwise_max 算子低精度实现,改成使用 FP32 类型进行数值计算,减少精度损失。[#54799](https://github.com/PaddlePaddle/Paddle/pull/54799) - - 将 Reduce 类算子计算需要的临时结果 Tensor 改成 FP32 类型,避免将中间结果转换成低精度带来的精度损失。[#55709](https://github.com/PaddlePaddle/Paddle/pull/55709)) -- flip、roll & roll_grad、index_put & index_put_grad 等算子 GPU 代码实现优化,在性能不下降的前提下移除不必要的 C++模板,优化算子编译耗时并减少编译生成的二进制体积。[#57309](https://github.com/PaddlePaddle/Paddle/pull/57309), [#57525](https://github.com/PaddlePaddle/Paddle/pull/57525) -- bernoulli 算子增加对输入概率合法性的检查。[#59174](https://github.com/PaddlePaddle/Paddle/pull/59174) - -#### 性能优化 - -- 优化 BroadcastKernel 对大 Tensor 的支持,改成对大 Tensor 切片多次调用 INT32 版本实现的方式,算子性能提升 7.27x。[#57313](https://github.com/PaddlePaddle/Paddle/pull/57313), [#57996](https://github.com/PaddlePaddle/Paddle/pull/57996) -- 优化 Tensor 保存接口的性能,通过先将 Tensor 拷贝到 CPU 再转 numpy,避免 Tensor 不连续时自动转换成连续 Tensor 的开销。[#57040](https://github.com/PaddlePaddle/Paddle/pull/57040) - -#### Bug Fix - -- 修复 memmory_efficient_attention 算子对 sm_90 的支持。[#58070](https://github.com/PaddlePaddle/Paddle/pull/58070) -- 修复 softmax 算子,当 axis=-1 且长度大于 100000 的实现出现的 NaN 问题。[#57851](https://github.com/PaddlePaddle/Paddle/pull/57851) -- 修复 set_constant 算子在一些情况下出现 GPU 访存错误问题。[#59905](https://github.com/PaddlePaddle/Paddle/pull/59905) -- 修复 layer_norm 算子快速实现版本中出现的 GPU 存储读写竞争问题。[#56435](https://github.com/PaddlePaddle/Paddle/pull/56435) - -### 拓展神经网络编译器 CINN 架构能力 - -在本次更新中,飞桨神经网络编译器 CINN 的重点在于架构的梳理和能力的全面扩展。鉴于大模型对动态 Shape 的需求日益增长,初步探索并实现了在动态 shape 下编译器的有效运行和优化策略。 -在架构层面,引入了 Python DSL,这一举措显著提升了 CINN 的开发便捷性和 Debug 能力,使得开发者能够更高效地编写和调试代码。同时,对 Schedule 的逻辑进行了重构,以 GroupSchedule 为主导,从而在算子 Group 层面实现更加通用且稳定的优化策略。为了增强 CINN 的稳定性,探索并引入了强约束组件,这一组件能够有效减少系统中的不确定性和潜在错误。此外,对 CINN 的历史工具类和软件结构进行了系统性的整理、优化和改进,进一步提升了代码的可读性和可维护性。在与飞桨其他组件的整合方面,进一步加强了 CINN 与 PIR、Paddle 的紧密结合,使得编译器与飞桨整体框架更加协调一致。这一改进不仅提升了编译器的性能,还为开发者提供了更加流畅和统一的开发体验。 - -#### 兼容性升级 - -- 更新存储读取接口至兼容 Paddle 2.0。 [#55836](https://github.com/PaddlePaddle/Paddle/pull/55836) -- 更新 relu6 Op Mapper 的兼容性。 [#55611](https://github.com/PaddlePaddle/Paddle/pull/55611) - -#### 改造废弃 - -- 删除旧的 Schedule 形式。 [#55566](https://github.com/PaddlePaddle/Paddle/pull/55566),[#55391](https://github.com/PaddlePaddle/Paddle/pull/55391) -- 删除一些过时测试。 [#56245](https://github.com/PaddlePaddle/Paddle/pull/56245),[#57987](https://github.com/PaddlePaddle/Paddle/pull/57987) -- 删除不再适用的 remove_nested_block Visitor 工具。 [#56972](https://github.com/PaddlePaddle/Paddle/pull/56972) -- 删除其他无用代码。 [#55413](https://github.com/PaddlePaddle/Paddle/pull/55413) - -#### 新功能 - -- 增加飞桨端 CINN paddle.framework.core.is_run_with_cinn()运行接口。 [#54355](https://github.com/PaddlePaddle/Paddle/pull/54355) -- 增加 CINN 相关算子逻辑,包括各种组合算子拆解逻辑。 [#56072](https://github.com/PaddlePaddle/Paddle/pull/56072),[#58210](https://github.com/PaddlePaddle/Paddle/pull/58210),[#58502](https://github.com/PaddlePaddle/Paddle/pull/58502), [#58591](https://github.com/PaddlePaddle/Paddle/pull/58591), [#58981](https://github.com/PaddlePaddle/Paddle/pull/58981), [#59135](https://github.com/PaddlePaddle/Paddle/pull/59135), [#59274](https://github.com/PaddlePaddle/Paddle/pull/59274), [#59306](https://github.com/PaddlePaddle/Paddle/pull/59306), [#59202](https://github.com/PaddlePaddle/Paddle/pull/59202), [#59176](https://github.com/PaddlePaddle/Paddle/pull/59176), [#59534](https://github.com/PaddlePaddle/Paddle/pull/59534), [#59713](https://github.com/PaddlePaddle/Paddle/pull/59713), [#59798](https://github.com/PaddlePaddle/Paddle/pull/59798);支持 bf16、amp 等形式[#54399](https://github.com/PaddlePaddle/Paddle/pull/54399), [#54368](https://github.com/PaddlePaddle/Paddle/pull/54368), [#54608](https://github.com/PaddlePaddle/Paddle/pull/54608);支持算子零维能力[#54892](https://github.com/PaddlePaddle/Paddle/pull/54892), [#54919](https://github.com/PaddlePaddle/Paddle/pull/54919), [#54907](https://github.com/PaddlePaddle/Paddle/pull/54907), [#54966](https://github.com/PaddlePaddle/Paddle/pull/54966) -- CINN 和飞桨 PIR、组合算子交界运行方式,使新增 PIR 和 CINN 运行浑然一体。 [#54732](https://github.com/PaddlePaddle/Paddle/pull/54732), [#56074](https://github.com/PaddlePaddle/Paddle/pull/56074), [#58216](https://github.com/PaddlePaddle/Paddle/pull/58216), [#55680](https://github.com/PaddlePaddle/Paddle/pull/55680), [#56302](https://github.com/PaddlePaddle/Paddle/pull/56302), [#59037](https://github.com/PaddlePaddle/Paddle/pull/59037), [#55186](https://github.com/PaddlePaddle/Paddle/pull/55186), [#58641](https://github.com/PaddlePaddle/Paddle/pull/58641) -- 对 CINN 变化起到稳定作用的强约束组件。 [#58719](https://github.com/PaddlePaddle/Paddle/pull/58719), [#59309](https://github.com/PaddlePaddle/Paddle/pull/59309), [#58993](https://github.com/PaddlePaddle/Paddle/pull/58993) -- Group Schedule 相关的 CINN 架构流程添加。 [#58399](https://github.com/PaddlePaddle/Paddle/pull/58399), [#56444](https://github.com/PaddlePaddle/Paddle/pull/56444) -- CINN 架构功能初步增加 CUTLASS、报错处理、NVRTC Cubin Fmad 选项。 [#58079](https://github.com/PaddlePaddle/Paddle/pull/58079), [#57198](https://github.com/PaddlePaddle/Paddle/pull/57198), [#58794](https://github.com/PaddlePaddle/Paddle/pull/58794) -- CINN 增加 Python 界面语言。 [#57731](https://github.com/PaddlePaddle/Paddle/pull/57731), [#57515](https://github.com/PaddlePaddle/Paddle/pull/57515), [#57644](https://github.com/PaddlePaddle/Paddle/pull/57644), [#57981](https://github.com/PaddlePaddle/Paddle/pull/57981), [#58009](https://github.com/PaddlePaddle/Paddle/pull/58009) -- CINN 增加动态 Shape 功能,涵盖 ASTGen 可以代替 ISL 产生动态 Shape 符号 [#56360](https://github.com/PaddlePaddle/Paddle/pull/56360), [#57207](https://github.com/PaddlePaddle/Paddle/pull/57207), [#57454](https://github.com/PaddlePaddle/Paddle/pull/57454);增加分桶条件编译功能 [#59165](https://github.com/PaddlePaddle/Paddle/pull/59165);增加 Schedule、Device、IR 层面支持动态 shape 的功能 [#58988](https://github.com/PaddlePaddle/Paddle/pull/58988), [#59493](https://github.com/PaddlePaddle/Paddle/pull/59493), [#58717](https://github.com/PaddlePaddle/Paddle/pull/58717), [#58602](https://github.com/PaddlePaddle/Paddle/pull/58602), [#59196](https://github.com/PaddlePaddle/Paddle/pull/59196) -- CINN Group Schedule 算子 Group 层面做更通用稳定的 Schedule 优化。 [#56122](https://github.com/PaddlePaddle/Paddle/pull/56122), [#57777](https://github.com/PaddlePaddle/Paddle/pull/57777), [#57569](https://github.com/PaddlePaddle/Paddle/pull/57569) - -#### 功能优化 - -- 丰富或改善算子功能,包括修理反向、FP16、Infershape、算子单测等各种算子过程的改善。 [#56320](https://github.com/PaddlePaddle/Paddle/pull/56320), [#56845](https://github.com/PaddlePaddle/Paddle/pull/56845), [#54939](https://github.com/PaddlePaddle/Paddle/pull/54939),[#54378](https://github.com/PaddlePaddle/Paddle/pull/54378),[#55321](https://github.com/PaddlePaddle/Paddle/pull/55321),[#55336](https://github.com/PaddlePaddle/Paddle/pull/55336),[#55337](https://github.com/PaddlePaddle/Paddle/pull/55337),[#55442](https://github.com/PaddlePaddle/Paddle/pull/55442),[#55470](https://github.com/PaddlePaddle/Paddle/pull/55470),[#55489](https://github.com/PaddlePaddle/Paddle/pull/55489),[#55510](https://github.com/PaddlePaddle/Paddle/pull/55510),[#55547](https://github.com/PaddlePaddle/Paddle/pull/55547),[#55505](https://github.com/PaddlePaddle/Paddle/pull/55505),[#55563](https://github.com/PaddlePaddle/Paddle/pull/55563),[#54280](https://github.com/PaddlePaddle/Paddle/pull/54280),[#59650](https://github.com/PaddlePaddle/Paddle/pull/59650),[#54862](https://github.com/PaddlePaddle/Paddle/pull/54862),[#55135](https://github.com/PaddlePaddle/Paddle/pull/55135),[#55292](https://github.com/PaddlePaddle/Paddle/pull/55292),[#55333](https://github.com/PaddlePaddle/Paddle/pull/55333),[#55316](https://github.com/PaddlePaddle/Paddle/pull/55316),[#55379](https://github.com/PaddlePaddle/Paddle/pull/55379),[#55326](https://github.com/PaddlePaddle/Paddle/pull/55326) -- CINN、飞桨、PIR、组合算子交界运行方式改善,主要包括各种和 PIR 及其执行器接口和 CINN 互相支持。 [#59170](https://github.com/PaddlePaddle/Paddle/pull/59170),[#58766](https://github.com/PaddlePaddle/Paddle/pull/58766),[#59255](https://github.com/PaddlePaddle/Paddle/pull/59255),[#59203](https://github.com/PaddlePaddle/Paddle/pull/59203),[#59024](https://github.com/PaddlePaddle/Paddle/pull/59024),[#57829](https://github.com/PaddlePaddle/Paddle/pull/57829),[#58135](https://github.com/PaddlePaddle/Paddle/pull/58135),[#58193](https://github.com/PaddlePaddle/Paddle/pull/58193),[#58207](https://github.com/PaddlePaddle/Paddle/pull/58207),[#58606](https://github.com/PaddlePaddle/Paddle/pull/58606),[#59437](https://github.com/PaddlePaddle/Paddle/pull/59437),[#59759](https://github.com/PaddlePaddle/Paddle/pull/59759),[#55075](https://github.com/PaddlePaddle/Paddle/pull/55075),[#56805](https://github.com/PaddlePaddle/Paddle/pull/56805),[#57764](https://github.com/PaddlePaddle/Paddle/pull/57764),[#58620](https://github.com/PaddlePaddle/Paddle/pull/58620),[#59769](https://github.com/PaddlePaddle/Paddle/pull/59769),[#58702](https://github.com/PaddlePaddle/Paddle/pull/58702),[#58749](https://github.com/PaddlePaddle/Paddle/pull/58749),[#59025](https://github.com/PaddlePaddle/Paddle/pull/59025),[#58820](https://github.com/PaddlePaddle/Paddle/pull/58820),[#58908](https://github.com/PaddlePaddle/Paddle/pull/58908),[#58169](https://github.com/PaddlePaddle/Paddle/pull/58169) -- 对 CINN 改善稳定作用的强约束组件。 [#55090](https://github.com/PaddlePaddle/Paddle/pull/55090),[#55705](https://github.com/PaddlePaddle/Paddle/pull/55705),[#57587](https://github.com/PaddlePaddle/Paddle/pull/57587),[#59501](https://github.com/PaddlePaddle/Paddle/pull/59501) -- CINN IR 和相关工具代码改善。 [#55145](https://github.com/PaddlePaddle/Paddle/pull/55145),[#55955](https://github.com/PaddlePaddle/Paddle/pull/55955),[#56307](https://github.com/PaddlePaddle/Paddle/pull/56307),[#55519](https://github.com/PaddlePaddle/Paddle/pull/55519),[#56958](https://github.com/PaddlePaddle/Paddle/pull/56958),[#57019](https://github.com/PaddlePaddle/Paddle/pull/57019),[#57230](https://github.com/PaddlePaddle/Paddle/pull/57230),[#57531](https://github.com/PaddlePaddle/Paddle/pull/57531),[#57532](https://github.com/PaddlePaddle/Paddle/pull/57532),[#57524](https://github.com/PaddlePaddle/Paddle/pull/57524),[#58770](https://github.com/PaddlePaddle/Paddle/pull/58770),[#59337](https://github.com/PaddlePaddle/Paddle/pull/59337),[#59096](https://github.com/PaddlePaddle/Paddle/pull/59096),[#56274](https://github.com/PaddlePaddle/Paddle/pull/56274),[#56350](https://github.com/PaddlePaddle/Paddle/pull/56350),[#57312](https://github.com/PaddlePaddle/Paddle/pull/57312),[#55171](https://github.com/PaddlePaddle/Paddle/pull/55171) -- CINN Group Schedule 算子 Group 层面做更通用稳定的 Schedule 优化。 [#54982](https://github.com/PaddlePaddle/Paddle/pull/54982),[#57963](https://github.com/PaddlePaddle/Paddle/pull/57963),[#58220](https://github.com/PaddlePaddle/Paddle/pull/58220),[#55484](https://github.com/PaddlePaddle/Paddle/pull/55484),[#55935](https://github.com/PaddlePaddle/Paddle/pull/55935),[#55590](https://github.com/PaddlePaddle/Paddle/pull/55590),[#56530](https://github.com/PaddlePaddle/Paddle/pull/56530),[#58344](https://github.com/PaddlePaddle/Paddle/pull/58344),[#59810](https://github.com/PaddlePaddle/Paddle/pull/59810) -- CINN 架构功能改善,包括并行编译、低层存储分配方式、打印信息、Group 结构、Pass 结构等。[#56282](https://github.com/PaddlePaddle/Paddle/pull/56282), [#59014](https://github.com/PaddlePaddle/Paddle/pull/59014),[#59209](https://github.com/PaddlePaddle/Paddle/pull/59209),[#52660](https://github.com/PaddlePaddle/Paddle/pull/52660),[#54749](https://github.com/PaddlePaddle/Paddle/pull/54749),[#58694](https://github.com/PaddlePaddle/Paddle/pull/58694),[#58940](https://github.com/PaddlePaddle/Paddle/pull/58940),[#59504](https://github.com/PaddlePaddle/Paddle/pull/59504),[#56123](https://github.com/PaddlePaddle/Paddle/pull/56123) -- CINN 改善 codegen、jit instruction、dim args、host kernel 等以支持动态 Shape 功能。 [#58825](https://github.com/PaddlePaddle/Paddle/pull/58825),[#59395](https://github.com/PaddlePaddle/Paddle/pull/59395),[#59398](https://github.com/PaddlePaddle/Paddle/pull/59398),[#59540](https://github.com/PaddlePaddle/Paddle/pull/59540),[#59470](https://github.com/PaddlePaddle/Paddle/pull/59470),[#59640](https://github.com/PaddlePaddle/Paddle/pull/59640) -- CINN 报错优化。 [#54983](https://github.com/PaddlePaddle/Paddle/pull/54983),[#55544](https://github.com/PaddlePaddle/Paddle/pull/55544) -- CINN 其他代码清理改善、包括 CI、文件路径、C++17、Flags、第三方库、Docker 等 [#55018](https://github.com/PaddlePaddle/Paddle/pull/55018),[#55121](https://github.com/PaddlePaddle/Paddle/pull/55121),[#55009](https://github.com/PaddlePaddle/Paddle/pull/55009),[#55888](https://github.com/PaddlePaddle/Paddle/pull/55888),[#56168](https://github.com/PaddlePaddle/Paddle/pull/56168),[#56192](https://github.com/PaddlePaddle/Paddle/pull/56192),[#56896](https://github.com/PaddlePaddle/Paddle/pull/56896),[#53861](https://github.com/PaddlePaddle/Paddle/pull/53861),[#55208](https://github.com/PaddlePaddle/Paddle/pull/55208) - -#### 性能优化 - -- 对 vit attention 进行融合。 [#54139](https://github.com/PaddlePaddle/Paddle/pull/54139) -- 优化 block reduce。 [#58196](https://github.com/PaddlePaddle/Paddle/pull/58196) - -#### bug 修复 - -- 算子相关 bug 修复。 [#56280](https://github.com/PaddlePaddle/Paddle/pull/56280),[#57767](https://github.com/PaddlePaddle/Paddle/pull/57767),[#58406](https://github.com/PaddlePaddle/Paddle/pull/58406),[#54406](https://github.com/PaddlePaddle/Paddle/pull/54406),[#54494](https://github.com/PaddlePaddle/Paddle/pull/54494),[#54751](https://github.com/PaddlePaddle/Paddle/pull/54751),[#55674](https://github.com/PaddlePaddle/Paddle/pull/55674),[#55684](https://github.com/PaddlePaddle/Paddle/pull/55684),[#55683](https://github.com/PaddlePaddle/Paddle/pull/55683),[#57798](https://github.com/PaddlePaddle/Paddle/pull/57798),[#57816](https://github.com/PaddlePaddle/Paddle/pull/57816),[#57687](https://github.com/PaddlePaddle/Paddle/pull/57687),[#56719](https://github.com/PaddlePaddle/Paddle/pull/56719),[#59756](https://github.com/PaddlePaddle/Paddle/pull/59756),[#59770](https://github.com/PaddlePaddle/Paddle/pull/59770),[#58811](https://github.com/PaddlePaddle/Paddle/pull/58811) -- 流程架构相关 bug 修复。 [#54899](https://github.com/PaddlePaddle/Paddle/pull/54899),[#59737](https://github.com/PaddlePaddle/Paddle/pull/59737),[#59356](https://github.com/PaddlePaddle/Paddle/pull/59356),[#56105](https://github.com/PaddlePaddle/Paddle/pull/56105),[#56662](https://github.com/PaddlePaddle/Paddle/pull/56662),[#58146](https://github.com/PaddlePaddle/Paddle/pull/58146),[#58910](https://github.com/PaddlePaddle/Paddle/pull/58910),[#58121](https://github.com/PaddlePaddle/Paddle/pull/58121),[#58943](https://github.com/PaddlePaddle/Paddle/pull/58943),[#58886](https://github.com/PaddlePaddle/Paddle/pull/58886),[#59642](https://github.com/PaddlePaddle/Paddle/pull/59642),[#56164](https://github.com/PaddlePaddle/Paddle/pull/56164),[#56338](https://github.com/PaddlePaddle/Paddle/pull/56338),[#56966](https://github.com/PaddlePaddle/Paddle/pull/56966),[#59112](https://github.com/PaddlePaddle/Paddle/pull/59112),[#55820](https://github.com/PaddlePaddle/Paddle/pull/55820),[#56660](https://github.com/PaddlePaddle/Paddle/pull/56660),[#57307](https://github.com/PaddlePaddle/Paddle/pull/57307),[#57530](https://github.com/PaddlePaddle/Paddle/pull/57530),[#58236](https://github.com/PaddlePaddle/Paddle/pull/58236),[#55190](https://github.com/PaddlePaddle/Paddle/pull/55190),[#55043](https://github.com/PaddlePaddle/Paddle/pull/55043),[#55667](https://github.com/PaddlePaddle/Paddle/pull/55667) -- 其他 bug 修复。 [#57239](https://github.com/PaddlePaddle/Paddle/pull/57239),[#55530](https://github.com/PaddlePaddle/Paddle/pull/55530),[#56605](https://github.com/PaddlePaddle/Paddle/pull/56605),[#58243](https://github.com/PaddlePaddle/Paddle/pull/58243),[#58197](https://github.com/PaddlePaddle/Paddle/pull/58197),[#58197](https://github.com/PaddlePaddle/Paddle/pull/58197),[#56086](https://github.com/PaddlePaddle/Paddle/pull/56086),[#56065](https://github.com/PaddlePaddle/Paddle/pull/56065),[#58775](https://github.com/PaddlePaddle/Paddle/pull/58775),[#54750](https://github.com/PaddlePaddle/Paddle/pull/54750),[#58595](https://github.com/PaddlePaddle/Paddle/pull/58595),[#58873](https://github.com/PaddlePaddle/Paddle/pull/58873) - -#### 文档 - -- 增加 README 文件。 [#58349](https://github.com/PaddlePaddle/Paddle/pull/58349) - -## 4. 部署方向(Paddle Inference) - -### 通用推理优化 - -本版本升级提升了推理引擎在 GPU 和 CPU 上性能和易用性,降低了用户使用成本和线上推理的应用成本。在 GPU 上支持了高性能的多线程异步执行器,各模型推理性能提升 5%~10%;同时支持新版本 TensorRT 和 BF16 推理能力,TensorRT 推理性能和易用性进一步提升;在 CPU 上,支持最新版本的 OneDNN 高性能推理,在 SwinTransformer、FastRCNN 等系列模型上性能大幅提升。 - -- matmul 支持 transpose、broadcast 操作。 [#56827](https://github.com/PaddlePaddle/Paddle/pull/56827) -- TruncatedNormal and Assign 支持 FP64 数据类型。[#57507](https://github.com/PaddlePaddle/Paddle/pull/57507) -- 支持 conv2d 显式量化推理。[#57160](https://github.com/PaddlePaddle/Paddle/pull/57160),[#58015](https://github.com/PaddlePaddle/Paddle/pull/58015) -- 新增 conv_fuse_pass,支持 conv + bn 融合,conv2d_fusion 融合重命名为 fused_conv2d_add_act。 [#58724](https://github.com/PaddlePaddle/Paddle/pull/58724),[#55374](https://github.com/PaddlePaddle/Paddle/pull/55374),[#54477](https://github.com/PaddlePaddle/Paddle/pull/54477),[#59431](https://github.com/PaddlePaddle/Paddle/pull/59431) -- 混合精度推理支持 OP 白名单。[#56535](https://github.com/PaddlePaddle/Paddle/pull/56535) -- 默认开启 OneDNN 优化,支持 SwinTransformer、FastRCNNd 等推理优化。[#58560](https://github.com/PaddlePaddle/Paddle/pull/58560),[#59394](https://github.com/PaddlePaddle/Paddle/pull/59394),[#59421](https://github.com/PaddlePaddle/Paddle/pull/59421),[#58435](https://github.com/PaddlePaddle/Paddle/pull/58435),[#58488](https://github.com/PaddlePaddle/Paddle/pull/58488),[#59259](https://github.com/PaddlePaddle/Paddle/pull/59259),[#56303](https://github.com/PaddlePaddle/Paddle/pull/56303),[#56782](https://github.com/PaddlePaddle/Paddle/pull/56782),[#57598](https://github.com/PaddlePaddle/Paddle/pull/57598),[#58361](https://github.com/PaddlePaddle/Paddle/pull/58361),[#59641](https://github.com/PaddlePaddle/Paddle/pull/59641),[#59527](https://github.com/PaddlePaddle/Paddle/pull/59527),[#59663](https://github.com/PaddlePaddle/Paddle/pull/59663),[#59744](https://github.com/PaddlePaddle/Paddle/pull/59744) -- 新增 share_data 支持传入指定数据。[#57933](https://github.com/PaddlePaddle/Paddle/pull/57933) - -### 大模型推理优化 - -实现了生成式大模型的细粒度融合推理优化,该优化方案既保证了高性能的推理能力,又具备良好的可拓展性。用户可以根据需要,灵活运用各种细粒度融合算子和飞桨原生算子,自由组合构建生成式大模型的网络结构,从而实现高效且低成本的推理。此外,我们的方案还支持主流的生成式大模型结构,显著降低了这类模型的推理部署成本,为生成式大模型的高效、低成本落地提供了有力支持。 - -- 支持 FMHA/MMHA 对 CacheKV 划分 block 调度。[#59462](https://github.com/PaddlePaddle/Paddle/pull/59462) -- RoPE 编码融合算子支持输入 sin/cos 值。[#55415](https://github.com/PaddlePaddle/Paddle/pull/55415) -- 新增细粒度融合算子支持生成式大模型高性能推理优化,新增 quant_linear、weight_quantize、linear_compress 等算子支持大模型量化推理。[#57852](https://github.com/PaddlePaddle/Paddle/pull/57852),[#55128](https://github.com/PaddlePaddle/Paddle/pull/55128),[#59090](https://github.com/PaddlePaddle/Paddle/pull/59090),[#56706](https://github.com/PaddlePaddle/Paddle/pull/56706),[#59951](https://github.com/PaddlePaddle/Paddle/pull/59951),[#55490](https://github.com/PaddlePaddle/Paddle/pull/55490),[#59291](https://github.com/PaddlePaddle/Paddle/pull/59291),[#59441](https://github.com/PaddlePaddle/Paddle/pull/59441),[#59778](https://github.com/PaddlePaddle/Paddle/pull/59778),[#59651](https://github.com/PaddlePaddle/Paddle/pull/59651)[#55301](https://github.com/PaddlePaddle/Paddle/pull/55301),[#58637](https://github.com/PaddlePaddle/Paddle/pull/58637),[#56673](https://github.com/PaddlePaddle/Paddle/pull/56673),[#56401](https://github.com/PaddlePaddle/Paddle/pull/56401) -- 支持变长推理系列 API。[#57948](https://github.com/PaddlePaddle/Paddle/pull/57948) -- 支持 GQA 推理。[#58472](https://github.com/PaddlePaddle/Paddle/pull/58472),[#58836](https://github.com/PaddlePaddle/Paddle/pull/58836) -- 新增 masked multihead attention 支持高性能 MMHA 推理。[#55344](https://github.com/PaddlePaddle/Paddle/pull/55344),[#56411](https://github.com/PaddlePaddle/Paddle/pull/56411),[#58134](https://github.com/PaddlePaddle/Paddle/pull/58134),[#57936](https://github.com/PaddlePaddle/Paddle/pull/57936) -- weight_quantize/weight_only_linear 支持 Volta 架构。[#58082](https://github.com/PaddlePaddle/Paddle/pull/58082) -- 新增 weight_only_linear_grad 支持大模型 weight only 量化梯度回传。[#57685](https://github.com/PaddlePaddle/Paddle/pull/57685) -- 修复大模型动转静问题,优化静态图卡间通信初始化逻辑。[#56390](https://github.com/PaddlePaddle/Paddle/pull/56390),[#57169](https://github.com/PaddlePaddle/Paddle/pull/57169),[#56688](https://github.com/PaddlePaddle/Paddle/pull/56688),[#56592](https://github.com/PaddlePaddle/Paddle/pull/56592),[#58868](https://github.com/PaddlePaddle/Paddle/pull/58868) -- 优化 top_p_sampling 随机数生成逻辑。[#59494](https://github.com/PaddlePaddle/Paddle/pull/59494) - -### Paddle-TensorRT 推理优化 - -- elementwise_add 融合支持 NHWC 格式。 [#56795](https://github.com/PaddlePaddle/Paddle/pull/56795) -- conv2d 支持 filter 作为输入。[#55246](https://github.com/PaddlePaddle/Paddle/pull/55246)。 -- 支持 BF16、FP64 推理。[#59765](https://github.com/PaddlePaddle/Paddle/pull/59765),[#55520](https://github.com/PaddlePaddle/Paddle/pull/55520) -- 新增 MarkTrtEngineOutputs API 支持指定 TensorRT Engine 输出。 [#56858](https://github.com/PaddlePaddle/Paddle/pull/56858),[#56188](https://github.com/PaddlePaddle/Paddle/pull/56188),[#57407](https://github.com/PaddlePaddle/Paddle/pull/57407) -- 支持自定义 OP 自动生成 TensorRT Plugin。[#58976](https://github.com/PaddlePaddle/Paddle/pull/58976),[#56037](https://github.com/PaddlePaddle/Paddle/pull/56037) -- TensorRT 推理支持指定输入 hook,优化 shape 收集流程。[#59466](https://github.com/PaddlePaddle/Paddle/pull/59466),[#54841](https://github.com/PaddlePaddle/Paddle/pull/54841),[#57498](https://github.com/PaddlePaddle/Paddle/pull/57498),[#54861](https://github.com/PaddlePaddle/Paddle/pull/54861),[#54432](https://github.com/PaddlePaddle/Paddle/pull/54432),[#55503](https://github.com/PaddlePaddle/Paddle/pull/55503) -- TensorRT 推理支持保存 Tuning 后的推理模型。[#55893](https://github.com/PaddlePaddle/Paddle/pull/55893),[#56952](https://github.com/PaddlePaddle/Paddle/pull/56952),[#57031](https://github.com/PaddlePaddle/Paddle/pull/57031) -- 支持变长 Transformer 模型 PromptTuning。[#57034](https://github.com/PaddlePaddle/Paddle/pull/57034) -- 新增 bitwise_and、bitwise_or、bitwise_not、cumsum、einsum、lookup_table、assign、flip、size、scatter、solve、unbind、reduce、argsort 算子支持,优化已有算子支持。[#59214](https://github.com/PaddlePaddle/Paddle/pull/59214),[#59293](https://github.com/PaddlePaddle/Paddle/pull/59293),[#54882](https://github.com/PaddlePaddle/Paddle/pull/54882),[#54097](https://github.com/PaddlePaddle/Paddle/pull/54097),[#54860](https://github.com/PaddlePaddle/Paddle/pull/54860),[#55426](https://github.com/PaddlePaddle/Paddle/pull/55426),[#54372](https://github.com/PaddlePaddle/Paddle/pull/54372),[#55688](https://github.com/PaddlePaddle/Paddle/pull/55688),[#56069](https://github.com/PaddlePaddle/Paddle/pull/56069),[#59563](https://github.com/PaddlePaddle/Paddle/pull/59563),[#59317](https://github.com/PaddlePaddle/Paddle/pull/59317),[#59424](https://github.com/PaddlePaddle/Paddle/pull/59424),[#55476](https://github.com/PaddlePaddle/Paddle/pull/55476),[#56043](https://github.com/PaddlePaddle/Paddle/pull/56043),[#58549](https://github.com/PaddlePaddle/Paddle/pull/58549),[#57326](https://github.com/PaddlePaddle/Paddle/pull/57326),[#59409](https://github.com/PaddlePaddle/Paddle/pull/59409)) -- TensorRT 默认开启显存共享。[#59495](https://github.com/PaddlePaddle/Paddle/pull/59495),[#58251](https://github.com/PaddlePaddle/Paddle/pull/58251) -- PrelnResidualBiasPluginDynamic 支持 4D 输入。[#56304](https://github.com/PaddlePaddle/Paddle/pull/56304) -- 新增 SM80 以下架构 Paddle-TRT 推理对 FlashAttention 的支持。[#56492](https://github.com/PaddlePaddle/Paddle/pull/56492) - -### 改造废弃 - -- OneDNN 中删除 fc_elementwise_add 融合。[#55504](https://github.com/PaddlePaddle/Paddle/pull/55504) -- 删除 redunant op。 [#54442](https://github.com/PaddlePaddle/Paddle/pull/54442) - -### Bug Fix - -- 修复 Inference so 链接 flags 冲突问题。[#59755](https://github.com/PaddlePaddle/Paddle/pull/59755) -- 修复 constant_folding pass 执行报错。[#55556](https://github.com/PaddlePaddle/Paddle/pull/55556) -- 修复 softmax 前向速度问题及反向精度问题。[#56036](https://github.com/PaddlePaddle/Paddle/pull/56036),[#57858](https://github.com/PaddlePaddle/Paddle/pull/57858)[#57538](https://github.com/PaddlePaddle/Paddle/pull/57538) -- 修复自定义 OP while 报错及导出问题。[#58898](https://github.com/PaddlePaddle/Paddle/pull/58898),[#59318](https://github.com/PaddlePaddle/Paddle/pull/59318) -- 修复 Windows 平台 CUDA 12.0 编译问题。[#59852](https://github.com/PaddlePaddle/Paddle/pull/59852) -- 修复 TensorRT 版本大于等于 8.6 时推理部分算子报错问题。[#54379](https://github.com/PaddlePaddle/Paddle/pull/54379),[#54679](https://github.com/PaddlePaddle/Paddle/pull/54679),[#54251](https://github.com/PaddlePaddle/Paddle/pull/54251) -- 修复、删除推理融合 Pass。[#54846](https://github.com/PaddlePaddle/Paddle/pull/54846),[#54887](https://github.com/PaddlePaddle/Paddle/pull/54887),[#55573](https://github.com/PaddlePaddle/Paddle/pull/55573),[#56434](https://github.com/PaddlePaddle/Paddle/pull/56434),[#56326](https://github.com/PaddlePaddle/Paddle/pull/56326),[#56753](https://github.com/PaddlePaddle/Paddle/pull/56753),[#57491](https://github.com/PaddlePaddle/Paddle/pull/57491),[#56909](https://github.com/PaddlePaddle/Paddle/pull/56909),[#54536](https://github.com/PaddlePaddle/Paddle/pull/54536),[#55073](https://github.com/PaddlePaddle/Paddle/pull/55073),[#55081](https://github.com/PaddlePaddle/Paddle/pull/55081),[#55240](https://github.com/PaddlePaddle/Paddle/pull/55240),[#56439](https://github.com/PaddlePaddle/Paddle/pull/56439),[#59009](https://github.com/PaddlePaddle/Paddle/pull/59009) -- 修复多 Stream 推理上下文切换报错问题。[#57629](https://github.com/PaddlePaddle/Paddle/pull/57629),[#58048](https://github.com/PaddlePaddle/Paddle/pull/58048),[#54994](https://github.com/PaddlePaddle/Paddle/pull/54994) - -## 5. 硬件适配 - -### 硬件适配方案 (Custom Device) - -在本次更新中,新增了对分布式高级策略、自定义算子和自定义融合策略的支持。通过升级分布式通信库,新增了对 MP、GroupShared、PP、SP 和 MOE 等多项高级分布式策略的支持。同时支持厂商灵活接入不同颗粒度的 Transformer 算子库并通过融合 Pass 修改计算图进行性能加速。 - -#### 新功能 - -- CustomDevice 升级对 Paddle 最新分布式通信库 CommContext 的支持,并新增了多种高级分布式策略 GroupShared 和 MOE 等策略。[#56301](https://github.com/PaddlePaddle/Paddle/pull/56301),[#54671](https://github.com/PaddlePaddle/Paddle/pull/54671),[#57957](https://github.com/PaddlePaddle/Paddle/pull/57957),[#56669](https://github.com/PaddlePaddle/Paddle/pull/56669),[#54384](https://github.com/PaddlePaddle/Paddle/pull/54384),[#54572](https://github.com/PaddlePaddle/Paddle/pull/54572),[#54573](https://github.com/PaddlePaddle/Paddle/pull/54573),[#54676](https://github.com/PaddlePaddle/Paddle/pull/54676) -- 新增 CustomDevice 对 CustomOP 的支持,并可注册 Paddle PHI 算子库中尚未定义的算子,同时新增 CustomDevice 通过 CAPI 支持 CustomOP。[#57038](https://github.com/PaddlePaddle/Paddle/pull/57038),[#55532](https://github.com/PaddlePaddle/Paddle/pull/55532),[#56755](https://github.com/PaddlePaddle/Paddle/pull/56755),[#55532](https://github.com/PaddlePaddle/Paddle/pull/55532),[#55533](https://github.com/PaddlePaddle/Paddle/pull/55533),[#55659](https://github.com/PaddlePaddle/Paddle/pull/55659) -- 新增 CustomDevice 对 CustomPass 功能的功能,支持通过 Python API 修改计算图 IR。[#55511](https://github.com/PaddlePaddle/Paddle/pull/55511),[#55728](https://github.com/PaddlePaddle/Paddle/pull/55728) -- 新增 CustomDevice 对 Paddle run_check 健康功能检查的支持。[#56318](https://github.com/PaddlePaddle/Paddle/pull/56318) -- 新增 CustomDevice 对 StreamSafeAllocator 的支持。[#55393](https://github.com/PaddlePaddle/Paddle/pull/55393),[#56380](https://github.com/PaddlePaddle/Paddle/pull/56380),[#56536](https://github.com/PaddlePaddle/Paddle/pull/56536),[#58035](https://github.com/PaddlePaddle/Paddle/pull/58035) -- 新增 CustomDevice 对 DataTransform 的支持。[#56627](https://github.com/PaddlePaddle/Paddle/pull/56627) - -#### 功能优化 - -- 新增 CustomDevice,支持飞桨更多的接口,包括 Variable.set_value,adamw,share_external_data,mp_allreduce_sum,tensor.numpy,get_paddle_place, GeneratorState。[#55272](https://github.com/PaddlePaddle/Paddle/pull/55272), [#56386](https://github.com/PaddlePaddle/Paddle/pull/56386), [#57253](https://github.com/PaddlePaddle/Paddle/pull/57253), [#56927](https://github.com/PaddlePaddle/Paddle/pull/56927),[#56189](https://github.com/PaddlePaddle/Paddle/pull/56189),[#55225](https://github.com/PaddlePaddle/Paddle/pull/55225),[#55247](https://github.com/PaddlePaddle/Paddle/pull/55247) -- 修改 CustomDevice 动态库加载方式,从 RTLD_NOW 改为 RTLD_LAZY,方便后续检查 Custom Device 相关软件栈版本的兼容性。 [#57544](https://github.com/PaddlePaddle/Paddle/pull/57544) -- 新增 CustomDevice 在混合精度训练下对 FP16 算子的检测功能。[#56053](https://github.com/PaddlePaddle/Paddle/pull/56053),[#56176](https://github.com/PaddlePaddle/Paddle/pull/56176) - -#### Bug Fix - -- 修复 CustomDevice 对分布式通信库支持上的一些问题。[#55293](https://github.com/PaddlePaddle/Paddle/pull/55293),[#58038](https://github.com/PaddlePaddle/Paddle/pull/58038),[#59800](https://github.com/PaddlePaddle/Paddle/pull/59800) -- 修复 CustomDevice 在部分算子上的问题,包括 c_softmax_with_cross_entropy,data loader,SplitDenseTensor,grad accumulation,atan2 grad。[#56486](https://github.com/PaddlePaddle/Paddle/pull/56486),[#55541](https://github.com/PaddlePaddle/Paddle/pull/55541),[#55615](https://github.com/PaddlePaddle/Paddle/pull/55615),[#56052](https://github.com/PaddlePaddle/Paddle/pull/56052),[#56067](https://github.com/PaddlePaddle/Paddle/pull/56067) -- 修复 CustomDevice 中设备管理的一些问题,包括设备异常 ([#56556](https://github.com/PaddlePaddle/Paddle/pull/56556),[#58639](https://github.com/PaddlePaddle/Paddle/pull/58639),[#55173](https://github.com/PaddlePaddle/Paddle/pull/55173)), 异常事件([#56745](https://github.com/PaddlePaddle/Paddle/pull/56745),[#58059](https://github.com/PaddlePaddle/Paddle/pull/58059)), 显存异常([#56977](https://github.com/PaddlePaddle/Paddle/pull/56977),[#59247](https://github.com/PaddlePaddle/Paddle/pull/59247),[#54606](https://github.com/PaddlePaddle/Paddle/pull/54606)), 设备初始化 ([#57099](https://github.com/PaddlePaddle/Paddle/pull/57099),[#57994](https://github.com/PaddlePaddle/Paddle/pull/57994)),设备释放([#54932](https://github.com/PaddlePaddle/Paddle/pull/54932),[#55351](https://github.com/PaddlePaddle/Paddle/pull/55351),[#55783](https://github.com/PaddlePaddle/Paddle/pull/55783)),和设备资源池等。([#55229](https://github.com/PaddlePaddle/Paddle/pull/55229),[#56580](https://github.com/PaddlePaddle/Paddle/pull/56580)) -- 修复 CustomDevice 编译相关问题。[#56760](https://github.com/PaddlePaddle/Paddle/pull/56760),[#56766](https://github.com/PaddlePaddle/Paddle/pull/56766) - -### 昆仑芯 XPU - -#### 新功能 - -- 新增 XPTI (XPU Profiling Tool Interface) 支持运行时性能数据的采集和分析功能。[#54685](https://github.com/PaddlePaddle/Paddle/pull/54685),[#54690](https://github.com/PaddlePaddle/Paddle/pull/54690),[#54800](https://github.com/PaddlePaddle/Paddle/pull/54800) -- 完成对 Paddle 最新分布式通信库 CommContext 的支持。[#59418](https://github.com/PaddlePaddle/Paddle/pull/59418) -- 新增 XPU 融合算子包括 fast_where。[#55628](https://github.com/PaddlePaddle/Paddle/pull/55628) -- 新增 XPU Plugin 功能支持,方便用户可通过 XTDK 编程方式开发 XPU 自定义算子。[#55101](https://github.com/PaddlePaddle/Paddle/pull/55101),[#59326](https://github.com/PaddlePaddle/Paddle/pull/59326) -- 新增 XPU 对 AutoGrowthAllocator 的支持。[#54121](https://github.com/PaddlePaddle/Paddle/pull/54121) -- 新增昆仑芯 3 的算子支持列表。[#57683](https://github.com/PaddlePaddle/Paddle/pull/57683) - -#### 功能优化 - -- 对 XPU Inference API 进行升级。[#54342](https://github.com/PaddlePaddle/Paddle/pull/54342) -- 优化部分 XPU 算子性能和新增部分 XPU 算子对 bf16 的的支持,包括 unique/index_put,squeeze/unsqueeze kernels,swish/swish_grad,scatter_nd_add_grad/slice,rsqrt/bitwise_or/arange_tensor,where,collective 算子等。[#56582](https://github.com/PaddlePaddle/Paddle/pull/56582),[#58161](https://github.com/PaddlePaddle/Paddle/pull/58161),[#58440](https://github.com/PaddlePaddle/Paddle/pull/58440),[#58580](https://github.com/PaddlePaddle/Paddle/pull/58580),[#58950](https://github.com/PaddlePaddle/Paddle/pull/58950),[#58616](https://github.com/PaddlePaddle/Paddle/pull/58616),[#59273](https://github.com/PaddlePaddle/Paddle/pull/59273) -- 优化 XPU 内存管理,避免内存泄漏。[#59334](https://github.com/PaddlePaddle/Paddle/pull/59334),[#54847](https://github.com/PaddlePaddle/Paddle/pull/54847) -- 支持 INT8 推理。[#57258](https://github.com/PaddlePaddle/Paddle/pull/57258) -- 新增 FP16 系列推理算子支持。[#55642](https://github.com/PaddlePaddle/Paddle/pull/55642),[#54410](https://github.com/PaddlePaddle/Paddle/pull/54410) -- 支持 share_external_memory 接口传入输入输出。[#55170](https://github.com/PaddlePaddle/Paddle/pull/55170) -- 开源量化模型 XPU 推理支持。[#58568](https://github.com/PaddlePaddle/Paddle/pull/58568) -- 新增 context_gm_size 配置代替在 Pass 中分配 global memory。[#54674](https://github.com/PaddlePaddle/Paddle/pull/54674) -- 新增 embedding、fast_gather_nd plugin。[#56488](https://github.com/PaddlePaddle/Paddle/pull/56488),[#56103](https://github.com/PaddlePaddle/Paddle/pull/56103) -- 支持 fast_layternorm + leaky_relu 融合。[#57113](https://github.com/PaddlePaddle/Paddle/pull/57113) -- KL1 和 KL2 精度下 elementwise_min/max/floordiv/where 推理支持。[#58422](https://github.com/PaddlePaddle/Paddle/pull/58422) -- 支持 fc 和 conv2d 算子 autotune 配置。[#58801](https://github.com/PaddlePaddle/Paddle/pull/58801) -- 支持 conv 和 fc 动态量化。[#59307](https://github.com/PaddlePaddle/Paddle/pull/59307) -- fc + act 融合支持 sigmoid, swish and relu6。[#54486](https://github.com/PaddlePaddle/Paddle/pull/54486) -- elementwise_sub/elementwise_div 支持 int 数据类型。[#55920](https://github.com/PaddlePaddle/Paddle/pull/55920) - -#### Bug Fix - -- 修复 XPU 通信库问题和部分算子问题包括 rnn、layer_norm_grad、yolo_box。 [#55475](https://github.com/PaddlePaddle/Paddle/pull/55475),[#55515](https://github.com/PaddlePaddle/Paddle/pull/55515) [#55656](https://github.com/PaddlePaddle/Paddle/pull/55656),[#54669](https://github.com/PaddlePaddle/Paddle/pull/54669),[#55310](https://github.com/PaddlePaddle/Paddle/pull/55310) - -### 海光 DCU - -#### Bug Fix - -- 修复海光 DCU 部分算子问题,包括 rnn,concat/split,fft 等。[#59402](https://github.com/PaddlePaddle/Paddle/pull/59402),[#55821](https://github.com/PaddlePaddle/Paddle/pull/55821),[#56340](https://github.com/PaddlePaddle/Paddle/pull/56340) -- 修复海光 DCU 通信库相关问题。[#57110](https://github.com/PaddlePaddle/Paddle/pull/57110) -- 修复海光 DCU 编译相关问题。[#59775](https://github.com/PaddlePaddle/Paddle/pull/59775),[#55507](https://github.com/PaddlePaddle/Paddle/pull/55507),[#55612](https://github.com/PaddlePaddle/Paddle/pull/55612),[#54952](https://github.com/PaddlePaddle/Paddle/pull/54952),[#55076](https://github.com/PaddlePaddle/Paddle/pull/55076),[#56079](https://github.com/PaddlePaddle/Paddle/pull/56079),[#54874](https://github.com/PaddlePaddle/Paddle/pull/54874) -- 修复海光 DCU 对 BF16 数据类型的支持问题。[#56517](https://github.com/PaddlePaddle/Paddle/pull/56517) - -## 6. 环境适配 - -采用模块化编译的方式优化了 CMake 代码的逻辑,提升了飞桨全量编译和增量编译的效率,提升了 RD 本地开发效率,同时支持了 Python3.12,CUDA12,Hopper 架构编译,并引入 Clang 等工具全面优化了代码格式。此外,将 C++单测从链接静态库的方式转变为链接动态库,减小编译体积。这些改进措施为用户提供更加流畅、高效地安装和开发体验。 - -- CMake 代码优化:分模块和目录编译成独立的静态库,并减少编译依赖,提升增量编译效率。[#59095](https://github.com/PaddlePaddle/Paddle/pull/59095), [#58960](https://github.com/PaddlePaddle/Paddle/pull/58960),[#56591](https://github.com/PaddlePaddle/Paddle/pull/56591),[#58484](https://github.com/PaddlePaddle/Paddle/pull/58484) -- CMake 编译分层:将公共组件拆分到公有 common 库,自下而上实现飞桨架构的编译分层,提高编译效率。[#56442](https://github.com/PaddlePaddle/Paddle/pull/56442),[#54729](https://github.com/PaddlePaddle/Paddle/pull/54729),[#55733](https://github.com/PaddlePaddle/Paddle/pull/55733),[#56352](https://github.com/PaddlePaddle/Paddle/pull/56352),[#55109](https://github.com/PaddlePaddle/Paddle/pull/55109),[#54992](https://github.com/PaddlePaddle/Paddle/pull/54992),[#57698](https://github.com/PaddlePaddle/Paddle/pull/57698),[#55147](https://github.com/PaddlePaddle/Paddle/pull/55147),[#55113](https://github.com/PaddlePaddle/Paddle/pull/55113),[#56691](https://github.com/PaddlePaddle/Paddle/pull/56691),[#58618](https://github.com/PaddlePaddle/Paddle/pull/58618),[#58899](https://github.com/PaddlePaddle/Paddle/pull/58899),[#59140](https://github.com/PaddlePaddle/Paddle/pull/59140),[#59129](https://github.com/PaddlePaddle/Paddle/pull/59129),[#59222](https://github.com/PaddlePaddle/Paddle/pull/59222),[#59105](https://github.com/PaddlePaddle/Paddle/pull/59105),[#59711](https://github.com/PaddlePaddle/Paddle/pull/59711) -- 第三方库离线编译:将第三方依赖库离线编译,CI/CE 系统无需每次编译重复下载第三方库,提升 CI/CE 系统运行效率。[#54344](https://github.com/PaddlePaddle/Paddle/pull/54344),[#54370](https://github.com/PaddlePaddle/Paddle/pull/54370),[#54466](https://github.com/PaddlePaddle/Paddle/pull/54466),[#54438](https://github.com/PaddlePaddle/Paddle/pull/54438),[#54388](https://github.com/PaddlePaddle/Paddle/pull/54388),[#54436](https://github.com/PaddlePaddle/Paddle/pull/54436),[#54392](https://github.com/PaddlePaddle/Paddle/pull/54392),[#54646](https://github.com/PaddlePaddle/Paddle/pull/54646),[#54380](https://github.com/PaddlePaddle/Paddle/pull/54380),[#55501](https://github.com/PaddlePaddle/Paddle/pull/55501),[#55136](https://github.com/PaddlePaddle/Paddle/pull/55136),[#54451](https://github.com/PaddlePaddle/Paddle/pull/54451),[#55631](https://github.com/PaddlePaddle/Paddle/pull/55631),[#55549](https://github.com/PaddlePaddle/Paddle/pull/55549),[#56165](https://github.com/PaddlePaddle/Paddle/pull/56165),[#54391](https://github.com/PaddlePaddle/Paddle/pull/54391),[#54614](https://github.com/PaddlePaddle/Paddle/pull/54614),[#54522](https://github.com/PaddlePaddle/Paddle/pull/54522),[#54764](https://github.com/PaddlePaddle/Paddle/pull/54764),[#54400](https://github.com/PaddlePaddle/Paddle/pull/54400),[#54322](https://github.com/PaddlePaddle/Paddle/pull/54322) -- 飞桨支持 Python 3.12。[#59396](https://github.com/PaddlePaddle/Paddle/pull/59396),[#58069](https://github.com/PaddlePaddle/Paddle/pull/58069) -- 使用 Clang 等工具对于源代码进行优化,提升代码质量。[#59626](https://github.com/PaddlePaddle/Paddle/pull/59626),[#55895](https://github.com/PaddlePaddle/Paddle/pull/55895),[#56632](https://github.com/PaddlePaddle/Paddle/pull/56632),[#54449](https://github.com/PaddlePaddle/Paddle/pull/54449),[#54523](https://github.com/PaddlePaddle/Paddle/pull/54523),[#54796](https://github.com/PaddlePaddle/Paddle/pull/54796),[#55847](https://github.com/PaddlePaddle/Paddle/pull/55847),[#55807](https://github.com/PaddlePaddle/Paddle/pull/55807),[#56261](https://github.com/PaddlePaddle/Paddle/pull/56261),[#57522](https://github.com/PaddlePaddle/Paddle/pull/57522),[#57868](https://github.com/PaddlePaddle/Paddle/pull/57868),[#57809](https://github.com/PaddlePaddle/Paddle/pull/57809),[#55658](https://github.com/PaddlePaddle/Paddle/pull/55658),[#58285](https://github.com/PaddlePaddle/Paddle/pull/58285),[#55491](https://github.com/PaddlePaddle/Paddle/pull/55491),[#55506](https://github.com/PaddlePaddle/Paddle/pull/55506),[#55279](https://github.com/PaddlePaddle/Paddle/pull/55279),[#55741](https://github.com/PaddlePaddle/Paddle/pull/55741),[#55894](https://github.com/PaddlePaddle/Paddle/pull/55894),[#55704](https://github.com/PaddlePaddle/Paddle/pull/55704),[#55800](https://github.com/PaddlePaddle/Paddle/pull/55800),[#55799](https://github.com/PaddlePaddle/Paddle/pull/55799),[#55983](https://github.com/PaddlePaddle/Paddle/pull/55983),[#55954](https://github.com/PaddlePaddle/Paddle/pull/55954),[#55764](https://github.com/PaddlePaddle/Paddle/pull/55764),[#56246](https://github.com/PaddlePaddle/Paddle/pull/56246),[#56219](https://github.com/PaddlePaddle/Paddle/pull/56219),[#56217](https://github.com/PaddlePaddle/Paddle/pull/56217),[#56216](https://github.com/PaddlePaddle/Paddle/pull/56216),[#56208](https://github.com/PaddlePaddle/Paddle/pull/56208),[#56134](https://github.com/PaddlePaddle/Paddle/pull/56134),[#56253](https://github.com/PaddlePaddle/Paddle/pull/56253),[#56255](https://github.com/PaddlePaddle/Paddle/pull/56255),[#56693](https://github.com/PaddlePaddle/Paddle/pull/56693),[#56692](https://github.com/PaddlePaddle/Paddle/pull/56692),[#56637](https://github.com/PaddlePaddle/Paddle/pull/56637),[#56636](https://github.com/PaddlePaddle/Paddle/pull/56636),[#56647](https://github.com/PaddlePaddle/Paddle/pull/56647),[#56218](https://github.com/PaddlePaddle/Paddle/pull/56218),[#56640](https://github.com/PaddlePaddle/Paddle/pull/56640),[#56635](https://github.com/PaddlePaddle/Paddle/pull/56635),[#55675](https://github.com/PaddlePaddle/Paddle/pull/55675),[#56601](https://github.com/PaddlePaddle/Paddle/pull/56601),[#56485](https://github.com/PaddlePaddle/Paddle/pull/56485),[#56648](https://github.com/PaddlePaddle/Paddle/pull/56648),[#56747](https://github.com/PaddlePaddle/Paddle/pull/56747),[#56676](https://github.com/PaddlePaddle/Paddle/pull/56676),[#56649](https://github.com/PaddlePaddle/Paddle/pull/56649),[#56895](https://github.com/PaddlePaddle/Paddle/pull/56895),[#56994](https://github.com/PaddlePaddle/Paddle/pull/56994),[#56904](https://github.com/PaddlePaddle/Paddle/pull/56904),[#56744](https://github.com/PaddlePaddle/Paddle/pull/56744),[#56954](https://github.com/PaddlePaddle/Paddle/pull/56954),[#57114](https://github.com/PaddlePaddle/Paddle/pull/57114),[#57343](https://github.com/PaddlePaddle/Paddle/pull/57343),[#57483](https://github.com/PaddlePaddle/Paddle/pull/57483),[#57871](https://github.com/PaddlePaddle/Paddle/pull/57871),[#57861](https://github.com/PaddlePaddle/Paddle/pull/57861),[#58028](https://github.com/PaddlePaddle/Paddle/pull/58028),[#57627](https://github.com/PaddlePaddle/Paddle/pull/57627),[#59072](https://github.com/PaddlePaddle/Paddle/pull/59072) -- C++从链接静态库转变为链接动态库,减小编译体积,提升编译效率。[#59477](https://github.com/PaddlePaddle/Paddle/pull/59477),[#56630](https://github.com/PaddlePaddle/Paddle/pull/56630),[#57789](https://github.com/PaddlePaddle/Paddle/pull/57789),[#54257](https://github.com/PaddlePaddle/Paddle/pull/54257),[#59620](https://github.com/PaddlePaddle/Paddle/pull/59620),[#59384](https://github.com/PaddlePaddle/Paddle/pull/59384),[#59619](https://github.com/PaddlePaddle/Paddle/pull/59619),[#58583](https://github.com/PaddlePaddle/Paddle/pull/58583),[#58821](https://github.com/PaddlePaddle/Paddle/pull/58821),[#58710](https://github.com/PaddlePaddle/Paddle/pull/58710),[#58619](https://github.com/PaddlePaddle/Paddle/pull/58619) -- 修复源代码编译相关的问题,提升编译安装效率。[#56617](https://github.com/PaddlePaddle/Paddle/pull/56617),[#58195](https://github.com/PaddlePaddle/Paddle/pull/58195),[#56136](https://github.com/PaddlePaddle/Paddle/pull/56136),[#54540](https://github.com/PaddlePaddle/Paddle/pull/54540),[#57172](https://github.com/PaddlePaddle/Paddle/pull/57172),[#54429](https://github.com/PaddlePaddle/Paddle/pull/54429),[#55603](https://github.com/PaddlePaddle/Paddle/pull/55603),[#54807](https://github.com/PaddlePaddle/Paddle/pull/54807),[#56102](https://github.com/PaddlePaddle/Paddle/pull/56102),[#56829](https://github.com/PaddlePaddle/Paddle/pull/56829),[#56951](https://github.com/PaddlePaddle/Paddle/pull/56951),[#56555](https://github.com/PaddlePaddle/Paddle/pull/56555),[#57781](https://github.com/PaddlePaddle/Paddle/pull/57781),[#57836](https://github.com/PaddlePaddle/Paddle/pull/57836),[#58807](https://github.com/PaddlePaddle/Paddle/pull/58807),[#54535](https://github.com/PaddlePaddle/Paddle/pull/54535),[#54946](https://github.com/PaddlePaddle/Paddle/pull/54946),[#54437](https://github.com/PaddlePaddle/Paddle/pull/54437),[#54411](https://github.com/PaddlePaddle/Paddle/pull/54411),[#54411](https://github.com/PaddlePaddle/Paddle/pull/54411),[#54391](https://github.com/PaddlePaddle/Paddle/pull/54391),[#54466](https://github.com/PaddlePaddle/Paddle/pull/54466),[#54480](https://github.com/PaddlePaddle/Paddle/pull/54480),[#54480](https://github.com/PaddlePaddle/Paddle/pull/54480),[#54724](https://github.com/PaddlePaddle/Paddle/pull/54724),[#59193](https://github.com/PaddlePaddle/Paddle/pull/59193),[#54735](https://github.com/PaddlePaddle/Paddle/pull/54735),[#54812](https://github.com/PaddlePaddle/Paddle/pull/54812),[#56430](https://github.com/PaddlePaddle/Paddle/pull/56430),[#56655](https://github.com/PaddlePaddle/Paddle/pull/56655),[#56684](https://github.com/PaddlePaddle/Paddle/pull/56684),[#56774](https://github.com/PaddlePaddle/Paddle/pull/56774),[#56936](https://github.com/PaddlePaddle/Paddle/pull/56936),[#56949](https://github.com/PaddlePaddle/Paddle/pull/56949),[#56974](https://github.com/PaddlePaddle/Paddle/pull/56974),[#57171](https://github.com/PaddlePaddle/Paddle/pull/57171),[#57712](https://github.com/PaddlePaddle/Paddle/pull/57712),[#56617](https://github.com/PaddlePaddle/Paddle/pull/56617),[#58181](https://github.com/PaddlePaddle/Paddle/pull/58181),[#58253](https://github.com/PaddlePaddle/Paddle/pull/58253),[#58268](https://github.com/PaddlePaddle/Paddle/pull/58268),[#59051](https://github.com/PaddlePaddle/Paddle/pull/59051),[#59048](https://github.com/PaddlePaddle/Paddle/pull/59048),[#59081](https://github.com/PaddlePaddle/Paddle/pull/59081),[#59076](https://github.com/PaddlePaddle/Paddle/pull/59076),[#59155](https://github.com/PaddlePaddle/Paddle/pull/59155),[#59253](https://github.com/PaddlePaddle/Paddle/pull/59253),[#59347](https://github.com/PaddlePaddle/Paddle/pull/59347),[#58957](https://github.com/PaddlePaddle/Paddle/pull/58957),[#59443](https://github.com/PaddlePaddle/Paddle/pull/59443),[#58998](https://github.com/PaddlePaddle/Paddle/pull/58998),[#57574](https://github.com/PaddlePaddle/Paddle/pull/57574),[#55889](https://github.com/PaddlePaddle/Paddle/pull/55889),[#59078](https://github.com/PaddlePaddle/Paddle/pull/59078),[#55762](https://github.com/PaddlePaddle/Paddle/pull/55762),[#56252](https://github.com/PaddlePaddle/Paddle/pull/56252),[#56715](https://github.com/PaddlePaddle/Paddle/pull/56715),[#54905](https://github.com/PaddlePaddle/Paddle/pull/54905),[#56978](https://github.com/PaddlePaddle/Paddle/pull/56978),[#57032](https://github.com/PaddlePaddle/Paddle/pull/57032),[#57179](https://github.com/PaddlePaddle/Paddle/pull/57179),[#57179](https://github.com/PaddlePaddle/Paddle/pull/57179),[#58996](https://github.com/PaddlePaddle/Paddle/pull/58996),[#59915](https://github.com/PaddlePaddle/Paddle/pull/59915),[#54883](https://github.com/PaddlePaddle/Paddle/pull/54883),[#56746](https://github.com/PaddlePaddle/Paddle/pull/56746),[#57674](https://github.com/PaddlePaddle/Paddle/pull/57674),[#60117](https://github.com/PaddlePaddle/Paddle/pull/60117),[#55627](https://github.com/PaddlePaddle/Paddle/pull/55627),[#54568](https://github.com/PaddlePaddle/Paddle/pull/54568),[#54450](https://github.com/PaddlePaddle/Paddle/pull/54450),[#54513](https://github.com/PaddlePaddle/Paddle/pull/54513),[#54615](https://github.com/PaddlePaddle/Paddle/pull/54615),[#54913](https://github.com/PaddlePaddle/Paddle/pull/54913),[#54916](https://github.com/PaddlePaddle/Paddle/pull/54916),[#55148](https://github.com/PaddlePaddle/Paddle/pull/55148),[#55125](https://github.com/PaddlePaddle/Paddle/pull/55125),[#55479](https://github.com/PaddlePaddle/Paddle/pull/55479),[#55723](https://github.com/PaddlePaddle/Paddle/pull/55723),[#55831](https://github.com/PaddlePaddle/Paddle/pull/55831),[#55904](https://github.com/PaddlePaddle/Paddle/pull/55904),[#56085](https://github.com/PaddlePaddle/Paddle/pull/56085),[#56259](https://github.com/PaddlePaddle/Paddle/pull/56259),[#56366](https://github.com/PaddlePaddle/Paddle/pull/56366),[#56366](https://github.com/PaddlePaddle/Paddle/pull/56366),[#56546](https://github.com/PaddlePaddle/Paddle/pull/56546),[#56679](https://github.com/PaddlePaddle/Paddle/pull/56679),[#57222](https://github.com/PaddlePaddle/Paddle/pull/57222),[#57387](https://github.com/PaddlePaddle/Paddle/pull/57387),[#57993](https://github.com/PaddlePaddle/Paddle/pull/57993),[#59556](https://github.com/PaddlePaddle/Paddle/pull/59556),[#57931](https://github.com/PaddlePaddle/Paddle/pull/57931),[#58112](https://github.com/PaddlePaddle/Paddle/pull/58112),[#54228](https://github.com/PaddlePaddle/Paddle/pull/54228),[#56913](https://github.com/PaddlePaddle/Paddle/pull/56913),[#56993](https://github.com/PaddlePaddle/Paddle/pull/56993),[#55042](https://github.com/PaddlePaddle/Paddle/pull/55042),[#55305](https://github.com/PaddlePaddle/Paddle/pull/55305),[#55286](https://github.com/PaddlePaddle/Paddle/pull/55286),[#56634](https://github.com/PaddlePaddle/Paddle/pull/56634),[#57778](https://github.com/PaddlePaddle/Paddle/pull/57778),[#58374](https://github.com/PaddlePaddle/Paddle/pull/58374),[#58640](https://github.com/PaddlePaddle/Paddle/pull/58640),[#58822](https://github.com/PaddlePaddle/Paddle/pull/58822),[#59055](https://github.com/PaddlePaddle/Paddle/pull/59055),[#59303](https://github.com/PaddlePaddle/Paddle/pull/59303),[#59487](https://github.com/PaddlePaddle/Paddle/pull/59487),[#58400](https://github.com/PaddlePaddle/Paddle/pull/58400),[#59283](https://github.com/PaddlePaddle/Paddle/pull/59283),[#54791](https://github.com/PaddlePaddle/Paddle/pull/54791),[#59134](https://github.com/PaddlePaddle/Paddle/pull/59134),[#56206](https://github.com/PaddlePaddle/Paddle/pull/56206),[#56199](https://github.com/PaddlePaddle/Paddle/pull/56199),[#56670](https://github.com/PaddlePaddle/Paddle/pull/56670),[#58923](https://github.com/PaddlePaddle/Paddle/pull/58923) -- 修复 Paddle ARM 编译相关问题。[#55416](https://github.com/PaddlePaddle/Paddle/pull/55416),[#55548](https://github.com/PaddlePaddle/Paddle/pull/55548) - -## Thanks to Our Contributors - -Azure-Tang, zhaoyinglia, From00, JZ-LIANG, xysheng-baidu, SylarTiaNII, kuizhiqing, zhiqiu, FeixLiu, liuzhenhai93, GhostScreaming, pangengzheng, xiaoyewww, wanghuancoder, ForFishes, hitywt, danleifeng, tianshuo78520a, ykkk2333, houj04, lj970926, XiaociZhang, HarperCy, cqulilujia, runzhech, RuohengMa, Caozhou1995, kangguangli, heavyrain-lzy, zyfncg, SigureMo, YuanRisheng, lchdl, LiYuRio, AndSonder, Wennie396, zhangbo9674, liudongxue01, risemeup1, phlrain, winter-wang, yuanlehome, NALLEIN, Liujie0926, yuguo-Jack, gitliuyf, zh794390558, Aurelius84, 6clc, GGBond8488, xiaoguoguo626807, Wong4j, iosmers, xiaoxiaohehe001, LielinJiang, carryyu, Difers, yangxiaoyu14, xuxinyi389, cxxly, gongshaotian, jjyaoao, lijialin03, lxd-cumt, cyber-pioneer, HydrogenSulfate, MayYouBeProsperous, Charles-hit, Patrick-Star125, ScottWong98, huangjiyi, DrRyanHuang, jinyouzhi, BeingGod, Wanglongzhi2001, yangguohao, zyt1024, longranger2, 2742195759, megemini, thisjiang, kevincheng2, zhoutianzi666, Wangzheee, ming1753, tianhaodongbd, freeliuzc, zhenyun-li, MARD1NO, RichardWooSJTU, eee4017, leo0519, csy0225, wwbitejotunn, bukejiyu, jiweibo, iamsonderr, ckl117, ronny1996, zhanglirong1999, LLee233, ZHUI, wangxn12138, zhwesky2010, Courtesy-Xs, zoooo0820, llyyxx0413, Asthestarsfalll, zxcd, pkuzyc, idontkonwher, sneaxiy, hong19860320, ZibinGuo, leolishaohao, MuShangCC, zhupengyang, shentanyue, Travis-Lee, wz1qqx, frank-oops, newway, QingshuChen, zhangyk0314, HandSomeLEEw, Shixiaowei02, zhangyuqin1998, Xing-lil, zhhsplendid, jiahy0825, xinyu-intel, MarioLulab, 0x45f, Tom-Zheng, xingmingyyj, zhangbopd, gouzil, zeroRains, BiynXu, WintersMontagne10335, wuhuachaocoding, GreatV, chenwhql, deepllz, parap1uie-s, ozogxyz, FisherWY, changeyoung98, zhiboniu, YangQun1 dynamicheart, Xreki, liugddx, Lylinnnnn, YSF-A, zzjjay, YanhuiDua, lishicheng1996, USTCKAY, abenmao, cocoshe, HermitSun, ccsuzzh, sanbuphy, enkilee, RedContritio, Liyulingyue, zrr1999, chen2016013, Galaxy1458, chalsliu, mrcangye, XieYunshen, zhiheng-liu, haohongxiang, ZzSean, JamesLim-sy, yuehuayingxueluo, niuliling123, umiswing, sijunhe, littsk, SecretXV, zhurou603, zhangjun, caizejun, yangjianfengo1, vivienfanghuagood, Xinyu302, lizexu123, yghstill, Li-fAngyU, VigiZhang, co63oc, dhanush-2501, ooooo-create, PommesPeter, zeus2x7, akshatvishu, jzhang533, Sekiro-x, gumblex, BernieHuang2008, YibinLiu666, qiuwenbogdut, XavierZXY, MqLeet, zhangting2020, mingxu1067, Ainavo, SSKlearns, yuchen202, silverling, zade23, wenxiaohahaha, NKNaN, Tsaiyue, fsczz, Tomoko-hjf, rhmaaa, zbt78, Hhankyangg, wangzhen38, zhengqiwen1997, engineer1109, onepick, qili93, Rane2021, nemonameless, DesmonDay, RachelXu7, ceci3, lyuwenyu, liuruyan, LokeZhou, shiyutang, lanxianghit, feifei-111, Sahala08, sunzhongkai588, Kaedeharai, Candy2Tang, liyongchao911, whisky-12, InsaneOnion, yoyoIcy, KongAKun, linzeyang, MuhammadNizamani, eltociear, Ligoml, LUZY0726, Windfarer, FlyingQianMM, jeng1220, junelotus, zlsh80826, Vvsmile, Frida-a, TonibMw, guoshengCS, zhink, ZhangYulongg, AlbertVan, fengxin-hello, mjp9527, entired, DanGuge. - - -# 2.5.0 Release Note - -## 1. 重要更新 -- **动静统一新架构**:实现基础算子组合的动转静加编译器执行新模式,在 ResNet50&Bert 模型上完成动转静、组合算子、神经网络编译器优化加速全流程。动转静完成整图 fallback 核心功能开发,支持动转静失败时回退到动态图训练执行;组合算子设计一套包含 150 多个基础算子的基础算子体系,实现 python 层前向算子拆分机制和支持动、静态图的反向算子拆分机制,实现 70 多个常用前、反向算子的拆分;CINN 编译器修复正确性问题,开发关键 Pass,添加手工 Schedule 规则,实现内核代码自动生成,ResNet50 模型性能提升 12%,Bert 模型性能提升 10%。 -- **PHI 算子库算子架构统一**:将原算子体系下剩余的 350+算子内核全部统一到 PHI 算子库中,以及原算子体系中的算子定义方式也都统一为 PHI 算子库的算子定义形式(基于 YAML 配置定义算子),提升了架构统一性,降低了框架开发的理解成本;将 PHI 算子库依赖的 Fluid 头文件全部解耦,并独立编译为动态链接库,为框架的二次开发提供更轻量的算子库复用方式;继续对飞桨框架中不规范的算子以及算子内核进行规范化调整,便于开发者理解,降低了硬件的接入成本。 -- **静态图新执行器全面上线**:静态图新执行器实现多项功能和性能优化,完成对原有多套旧执行器的统一和替换,成为静态图单卡和分布式训练 python 端入口以及动转静、控制流、CINN 等后端默认使用的执行引擎,大幅提升框架调度性能,功能架构更加清晰,二次开发能力显著增强。 -- **Python API 支持 0 维 tensor**:为形状为`[1,]` 及形状为 `[]` 的张量定义了清晰的语义。 -- **新的环境适配**:适配了 CUDA 12,并支持使用 gcc12 进行编译。 - -## 2. 不兼容升级 -- 飞桨 API 支持 0 维 tensor。飞桨之前用 shape 为[1]的 1 维 tensor 来替代 0 维 tensor,这种替代方式和当前主流习惯有差异,增加模型的开发调试成本,有时还会导致非预期错误。本版本对需支持 0 维 tensor 的 376 个 API 进行了修正,和社区广泛使用的工具如 EinOps 等实现。例如,在之前的情况下,模型训练中输出的 loss 为 1 维 tensor,如果要取出或打印 loss,往往需要使用 `loss.numpy()[0]` 这样的代码。经过本次修改后,模型训练中输出的 loss 为 0 维 tensor,使用 `loss.numpy()` 即可取出或打印 loss,代码简短、易懂且符合业界使用习惯。 -- `paddle.fluid` API 全面退场。按照上个版本已预告的计划,本次退场了 1116 个`paddle.fluid`API 及相关内部接口,剩余少量相关内部接口会在下个版本全部清理完成。fluid API 属于飞桨 2.0 本计划移除但考虑到兼容性等因素延缓清理的历史 API,本次退场清理不会影响基于飞桨 2.0 开发的程序,飞桨 API 体系也会更加简洁易懂。 -- 旧版动态图 Python 端代码完成清理。至此,Python 端仅使用新版动态图调用 C++核心逻辑。 -- 为统一静态图模型数据并行的训练方式,废弃原有的单进程多卡训练方式,包括 `paddle.static.ParallelExecutor` 和 `paddle.static.CompiledProgram().with_data_parallel()` 两个接口,原因是这套接口只支持单机多卡,不支持多机多卡,且底层执行性能较差。推荐统一使用多进程多卡训练方式,即 `paddle.distributed.launch` 接口来进行数据并行的分布式训练。该升级只影响静态图,不影响动态图和动转静训练,如果使用了废弃接口,请参考 [数据并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/cluster_quick_start_collective_cn.html) 的文档修改模型代码。[#50351](https://github.com/PaddlePaddle/Paddle/pull/50351),[#50501](https://github.com/PaddlePaddle/Paddle/pull/50501),[#51240](https://github.com/PaddlePaddle/Paddle/pull/51240),[#51701](https://github.com/PaddlePaddle/Paddle/pull/51701),[#51616](https://github.com/PaddlePaddle/Paddle/pull/51616),[#51369](https://github.com/PaddlePaddle/Paddle/pull/51369),[#52671](https://github.com/PaddlePaddle/Paddle/pull/52671) -- 移除框架中原有的昇腾 NPU 和寒武纪 MLU 的适配代码,全部升级为 CustomDevice 插件式适配方式,并将昇腾 NPU 和寒武纪 MLU 的适配代码迁移至 PaddleCustomDevice 仓库。 - -## 3. 训练框架(含分布式) -### Python API -#### API 支持 0 维 tensor -- API 输入支持 0 维 tensor,涉及 `paddle.reshape`、`paddle.trace`、`paddle.linalg.norm` 等 286 个 API。[#53208](https://github.com/PaddlePaddle/Paddle/pull/53208), [#53592](https://github.com/PaddlePaddle/Paddle/pull/53592), [#47074](https://github.com/PaddlePaddle/Paddle/pull/47074), [#53186](https://github.com/PaddlePaddle/Paddle/pull/53186), [#47677](https://github.com/PaddlePaddle/Paddle/pull/47677), [#49357](https://github.com/PaddlePaddle/Paddle/pull/49357), [#50237](https://github.com/PaddlePaddle/Paddle/pull/50237), [#46555](https://github.com/PaddlePaddle/Paddle/pull/46555), [#47219](https://github.com/PaddlePaddle/Paddle/pull/47219), [#47501](https://github.com/PaddlePaddle/Paddle/pull/47501), [#47858](https://github.com/PaddlePaddle/Paddle/pull/47858), [#47961](https://github.com/PaddlePaddle/Paddle/pull/47961), [#48058](https://github.com/PaddlePaddle/Paddle/pull/48058), [#48007](https://github.com/PaddlePaddle/Paddle/pull/48007), [#49755](https://github.com/PaddlePaddle/Paddle/pull/49755), [#51024](https://github.com/PaddlePaddle/Paddle/pull/51024), [#51566](https://github.com/PaddlePaddle/Paddle/pull/51566), [#51899](https://github.com/PaddlePaddle/Paddle/pull/51899), [#49813](https://github.com/PaddlePaddle/Paddle/pull/49813), [#47812](https://github.com/PaddlePaddle/Paddle/pull/47812), [#47849](https://github.com/PaddlePaddle/Paddle/pull/47849), [#47251](https://github.com/PaddlePaddle/Paddle/pull/47251), [#53125](https://github.com/PaddlePaddle/Paddle/pull/53125), [#53828](https://github.com/PaddlePaddle/Paddle/pull/53828), [#51265](https://github.com/PaddlePaddle/Paddle/pull/51265), [#47689](https://github.com/PaddlePaddle/Paddle/pull/47689), [#48452](https://github.com/PaddlePaddle/Paddle/pull/48452), [#49072](https://github.com/PaddlePaddle/Paddle/pull/49072), [#48638](https://github.com/PaddlePaddle/Paddle/pull/48638), [#49175](https://github.com/PaddlePaddle/Paddle/pull/49175), [#49279](https://github.com/PaddlePaddle/Paddle/pull/49279), [#50857](https://github.com/PaddlePaddle/Paddle/pull/50857), [#49805](https://github.com/PaddlePaddle/Paddle/pull/49805), [#47734](https://github.com/PaddlePaddle/Paddle/pull/47734), [#45992](https://github.com/PaddlePaddle/Paddle/pull/45992), [#49616](https://github.com/PaddlePaddle/Paddle/pull/49616), [#49959](https://github.com/PaddlePaddle/Paddle/pull/49959), [#50536](https://github.com/PaddlePaddle/Paddle/pull/50536), [#49544](https://github.com/PaddlePaddle/Paddle/pull/49544), [#49842](https://github.com/PaddlePaddle/Paddle/pull/49842), [#46909](https://github.com/PaddlePaddle/Paddle/pull/46909), [#49361](https://github.com/PaddlePaddle/Paddle/pull/49361), [#50169](https://github.com/PaddlePaddle/Paddle/pull/50169), [#48314](https://github.com/PaddlePaddle/Paddle/pull/48314), [#48735](https://github.com/PaddlePaddle/Paddle/pull/48735), [#49122](https://github.com/PaddlePaddle/Paddle/pull/49122), [#49122](https://github.com/PaddlePaddle/Paddle/pull/49122), [#49177](https://github.com/PaddlePaddle/Paddle/pull/49177), [#49501](https://github.com/PaddlePaddle/Paddle/pull/49501), [#49562](https://github.com/PaddlePaddle/Paddle/pull/49562), [#49340](https://github.com/PaddlePaddle/Paddle/pull/49340), [#49550](https://github.com/PaddlePaddle/Paddle/pull/49550), [#49596](https://github.com/PaddlePaddle/Paddle/pull/49596), [#49730](https://github.com/PaddlePaddle/Paddle/pull/49730), [#49667](https://github.com/PaddlePaddle/Paddle/pull/49667), [#49692](https://github.com/PaddlePaddle/Paddle/pull/49692), [#49854](https://github.com/PaddlePaddle/Paddle/pull/49854), [#49845](https://github.com/PaddlePaddle/Paddle/pull/49845), [#49803](https://github.com/PaddlePaddle/Paddle/pull/49803), [#49889](https://github.com/PaddlePaddle/Paddle/pull/49889), [#49904](https://github.com/PaddlePaddle/Paddle/pull/49904), [#49518](https://github.com/PaddlePaddle/Paddle/pull/49518), [#49884](https://github.com/PaddlePaddle/Paddle/pull/49884), [#49880](https://github.com/PaddlePaddle/Paddle/pull/49880), [#49862](https://github.com/PaddlePaddle/Paddle/pull/49862), [#49921](https://github.com/PaddlePaddle/Paddle/pull/49921), [#49260](https://github.com/PaddlePaddle/Paddle/pull/49260), [#49929](https://github.com/PaddlePaddle/Paddle/pull/49929), [#49570](https://github.com/PaddlePaddle/Paddle/pull/49570), [#49882](https://github.com/PaddlePaddle/Paddle/pull/49882), [#50213](https://github.com/PaddlePaddle/Paddle/pull/50213), [#49780](https://github.com/PaddlePaddle/Paddle/pull/49780), [#50271](https://github.com/PaddlePaddle/Paddle/pull/50271), [#50289](https://github.com/PaddlePaddle/Paddle/pull/50289), [#50293](https://github.com/PaddlePaddle/Paddle/pull/50293), [#49735](https://github.com/PaddlePaddle/Paddle/pull/49735), [#50433](https://github.com/PaddlePaddle/Paddle/pull/50433), [#49847](https://github.com/PaddlePaddle/Paddle/pull/49847), [#50635](https://github.com/PaddlePaddle/Paddle/pull/50635), [#50950](https://github.com/PaddlePaddle/Paddle/pull/50950), [#50947](https://github.com/PaddlePaddle/Paddle/pull/50947), [#49460](https://github.com/PaddlePaddle/Paddle/pull/49460), [#53087](https://github.com/PaddlePaddle/Paddle/pull/53087), [#51687](https://github.com/PaddlePaddle/Paddle/pull/51687), [#52185](https://github.com/PaddlePaddle/Paddle/pull/52185), [#54649](https://github.com/PaddlePaddle/Paddle/pull/54649) -- API 输出支持 0 维 tensor,涉及 `paddle.sum`、`paddle.min/max`、`paddle.any/all` 等 90 个 API。[#52891](https://github.com/PaddlePaddle/Paddle/pull/52891), [#52861](https://github.com/PaddlePaddle/Paddle/pull/52861), [#52775](https://github.com/PaddlePaddle/Paddle/pull/52775), [#52850](https://github.com/PaddlePaddle/Paddle/pull/52850), [#52843](https://github.com/PaddlePaddle/Paddle/pull/52843), [#52857](https://github.com/PaddlePaddle/Paddle/pull/52857), [#51721](https://github.com/PaddlePaddle/Paddle/pull/51721), [#53051](https://github.com/PaddlePaddle/Paddle/pull/53051), [#53192](https://github.com/PaddlePaddle/Paddle/pull/53192), [#52739](https://github.com/PaddlePaddle/Paddle/pull/52739), [#52741](https://github.com/PaddlePaddle/Paddle/pull/52741), [#53175](https://github.com/PaddlePaddle/Paddle/pull/53175), [#51889](https://github.com/PaddlePaddle/Paddle/pull/51889), [#53199](https://github.com/PaddlePaddle/Paddle/pull/53199), [#53242](https://github.com/PaddlePaddle/Paddle/pull/53242), [#53421](https://github.com/PaddlePaddle/Paddle/pull/53421) -- 支持 0 维 tensor 后,修正原有不规范的代码,及对模型代码中的非规范用法进行提示和兼容。[#51562](https://github.com/PaddlePaddle/Paddle/pull/51562), [#51586](https://github.com/PaddlePaddle/Paddle/pull/51586), [#51757](https://github.com/PaddlePaddle/Paddle/pull/51757), [#52197](https://github.com/PaddlePaddle/Paddle/pull/52197), [#54117](https://github.com/PaddlePaddle/Paddle/pull/54117)。 - -#### new API -- 新增 jacobian 和 hessian API,用于科学计算。[#53331](https://github.com/PaddlePaddle/Paddle/pull/53331) -- 新增稀疏计算 API。例如 `paddle.sparse.reshape`、`paddle.sparse.sum` 和 `paddle.sparse.slice` 等。[#46694](https://github.com/PaddlePaddle/Paddle/pull/46694), [#51513](https://github.com/PaddlePaddle/Paddle/pull/51513), [#53794](https://github.com/PaddlePaddle/Paddle/pull/53794), [#51406](https://github.com/PaddlePaddle/Paddle/pull/51406) -- 新增其它 API。例如 `paddle.optimizer.LBFGS`、`paddle.index_put` 和 `paddle.logaddexp` 等。[#53314](https://github.com/PaddlePaddle/Paddle/pull/53314), [#51912](https://github.com/PaddlePaddle/Paddle/pull/51912), [#52886](https://github.com/PaddlePaddle/Paddle/pull/52886), [#50843](https://github.com/PaddlePaddle/Paddle/pull/50843), [#47282](https://github.com/PaddlePaddle/Paddle/pull/47282), [#52284](https://github.com/PaddlePaddle/Paddle/pull/52284) - -### 动态图 -#### 新功能 -- 新增了 paddle.nn.utils.clip_grad_norm_用于支持梯度裁剪和 paddle.Tensor.data_ptr 用于获取 Tensor 数据的内存/显存的地址 [PR49935](https://github.com/PaddlePaddle/Paddle/pull/49935)[, PR48235](https://github.com/PaddlePaddle/Paddle/pull/48235), [PR49173](https://github.com/PaddlePaddle/Paddle/pull/49173) -- 新增了 saved_tensors_hooks 机制,用于临时存放和取回用于反向计算使用的前向 Tensor。 [PR45763](https://github.com/PaddlePaddle/Paddle/pull/45763), [PR46215](https://github.com/PaddlePaddle/Paddle/pull/46215), [PR48124](https://github.com/PaddlePaddle/Paddle/pull/48124) -- Tensor 支持了 pickler,用于支持 Tensor 的序列化。 [PR47025](https://github.com/PaddlePaddle/Paddle/pull/47025), [PR48179](https://github.com/PaddlePaddle/Paddle/pull/48179) -- 新增了调试日志,反向出现 nan/inf 时打印前向 Python 堆栈 [PR53217](https://github.com/PaddlePaddle/Paddle/pull/53217) [PR52639](https://github.com/PaddlePaddle/Paddle/pull/52639) [PR52729](https://github.com/PaddlePaddle/Paddle/pull/52729) -- 新增了对 expand_v2, tile, concat, assign, slice 高阶微分的支持。[PR45941](https://github.com/PaddlePaddle/Paddle/pull/45941)[, PR45942](https://github.com/PaddlePaddle/Paddle/pull/45942)[, PR45940](https://github.com/PaddlePaddle/Paddle/pull/45940)[, PR45879](https://github.com/PaddlePaddle/Paddle/pull/45879), [PR45960](https://github.com/PaddlePaddle/Paddle/pull/45960) - -#### 功能优化 -- 优化了动态图的日志打印,包括日志内容优化、VLog 级别优化、报错内容优化等。[PR45783](https://github.com/PaddlePaddle/Paddle/pull/45783), [PR46349](https://github.com/PaddlePaddle/Paddle/pull/46349), [PR46934](https://github.com/PaddlePaddle/Paddle/pull/46934), [PR47724](https://github.com/PaddlePaddle/Paddle/pull/47724) -- 新增了 FLAGS_auto_growth_chunk_size_in_mb 用于 auto_growth_allocator 最小 chunk size 的设置 [PR52204](https://github.com/PaddlePaddle/Paddle/pull/52204) - -#### bug fix -- 修复了一些算子的 bug,包括:batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad。[PR47802](https://github.com/PaddlePaddle/Paddle/pull/47802), [PR47634](https://github.com/PaddlePaddle/Paddle/pull/47634), [PR47349](https://github.com/PaddlePaddle/Paddle/pull/47349), [PR46124](https://github.com/PaddlePaddle/Paddle/pull/46124), [PR46147](https://github.com/PaddlePaddle/Paddle/pull/46147), [PR50388](https://github.com/PaddlePaddle/Paddle/pull/50388), [PR48626](https://github.com/PaddlePaddle/Paddle/pull/48626), [PR48519](https://github.com/PaddlePaddle/Paddle/pull/48519), [PR50386](https://github.com/PaddlePaddle/Paddle/pull/50386), [PR48432](https://github.com/PaddlePaddle/Paddle/pull/48432), [PR51851](https://github.com/PaddlePaddle/Paddle/pull/51851) -- 修复了 PyLayer 的一些错误问题。[PR51740](https://github.com/PaddlePaddle/Paddle/pull/51740), [PR47154](https://github.com/PaddlePaddle/Paddle/pull/47154), [PR47323](https://github.com/PaddlePaddle/Paddle/pull/47323), [PR54041](https://github.com/PaddlePaddle/Paddle/pull/54041), [PR48533](https://github.com/PaddlePaddle/Paddle/pull/48533) -- 确保 sync_batch_norm 在反向有序,防止错序导致 hang 或精度错误。[PR52268](https://github.com/PaddlePaddle/Paddle/pull/52268), [PR52860](https://github.com/PaddlePaddle/Paddle/pull/52860), [PR52779](https://github.com/PaddlePaddle/Paddle/pull/52779) -- 修复了 linspace 在 AMP 下的 bug。[PR46088](https://github.com/PaddlePaddle/Paddle/pull/46088) -- 修复了 Python C API 错误调用导致 Windows 崩溃的问题。[PR46833](https://github.com/PaddlePaddle/Paddle/pull/46833) -- 修复了 DataLoader 可能遗漏删除/dev/shm 的问题。[PR48511](https://github.com/PaddlePaddle/Paddle/pull/48511) -- 修复了 paddle.grad 的一些问题。[PR47151](https://github.com/PaddlePaddle/Paddle/pull/47151) -- 为不支持高阶微分的算子添加报错信息。[PR47231](https://github.com/PaddlePaddle/Paddle/pull/47231) -- 为 python 运算符添加 numpyarray 的支持。[PR48229](https://github.com/PaddlePaddle/Paddle/pull/48229) -- 有两处 element_size 接口,删除其中之一。[PR49631](https://github.com/PaddlePaddle/Paddle/pull/49631) -- 修复老动态图开 VLOG 崩溃问题。[PR47115](https://github.com/PaddlePaddle/Paddle/pull/47115) -- XPU,d2d 时,改成 d2h+h2d,规避多线程问题 。[PR48373](https://github.com/PaddlePaddle/Paddle/pull/48373) - -#### 性能优化 -- Python 运算符下沉到 C++实现,以提升 API 性能, 下沉后该类 API 有 3~6 倍性能提升。[PR45811](https://github.com/PaddlePaddle/Paddle/pull/45811), [PR46326](https://github.com/PaddlePaddle/Paddle/pull/46326), [PR46329](https://github.com/PaddlePaddle/Paddle/pull/46329), [PR46520](https://github.com/PaddlePaddle/Paddle/pull/46520), [PR46542](https://github.com/PaddlePaddle/Paddle/pull/46542), [PR46565](https://github.com/PaddlePaddle/Paddle/pull/46565), [PR47060](https://github.com/PaddlePaddle/Paddle/pull/47060), [PR47077](https://github.com/PaddlePaddle/Paddle/pull/47077), [PR47174](https://github.com/PaddlePaddle/Paddle/pull/47174), [PR47315](https://github.com/PaddlePaddle/Paddle/pull/47315) -- 优化了 Optimizer CPU 调度性能,可减少 Optimizer 阶段导致的 GPU Gap。 [PR49787](https://github.com/PaddlePaddle/Paddle/pull/49787), [PR50188](https://github.com/PaddlePaddle/Paddle/pull/50188)[, PR51340](https://github.com/PaddlePaddle/Paddle/pull/51340), [PR49864](https://github.com/PaddlePaddle/Paddle/pull/49864), [PR50158](https://github.com/PaddlePaddle/Paddle/pull/50158), [PR50335](https://github.com/PaddlePaddle/Paddle/pull/50335) -- API 中可下沉到 C++的逻辑,下沉到 C++,以提升 API 性能。[PR46412](https://github.com/PaddlePaddle/Paddle/pull/46412), [PR46190](https://github.com/PaddlePaddle/Paddle/pull/46190) -- 优化动态图下 Python 端不必要的调用逻辑,以提升 API 性能。[PR46221](https://github.com/PaddlePaddle/Paddle/pull/46221), [PR49473](https://github.com/PaddlePaddle/Paddle/pull/49473), [PR49574](https://github.com/PaddlePaddle/Paddle/pull/49574), [PR49589](https://github.com/PaddlePaddle/Paddle/pull/49589), [PR49612](https://github.com/PaddlePaddle/Paddle/pull/49612), [PR49717](https://github.com/PaddlePaddle/Paddle/pull/49717)[, PR49733](https://github.com/PaddlePaddle/Paddle/pull/49733), [PR49823](https://github.com/PaddlePaddle/Paddle/pull/49823)[, PR49508](https://github.com/PaddlePaddle/Paddle/pull/49508), [PR46840](https://github.com/PaddlePaddle/Paddle/pull/46840) -- 优化了 Allocator 的使用,以提升动态图 API 调度性能。[PR47125](https://github.com/PaddlePaddle/Paddle/pull/47125), [PR48548](https://github.com/PaddlePaddle/Paddle/pull/48548), [PR50995](https://github.com/PaddlePaddle/Paddle/pull/50995), [PR47731](https://github.com/PaddlePaddle/Paddle/pull/47731) -- 优化了 fused_attention 算子性能。[PR48902](https://github.com/PaddlePaddle/Paddle/pull/48902) -- optimizer 的_add_accumulator,如果 device 是 CPU,且在动态图下,直接使用 full 初始化 var。[PR48189](https://github.com/PaddlePaddle/Paddle/pull/48189) -- 对反向图不必要执行的 subgraph 进行剪枝以提升性能。[PR47827](https://github.com/PaddlePaddle/Paddle/pull/47827) -- 优化了 initalizers 的性能。[PR46033](https://github.com/PaddlePaddle/Paddle/pull/46033) -- 新增 fused dropout add 算子,提升 dropout 和 add 一起计算的性能。[#52903](https://github.com/PaddlePaddle/Paddle/pull/52903) - -### 静态图 -#### 静态图新执行器全面上线 -静态图新执行器实现多项功能和性能优化,完成对原有多套旧执行器的统一和替换,成为静态图单卡和分布式训练 python 端入口以及动转静、控制流、CINN 等后端默认使用的执行引擎,大幅提升框架调度性能,功能架构更加清晰,二次开发能力显著增强。[#45913](https://github.com/PaddlePaddle/Paddle/pull/45913),[#46025](https://github.com/PaddlePaddle/Paddle/pull/46025),[#48911](https://github.com/PaddlePaddle/Paddle/pull/48911),[#50239](https://github.com/PaddlePaddle/Paddle/pull/50239),[#45696](https://github.com/PaddlePaddle/Paddle/pull/45696),[#46092](https://github.com/PaddlePaddle/Paddle/pull/46092),[#48158](https://github.com/PaddlePaddle/Paddle/pull/48158),[#51389](https://github.com/PaddlePaddle/Paddle/pull/51389),[#49708](https://github.com/PaddlePaddle/Paddle/pull/49708),[#49275](https://github.com/PaddlePaddle/Paddle/pull/49275),[#48789](https://github.com/PaddlePaddle/Paddle/pull/48789),[#49939](https://github.com/PaddlePaddle/Paddle/pull/49939),[#51149](https://github.com/PaddlePaddle/Paddle/pull/51149),[#52652](https://github.com/PaddlePaddle/Paddle/pull/52652) - -### 算子库 -#### 自定义算子等功能增强 -包括:全新支持了自定义扩展机制,实现将 C++ 扩展的运算函数绑定至 Python 端使用,进一步提升了框架的二次开发能力;扩展支持自定义硬件上使用自定义算子机制,以满足硬件厂商实现非 Paddle 已有算子的需求;扩展支持了在自定义算子中实现`inplace`、`vector`输出、`optional`输入等高阶机制;优化了自定义算子在动态图模式下的调度性能,多输入参数的算子性能提升 25.4%;为自定义算子 Tensor 扩展新增了常用运算符及 API,支持链式调用,简化代码写法。对算子内核选择机制进行了优化;对部分算子内核进行了逻辑完善、支持数据类型增强以及性能优化;新增以及完善 XPU 内核 100+;修复各项 Bug 累计 170+。 -[#49222](https://github.com/PaddlePaddle/Paddle/pull/49222), [#51773](https://github.com/PaddlePaddle/Paddle/pull/51773), [#51923](https://github.com/PaddlePaddle/Paddle/pull/51923), [#53080](https://github.com/PaddlePaddle/Paddle/pull/53080), [#50731](https://github.com/PaddlePaddle/Paddle/pull/50731), [#50563](https://github.com/PaddlePaddle/Paddle/pull/50563), [#50840](https://github.com/PaddlePaddle/Paddle/pull/50840), [#50983](https://github.com/PaddlePaddle/Paddle/pull/50983), [#51713](https://github.com/PaddlePaddle/Paddle/pull/51713), [#48733](https://github.com/PaddlePaddle/Paddle/pull/48733), [#50558](https://github.com/PaddlePaddle/Paddle/pull/50558), [#50764](https://github.com/PaddlePaddle/Paddle/pull/50764), [#51973](https://github.com/PaddlePaddle/Paddle/pull/51973), [#52216](https://github.com/PaddlePaddle/Paddle/pull/52216), [#51027](https://github.com/PaddlePaddle/Paddle/pull/51027), [#50745](https://github.com/PaddlePaddle/Paddle/pull/50745), [#50756](https://github.com/PaddlePaddle/Paddle/pull/50756), [#50886](https://github.com/PaddlePaddle/Paddle/pull/50886), [#50813](https://github.com/PaddlePaddle/Paddle/pull/50813), [#50869](https://github.com/PaddlePaddle/Paddle/pull/50869), [#51085](https://github.com/PaddlePaddle/Paddle/pull/51085), [#51646](https://github.com/PaddlePaddle/Paddle/pull/51646), [#51620](https://github.com/PaddlePaddle/Paddle/pull/51620), [#51844](https://github.com/PaddlePaddle/Paddle/pull/51844), [#52421](https://github.com/PaddlePaddle/Paddle/pull/52421), [#52872](https://github.com/PaddlePaddle/Paddle/pull/52872), [#52597](https://github.com/PaddlePaddle/Paddle/pull/52597), [#50582](https://github.com/PaddlePaddle/Paddle/pull/50582), [#52114](https://github.com/PaddlePaddle/Paddle/pull/52114), [#52915](https://github.com/PaddlePaddle/Paddle/pull/52915), [#50928](https://github.com/PaddlePaddle/Paddle/pull/50928), [#48272](https://github.com/PaddlePaddle/Paddle/pull/48272), [#48702](https://github.com/PaddlePaddle/Paddle/pull/48702), [#52191](https://github.com/PaddlePaddle/Paddle/pull/52191), [#52191](https://github.com/PaddlePaddle/Paddle/pull/52191), [#47374](https://github.com/PaddlePaddle/Paddle/pull/47374), [#47375](https://github.com/PaddlePaddle/Paddle/pull/47375), [#47378](https://github.com/PaddlePaddle/Paddle/pull/47378), [#54126](https://github.com/PaddlePaddle/Paddle/pull/54126), [#47638](https://github.com/PaddlePaddle/Paddle/pull/47638), [#47661](https://github.com/PaddlePaddle/Paddle/pull/47661), [#50606](https://github.com/PaddlePaddle/Paddle/pull/50606), [#53528](https://github.com/PaddlePaddle/Paddle/pull/53528), [#50599](https://github.com/PaddlePaddle/Paddle/pull/50599), [#51727](https://github.com/PaddlePaddle/Paddle/pull/51727), [#50825](https://github.com/PaddlePaddle/Paddle/pull/50825), [#50773](https://github.com/PaddlePaddle/Paddle/pull/50773), [#50979](https://github.com/PaddlePaddle/Paddle/pull/50979), [#53336](https://github.com/PaddlePaddle/Paddle/pull/53336), [#53555](https://github.com/PaddlePaddle/Paddle/pull/53555), [#53716](https://github.com/PaddlePaddle/Paddle/pull/53716), [#53753](https://github.com/PaddlePaddle/Paddle/pull/53753), [#53981](https://github.com/PaddlePaddle/Paddle/pull/53981), [#53977](https://github.com/PaddlePaddle/Paddle/pull/53977), [#53980](https://github.com/PaddlePaddle/Paddle/pull/53980), [#54043](https://github.com/PaddlePaddle/Paddle/pull/54043), [#54066](https://github.com/PaddlePaddle/Paddle/pull/54066), [#52866](https://github.com/PaddlePaddle/Paddle/pull/52866), [#53043](https://github.com/PaddlePaddle/Paddle/pull/53043), [#53325](https://github.com/PaddlePaddle/Paddle/pull/53325), [#54323](https://github.com/PaddlePaddle/Paddle/pull/54323), [#54367](https://github.com/PaddlePaddle/Paddle/pull/54367), [#51353](https://github.com/PaddlePaddle/Paddle/pull/51353), [#53749](https://github.com/PaddlePaddle/Paddle/pull/53749), [#50013](https://github.com/PaddlePaddle/Paddle/pull/50013), [#47570](https://github.com/PaddlePaddle/Paddle/pull/47570), [#50997](https://github.com/PaddlePaddle/Paddle/pull/50997), [#51241](https://github.com/PaddlePaddle/Paddle/pull/51241), [#49537](https://github.com/PaddlePaddle/Paddle/pull/49537) - -#### 算子体系架构统一 -具体包括:将原算子体系下剩余的 350+算子内核全部统一到 PHI 算子库中,以及原算子体系中的算子定义方式也都统一为 PHI 算子库的算子定义形式(基于 YAML 配置定义算子),提升了架构统一性,降低了框架开发的理解成本;将 PHI 算子库依赖的 Fluid 头文件全部解耦,并独立编译为动态链接库,为框架的二次开发提供更轻量的算子库复用方式;继续对飞桨框架中不规范的算子以及算子内核进行规范化调整,便于开发者理解,降低了硬件的接入成本。 -[#47856](https://github.com/PaddlePaddle/Paddle/pull/47856), [#49328](https://github.com/PaddlePaddle/Paddle/pull/49328), [#49138](https://github.com/PaddlePaddle/Paddle/pull/49138), [#52014](https://github.com/PaddlePaddle/Paddle/pull/52014), [#52044](https://github.com/PaddlePaddle/Paddle/pull/52044), [#52116](https://github.com/PaddlePaddle/Paddle/pull/52116), [#52486](https://github.com/PaddlePaddle/Paddle/pull/52486), [#52101](https://github.com/PaddlePaddle/Paddle/pull/52101), [#52882](https://github.com/PaddlePaddle/Paddle/pull/52882), [#53003](https://github.com/PaddlePaddle/Paddle/pull/53003), [#53034](https://github.com/PaddlePaddle/Paddle/pull/53034), [#51914](https://github.com/PaddlePaddle/Paddle/pull/51914), [#49116](https://github.com/PaddlePaddle/Paddle/pull/49116), [#52626](https://github.com/PaddlePaddle/Paddle/pull/52626), [#52878](https://github.com/PaddlePaddle/Paddle/pull/52878), [#52879](https://github.com/PaddlePaddle/Paddle/pull/52879), [#52880](https://github.com/PaddlePaddle/Paddle/pull/52880), [#52875](https://github.com/PaddlePaddle/Paddle/pull/52875), [#51600](https://github.com/PaddlePaddle/Paddle/pull/51600), [#51601](https://github.com/PaddlePaddle/Paddle/pull/51601), [#51590](https://github.com/PaddlePaddle/Paddle/pull/51590), [#51887](https://github.com/PaddlePaddle/Paddle/pull/51887), [#51891](https://github.com/PaddlePaddle/Paddle/pull/51891), [#52036](https://github.com/PaddlePaddle/Paddle/pull/52036), [#52130](https://github.com/PaddlePaddle/Paddle/pull/52130), [#52134](https://github.com/PaddlePaddle/Paddle/pull/52134), [#51951](https://github.com/PaddlePaddle/Paddle/pull/51951), [#51886](https://github.com/PaddlePaddle/Paddle/pull/51886), [#52274](https://github.com/PaddlePaddle/Paddle/pull/52274), [#52263](https://github.com/PaddlePaddle/Paddle/pull/52263), [#51913](https://github.com/PaddlePaddle/Paddle/pull/51913), [#52145](https://github.com/PaddlePaddle/Paddle/pull/52145), [#52347](https://github.com/PaddlePaddle/Paddle/pull/52347), [#52370](https://github.com/PaddlePaddle/Paddle/pull/52370), [#52437](https://github.com/PaddlePaddle/Paddle/pull/52437), [#52424](https://github.com/PaddlePaddle/Paddle/pull/52424), [#52231](https://github.com/PaddlePaddle/Paddle/pull/52231), [#52522](https://github.com/PaddlePaddle/Paddle/pull/52522), [#52529](https://github.com/PaddlePaddle/Paddle/pull/52529), [#52802](https://github.com/PaddlePaddle/Paddle/pull/52802), [#52799](https://github.com/PaddlePaddle/Paddle/pull/52799), [#52855](https://github.com/PaddlePaddle/Paddle/pull/52855), [#52711](https://github.com/PaddlePaddle/Paddle/pull/52711), [#52940](https://github.com/PaddlePaddle/Paddle/pull/52940), [#53309](https://github.com/PaddlePaddle/Paddle/pull/53309), [#47817](https://github.com/PaddlePaddle/Paddle/pull/47817), [#48001](https://github.com/PaddlePaddle/Paddle/pull/48001), [#48063](https://github.com/PaddlePaddle/Paddle/pull/48063), [#48049](https://github.com/PaddlePaddle/Paddle/pull/48049), [#48168](https://github.com/PaddlePaddle/Paddle/pull/48168), [#48415](https://github.com/PaddlePaddle/Paddle/pull/48415), [#48696](https://github.com/PaddlePaddle/Paddle/pull/48696), [#48970](https://github.com/PaddlePaddle/Paddle/pull/48970), [#50183](https://github.com/PaddlePaddle/Paddle/pull/50183), [#50407](https://github.com/PaddlePaddle/Paddle/pull/50407), [#50498](https://github.com/PaddlePaddle/Paddle/pull/50498), [#50419](https://github.com/PaddlePaddle/Paddle/pull/50419), [#50282](https://github.com/PaddlePaddle/Paddle/pull/50282), [#50870](https://github.com/PaddlePaddle/Paddle/pull/50870), [#50911](https://github.com/PaddlePaddle/Paddle/pull/50911), [#50865](https://github.com/PaddlePaddle/Paddle/pull/50865), [#51288](https://github.com/PaddlePaddle/Paddle/pull/51288), [#53735](https://github.com/PaddlePaddle/Paddle/pull/53735), [#47248](https://github.com/PaddlePaddle/Paddle/pull/47248), [#47787](https://github.com/PaddlePaddle/Paddle/pull/47787), [#52202](https://github.com/PaddlePaddle/Paddle/pull/52202), -[#47579](https://github.com/PaddlePaddle/Paddle/pull/47579), [#49444](https://github.com/PaddlePaddle/Paddle/pull/49444), [#45772](https://github.com/PaddlePaddle/Paddle/pull/45772), [#51264](https://github.com/PaddlePaddle/Paddle/pull/51264), [#51634](https://github.com/PaddlePaddle/Paddle/pull/51634), [#51631](https://github.com/PaddlePaddle/Paddle/pull/51631), [#47385](https://github.com/PaddlePaddle/Paddle/pull/47385), [#46342](https://github.com/PaddlePaddle/Paddle/pull/46342), [#47510](https://github.com/PaddlePaddle/Paddle/pull/47510), [#47532](https://github.com/PaddlePaddle/Paddle/pull/47532), [#47702](https://github.com/PaddlePaddle/Paddle/pull/47702), [#47860](https://github.com/PaddlePaddle/Paddle/pull/47860), [#49470](https://github.com/PaddlePaddle/Paddle/pull/49470), [#50358](https://github.com/PaddlePaddle/Paddle/pull/50358), [#49121](https://github.com/PaddlePaddle/Paddle/pull/49121), [#50190](https://github.com/PaddlePaddle/Paddle/pull/50190), [#52374](https://github.com/PaddlePaddle/Paddle/pull/52374), [#52372](https://github.com/PaddlePaddle/Paddle/pull/52372), [#52375](https://github.com/PaddlePaddle/Paddle/pull/52375), [#52371](https://github.com/PaddlePaddle/Paddle/pull/52371) - -### 动转静加组合算子 -#### 新功能 -- 组合算子添加 dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish 算子的组合规则 [#50497](https://github.com/PaddlePaddle/Paddle/pull/50497), [#50838](https://github.com/PaddlePaddle/Paddle/pull/50838), [#50861](https://github.com/PaddlePaddle/Paddle/pull/50861), [#50819](https://github.com/PaddlePaddle/Paddle/pull/50819), [#50810](https://github.com/PaddlePaddle/Paddle/pull/50810), [#51527](https://github.com/PaddlePaddle/Paddle/pull/51527), [#51070](https://github.com/PaddlePaddle/Paddle/pull/51070), [#51539](https://github.com/PaddlePaddle/Paddle/pull/51539), [#51061](https://github.com/PaddlePaddle/Paddle/pull/51061), [#49894](https://github.com/PaddlePaddle/Paddle/pull/49894), [#50422](https://github.com/PaddlePaddle/Paddle/pull/50422), [#51874](https://github.com/PaddlePaddle/Paddle/pull/51874), [#51341](https://github.com/PaddlePaddle/Paddle/pull/51341), [#50295](https://github.com/PaddlePaddle/Paddle/pull/50295), [#50298](https://github.com/PaddlePaddle/Paddle/pull/50298), [#50672](https://github.com/PaddlePaddle/Paddle/pull/50672), [#51432](https://github.com/PaddlePaddle/Paddle/pull/51432), [#51003](https://github.com/PaddlePaddle/Paddle/pull/51003) -- 组合算子添加 gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad 算子的 vjp 规则 [#50966](https://github.com/PaddlePaddle/Paddle/pull/50966), [#51653](https://github.com/PaddlePaddle/Paddle/pull/51653), [#52663](https://github.com/PaddlePaddle/Paddle/pull/52663), [#51742](https://github.com/PaddlePaddle/Paddle/pull/51742), [#52203](https://github.com/PaddlePaddle/Paddle/pull/52203), [#50794](https://github.com/PaddlePaddle/Paddle/pull/50794), [#50305](https://github.com/PaddlePaddle/Paddle/pull/50305), [#50786](https://github.com/PaddlePaddle/Paddle/pull/50786), [#50679](https://github.com/PaddlePaddle/Paddle/pull/50679), [#51045](https://github.com/PaddlePaddle/Paddle/pull/51045), [#51230](https://github.com/PaddlePaddle/Paddle/pull/51230), [#51474](https://github.com/PaddlePaddle/Paddle/pull/51474), [#51283](https://github.com/PaddlePaddle/Paddle/pull/51283), [#51238](https://github.com/PaddlePaddle/Paddle/pull/51238), [#49831](https://github.com/PaddlePaddle/Paddle/pull/49831), [#51838](https://github.com/PaddlePaddle/Paddle/pull/51838), [#50771](https://github.com/PaddlePaddle/Paddle/pull/50771), [#50565](https://github.com/PaddlePaddle/Paddle/pull/50565), [#51768](https://github.com/PaddlePaddle/Paddle/pull/51768), [#51750](https://github.com/PaddlePaddle/Paddle/pull/51750), [#51748](https://github.com/PaddlePaddle/Paddle/pull/51748), [#52532](https://github.com/PaddlePaddle/Paddle/pull/52532), [#52935](https://github.com/PaddlePaddle/Paddle/pull/52935), [#50963](https://github.com/PaddlePaddle/Paddle/pull/50963), [#51430](https://github.com/PaddlePaddle/Paddle/pull/51430), [#53141](https://github.com/PaddlePaddle/Paddle/pull/53141), [#52469](https://github.com/PaddlePaddle/Paddle/pull/52469), [#50436](https://github.com/PaddlePaddle/Paddle/pull/50436), [#51059](https://github.com/PaddlePaddle/Paddle/pull/51059), [#51296](https://github.com/PaddlePaddle/Paddle/pull/51296), [#52533](https://github.com/PaddlePaddle/Paddle/pull/52533), [#53374](https://github.com/PaddlePaddle/Paddle/pull/53374) -- 组合算子添加 matmul, tanh, elementwise 二阶微分规则 [#50452](https://github.com/PaddlePaddle/Paddle/pull/50452), [#52192](https://github.com/PaddlePaddle/Paddle/pull/52192), [#53014](https://github.com/PaddlePaddle/Paddle/pull/53014) -- 组合算子添加 exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max 组合算子 bf16 数据类型支持 [#54263](https://github.com/PaddlePaddle/Paddle/pull/54263), [#54236](https://github.com/PaddlePaddle/Paddle/pull/54236), [#53865](https://github.com/PaddlePaddle/Paddle/pull/53865), [#54175](https://github.com/PaddlePaddle/Paddle/pull/54175), [#54399](https://github.com/PaddlePaddle/Paddle/pull/54399) -- 动转静新增控制流中的容器添加赋值语义支持 [#51248](https://github.com/PaddlePaddle/Paddle/pull/51248) -- 动转静新增全图回退功能,当动转静转换失败时,可全图回退到动态图方式执行; 回退机制增加 set_eval_frame 接口 [#50111](https://github.com/PaddlePaddle/Paddle/pull/50111), [#52006](https://github.com/PaddlePaddle/Paddle/pull/52006) -- 动转静 to_static 支持算子组合机制;支持被 to_static 装饰下使用 register_hook 的场景; [#49836](https://github.com/PaddlePaddle/Paddle/pull/49836), [#52948](https://github.com/PaddlePaddle/Paddle/pull/52948), [#53572](https://github.com/PaddlePaddle/Paddle/pull/53572) -- 动转静 to_static 接口增加 backend 参数, 可以指定为 `CINN` 或者 None,当该参数指定为 `CINN` 时,将会使用 CINN 编译器来加速训练和推理 [#52596](https://github.com/PaddlePaddle/Paddle/pull/52596) -- 新增 primitive 接口代码自动生成功能,根据 ops.yaml 和 legacy_ops.yaml 中的算子定义;自动生成 primitive 接口的代码;自动生成 Tensor 运算接口 [#50315](https://github.com/PaddlePaddle/Paddle/pull/50315), [#49654](https://github.com/PaddlePaddle/Paddle/pull/49654), [#50642](https://github.com/PaddlePaddle/Paddle/pull/50642) -- 新增算子前向组合功能,通过注册前向算子的组合规则,实现将前向算子拆分成基础算子 [#49605](https://github.com/PaddlePaddle/Paddle/pull/49605) -- 新增组合算子开关,可以在 shell 中通过设置环境变量,实现算子按照不同方式进行拆分 [#50309](https://github.com/PaddlePaddle/Paddle/pull/50309) -- 添加`OpTest`新增组合测试功能,对算子精度进行保障;添加 elementwise 类基础算子单测;添加 batch_norm 的 CINN 单测 [#50509](https://github.com/PaddlePaddle/Paddle/pull/50509), [#50807](https://github.com/PaddlePaddle/Paddle/pull/50807), [#52815](https://github.com/PaddlePaddle/Paddle/pull/52815) - -#### 功能优化 -- 添加组合算子支持 FP16 运算和 AMP O1 运算;添加 softmax 和 layer_norm 算子 AMP 逻辑 [#52397](https://github.com/PaddlePaddle/Paddle/pull/52397), [#52598](https://github.com/PaddlePaddle/Paddle/pull/52598), [#51473](https://github.com/PaddlePaddle/Paddle/pull/51473) -- 简化组合算子 batch_norm 的组合规则和 vjp 规则 [#54012](https://github.com/PaddlePaddle/Paddle/pull/54012), [#51827](https://github.com/PaddlePaddle/Paddle/pull/51827), [#51933](https://github.com/PaddlePaddle/Paddle/pull/51933), -- 组合算子优化组合规则,提升含 scalar 组合规则的性能;优化组合算子日志打印 [#51960](https://github.com/PaddlePaddle/Paddle/pull/51960), [#50160](https://github.com/PaddlePaddle/Paddle/pull/50160) -- 组合算子支持 jit.save 接口;新增自定义 VJP 规则接口 [#52344](https://github.com/PaddlePaddle/Paddle/pull/52344), [#50885](https://github.com/PaddlePaddle/Paddle/pull/50885) -- 组合算子 gather_grad 删除 overwrite 参数。 [#52707](https://github.com/PaddlePaddle/Paddle/pull/52707) -- 动转静代码风格清理,报错信息优化,规范日志 [#48637](https://github.com/PaddlePaddle/Paddle/pull/48637), [#46128](https://github.com/PaddlePaddle/Paddle/pull/46128), [#52527](https://github.com/PaddlePaddle/Paddle/pull/52527), [#46800](https://github.com/PaddlePaddle/Paddle/pull/46800),[#46415](https://github.com/PaddlePaddle/Paddle/pull/46415) -- 动转静通过调用 append backward 的方式获取`grad var name`以修复高阶梯度计算时的错误 [#53250](https://github.com/PaddlePaddle/Paddle/pull/53250) -- 动转静功能升级,清理 to_static 的临时目录以加速代码转换;增强 to_static 自动略过内部接口;支持在程序使用 to_static 装饰器 [#47102](https://github.com/PaddlePaddle/Paddle/pull/47102), [#50596](https://github.com/PaddlePaddle/Paddle/pull/50596), [#45768](https://github.com/PaddlePaddle/Paddle/pull/45768) -- 动转静优化`print`函数转换以支持在组网阶段打印 Tensor 参数;升级参数收集机制 [#48672](https://github.com/PaddlePaddle/Paddle/pull/48672), [#50336](https://github.com/PaddlePaddle/Paddle/pull/50336) - -#### bug fix -- 组合算子修复 cmake 编译错误;修复 cuda 12 测试错误;修复若干算子如 meshgird, expand_as, concat, conv, arrange 等错误[#49643](https://github.com/PaddlePaddle/Paddle/pull/49643), [#54622](https://github.com/PaddlePaddle/Paddle/pull/54622), [#53951](https://github.com/PaddlePaddle/Paddle/pull/53951), [#53951](https://github.com/PaddlePaddle/Paddle/pull/53951), [#53350](https://github.com/PaddlePaddle/Paddle/pull/53350), [#51486](https://github.com/PaddlePaddle/Paddle/pull/51486), [#52764](https://github.com/PaddlePaddle/Paddle/pull/52764) -- 组合算子修复若干 rank=1, shape=-1, amp, 多进程等场景下的 bug;[#51413](https://github.com/PaddlePaddle/Paddle/pull/51413), [#51435](https://github.com/PaddlePaddle/Paddle/pull/51435), [#50518](https://github.com/PaddlePaddle/Paddle/pull/50518), [#47301](https://github.com/PaddlePaddle/Paddle/pull/47301), -- 组合算子修复 composite grad maker 和 static prim api 自动代码生成 bug; 修复 op 创建属性丢失和部分组合规则不生效的 bug [#50854](https://github.com/PaddlePaddle/Paddle/pull/50854), [#51445](https://github.com/PaddlePaddle/Paddle/pull/51445), [#50780](https://github.com/PaddlePaddle/Paddle/pull/50780), [#52120](https://github.com/PaddlePaddle/Paddle/pull/52120) -- 组合算子修复一些其他 bug [#50086](https://github.com/PaddlePaddle/Paddle/pull/50086), [#51208](https://github.com/PaddlePaddle/Paddle/pull/51208), [#51577](https://github.com/PaddlePaddle/Paddle/pull/51577), [#53598](https://github.com/PaddlePaddle/Paddle/pull/53598), [#47500](https://github.com/PaddlePaddle/Paddle/pull/47500), [#52119](https://github.com/PaddlePaddle/Paddle/pull/52119), [#50397](https://github.com/PaddlePaddle/Paddle/pull/50397), [#50527](https://github.com/PaddlePaddle/Paddle/pull/50527), [#50788](https://github.com/PaddlePaddle/Paddle/pull/50788), [#51014](https://github.com/PaddlePaddle/Paddle/pull/51014), [#52154](https://github.com/PaddlePaddle/Paddle/pull/52154), [#52752](https://github.com/PaddlePaddle/Paddle/pull/52752) -- 动转静修复 dataloader, cond 输入 dict, transformer 导入, T5 模型内存泄露, grad var name 解析错误等 bug [#49821](https://github.com/PaddlePaddle/Paddle/pull/49821), [#47299](https://github.com/PaddlePaddle/Paddle/pull/47299), [#50776](https://github.com/PaddlePaddle/Paddle/pull/50776), [#50883](https://github.com/PaddlePaddle/Paddle/pull/50883), [#51100](https://github.com/PaddlePaddle/Paddle/pull/51100), [#51464](https://github.com/PaddlePaddle/Paddle/pull/51464), [#51966](https://github.com/PaddlePaddle/Paddle/pull/51966), [#52110](https://github.com/PaddlePaddle/Paddle/pull/52110), [#52821](https://github.com/PaddlePaddle/Paddle/pull/52821) -- 动转静修复 Lazy 初始化,Windows 训练,is_paddle_func 失效,recurrent op 删除 pass 失败等错误 [#50785](https://github.com/PaddlePaddle/Paddle/pull/50785), [#52580](https://github.com/PaddlePaddle/Paddle/pull/52580), [#51585](https://github.com/PaddlePaddle/Paddle/pull/51585), [#51763](https://github.com/PaddlePaddle/Paddle/pull/51763), [#51763](https://github.com/PaddlePaddle/Paddle/pull/51763) +#### 新增功能 +- 新增部分算子对 BF16 数据类型的支持,包括 compare_kernel 与 add reduce_all_kernel([#63602](https://github.com/PaddlePaddle/Paddle/pull/63602))、empty([#60212](https://github.com/PaddlePaddle/Paddle/pull/60212))、hybrid_parallel_optimizer([#60213](https://github.com/PaddlePaddle/Paddle/pull/60213))、reduce_max/reduce_min([#60453](https://github.com/PaddlePaddle/Paddle/pull/60453))、all_reduce/concat/split([#62364](https://github.com/PaddlePaddle/Paddle/pull/62364))、tile/tile_grad([#63075](https://github.com/PaddlePaddle/Paddle/pull/63075))、accuracy([#63863](https://github.com/PaddlePaddle/Paddle/pull/63863)), swiglu/set_value([#64070](https://github.com/PaddlePaddle/Paddle/pull/64070))、amp_master_grad([#63865](https://github.com/PaddlePaddle/Paddle/pull/63865))、c_concat ([#63403](https://github.com/PaddlePaddle/Paddle/pull/63403))、flatten ([#63997](https://github.com/PaddlePaddle/Paddle/pull/63997))、compare_op ([#64473](https://github.com/PaddlePaddle/Paddle/pull/64473))、moment1/moment2 ([#62688](https://github.com/PaddlePaddle/Paddle/pull/62688))、fused_rope ([#60064](https://github.com/PaddlePaddle/Paddle/pull/60064))、c_softmax_with_cross_entropy ([#60472](https://github.com/PaddlePaddle/Paddle/pull/60472))、elementwise_pow/square/sin/cos ([#60402](https://github.com/PaddlePaddle/Paddle/pull/60402))、strided_slice ([#60382](https://github.com/PaddlePaddle/Paddle/pull/60382))、tile/sigmoid_grad ([#60119](https://github.com/PaddlePaddle/Paddle/pull/60119))、 elementwise_sub/elementwise_div ([#60386](https://github.com/PaddlePaddle/Paddle/pull/60386))、softmax_with_cross_entropy ([#63759](https://github.com/PaddlePaddle/Paddle/pull/63759)) +- 新增部分算子对 INT8 数据类型的支持,包括 multi_encoder_xpu ([#61212](https://github.com/PaddlePaddle/Paddle/pull/61212))、qkv_attention ([#63105](https://github.com/PaddlePaddle/Paddle/pull/63105)) +- 更新昆仑 SDK 版本包括 BKCL、XHPC、XCCL 等。 [#59895](https://github.com/PaddlePaddle/Paddle/pull/59895)、[#59888](https://github.com/PaddlePaddle/Paddle/pull/59888)、[#63624](https://github.com/PaddlePaddle/Paddle/pull/63624), [#60305](https://github.com/PaddlePaddle/Paddle/pull/60305), [#62076](https://github.com/PaddlePaddle/Paddle/pull/62076), [#62646](https://github.com/PaddlePaddle/Paddle/pull/62646), [#63520](https://github.com/PaddlePaddle/Paddle/pull/63520), [#64163](https://github.com/PaddlePaddle/Paddle/pull/64163), [#64326](https://github.com/PaddlePaddle/Paddle/pull/64326), [#60617](https://github.com/PaddlePaddle/Paddle/pull/60617), [#60377](https://github.com/PaddlePaddle/Paddle/pull/60377), [#60421](https://github.com/PaddlePaddle/Paddle/pull/60421), [#60598](https://github.com/PaddlePaddle/Paddle/pull/60598), [#61199](https://github.com/PaddlePaddle/Paddle/pull/61199) +- 新增对 memory stat 功能支持。[#61116](https://github.com/PaddlePaddle/Paddle/pull/61116) +- 新增多 stream 支持,且可以给每个 stream 分配默认的 l3/gm buffer 大小。 [#62729](https://github.com/PaddlePaddle/Paddle/pull/62729) +- 新增 nonzero 算子支持支持 simulator XPUSIM_SKIP_RUN 模式。[#60224](https://github.com/PaddlePaddle/Paddle/pull/60224)。[#60388](https://github.com/PaddlePaddle/Paddle/pull/60388) +- 新增 stride_slice 和 stride_slice_grad 算子支持 strides < 0。 [#62749](https://github.com/PaddlePaddle/Paddle/pull/62749) +- 新增 rotary_embedding 对 use_neox_rotary_style == True 的支持。[#64090](https://github.com/PaddlePaddle/Paddle/pull/64090) +- 新增融合 Pass 和融合算子,包括 cross_attention ([#63203](https://github.com/PaddlePaddle/Paddle/pull/63203))、fused_bias_act ([#62232](https://github.com/PaddlePaddle/Paddle/pull/62232))、fused_layernorm ([#62228](https://github.com/PaddlePaddle/Paddle/pull/62228))、group_norm_silu_xpu_fuse_pass ([#63342](https://github.com/PaddlePaddle/Paddle/pull/63342)) +- 新增对分布式策略 sharding stage3 的支持。 [#57457](https://github.com/PaddlePaddle/Paddle/pull/57457) +- 新增 tf32 fc quantization 模式的支持。[#62273](https://github.com/PaddlePaddle/Paddle/pull/62273) +- 新增 flash attention 算子。[#60065](https://github.com/PaddlePaddle/Paddle/pull/60065) +- 新增 roformer relative embedding pass & kernel 并支持 multi_encoder_xpu。[#62089](https://github.com/PaddlePaddle/Paddle/pull/62089) +- 新增 pp + sharding 策略支持。[#63640](https://github.com/PaddlePaddle/Paddle/pull/63640) +- 升级 XPU 通信库架构以支持动静统一的通信库功能。[#63817](https://github.com/PaddlePaddle/Paddle/pull/63817) #### 性能优化 -- 动转静调用 run_program_op 的执行过程中,增加 scope 缓存和复用机制,避免每个 step 都会传入新的 scope [#45813](https://github.com/PaddlePaddle/Paddle/pull/45813) - -### 分布式训练 -#### 动态图分布式 -- 去除旧动态图分布式 sharding 功能 API [#49334](https://github.com/PaddlePaddle/Paddle/pull/49334) -- fleet 升级到 distributed 目录 [#50834](https://github.com/PaddlePaddle/Paddle/pull/50834) -- 优化分布式策略的日志打印。[#47761](https://github.com/PaddlePaddle/Paddle/pull/47761) -- 重计算支持 hook 模式、inplace 功能、stop_gradient 模式,支持更灵活的使用。 [#48471](https://github.com/PaddlePaddle/Paddle/pull/48471), [#47985](https://github.com/PaddlePaddle/Paddle/pull/47985) -- 数据并行 - - 数据并行支持 no_sync 接口,用于屏蔽参数梯度通信;参数同步功能;添加 scale 接口,缩放参数。[#47536](https://github.com/PaddlePaddle/Paddle/pull/47536),[#51895](https://github.com/PaddlePaddle/Paddle/pull/51895),[#47519](https://github.com/PaddlePaddle/Paddle/pull/47519) - - 修复数据并行下显存泄露问题。[#47369](https://github.com/PaddlePaddle/Paddle/pull/47369),[#47444](https://github.com/PaddlePaddle/Paddle/pull/47444),[#48668](https://github.com/PaddlePaddle/Paddle/pull/48668) - - 支持 sparse 参数梯度同步。[#52785](https://github.com/PaddlePaddle/Paddle/pull/52785) -- 流水线并行 - - 优化流水线性能,去除通信等待,优化调度,通信 overlap。[#46209](https://github.com/PaddlePaddle/Paddle/pull/46209),[#54003](https://github.com/PaddlePaddle/Paddle/pull/54003),[#54312](https://github.com/PaddlePaddle/Paddle/pull/54312),[#53384](https://github.com/PaddlePaddle/Paddle/pull/53384),[#54310](https://github.com/PaddlePaddle/Paddle/pull/54310),[#46399](https://github.com/PaddlePaddle/Paddle/pull/46399),[#46483](https://github.com/PaddlePaddle/Paddle/pull/46483),[#46780](https://github.com/PaddlePaddle/Paddle/pull/46780),[#46116](https://github.com/PaddlePaddle/Paddle/pull/46116) - - 支持自定义切分,日志打印,随机种子设置,timer 耗时打印。[#53344](https://github.com/PaddlePaddle/Paddle/pull/53344), [#47670](https://github.com/PaddlePaddle/Paddle/pull/47670),[#47336](https://github.com/PaddlePaddle/Paddle/pull/47336),[#52656](https://github.com/PaddlePaddle/Paddle/pull/52656),[#53831](https://github.com/PaddlePaddle/Paddle/pull/53831) - - 优化流水线调度中的显存释放逻辑,提前释放中间变量和数据。[#54557](https://github.com/PaddlePaddle/Paddle/pull/54557), [#47199](https://github.com/PaddlePaddle/Paddle/pull/47199),[#47497](https://github.com/PaddlePaddle/Paddle/pull/47497),[#48045](https://github.com/PaddlePaddle/Paddle/pull/48045),[#54672](https://github.com/PaddlePaddle/Paddle/pull/54672) - - 支持流水线并行的 VPP 模式,模型保存。[#54196](https://github.com/PaddlePaddle/Paddle/pull/54196), [#52927](https://github.com/PaddlePaddle/Paddle/pull/52927),[#47801](https://github.com/PaddlePaddle/Paddle/pull/47801),[#45922](https://github.com/PaddlePaddle/Paddle/pull/45922),[#47242](https://github.com/PaddlePaddle/Paddle/pull/47242) -- 分组切分并行 - - sharding stage2 并行支持量化功能,混合并行训练,梯度累加,XPU 硬件,BF16 低精度计算、优化器学习率设置、offload 功能、数据并行。[#47169](https://github.com/PaddlePaddle/Paddle/pull/47169),[#47535](https://github.com/PaddlePaddle/Paddle/pull/47535), [#46795](https://github.com/PaddlePaddle/Paddle/pull/46795),[#47711](https://github.com/PaddlePaddle/Paddle/pull/47711),[#48310](https://github.com/PaddlePaddle/Paddle/pull/48310),[#46846](https://github.com/PaddlePaddle/Paddle/pull/46846),[#48857](https://github.com/PaddlePaddle/Paddle/pull/48857),[#49196](https://github.com/PaddlePaddle/Paddle/pull/49196),[#49931](https://github.com/PaddlePaddle/Paddle/pull/49931),[#47114](https://github.com/PaddlePaddle/Paddle/pull/47114),[#49767](https://github.com/PaddlePaddle/Paddle/pull/49767) - - sharing stage2 性能优化,支持通信计算 overlap。[#46495](https://github.com/PaddlePaddle/Paddle/pull/46495),[#46894](https://github.com/PaddlePaddle/Paddle/pull/46894) - - sharding stage3 支持共享参数、不可训练参数。[#48695](https://github.com/PaddlePaddle/Paddle/pull/48695),[#48577](https://github.com/PaddlePaddle/Paddle/pull/48577) -- 张量模型并行 - - 张量模型并行性能优化,减少 stream 切流对性能的影响。[#47715](https://github.com/PaddlePaddle/Paddle/pull/47715),[#51617](https://github.com/PaddlePaddle/Paddle/pull/51617) - - 支持参数、优化器状体、梯度同步。[#51428](https://github.com/PaddlePaddle/Paddle/pull/51428),[#53254](https://github.com/PaddlePaddle/Paddle/pull/53254), [#53335](https://github.com/PaddlePaddle/Paddle/pull/53335),[#45803](https://github.com/PaddlePaddle/Paddle/pull/45803),[#46303](https://github.com/PaddlePaddle/Paddle/pull/46303),[#52293](https://github.com/PaddlePaddle/Paddle/pull/52293) - - 优化张量模型并行算子,如 c_embedding、softmax_with_corss_entropy。[#53197](https://github.com/PaddlePaddle/Paddle/pull/53197),[#53547](https://github.com/PaddlePaddle/Paddle/pull/53547),[#53541](https://github.com/PaddlePaddle/Paddle/pull/53541),[#52789](https://github.com/PaddlePaddle/Paddle/pull/52789),[#46491](https://github.com/PaddlePaddle/Paddle/pull/46491),[#52742](https://github.com/PaddlePaddle/Paddle/pull/52742),[#53419](https://github.com/PaddlePaddle/Paddle/pull/53419) -- Launch 启动 - - 支持分布式 Launch 功能,保存独立日志。[#53207](https://github.com/PaddlePaddle/Paddle/pull/53207),[#50405](https://github.com/PaddlePaddle/Paddle/pull/50405) - - 新增框架打印环境变量功能,日志覆盖功能,日志返回,环境检查,便于 debug 环境变量的改动。[#53243](https://github.com/PaddlePaddle/Paddle/pull/53243),[#53243](https://github.com/PaddlePaddle/Paddle/pull/53243), [#51803](https://github.com/PaddlePaddle/Paddle/pull/51803), [#53990](https://github.com/PaddlePaddle/Paddle/pull/53990) -- 通信库 - - 增加自定义混合并行通信组,拓扑结构信息打印,自定义通信拓扑顺序。[#47021](https://github.com/PaddlePaddle/Paddle/pull/47021),[#54000](https://github.com/PaddlePaddle/Paddle/pull/54000),[#51781](https://github.com/PaddlePaddle/Paddle/pull/51781) - - 去除通信库对 Place 信息依赖 [#47857](https://github.com/PaddlePaddle/Paddle/pull/47857) - - 增加通信库对 GLOO 算子支持,支持 send/recv/gather。 [#52221](https://github.com/PaddlePaddle/Paddle/pull/52221), [#52334](https://github.com/PaddlePaddle/Paddle/pull/52334),[#49084](https://github.com/PaddlePaddle/Paddle/pull/49084) - - 禁止通信算子的反向计算。[#47636](https://github.com/PaddlePaddle/Paddle/pull/47636) - - 新增通信库静态 shape check,帮助判别通信量是否匹配。[#48256](https://github.com/PaddlePaddle/Paddle/pull/48256),[#48915](https://github.com/PaddlePaddle/Paddle/pull/48915),[#48646](https://github.com/PaddlePaddle/Paddle/pull/48646) - - 支持通信 python object 类型,BF16 类型,alltoall,reduce,allgather,group call,global gather,broadcast,scatter 通信方式,XPU 设备通信支持。[#51765](https://github.com/PaddlePaddle/Paddle/pull/51765),[#45844](https://github.com/PaddlePaddle/Paddle/pull/45844),[#48059](https://github.com/PaddlePaddle/Paddle/pull/48059),[#48115](https://github.com/PaddlePaddle/Paddle/pull/48115), [#48339](https://github.com/PaddlePaddle/Paddle/pull/48339),[#49252](https://github.com/PaddlePaddle/Paddle/pull/49252),[#49451](https://github.com/PaddlePaddle/Paddle/pull/49451),[#50085](https://github.com/PaddlePaddle/Paddle/pull/50085),[#50701](https://github.com/PaddlePaddle/Paddle/pull/50701),[#48208](https://github.com/PaddlePaddle/Paddle/pull/48208),[#48736](https://github.com/PaddlePaddle/Paddle/pull/48736),[#51762](https://github.com/PaddlePaddle/Paddle/pull/51762),[#52495](https://github.com/PaddlePaddle/Paddle/pull/52495),[#53514](https://github.com/PaddlePaddle/Paddle/pull/53514),[#48232](https://github.com/PaddlePaddle/Paddle/pull/48232),[#49896](https://github.com/PaddlePaddle/Paddle/pull/49896),[#49941](https://github.com/PaddlePaddle/Paddle/pull/49941),[#45584](https://github.com/PaddlePaddle/Paddle/pull/45584) - - 新增对计算流通信功能。[#46182](https://github.com/PaddlePaddle/Paddle/pull/46182),[#46023](https://github.com/PaddlePaddle/Paddle/pull/46023),[#46295](https://github.com/PaddlePaddle/Paddle/pull/46295),[#46761](https://github.com/PaddlePaddle/Paddle/pull/46761),[#47481](https://github.com/PaddlePaddle/Paddle/pull/47481),[#47740](https://github.com/PaddlePaddle/Paddle/pull/47740),[#47976](https://github.com/PaddlePaddle/Paddle/pull/47976),[#48163](https://github.com/PaddlePaddle/Paddle/pull/48163),[#48396](https://github.com/PaddlePaddle/Paddle/pull/48396),[#48308](https://github.com/PaddlePaddle/Paddle/pull/48308),[#47110](https://github.com/PaddlePaddle/Paddle/pull/47110),[#53089](https://github.com/PaddlePaddle/Paddle/pull/53089) - - 优化通信库 TCP 建联时间。[#49810](https://github.com/PaddlePaddle/Paddle/pull/49810),[#47184](https://github.com/PaddlePaddle/Paddle/pull/47184) - -#### 自动并行 -- 静态图半自动并行功能完善: - - 新增多个算子的 FLOPs 计算函数,并新增基于 FLOPs 的计算 Cost 建模 [#48083](https://github.com/PaddlePaddle/Paddle/pull/48083),[#47978](https://github.com/PaddlePaddle/Paddle/pull/47978),[#47595](https://github.com/PaddlePaddle/Paddle/pull/47595),[#48083](https://github.com/PaddlePaddle/Paddle/pull/48083),[#48084](https://github.com/PaddlePaddle/Paddle/pull/48084),[#47816](https://github.com/PaddlePaddle/Paddle/pull/47816) - - 接口易用性提升,完善 DistAttr, Process Mesh, Engine API、信息打印、输入输出等模块;执行 Engine 新增 cost 接口,可用于理论分析模型运行的时间和显存开销 [#47503](https://github.com/PaddlePaddle/Paddle/pull/47503),[#46416](https://github.com/PaddlePaddle/Paddle/pull/46416),[#46554](https://github.com/PaddlePaddle/Paddle/pull/46554), [#46633](https://github.com/PaddlePaddle/Paddle/pull/46633),[#49214](https://github.com/PaddlePaddle/Paddle/pull/49214),[#53848](https://github.com/PaddlePaddle/Paddle/pull/53848),[#46552](https://github.com/PaddlePaddle/Paddle/pull/46552), [#47043](https://github.com/PaddlePaddle/Paddle/pull/47043), [#49665](https://github.com/PaddlePaddle/Paddle/pull/49665), [#52912](https://github.com/PaddlePaddle/Paddle/pull/52912), [#45776](https://github.com/PaddlePaddle/Paddle/pull/45776), [#47263](https://github.com/PaddlePaddle/Paddle/pull/47263) - - 优化 Pass 的通用性和易用性升级,支持更多场景、减少 Pass 预分析耗时 [#46519](https://github.com/PaddlePaddle/Paddle/pull/46519),[#47358](https://github.com/PaddlePaddle/Paddle/pull/47358),[#46391](https://github.com/PaddlePaddle/Paddle/pull/46391), [#51035](https://github.com/PaddlePaddle/Paddle/pull/51035) - - 调试能力增强,添加分布式随机性控制机制和混合并行精度对齐工具 [#52903](https://github.com/PaddlePaddle/Paddle/pull/52903),[#49865](https://github.com/PaddlePaddle/Paddle/pull/49865) - - 支持推理生成任务组网的自动切分, 适配生成模型中的控制流、conditional block 等特殊用法 [#46771](https://github.com/PaddlePaddle/Paddle/pull/46771), [#54067](https://github.com/PaddlePaddle/Paddle/pull/54067) - - 完善 grad_clip,支持了数据并行场景下的负载均衡。[#49510](https://github.com/PaddlePaddle/Paddle/pull/49510), [#49249](https://github.com/PaddlePaddle/Paddle/pull/49249) -- 静态图半自动并行性能提升: - - 新增 Sharding Pass 自动化通信 Fuse 和 多流通信功能,GPT 6.7B 模型两机上吞吐性能提升 26% [#48604](https://github.com/PaddlePaddle/Paddle/pull/48604), [#47180](https://github.com/PaddlePaddle/Paddle/pull/47180),[#46180](https://github.com/PaddlePaddle/Paddle/pull/46180) - - 新增 Recompute 优化策略调优功能,支持根据显存和模型大小选择最优 recompute checkpoint 设置 [#48608](https://github.com/PaddlePaddle/Paddle/pull/48608),[#47846](https://github.com/PaddlePaddle/Paddle/pull/47846),[#49010](https://github.com/PaddlePaddle/Paddle/pull/49010) - - 流水线并行新增 1F1B 调度优化 Pass [#54260](https://github.com/PaddlePaddle/Paddle/pull/54260), [#45915](https://github.com/PaddlePaddle/Paddle/pull/45915) - - 数据并行优化,支持融合通信和通信计算 Overlap 等优化, GPT 1.3B 模型内性能提升 5% [#48092](https://github.com/PaddlePaddle/Paddle/pull/48092),[#45643](https://github.com/PaddlePaddle/Paddle/pull/45643),[#49744](https://github.com/PaddlePaddle/Paddle/pull/49744), [#47578](https://github.com/PaddlePaddle/Paddle/pull/47578) - - 优化 Reshard 模块 concate 性能,减少部分场景下 concate 次数。[#47809](https://github.com/PaddlePaddle/Paddle/pull/47809) - - 混合精度优化 Pass 性能升级, 支持 BF16 低精度, 适配 while 循环控制流的自动混合并行等 [#51285](https://github.com/PaddlePaddle/Paddle/pull/51285),[#51147](https://github.com/PaddlePaddle/Paddle/pull/51147), [#49219](https://github.com/PaddlePaddle/Paddle/pull/49219), [#49079](https://github.com/PaddlePaddle/Paddle/pull/49079) -- 静态图全自动并行功能完善: - - 新增基于规则的全自动搜索策略 [#51859](https://github.com/PaddlePaddle/Paddle/pull/51859),[#51908](https://github.com/PaddlePaddle/Paddle/pull/51908),[#52053](https://github.com/PaddlePaddle/Paddle/pull/52053),[#48316](https://github.com/PaddlePaddle/Paddle/pull/48316),[#48464](https://github.com/PaddlePaddle/Paddle/pull/48464), [#52041](https://github.com/PaddlePaddle/Paddle/pull/52041) - - 自动并行建模能力完善,丰富单节点内拓扑建模、通信量建模等。 [#52723](https://github.com/PaddlePaddle/Paddle/pull/52723),[#46387](https://github.com/PaddlePaddle/Paddle/pull/46387),[#47043](https://github.com/PaddlePaddle/Paddle/pull/47043) - -#### 参数服务器 -- 清空 ps 目录下 all 列表,其中 API 不暴露 [#51289](https://github.com/PaddlePaddle/Paddle/pull/51289) -- 清理 cvm 算子 [#48989](https://github.com/PaddlePaddle/Paddle/pull/48989) -- GPUPS 新增对 AFS 支持。[#46611](https://github.com/PaddlePaddle/Paddle/pull/46611) -- PGLBOX2.0 日志降级、修复 dense 参数卡住问题、修复 barrier 不生效的问题、增加 get_epoch_finish python 端接口[#49946](https://github.com/PaddlePaddle/Paddle/pull/49946),[#50166](https://github.com/PaddlePaddle/Paddle/pull/50166),[#50349](https://github.com/PaddlePaddle/Paddle/pull/50349) -- GPUPs 运行切换到指定模式。[#51115](https://github.com/PaddlePaddle/Paddle/pull/51115) -- GPUPS 加入 benchmark。[#49587](https://github.com/PaddlePaddle/Paddle/pull/49587),[#49649](https://github.com/PaddlePaddle/Paddle/pull/49649) -- GPUPS 优化器选择问题修复,修复 reader 读取问题,修复 RPC 编译问题。 [#47026](https://github.com/PaddlePaddle/Paddle/pull/47026),[#47192](https://github.com/PaddlePaddle/Paddle/pull/47192),[#49878](https://github.com/PaddlePaddle/Paddle/pull/49878), [#46356](https://github.com/PaddlePaddle/Paddle/pull/46356),[#46575](https://github.com/PaddlePaddle/Paddle/pull/46575),[#49389](https://github.com/PaddlePaddle/Paddle/pull/49389),[#46258](https://github.com/PaddlePaddle/Paddle/pull/46258),[#50136](https://github.com/PaddlePaddle/Paddle/pull/50136) -- 增加 rocksdb 编译方式。[#46074](https://github.com/PaddlePaddle/Paddle/pull/46074) - -### CUDA -#### 新功能 -- 新增对 CUDA 12.0 的编译支持,并修复相关单测 ([#49539](https://github.com/PaddlePaddle/Paddle/pull/49539), [#54542](https://github.com/PaddlePaddle/Paddle/pull/54542)) -- 新增 CUDNN Frontend API 的编译支持及相关单测,可以使用`WITH_CUDNN_FRONTEND=ON` 的编译选项进行开启。([#47524](https://github.com/PaddlePaddle/Paddle/pull/47524), [#47612](https://github.com/PaddlePaddle/Paddle/pull/47612)) - -#### 功能优化 -- 混合精度策略及精度优化: - - 新增及优化了框架 200 余个算子的 FP16、BF16 数据类型支持,包括 logsumexp,reduce_max,cumprod,sync_batch_norm,compare 类 OP 等,并对所有 FP16、BF16 算子进行了精度优化及单测覆盖,针对低精度算子完善单测框架功能,确保在大模型训推过程中精度无损。([#51193](https://github.com/PaddlePaddle/Paddle/pull/51193), [#51114](https://github.com/PaddlePaddle/Paddle/pull/51114), [#45817](https://github.com/PaddlePaddle/Paddle/pull/45817), [#52862](https://github.com/PaddlePaddle/Paddle/pull/52862), [#52919](https://github.com/PaddlePaddle/Paddle/pull/52919), [#52921](https://github.com/PaddlePaddle/Paddle/pull/52921), [#46413](https://github.com/PaddlePaddle/Paddle/pull/46413), [#48205](https://github.com/PaddlePaddle/Paddle/pull/48205), [#54193](https://github.com/PaddlePaddle/Paddle/pull/54193), [#48041](https://github.com/PaddlePaddle/Paddle/pull/48041), [#48121](https://github.com/PaddlePaddle/Paddle/pull/48121), [#46364](https://github.com/PaddlePaddle/Paddle/pull/46364), [#51153](https://github.com/PaddlePaddle/Paddle/pull/51153), [#53023](https://github.com/PaddlePaddle/Paddle/pull/53023), [#53079](https://github.com/PaddlePaddle/Paddle/pull/53079), [#53137](https://github.com/PaddlePaddle/Paddle/pull/53137), [#46212](https://github.com/PaddlePaddle/Paddle/pull/46212), [#50908](https://github.com/PaddlePaddle/Paddle/pull/50908), [#52555](https://github.com/PaddlePaddle/Paddle/pull/52555), [#51582](https://github.com/PaddlePaddle/Paddle/pull/51582), [#47897](https://github.com/PaddlePaddle/Paddle/pull/47897), [#45601](https://github.com/PaddlePaddle/Paddle/pull/45601), [#53522](https://github.com/PaddlePaddle/Paddle/pull/53522), [#52666](https://github.com/PaddlePaddle/Paddle/pull/52666), [#50101](https://github.com/PaddlePaddle/Paddle/pull/50101), [#48315](https://github.com/PaddlePaddle/Paddle/pull/48315), [#50847](https://github.com/PaddlePaddle/Paddle/pull/50847), [#50905](https://github.com/PaddlePaddle/Paddle/pull/50905), [#50906](https://github.com/PaddlePaddle/Paddle/pull/50906), [#50909](https://github.com/PaddlePaddle/Paddle/pull/50909), [#50916](https://github.com/PaddlePaddle/Paddle/pull/50916), [#50917](https://github.com/PaddlePaddle/Paddle/pull/50917), [#50920](https://github.com/PaddlePaddle/Paddle/pull/50920), [#50919](https://github.com/PaddlePaddle/Paddle/pull/50919), [#50904](https://github.com/PaddlePaddle/Paddle/pull/50904), [#50918](https://github.com/PaddlePaddle/Paddle/pull/50918), [#50938](https://github.com/PaddlePaddle/Paddle/pull/50938), [#50858](https://github.com/PaddlePaddle/Paddle/pull/50858), [#50933](https://github.com/PaddlePaddle/Paddle/pull/50933), [#50945](https://github.com/PaddlePaddle/Paddle/pull/50945), [#50936](https://github.com/PaddlePaddle/Paddle/pull/50936), [#51168](https://github.com/PaddlePaddle/Paddle/pull/51168), [#51493](https://github.com/PaddlePaddle/Paddle/pull/51493), [#50924](https://github.com/PaddlePaddle/Paddle/pull/50924), [#50923](https://github.com/PaddlePaddle/Paddle/pull/50923), [#50926](https://github.com/PaddlePaddle/Paddle/pull/50926), [#50925](https://github.com/PaddlePaddle/Paddle/pull/50925), [#50930](https://github.com/PaddlePaddle/Paddle/pull/50930), [#53284](https://github.com/PaddlePaddle/Paddle/pull/53284), [#53286](https://github.com/PaddlePaddle/Paddle/pull/53286), [#53285](https://github.com/PaddlePaddle/Paddle/pull/53285), [#50976](https://github.com/PaddlePaddle/Paddle/pull/50976), [#50915](https://github.com/PaddlePaddle/Paddle/pull/50915), [#50915](https://github.com/PaddlePaddle/Paddle/pull/50915), [#48192](https://github.com/PaddlePaddle/Paddle/pull/48192), [#50993](https://github.com/PaddlePaddle/Paddle/pull/50993), [#50998](https://github.com/PaddlePaddle/Paddle/pull/50998), [#51380](https://github.com/PaddlePaddle/Paddle/pull/51380), [#51137](https://github.com/PaddlePaddle/Paddle/pull/51137), [#51106](https://github.com/PaddlePaddle/Paddle/pull/51106), [#51197](https://github.com/PaddlePaddle/Paddle/pull/51197), [#51159](https://github.com/PaddlePaddle/Paddle/pull/51159), [#51552](https://github.com/PaddlePaddle/Paddle/pull/51552), [#51151](https://github.com/PaddlePaddle/Paddle/pull/51151), [#51005](https://github.com/PaddlePaddle/Paddle/pull/51005), [#51565](https://github.com/PaddlePaddle/Paddle/pull/51565), [#51036](https://github.com/PaddlePaddle/Paddle/pull/51036), [#51185](https://github.com/PaddlePaddle/Paddle/pull/51185), [#51791](https://github.com/PaddlePaddle/Paddle/pull/51791), [#51083](https://github.com/PaddlePaddle/Paddle/pull/51083), [#51694](https://github.com/PaddlePaddle/Paddle/pull/51694), [#51689](https://github.com/PaddlePaddle/Paddle/pull/51689), [#51009](https://github.com/PaddlePaddle/Paddle/pull/51009), [#51051](https://github.com/PaddlePaddle/Paddle/pull/51051), [#51532](https://github.com/PaddlePaddle/Paddle/pull/51532), [#51978](https://github.com/PaddlePaddle/Paddle/pull/51978), [#51903](https://github.com/PaddlePaddle/Paddle/pull/51903), [#51888](https://github.com/PaddlePaddle/Paddle/pull/51888), [#52016](https://github.com/PaddlePaddle/Paddle/pull/52016), [#52035](https://github.com/PaddlePaddle/Paddle/pull/52035), [#52184](https://github.com/PaddlePaddle/Paddle/pull/52184), [#52018](https://github.com/PaddlePaddle/Paddle/pull/52018), [#51787](https://github.com/PaddlePaddle/Paddle/pull/51787), [#51640](https://github.com/PaddlePaddle/Paddle/pull/51640), [#52172](https://github.com/PaddlePaddle/Paddle/pull/52172), [#52193](https://github.com/PaddlePaddle/Paddle/pull/52193), [#51160](https://github.com/PaddlePaddle/Paddle/pull/51160), [#51809](https://github.com/PaddlePaddle/Paddle/pull/51809), [#51678](https://github.com/PaddlePaddle/Paddle/pull/51678), [#52158](https://github.com/PaddlePaddle/Paddle/pull/52158), [#51015](https://github.com/PaddlePaddle/Paddle/pull/51015), [#52240](https://github.com/PaddlePaddle/Paddle/pull/52240), [#52276](https://github.com/PaddlePaddle/Paddle/pull/52276), [#52233](https://github.com/PaddlePaddle/Paddle/pull/52233), [#52220](https://github.com/PaddlePaddle/Paddle/pull/52220), [#52107](https://github.com/PaddlePaddle/Paddle/pull/52107), [#52282](https://github.com/PaddlePaddle/Paddle/pull/52282), [#52311](https://github.com/PaddlePaddle/Paddle/pull/52311), [#52315](https://github.com/PaddlePaddle/Paddle/pull/52315), [#52357](https://github.com/PaddlePaddle/Paddle/pull/52357), [#52256](https://github.com/PaddlePaddle/Paddle/pull/52256), [#51649](https://github.com/PaddlePaddle/Paddle/pull/51649), [#52413](https://github.com/PaddlePaddle/Paddle/pull/52413), [#52369](https://github.com/PaddlePaddle/Paddle/pull/52369), [#51837](https://github.com/PaddlePaddle/Paddle/pull/51837), [#52112](https://github.com/PaddlePaddle/Paddle/pull/52112), [#51819](https://github.com/PaddlePaddle/Paddle/pull/51819), [#52388](https://github.com/PaddlePaddle/Paddle/pull/52388), [#52411](https://github.com/PaddlePaddle/Paddle/pull/52411), [#52521](https://github.com/PaddlePaddle/Paddle/pull/52521), [#51300](https://github.com/PaddlePaddle/Paddle/pull/51300), [#51117](https://github.com/PaddlePaddle/Paddle/pull/51117), [#52380](https://github.com/PaddlePaddle/Paddle/pull/52380), [#52317](https://github.com/PaddlePaddle/Paddle/pull/52317), [#51263](https://github.com/PaddlePaddle/Paddle/pull/51263), [#52668](https://github.com/PaddlePaddle/Paddle/pull/52668), [#52259](https://github.com/PaddlePaddle/Paddle/pull/52259), [#50999](https://github.com/PaddlePaddle/Paddle/pull/50999), [#52407](https://github.com/PaddlePaddle/Paddle/pull/52407), [#52288](https://github.com/PaddlePaddle/Paddle/pull/52288), [#52845](https://github.com/PaddlePaddle/Paddle/pull/52845), [#50953](https://github.com/PaddlePaddle/Paddle/pull/50953), [#52667](https://github.com/PaddlePaddle/Paddle/pull/52667), [#52582](https://github.com/PaddlePaddle/Paddle/pull/52582), [#52426](https://github.com/PaddlePaddle/Paddle/pull/52426), [#51884](https://github.com/PaddlePaddle/Paddle/pull/51884), [#52630](https://github.com/PaddlePaddle/Paddle/pull/52630), [#52136](https://github.com/PaddlePaddle/Paddle/pull/52136), [#52604](https://github.com/PaddlePaddle/Paddle/pull/52604), [#51615](https://github.com/PaddlePaddle/Paddle/pull/51615), [#51275](https://github.com/PaddlePaddle/Paddle/pull/51275), [#52898](https://github.com/PaddlePaddle/Paddle/pull/52898), [#52918](https://github.com/PaddlePaddle/Paddle/pull/52918), [#52572](https://github.com/PaddlePaddle/Paddle/pull/52572), [#52683](https://github.com/PaddlePaddle/Paddle/pull/52683), [#52956](https://github.com/PaddlePaddle/Paddle/pull/52956), [#52963](https://github.com/PaddlePaddle/Paddle/pull/52963), [#52954](https://github.com/PaddlePaddle/Paddle/pull/52954), [#52444](https://github.com/PaddlePaddle/Paddle/pull/52444), [#52314](https://github.com/PaddlePaddle/Paddle/pull/52314), [#52887](https://github.com/PaddlePaddle/Paddle/pull/52887), [#52195](https://github.com/PaddlePaddle/Paddle/pull/52195), [#53100](https://github.com/PaddlePaddle/Paddle/pull/53100), [#52961](https://github.com/PaddlePaddle/Paddle/pull/52961), [#52953](https://github.com/PaddlePaddle/Paddle/pull/52953), [#53111](https://github.com/PaddlePaddle/Paddle/pull/53111), [#53549](https://github.com/PaddlePaddle/Paddle/pull/53549), [#53736](https://github.com/PaddlePaddle/Paddle/pull/53736), [#52920](https://github.com/PaddlePaddle/Paddle/pull/52920), [#53195](https://github.com/PaddlePaddle/Paddle/pull/53195), [#53535](https://github.com/PaddlePaddle/Paddle/pull/53535), [#53876](https://github.com/PaddlePaddle/Paddle/pull/53876), [#53785](https://github.com/PaddlePaddle/Paddle/pull/53785), [#53722](https://github.com/PaddlePaddle/Paddle/pull/53722), [#54285](https://github.com/PaddlePaddle/Paddle/pull/54285), [#54232](https://github.com/PaddlePaddle/Paddle/pull/54232), [#53922](https://github.com/PaddlePaddle/Paddle/pull/53922), [#47277](https://github.com/PaddlePaddle/Paddle/pull/47277), [#50811](https://github.com/PaddlePaddle/Paddle/pull/50811), [#54571](https://github.com/PaddlePaddle/Paddle/pull/54571), [#50129](https://github.com/PaddlePaddle/Paddle/pull/50129), [#50340](https://github.com/PaddlePaddle/Paddle/pull/50340), [#50848](https://github.com/PaddlePaddle/Paddle/pull/50848), [#50849](https://github.com/PaddlePaddle/Paddle/pull/50849), [#50868](https://github.com/PaddlePaddle/Paddle/pull/50868), [#50878](https://github.com/PaddlePaddle/Paddle/pull/50878), [#50929](https://github.com/PaddlePaddle/Paddle/pull/50929), [#50939](https://github.com/PaddlePaddle/Paddle/pull/50939), [#50973](https://github.com/PaddlePaddle/Paddle/pull/50973), [#50913](https://github.com/PaddlePaddle/Paddle/pull/50913), [#51145](https://github.com/PaddlePaddle/Paddle/pull/51145), [#51090](https://github.com/PaddlePaddle/Paddle/pull/51090), [#51098](https://github.com/PaddlePaddle/Paddle/pull/51098), [#51094](https://github.com/PaddlePaddle/Paddle/pull/51094), [#51216](https://github.com/PaddlePaddle/Paddle/pull/51216), [#51736](https://github.com/PaddlePaddle/Paddle/pull/51736), [#51684](https://github.com/PaddlePaddle/Paddle/pull/51684), [#51925](https://github.com/PaddlePaddle/Paddle/pull/51925), [#54030](https://github.com/PaddlePaddle/Paddle/pull/54030), [#50700](https://github.com/PaddlePaddle/Paddle/pull/50700), [#52264](https://github.com/PaddlePaddle/Paddle/pull/52264), [#51069](https://github.com/PaddlePaddle/Paddle/pull/51069), [#51101](https://github.com/PaddlePaddle/Paddle/pull/51101), [#51286](https://github.com/PaddlePaddle/Paddle/pull/51286), [#53582](https://github.com/PaddlePaddle/Paddle/pull/53582),[#49869](https://github.com/PaddlePaddle/Paddle/pull/49869))) -- 混合精度策略(AMP)优化:在混合精度训练的易用性、精度稳定性及可调试性方面进行了全面的升级和优化,能够更好的支持大模型训练加速。易用性方面统一了动静态图 API,并新增 model.float()、model.float16()、model.bfloat16()等转换接口;精度稳定性方面增强了针对 BF16 类型的策略自动调整,优化了黑名单设置,增强了优化器算子 Adagrad、Adamax、Adadelta、RMSProp 等对 multi_precision 功能的支持,在 O2 模式下,完善了 master grad 机制,并新增类型提升机制,以及新增参数对特定模块使用 float32 计算以保障精度;在可调式性方面,新增 paddle.amp.debugging 模块,提供算子统计、异常值检测、精度对比等功能。( [#50132](https://github.com/PaddlePaddle/Paddle/pull/50132), [#50078](https://github.com/PaddlePaddle/Paddle/pull/50078), [#50131](https://github.com/PaddlePaddle/Paddle/pull/50131), [#49705](https://github.com/PaddlePaddle/Paddle/pull/49705), [#52936](https://github.com/PaddlePaddle/Paddle/pull/52936), [#52871](https://github.com/PaddlePaddle/Paddle/pull/52871), [#53289](https://github.com/PaddlePaddle/Paddle/pull/53289), [#53362](https://github.com/PaddlePaddle/Paddle/pull/53362), [#54240](https://github.com/PaddlePaddle/Paddle/pull/54240), [#53768](https://github.com/PaddlePaddle/Paddle/pull/53768), [#48041](https://github.com/PaddlePaddle/Paddle/pull/48041), [#47672](https://github.com/PaddlePaddle/Paddle/pull/47672), [#48843](https://github.com/PaddlePaddle/Paddle/pull/48843), [#49391](https://github.com/PaddlePaddle/Paddle/pull/49391), [#51635](https://github.com/PaddlePaddle/Paddle/pull/51635), [#45541](https://github.com/PaddlePaddle/Paddle/pull/45541), [#53742](https://github.com/PaddlePaddle/Paddle/pull/53742), [#51020](https://github.com/PaddlePaddle/Paddle/pull/51020), [#51063](https://github.com/PaddlePaddle/Paddle/pull/51063), [#52514](https://github.com/PaddlePaddle/Paddle/pull/52514), [#50940](https://github.com/PaddlePaddle/Paddle/pull/50940), [#52936](https://github.com/PaddlePaddle/Paddle/pull/52936), [#53439](https://github.com/PaddlePaddle/Paddle/pull/53439), [#53712](https://github.com/PaddlePaddle/Paddle/pull/53712), [#48238](https://github.com/PaddlePaddle/Paddle/pull/48238), [#52215](https://github.com/PaddlePaddle/Paddle/pull/52215), [#53012](https://github.com/PaddlePaddle/Paddle/pull/53012), [#52918](https://github.com/PaddlePaddle/Paddle/pull/52918), [#54571](https://github.com/PaddlePaddle/Paddle/pull/54571)) -- GroupNorm 算子新增对 NHWC 数据格式的支持 ([#47533](https://github.com/PaddlePaddle/Paddle/pull/47533)) -- index_put 算子新增对 bool 和 int 的混合数据类型支持 ([#54195](https://github.com/PaddlePaddle/Paddle/pull/54195)) -- 新增 sparse.is_nan API 用于判断 sparse tensor 中是否含有 NaN 元素。 ([#51513](https://github.com/PaddlePaddle/Paddle/pull/51513)) - -#### bug fix -- 修复 trace、roll、dropout_nd、log_softmax 等多个算子计算出错、栈溢出,以及部分单测问题。([#50243](https://github.com/PaddlePaddle/Paddle/pull/50243), [#52012](https://github.com/PaddlePaddle/Paddle/pull/52012), [#53795](https://github.com/PaddlePaddle/Paddle/pull/53795), [#53149](https://github.com/PaddlePaddle/Paddle/pull/53149), [#53654](https://github.com/PaddlePaddle/Paddle/pull/53654), [#51054](https://github.com/PaddlePaddle/Paddle/pull/51054), [#49373](https://github.com/PaddlePaddle/Paddle/pull/49373), [#53038](https://github.com/PaddlePaddle/Paddle/pull/53038)) -- 修复 conv 算子穷举搜索在部分场景不生效的问题。([#47065](https://github.com/PaddlePaddle/Paddle/pull/47065)) -- 修复 collective_reduce_scatter 等算子在 A100 上出现 timeout 的问题。([#54513](https://github.com/PaddlePaddle/Paddle/pull/54513)) -- 修复 FusedLinear 单测中属性错误的问题。 ([#50359](https://github.com/PaddlePaddle/Paddle/pull/50359)) -- 修复在使用 Profiler 时可能出现的 OOM 等问题 ([#46089](https://github.com/PaddlePaddle/Paddle/pull/46089)) - -#### 性能提升 -- 进一步优化框架大量算子的 GPU Kernel 以及 eigen 实现方式,包括 max_pool3d, dropout, adaptive_pooling, depthwise_conv2d、transpose, eigh, broadcast 类计算,reduce 类计算,prelu,logsumexp,以及 sparse 类算子等,在更多配置场景下达到更优性能。([#45820](https://github.com/PaddlePaddle/Paddle/pull/45820), [#45959](https://github.com/PaddlePaddle/Paddle/pull/45959), [#45934](https://github.com/PaddlePaddle/Paddle/pull/45934), [#46332](https://github.com/PaddlePaddle/Paddle/pull/46332), [#46287](https://github.com/PaddlePaddle/Paddle/pull/46287), [#47233](https://github.com/PaddlePaddle/Paddle/pull/47233), [#48855](https://github.com/PaddlePaddle/Paddle/pull/48855), [#48560](https://github.com/PaddlePaddle/Paddle/pull/48560), [#49419](https://github.com/PaddlePaddle/Paddle/pull/49419), [#49748](https://github.com/PaddlePaddle/Paddle/pull/49748), [#50348](https://github.com/PaddlePaddle/Paddle/pull/50348), [#52401](https://github.com/PaddlePaddle/Paddle/pull/52401), [#51131](https://github.com/PaddlePaddle/Paddle/pull/51131), [#51141](https://github.com/PaddlePaddle/Paddle/pull/51141), [#51479](https://github.com/PaddlePaddle/Paddle/pull/51479), [#51835](https://github.com/PaddlePaddle/Paddle/pull/51835), [#52509](https://github.com/PaddlePaddle/Paddle/pull/52509), [#52482](https://github.com/PaddlePaddle/Paddle/pull/52482), [#52700](https://github.com/PaddlePaddle/Paddle/pull/52700), [#53112](https://github.com/PaddlePaddle/Paddle/pull/53112), [#53659](https://github.com/PaddlePaddle/Paddle/pull/53659), [#53658](https://github.com/PaddlePaddle/Paddle/pull/53658), [#53154](https://github.com/PaddlePaddle/Paddle/pull/53154), [#54071](https://github.com/PaddlePaddle/Paddle/pull/54071), [#53622](https://github.com/PaddlePaddle/Paddle/pull/53622), [#52952](https://github.com/PaddlePaddle/Paddle/pull/52952), [#46046](https://github.com/PaddlePaddle/Paddle/pull/46046), [#46119](https://github.com/PaddlePaddle/Paddle/pull/46119), [#45946](https://github.com/PaddlePaddle/Paddle/pull/45946), [#47212](https://github.com/PaddlePaddle/Paddle/pull/47212), [#47791](https://github.com/PaddlePaddle/Paddle/pull/47791), [#47454](https://github.com/PaddlePaddle/Paddle/pull/47454), [#45230](https://github.com/PaddlePaddle/Paddle/pull/45230), [#48899](https://github.com/PaddlePaddle/Paddle/pull/48899), [#33051](https://github.com/PaddlePaddle/Paddle/pull/33051), [#49040](https://github.com/PaddlePaddle/Paddle/pull/49040), [#48992](https://github.com/PaddlePaddle/Paddle/pull/48992), [#49086](https://github.com/PaddlePaddle/Paddle/pull/49086), [#50808](https://github.com/PaddlePaddle/Paddle/pull/50808), [#46431](https://github.com/PaddlePaddle/Paddle/pull/46431), [#50931](https://github.com/PaddlePaddle/Paddle/pull/50931), [#48056](https://github.com/PaddlePaddle/Paddle/pull/48056), [#46071](https://github.com/PaddlePaddle/Paddle/pull/46071), [#49231](https://github.com/PaddlePaddle/Paddle/pull/49231), [#38660](https://github.com/PaddlePaddle/Paddle/pull/38660), [#50287](https://github.com/PaddlePaddle/Paddle/pull/50287), [#46111](https://github.com/PaddlePaddle/Paddle/pull/46111), [#46997](https://github.com/PaddlePaddle/Paddle/pull/46997), [#45854](https://github.com/PaddlePaddle/Paddle/pull/45854), [#47738](https://github.com/PaddlePaddle/Paddle/pull/47738), [#48635](https://github.com/PaddlePaddle/Paddle/pull/48635), [#50353](https://github.com/PaddlePaddle/Paddle/pull/50353), [#50362](https://github.com/PaddlePaddle/Paddle/pull/50362), [#51934](https://github.com/PaddlePaddle/Paddle/pull/51934), [#54045](https://github.com/PaddlePaddle/Paddle/pull/54045), [#46679](https://github.com/PaddlePaddle/Paddle/pull/46679), [#52093](https://github.com/PaddlePaddle/Paddle/pull/52093), [#52969](https://github.com/PaddlePaddle/Paddle/pull/52969)) -- 提供更多融合算子实现,以及相关融合 Pass,如 fused_feed_forward,gather-gemm-scatter,matmul + bias,layernorm_shift_partition + element_add,elementwise 类融合等模式,进一步提升使用该模式的模型性能。( [#50423](https://github.com/PaddlePaddle/Paddle/pull/50423), [#50091](https://github.com/PaddlePaddle/Paddle/pull/50091), [#50364](https://github.com/PaddlePaddle/Paddle/pull/50364), [#53017](https://github.com/PaddlePaddle/Paddle/pull/53017), [#50755](https://github.com/PaddlePaddle/Paddle/pull/50755), [#50050](https://github.com/PaddlePaddle/Paddle/pull/50050), [#47099](https://github.com/PaddlePaddle/Paddle/pull/47099), [#48848](https://github.com/PaddlePaddle/Paddle/pull/48848), [#49383](https://github.com/PaddlePaddle/Paddle/pull/49383), [#50809](https://github.com/PaddlePaddle/Paddle/pull/50809), [#52361](https://github.com/PaddlePaddle/Paddle/pull/52361), [#52028](https://github.com/PaddlePaddle/Paddle/pull/52028), [#48439](https://github.com/PaddlePaddle/Paddle/pull/48439), [#49009](https://github.com/PaddlePaddle/Paddle/pull/49009), [#51427](https://github.com/PaddlePaddle/Paddle/pull/51427), [#52731](https://github.com/PaddlePaddle/Paddle/pull/52731), [#51805](https://github.com/PaddlePaddle/Paddle/pull/51805)) - -#### 文档 -- 修复 index_put 文档中的错误 ([#53727](https://github.com/PaddlePaddle/Paddle/pull/53727)) - -### Intermediate Representation -为了飞桨 IR 体系存在的稳定性、降低研发成本问题,孵化了飞桨新的 IR 体系,完成了基础的数据结构定义、算子定义生成和执行体系适配。为了更好的支持科学计算场景的高阶需求,完成了 silu、cast 等算子的高阶适配。 -- 完成了 IR 数据数据结构定义,包含类型系统,算子定义;打通了和 phi kernel 的执行适配。[#51112](https://github.com/PaddlePaddle/Paddle/pull/51112), [#51992](https://github.com/PaddlePaddle/Paddle/pull/51992), [#50412](https://github.com/PaddlePaddle/Paddle/pull/50412), [#53557](https://github.com/PaddlePaddle/Paddle/pull/53557), [#53953](https://github.com/PaddlePaddle/Paddle/pull/53953), [#50959](https://github.com/PaddlePaddle/Paddle/pull/50959), [#54250](https://github.com/PaddlePaddle/Paddle/pull/54250), [#54197](https://github.com/PaddlePaddle/Paddle/pull/54197), [#54289](https://github.com/PaddlePaddle/Paddle/pull/54289), [#51636](https://github.com/PaddlePaddle/Paddle/pull/51636), [#52846](https://github.com/PaddlePaddle/Paddle/pull/52846), [#53988](https://github.com/PaddlePaddle/Paddle/pull/53988), [#54143](https://github.com/PaddlePaddle/Paddle/pull/54143), [#54035](https://github.com/PaddlePaddle/Paddle/pull/54035), [#54052](https://github.com/PaddlePaddle/Paddle/pull/54052), [#54340](https://github.com/PaddlePaddle/Paddle/pull/54340), [#54356](https://github.com/PaddlePaddle/Paddle/pull/54356), [#54068](https://github.com/PaddlePaddle/Paddle/pull/54068), [#53894](https://github.com/PaddlePaddle/Paddle/pull/53894), [#53707](https://github.com/PaddlePaddle/Paddle/pull/53707), [#54185](https://github.com/PaddlePaddle/Paddle/pull/54185), [#54031](https://github.com/PaddlePaddle/Paddle/pull/54031), [#54220](https://github.com/PaddlePaddle/Paddle/pull/54220), [#54275](https://github.com/PaddlePaddle/Paddle/pull/54275), [#54281](https://github.com/PaddlePaddle/Paddle/pull/54281), [#54186](https://github.com/PaddlePaddle/Paddle/pull/54186), [#54259](https://github.com/PaddlePaddle/Paddle/pull/54259), [#54124](https://github.com/PaddlePaddle/Paddle/pull/54124), [#54292](https://github.com/PaddlePaddle/Paddle/pull/54292), [#48068](https://github.com/PaddlePaddle/Paddle/pull/48068), [#53978](https://github.com/PaddlePaddle/Paddle/pull/53978) -- 完善 pass 基础设置,包含基础的 pass 定义,pass 注册管理等。 [#54023](https://github.com/PaddlePaddle/Paddle/pull/54023),[#54170](https://github.com/PaddlePaddle/Paddle/pull/54170), [#54170](https://github.com/PaddlePaddle/Paddle/pull/54170), [#54308](https://github.com/PaddlePaddle/Paddle/pull/54308), [#54348](https://github.com/PaddlePaddle/Paddle/pull/54348), [#54385](https://github.com/PaddlePaddle/Paddle/pull/54385) -- 完善高阶算子的适配,主要包含基础模块改造和 silu、cast 算子适配等。 [#52005](https://github.com/PaddlePaddle/Paddle/pull/52005), [#53425](https://github.com/PaddlePaddle/Paddle/pull/53425), [#53417](https://github.com/PaddlePaddle/Paddle/pull/53417), [#53417](https://github.com/PaddlePaddle/Paddle/pull/53417), [#53498](https://github.com/PaddlePaddle/Paddle/pull/53498), [#53171](https://github.com/PaddlePaddle/Paddle/pull/53171), [#53632](https://github.com/PaddlePaddle/Paddle/pull/53632), [#53605](https://github.com/PaddlePaddle/Paddle/pull/53605), [#53746](https://github.com/PaddlePaddle/Paddle/pull/53746), [#53874](https://github.com/PaddlePaddle/Paddle/pull/53874), [#54164](https://github.com/PaddlePaddle/Paddle/pull/54164), [#45888](https://github.com/PaddlePaddle/Paddle/pull/45888), [#46024](https://github.com/PaddlePaddle/Paddle/pull/46024), [#46446](https://github.com/PaddlePaddle/Paddle/pull/46446), [#46960](https://github.com/PaddlePaddle/Paddle/pull/46960) - -### CINN 编译器 -#### 新功能 -- 新增 CINN 对 0D-Tensor 的支持,目前为配合主框架升级,暂时采用增加 pass 的临时方案进行支持,后续会对该方案进行替换升级。 ([#53382](https://github.com/PaddlePaddle/Paddle/pull/53382), [#53955](https://github.com/PaddlePaddle/Paddle/pull/53955), [#54064](https://github.com/PaddlePaddle/Paddle/pull/54064), [#54118](https://github.com/PaddlePaddle/Paddle/pull/54118), [#54216](https://github.com/PaddlePaddle/Paddle/pull/54216), [#53454](https://github.com/PaddlePaddle/Paddle/pull/53454)) -- 新增 CINN 对 int8/uint8/int16/uint16/bf16 等数据类型的支持 ([#50566](https://github.com/PaddlePaddle/Paddle/pull/50566), [#53637](https://github.com/PaddlePaddle/Paddle/pull/53637)) -- 新增 CINN expand 算子的支持 ([#46776](https://github.com/PaddlePaddle/Paddle/pull/46776)) -- 新增 CINN 对 PaddleInference 的支持. ([#45009](https://github.com/PaddlePaddle/Paddle/pull/45009)) - -#### 功能优化 -- CINN 编译器,传递 skip_gc_vars 属性到 CINN 子图;CINN 为 skip_gc_vars 添加 fetch 算子 [#49471](https://github.com/PaddlePaddle/Paddle/pull/49471), [#49553](https://github.com/PaddlePaddle/Paddle/pull/49553) -- CINN 编译器,conv2d 和 conv2d_grad 默认不使用 cinn 算子 [#51645](https://github.com/PaddlePaddle/Paddle/pull/51645) -- 将 build_cinn_pass 添加到 BuildStrategy,以便于在动转静中使用 ([#49496](https://github.com/PaddlePaddle/Paddle/pull/49496)) -- 增加 reshape 算子在组合算子机制下的单测 ([#51276](https://github.com/PaddlePaddle/Paddle/pull/51276)) -- 主框架联编 CINN 的版本从固定 commit 改为 develop ([#49775](https://github.com/PaddlePaddle/Paddle/pull/49775)) -- 为 CINN 设置默认 Target 参数 ([#50182](https://github.com/PaddlePaddle/Paddle/pull/50182)) - -#### bug fix -- 修复 CINN 符号化过程中拓扑排序后的出现的算子顺序不一致的问题。 ([#52556](https://github.com/PaddlePaddle/Paddle/pull/52556)) -- 修复一些算子计算错误、精度下降,以及单测相关问题 ([#53859](https://github.com/PaddlePaddle/Paddle/pull/53859), [#54261](https://github.com/PaddlePaddle/Paddle/pull/54261), [#46801](https://github.com/PaddlePaddle/Paddle/pull/46801), [#53676](https://github.com/PaddlePaddle/Paddle/pull/53676), [#53772](https://github.com/PaddlePaddle/Paddle/pull/53772)) -- 修复 CINN 对 float16 类型支持的问题。([#48249](https://github.com/PaddlePaddle/Paddle/pull/48249)) -- 修复 build_cinn_pass 中的问题。 ([#46843](https://github.com/PaddlePaddle/Paddle/pull/46843)) -- 修复了组合算子+动转静 在开启 CINN 时,出现反向因误被 GC 而导致的无数据区的问题 ([#50116](https://github.com/PaddlePaddle/Paddle/pull/50116)) -- 修复编译器 dropout amp 出错,组合算子跑 resnet 出错,inplace 变量未找到等问题 [#51688](https://github.com/PaddlePaddle/Paddle/pull/51688), [#52813](https://github.com/PaddlePaddle/Paddle/pull/52813), [#51769](https://github.com/PaddlePaddle/Paddle/pull/51769) - -#### 性能提升 -- 优化 reshape 相关融合策略 ([#53066](https://github.com/PaddlePaddle/Paddle/pull/53066)) -- 优化 BuildCINNPass 的性能 ([#49696](https://github.com/PaddlePaddle/Paddle/pull/49696)) -- 优化子图检测模块的性能 ([#45040](https://github.com/PaddlePaddle/Paddle/pull/45040), [#46937](https://github.com/PaddlePaddle/Paddle/pull/46937)) - -### 硬件接入 -#### CustomDevice -- 训练侧新增分布式策略 MP/Sharding/PP/MoE 以及 recompute 重计算功能的支持,推理侧新增分布式策略 MP 的支持,支持通过 CustomDevice 接入的硬件昇腾 NPU 和寒武纪 MLU 无需修改任何代码即可自动继承 CustomDevice 新增的所有分布式策略。 [#52872](https://github.com/PaddlePaddle/Paddle/pull/52872), [#54384](https://github.com/PaddlePaddle/Paddle/pull/54384), [#53220](https://github.com/PaddlePaddle/Paddle/pull/53220), [#54572](https://github.com/PaddlePaddle/Paddle/pull/54572), [#54573](https://github.com/PaddlePaddle/Paddle/pull/54573), [#54676](https://github.com/PaddlePaddle/Paddle/pull/54676), [#53044](https://github.com/PaddlePaddle/Paddle/pull/53044), [#53719](https://github.com/PaddlePaddle/Paddle/pull/53719), [#53701](https://github.com/PaddlePaddle/Paddle/pull/53701), [#53702](https://github.com/PaddlePaddle/Paddle/pull/53702), [#53703](https://github.com/PaddlePaddle/Paddle/pull/53703) -- 新增 API paddle.device.is_compiled_with_custom_device,方便用户判断当前环境是否支持某硬件的插件式设备后端 [#49271](https://github.com/PaddlePaddle/Paddle/pull/49721) -- 增加环境变量 CUSTOM_DEVICE_BLACK_LIST 设置,支持黑名单内的算子自动异构到 CPU 上运行 [#50409](https://github.com/PaddlePaddle/Paddle/pull/50409), [#50666](https://github.com/PaddlePaddle/Paddle/pull/50666) -- 优化 CustomDevice 性能,减少对 runtime 中 get_device_count 接口的调用次数 [#46963](https://github.com/PaddlePaddle/Paddle/pull/46963) - -#### 昆仑芯 XPU -- 训练侧使用了新版动态图并新增分布式策略 MP/Sharding/PP 以及 recompute 重计算功能,通信库通信的支持;推理侧新增分布式策略 MP 的支持,并增加对 XPU FasterTransformer 算子加速库的支持;[#49531](https://github.com/PaddlePaddle/Paddle/pull/49531), [#49815](https://github.com/PaddlePaddle/Paddle/pull/49815), [#48897](https://github.com/PaddlePaddle/Paddle/pull/48897), [#50717](https://github.com/PaddlePaddle/Paddle/pull/50717), [#51082](https://github.com/PaddlePaddle/Paddle/pull/51082), [#49757](https://github.com/PaddlePaddle/Paddle/pull/49757), [#51399](https://github.com/PaddlePaddle/Paddle/pull/51399), [#50329](https://github.com/PaddlePaddle/Paddle/pull/50329), [#48369](https://github.com/PaddlePaddle/Paddle/pull/48369), [#47838](https://github.com/PaddlePaddle/Paddle/pull/47838),[#48076](https://github.com/PaddlePaddle/Paddle/pull/48076),[#47882](https://github.com/PaddlePaddle/Paddle/pull/47882),[#48961](https://github.com/PaddlePaddle/Paddle/pull/48961),[#49043](https://github.com/PaddlePaddle/Paddle/pull/49043),[#49749](https://github.com/PaddlePaddle/Paddle/pull/49749),[#49806](https://github.com/PaddlePaddle/Paddle/pull/49806),[#53427](https://github.com/PaddlePaddle/Paddle/pull/53427),[#48470](https://github.com/PaddlePaddle/Paddle/pull/48470),[#49207](https://github.com/PaddlePaddle/Paddle/pull/49207),[#52296](https://github.com/PaddlePaddle/Paddle/pull/52296),[#51785](https://github.com/PaddlePaddle/Paddle/pull/51785),[#47168](https://github.com/PaddlePaddle/Paddle/pull/47168),[#47445](https://github.com/PaddlePaddle/Paddle/pull/47445),[#50200](https://github.com/PaddlePaddle/Paddle/pull/50200),[#49934](https://github.com/PaddlePaddle/Paddle/pull/49934),[#50792](https://github.com/PaddlePaddle/Paddle/pull/50792),[#52228](https://github.com/PaddlePaddle/Paddle/pull/52228),[#53337](https://github.com/PaddlePaddle/Paddle/pull/53337),[#53389](https://github.com/PaddlePaddle/Paddle/pull/53389),[#53496](https://github.com/PaddlePaddle/Paddle/pull/53496),[#53609](https://github.com/PaddlePaddle/Paddle/pull/53609),[#53697](https://github.com/PaddlePaddle/Paddle/pull/53697),[#53496](https://github.com/PaddlePaddle/Paddle/pull/53496),[#53720](https://github.com/PaddlePaddle/Paddle/pull/53720),[#53734](https://github.com/PaddlePaddle/Paddle/pull/53734),[#54172](https://github.com/PaddlePaddle/Paddle/pull/54172),[PR46227](https://github.com/PaddlePaddle/Paddle/pull/46227) - -## 4. 部署方向(Paddle Inference) -### 新功能 -- 支持 Paddle TensorRT 多个子图 TensorRT engine 或者不同 Predictor 的之间的 TensorRT engine 共享显存,以便节约显存。[#45842](https://github.com/PaddlePaddle/Paddle/pull/45842) [#47631](https://github.com/PaddlePaddle/Paddle/pull/47631) -- C++ API 增加获取输入 Tensor 的 Shape 和数据类型接口,增加获取输出 Tensor 的 Shape 和数据类型接口。C API 增加 SetExecStream、EnableMkldnnInt8 等 C++已有接口,用于服务化部署。 [#49758](https://github.com/PaddlePaddle/Paddle/pull/49758) -- 新增 paddle.inference.Predictor.register_output_hook()接口,可支持调试时打印 GPU 推理下每层的输出,同时也支持在 While 等控制流模型中使用。注意此接口不支持 Paddle-TensorRT。[#54433](https://github.com/PaddlePaddle/Paddle/pull/54433) ,[#47050](https://github.com/PaddlePaddle/Paddle/pull/47050) , [#54254](https://github.com/PaddlePaddle/Paddle/pull/54254) 。 -- Paddle Inference 推理的 Predictor 接口支持 paddle::Tensor 作为输入和输出,以便用户直接复用飞桨动态图做推理前、后处理。 ([#50445](https://github.com/PaddlePaddle/Paddle/pull/50445)) -- 增强 Paddle TensorRT 动态 shape 运行能力,config.enable_tuned_tensorrt_dynamic_shape()接口,不传任何参数时,在运行时构建 TensorRT Engine。不再需要先收集 shape 信息再运行,但为了避免运行时的重新构建,需要在前几次运行时,覆盖最小及最大 Shape 的情况, [#52162](https://github.com/PaddlePaddle/Paddle/pull/52162) 。 -- Paddle-TensorRT 支持 NHWC 格式的模型输入,[#49633](https://github.com/PaddlePaddle/Paddle/pull/49633) 。 -- 扩展 config.Exp_DisableTensorRtOPs 接口通过指定 Tensor 变量的名字来禁止进入 TensorRT,[#49497](https://github.com/PaddlePaddle/Paddle/pull/49497) 。 +- 新增 XHPC buffer manager 以提升 Paddle 和 XHPC 内存协同性能。 [#63924](https://github.com/PaddlePaddle/Paddle/pull/63924) +- 提升 TensorSetConstantXPU 性能,并支持 BF16 数据类型。[#63920](https://github.com/PaddlePaddle/Paddle/pull/63920),[#61818](https://github.com/PaddlePaddle/Paddle/pull/61818) +- 融合多个 group norm + silu + conv 模块, 压缩显存。[#62892](https://github.com/PaddlePaddle/Paddle/pull/62892) +- 优化 comm manager 中 XPU 显存分配。[#64139](https://github.com/PaddlePaddle/Paddle/pull/64139) +- 优化算子性能,包括 mean_all_grad ([#61148](https://github.com/PaddlePaddle/Paddle/pull/61148))、dropout_v2 ([#61029](https://github.com/PaddlePaddle/Paddle/pull/61029))、fused_rotary_position_embedding ([#62846](https://github.com/PaddlePaddle/Paddle/pull/62846))、cross_entropy ([#63159](https://github.com/PaddlePaddle/Paddle/pull/63159))、elementwise_add ([#64289](https://github.com/PaddlePaddle/Paddle/pull/64289))、fused_gemm_epilogue ([#61350](https://github.com/PaddlePaddle/Paddle/pull/61350)、check_nan_or_inf ([#60853](https://github.com/PaddlePaddle/Paddle/pull/60853)) +- XPU 硬件下新增 qk_qkv_attention_xpu_fuse_pass 和 qkv_attention_xpu_kernel。 [#60089](https://github.com/PaddlePaddle/Paddle/pull/60089) +- XPU 硬件下新增 rotary position 编码的融合算子支持 elementwise_mul + strided_slice + sin/cos+ stack 融合为 1 个算子。 [#60025](https://github.com/PaddlePaddle/Paddle/pull/60025) +- 添加 group_norm_silu_xpu_fuse_pass。 [#62689](https://github.com/PaddlePaddle/Paddle/pull/62689) +- 添加 weight_only_linear_xpu_pass。 [#64185](https://github.com/PaddlePaddle/Paddle/pull/64185) +- 新增 block_multihead_attention 算子及 PASS,支持 LLaMA2 模型在 XPU 设备中的大模型推理。 [#65036](https://github.com/PaddlePaddle/Paddle/pull/65036) +- 支持 squeeze_excitation_block_xpu_kernel 的 float16 类型。 [#61023](https://github.com/PaddlePaddle/Paddle/pull/61023) -### 功能优化 -- GPU 混合精度推理(非 Paddle TensorRT 场景)功能增强,Config.enable_use_gpu 增强可设置精度类型。 [#47993](https://github.com/PaddlePaddle/Paddle/pull/47993) -- 支持 double 类型输入进行推理, [#51786](https://github.com/PaddlePaddle/Paddle/pull/51786) 。 -- 由于 TensorRT 算子不支持 INT64 类型导致模型中存在 INT64 数据类型式运行失败问题,Paddle-TensorRT 做了增强,当模型中包含 INT64 数据类型时,进行自动转换,降低到 INT32 类型运行。 [#45547](https://github.com/PaddlePaddle/Paddle/pull/45547) -- Paddle-TensorRT 支持更多算子进入 TensorRT 推理,包含: - - expand_v2,gather_nd,rsqrt,sign,not,onehot,arg_min,temporal_shift,expend_as_v2,setvalue,index_select,round,acosh,square,reduce_max,not_equal,reduce_min,reduce_prod,grid_sampler,elementwise_mod,pad3d ,greater_equal,bitwise,cumsum,matmul_v2,reciprocal,where,bmm,take_along_axis,less_than,greater_than, logical_or, logical_xor, logical_and, less_equal,range,reduce_all,reduce_any ,fill_any_like ,pow - - [#47002](https://github.com/PaddlePaddle/Paddle/pull/47002) , [#47589](https://github.com/PaddlePaddle/Paddle/pull/47589) ,[#48223](https://github.com/PaddlePaddle/Paddle/pull/48223) ,[#48557](https://github.com/PaddlePaddle/Paddle/pull/48557) , [#48655](https://github.com/PaddlePaddle/Paddle/pull/48655) , [#49113](https://github.com/PaddlePaddle/Paddle/pull/49113) , [#51207](https://github.com/PaddlePaddle/Paddle/pull/51207) ,[#51028](https://github.com/PaddlePaddle/Paddle/pull/51028) ,[#50341](https://github.com/PaddlePaddle/Paddle/pull/50341) ,[#51498](https://github.com/PaddlePaddle/Paddle/pull/51498) ,[#48534](https://github.com/PaddlePaddle/Paddle/pull/48534) ,[#48684](https://github.com/PaddlePaddle/Paddle/pull/48684) , [#49393](https://github.com/PaddlePaddle/Paddle/pull/49393) , [#49615](https://github.com/PaddlePaddle/Paddle/pull/49615) ,[#50934](https://github.com/PaddlePaddle/Paddle/pull/50934) ,[#50974](https://github.com/PaddlePaddle/Paddle/pull/50974),[#50986](https://github.com/PaddlePaddle/Paddle/pull/50986) , [#52000](https://github.com/PaddlePaddle/Paddle/pull/52000) ,[#51971](https://github.com/PaddlePaddle/Paddle/pull/51971) , [#52518](https://github.com/PaddlePaddle/Paddle/pull/52518) ,[#44918](https://github.com/PaddlePaddle/Paddle/pull/44918) ,[#48230](https://github.com/PaddlePaddle/Paddle/pull/48230) ,[#47820](https://github.com/PaddlePaddle/Paddle/pull/47820) , [#46877](https://github.com/PaddlePaddle/Paddle/pull/46877) , [#48358](https://github.com/PaddlePaddle/Paddle/pull/48358) , [#48592](https://github.com/PaddlePaddle/Paddle/pull/48592) ,[#48697](https://github.com/PaddlePaddle/Paddle/pull/48697) , [#53088](https://github.com/PaddlePaddle/Paddle/pull/53088) , [#47974](https://github.com/PaddlePaddle/Paddle/pull/47974) , [#53462](https://github.com/PaddlePaddle/Paddle/pull/53462) -- 增强 Paddle-TensorRT 映射算子 strided_slice,instance_norm,prelu,argmax,cast,nearest_interp_v2,elementwise,bilinear 实现,[#46819](https://github.com/PaddlePaddle/Paddle/pull/46819) ,[#47998](https://github.com/PaddlePaddle/Paddle/pull/47998) ,[#48043](https://github.com/PaddlePaddle/Paddle/pull/48043) ,[#48998](https://github.com/PaddlePaddle/Paddle/pull/48998) , [#49675](https://github.com/PaddlePaddle/Paddle/pull/49675) , [#47495](https://github.com/PaddlePaddle/Paddle/pull/47495) -- Paddle-TensorRT 部分算子(scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range,unary,equal, elementwise_op) 支持 0 维 Tensor,[#53660](https://github.com/PaddlePaddle/Paddle/pull/53660) ,[#53627](https://github.com/PaddlePaddle/Paddle/pull/53627) , [#53634](https://github.com/PaddlePaddle/Paddle/pull/53634) , [#53714](https://github.com/PaddlePaddle/Paddle/pull/53714) , [#53729](https://github.com/PaddlePaddle/Paddle/pull/53729) ,[#53769](https://github.com/PaddlePaddle/Paddle/pull/53769) ,[#53506](https://github.com/PaddlePaddle/Paddle/pull/53506) ,[#53704](https://github.com/PaddlePaddle/Paddle/pull/53704) -- 支持 GCC12 + CUDA 12.0 以下版本编译, [#50106](https://github.com/PaddlePaddle/Paddle/pull/50106) -- Paddle-TensorRT 的 DeformableConv 插件支持动态 Shape 输入,[#50698](https://github.com/PaddlePaddle/Paddle/pull/50698) -- Paddle-TensorRT 增加 lookup_table 算子的插件支持, [#46613](https://github.com/PaddlePaddle/Paddle/pull/46613) -- 新增 config.enable_low_precision_io()接口支持 Paddle-TensorRT 场景下低精度类型输入, [#52485](https://github.com/PaddlePaddle/Paddle/pull/52485) -- Paddle-TensorRT 的 LayerNorm 插件支持 FP16 计算, [#45043](https://github.com/PaddlePaddle/Paddle/pull/45043) -- Predictor 的输入数据 paddle_infer::Tensor 支持 bool 类型,[#49388](https://github.com/PaddlePaddle/Paddle/pull/49388) -- Paddle-TensorRT 增强 Convolution 实现采用 ConvolutionNd,[#47653](https://github.com/PaddlePaddle/Paddle/pull/47653) -- conv2d_fusion 融合算子支持 NHWC 格式,[#49047](https://github.com/PaddlePaddle/Paddle/pull/49047) -- 调整 C++推理库下 Phi 算子相关目录结构,[#53091](https://github.com/PaddlePaddle/Paddle/pull/53091) -- 当 TensorRT 序列化和加载版本不匹配时,支持重新构建 TensorRT Engine,而不是报错,[#50775](https://github.com/PaddlePaddle/Paddle/pull/50775) 。 -- 优化 Paddle-TensorRT 运行时打印日志信息,[#50181](https://github.com/PaddlePaddle/Paddle/pull/50181) -- 基于 oneDNN 的 CPU 推理支持 elementwise 的 0 维 Tensor 输入,[#51656](https://github.com/PaddlePaddle/Paddle/pull/51656) -- 清理和规范化 Paddle-TensorRT 的 FC、matmul、matmul_v2 算子的支持,统一升级到使用 TensorRT 的 IMatrixMultiplyLayer 进行支持,[#52222](https://github.com/PaddlePaddle/Paddle/pull/52222) +#### Bug 修复 +- 修复 tile 算子对 0 维 Tensor 的支持。 [#64279](https://github.com/PaddlePaddle/Paddle/pull/64279) +- 修复 group_norm_silu_fuse_pass。 [#63449](https://github.com/PaddlePaddle/Paddle/pull/63449) +- 修复 XPU API GM 显存问题。[#60260](https://github.com/PaddlePaddle/Paddle/pull/60260),[#60387](https://github.com/PaddlePaddle/Paddle/pull/60387),[#62940](https://github.com/PaddlePaddle/Paddle/pull/62940) +- 修复分布式策略 Sharing stage1 v2 的错误。[#64209](https://github.com/PaddlePaddle/Paddle/pull/64209) +- 修复 XPU constant 问题。[#60763](https://github.com/PaddlePaddle/Paddle/pull/60763) +- 修复部分算子问题,包括 AdamW ([#62251](https://github.com/PaddlePaddle/Paddle/pull/62251))、dropout_v3 ([#62726](https://github.com/PaddlePaddle/Paddle/pull/62726))、softmax([#63780](https://github.com/PaddlePaddle/Paddle/pull/63780)) 、fused rope embedding ([#62143](https://github.com/PaddlePaddle/Paddle/pull/62143))、elementwise_add ([#60252](https://github.com/PaddlePaddle/Paddle/pull/60252))、resnet_basic_block ([#62914](https://github.com/PaddlePaddle/Paddle/pull/62914)) +- 修复 XPU 运行和安装相关问题。[#60028](https://github.com/PaddlePaddle/Paddle/pull/60028),[#61970](https://github.com/PaddlePaddle/Paddle/pull/61970) +- 修复 XPU 编译 bug。[#63307](https://github.com/PaddlePaddle/Paddle/pull/63307) +- 修复 XPU 通信库初始化时端侧内存相关的 bug。[#64396](https://github.com/PaddlePaddle/Paddle/pull/64396) -### 性能提升 -- 支持多个 lookup_tables 进入 Paddle-TensorRT 的 Embedding+Eltwise+LayerNorm 的融合 [#46243](https://github.com/PaddlePaddle/Paddle/pull/46243) ,[#46230](https://github.com/PaddlePaddle/Paddle/pull/46230) -- 增加 MoE 融合 Phi 算子,提升 MoE 模型性能推理性能, [#48703](https://github.com/PaddlePaddle/Paddle/pull/48703) -- 在 INT8 量化推理的场景下,Paddle-TensorRT 插件 fallback 到 FP16 计算而不是 FP32 计算,[#50554](https://github.com/PaddlePaddle/Paddle/pull/50554) -- 优化推理时内存、显存, [#49051](https://github.com/PaddlePaddle/Paddle/pull/49051) , [#49046](https://github.com/PaddlePaddle/Paddle/pull/49046) ,[#53930](https://github.com/PaddlePaddle/Paddle/pull/53930) -- Layout 排布优化 Pass 增强, [#52997](https://github.com/PaddlePaddle/Paddle/pull/52997) -- 支持对算子 Shape 推断进行缓存,提升模型推理性能, [#48312](https://github.com/PaddlePaddle/Paddle/pull/48312) -- 使用 half2 指令优化 bias+add+relu 融合,[#49048](https://github.com/PaddlePaddle/Paddle/pull/49048) -- 使用向量化操作优化多个输入的 Concat Kernel,[#49540](https://github.com/PaddlePaddle/Paddle/pull/49540) -- 基于 CUTLASS 实现 Convolution、Depthwise Convolution 及相关融合算子,提升推理速度。 [#47989](https://github.com/PaddlePaddle/Paddle/pull/47989) ,[#50603](https://github.com/PaddlePaddle/Paddle/pull/50603) ,[#51792](https://github.com/PaddlePaddle/Paddle/pull/51792) ,[#50603](https://github.com/PaddlePaddle/Paddle/pull/50603) -- Paddle-TensorRT 支持 FlashAttention 的插件,提升 StableDiffusion 等模型的推理速度,[#49438](https://github.com/PaddlePaddle/Paddle/pull/49438) 。 -- 增加 Transpose+LayerNorm 的融合 PASS,提升 StableDiffusion 等模型的推理速度,[#50082](https://github.com/PaddlePaddle/Paddle/pull/50082) 。 -- 增加 Elementwise+Transpose 的融合,[#50081](https://github.com/PaddlePaddle/Paddle/pull/50081) -- 优化 Paddle-TensorRT Group Norm 插件实现 ,[#49160](https://github.com/PaddlePaddle/Paddle/pull/49160) -- Config.EnableTensorRtEngine()接口增加 use_cuda_graph 参数,可以支持开启 CUDA Graph,注意在使用时,需要保证模型输入 shape 不变,可以降低运行时耗时,[#53406](https://github.com/PaddlePaddle/Paddle/pull/53406) -- 支持对 Reshape 的 inplace 操作减少模型运行时的拷贝耗时, [#49146](https://github.com/PaddlePaddle/Paddle/pull/49146) -- 基于 oneDNN 优化 LayerNorm kernel 实现,[#47782](https://github.com/PaddlePaddle/Paddle/pull/47782) -- 基于 oneDNN 支持 quantize+transpose 以及 transpose+dequantize 融合,[#49509](https://github.com/PaddlePaddle/Paddle/pull/49509) -- CPU 推理下当开启 MKLDNN 时,默认开启 FC 相关的融合 Pass,提升性能,[#45704](https://github.com/PaddlePaddle/Paddle/pull/45704) -- CPU 的 OneDNN 推理支持 suqeeze2 + transpose2 融合,[#47592](https://github.com/PaddlePaddle/Paddle/pull/47592) - -### XPU 推理提升和性能优化 -- 新增 ExpRunWithRuntimeConfig 接口与 XpuRuntimeConfig 允许推理期间设置外部流、L3 cache 等参数;GetExecStream 接口支持获得昆仑芯外部流对象;输入、输出支持昆仑芯设备内存减少 D2H 和 H2D 开销,[#53334](https://github.com/PaddlePaddle/Paddle/pull/53334)、 [#52466](https://github.com/PaddlePaddle/Paddle/pull/52466)、 [#53240](https://github.com/PaddlePaddle/Paddle/pull/53240) -- 新增 multi-encoder, fused_multi_transformer 算子和融合 pass,提升 ERNIE 和 Transformer 类模型性能,[#50570](https://github.com/PaddlePaddle/Paddle/pull/50570)、[#51346](https://github.com/PaddlePaddle/Paddle/pull/51346)、 [#50499](https://github.com/PaddlePaddle/Paddle/pull/50499)、[#53982](https://github.com/PaddlePaddle/Paddle/pull/53982)、[#50759](https://github.com/PaddlePaddle/Paddle/pull/50759)、[#51571](https://github.com/PaddlePaddle/Paddle/pull/51571)、 [#53144](https://github.com/PaddlePaddle/Paddle/pull/53144)、[#53306](https://github.com/PaddlePaddle/Paddle/pull/53306) -- 优化 BeamSearch 性能,当 beam_size=1 时对 write_read_array, gather 等细粒度算子进行变换、去除和融合提升模型性能,[#53130](https://github.com/PaddlePaddle/Paddle/pull/53130) -- 多个相同输入的 stack 算子变换为支持 broadcast 的 unsqueeze 算子,unsquee/squeeze 支持 inplace 计算, [#52099](https://github.com/PaddlePaddle/Paddle/pull/52099) -- 新增支持导出适用于昆仑芯的多卡推理模型, [#50490](https://github.com/PaddlePaddle/Paddle/pull/50490) -- 新增 embedding_with_eltwise_add 融合 pass 及算子 phi kernel,减小显存占用并提升推理性能, [#50590](https://github.com/PaddlePaddle/Paddle/pull/50590) -- interpolate 类算子 phi kernel 支持 FP16, [#52358](https://github.com/PaddlePaddle/Paddle/pull/52358) -- argmax 算子支持 INT32 类型输出, [#51303](https://github.com/PaddlePaddle/Paddle/pull/51303) -- 修复开启混合精度推理模式后, 保存序列化模型时只有 model 文件时的报错, [#52994](https://github.com/PaddlePaddle/Paddle/pull/52994) -- 修复 instance_norm 在 scale 和 bias 为空时出现的段错误, [#52627](https://github.com/PaddlePaddle/Paddle/pull/52627) -- conv_transpose 算子支持 FP16,[#53626](https://github.com/PaddlePaddle/Paddle/pull/53626) -- 添加 yolo_box_xpu 融合 pass 及算子 phi kernel,优化 YOLO 模型通用子结构, [#54163](https://github.com/PaddlePaddle/Paddle/pull/54163) -- 添加 conv2d_xpu 融合 pass 以及算子 phi kernel,并支持 FP16 推理,优化卷积操作推理耗时,[#52247](https://github.com/PaddlePaddle/Paddle/pull/52247) ,[#53626](https://github.com/PaddlePaddle/Paddle/pull/53626) -- 添加 sigmoid_elementmul 通用融合 pass,融合为 swish 算子以匹配 conv2d_fusion pass 提升 YOLO 模型推理性能, [#53580](https://github.com/PaddlePaddle/Paddle/pull/53580) -- 添加 act_add 融合 pass 及算子 phi kernel 提升推理性能,[#53965](https://github.com/PaddlePaddle/Paddle/pull/53965) -- 添加 fold_interp_outsize 融合 pass 提升推理性能, [#54245](https://github.com/PaddlePaddle/Paddle/pull/54245) -- 解决当 FC 存在共享 weight 时因重复融合导致结果错误的问题。 [#51108](https://github.com/PaddlePaddle/Paddle/pull/51108)、[#51039](https://github.com/PaddlePaddle/Paddle/pull/51039) -- 删除算子仅用于训练的 op_device 属性,防止在推理期间错误的选择训练时的 place, [#51029](https://github.com/PaddlePaddle/Paddle/pull/51029) -- 支持优化后模型的保存,允许再次推理时跳过 PASS 优化减少第一次推理时间, [#53696](https://github.com/PaddlePaddle/Paddle/pull/53696) -- 解决算子 Kernel 的 CPUPlace 输入被强制拷贝到 XPU 而导致的计算错误问题, [#51306](https://github.com/PaddlePaddle/Paddle/pull/51306) -- subblock 支持参数 H2D 提前拷贝以提升推理性能。[#51876](https://github.com/PaddlePaddle/Paddle/pull/51876) -- 修复昆仑芯 2 代芯片输出激活的 scale 存储空间大小。 [#53505](https://github.com/PaddlePaddle/Paddle/pull/53505) -- 新执行器昆仑芯 D2D 拷贝支持异步执行, [#51876](https://github.com/PaddlePaddle/Paddle/pull/51876) -- 删除只有一个输入的 concat 算子,[#52304](https://github.com/PaddlePaddle/Paddle/pull/52304) -- lookup_table_v2 支持 FP16 删除冗余 cast 算子, [#52888](https://github.com/PaddlePaddle/Paddle/pull/52888) -- 控制流 While 算子支持缓存 scope,降低每次新建 scope 的开销, [#52628](https://github.com/PaddlePaddle/Paddle/pull/52628) -- scatter 新增支持 FP16,删除冗余 cast 算子以及某一个输入为 1 的 elementwise_mul 算子。[#52831](https://github.com/PaddlePaddle/Paddle/pull/52831) - -### 模型量化 -- 动态图量化功能全面升级 - - 新增动态图模型下量化训练的 API 为 ```paddle.quantization.QAT``` ,支持通过配置传入量化相关参数,简化量化训练使用流程和二次开发难度 ([#49398](https://github.com/PaddlePaddle/Paddle/pull/49398)) - - 新增离线量化的 API 为 ```paddle.quantization.PTQ``` ,支持量化模型导出成推理支持的模型格式 ([#50107](https://github.com/PaddlePaddle/Paddle/pull/50107)) - - 新增 STUB 算子,在训练过程中模拟实际的量化操作([#50510](https://github.com/PaddlePaddle/Paddle/pull/50510)) -- 支持量化训练模型加载离线量化模型的参数,支持更多算子量化,包含 matmul, scale,conv1d,[#47892](https://github.com/PaddlePaddle/Paddle/pull/47892), [#45911](https://github.com/PaddlePaddle/Paddle/pull/45911),[#48912](https://github.com/PaddlePaddle/Paddle/pull/48912) -- 支持静态图量化训练的混合并行训练,[#52219](https://github.com/PaddlePaddle/Paddle/pull/52219) -- 修复动态图量化过程中的问题: - - 导出量化训练模型时候重复插入量化节点,[#48751](https://github.com/PaddlePaddle/Paddle/pull/48751) - - 修复给模型输入插入量化节点的问题,[#49926](https://github.com/PaddlePaddle/Paddle/pull/49926) - -## 5. 环境适配 -为提升源码编译效率,完善和推广 setuptools + ninja 编译方式,提升开发效率,CPU 场景下,全量编译耗时减少 20min,编译速度提升 24.52%,GPU 场景下全量编译耗时减少 22min,编译速度提升 29.31%; 为了适配较为主流的开发环境,飞桨在源码编译支持了 gcc12 编译和 C++17 标准,适配了最新的 CUDA12; 代码质量完成了编译 warning 的清理,提升编译体验;第三方依赖层级,为减少依赖冲突,升级了底层的 protobuf 版本,并清理了一些低版本依赖库的废弃属性和老旧的代码格式,并移除了对于 python2.x 的支持。 -- ninja 编译适配,提升编译速度。[#52433](https://github.com/PaddlePaddle/Paddle/pull/52433),[#48932](https://github.com/PaddlePaddle/Paddle/pull/48932),[#49420](https://github.com/PaddlePaddle/Paddle/pull/49420),[#48435](https://github.com/PaddlePaddle/Paddle/pull/48435),[#49303](https://github.com/PaddlePaddle/Paddle/pull/49303),[#49448](https://github.com/PaddlePaddle/Paddle/pull/49448),[#49838](https://github.com/PaddlePaddle/Paddle/pull/49838),[#50067](https://github.com/PaddlePaddle/Paddle/pull/50067),[#52796](https://github.com/PaddlePaddle/Paddle/pull/52796),[#50431](https://github.com/PaddlePaddle/Paddle/pull/50431),[#49181](https://github.com/PaddlePaddle/Paddle/pull/49181),[#48867](https://github.com/PaddlePaddle/Paddle/pull/48867),[#48490](https://github.com/PaddlePaddle/Paddle/pull/48490),[#48211](https://github.com/PaddlePaddle/Paddle/pull/48211),[#49499](https://github.com/PaddlePaddle/Paddle/pull/49499),[#53076](https://github.com/PaddlePaddle/Paddle/pull/53076) -- setuptools 编译打包一体化适配。[#48770](https://github.com/PaddlePaddle/Paddle/pull/48770),[#46957](https://github.com/PaddlePaddle/Paddle/pull/46957),[#49583](https://github.com/PaddlePaddle/Paddle/pull/49583),[#47602](https://github.com/PaddlePaddle/Paddle/pull/47602),[#48301](https://github.com/PaddlePaddle/Paddle/pull/48301),[#50800](https://github.com/PaddlePaddle/Paddle/pull/50800),[#42575](https://github.com/PaddlePaddle/Paddle/pull/42575)),[#49826](https://github.com/PaddlePaddle/Paddle/pull/49826),[#49002](https://github.com/PaddlePaddle/Paddle/pull/49002),[#51443](https://github.com/PaddlePaddle/Paddle/pull/51443),[#51528](https://github.com/PaddlePaddle/Paddle/pull/51528),[#52621](https://github.com/PaddlePaddle/Paddle/pull/52621),[#52465](https://github.com/PaddlePaddle/Paddle/pull/52465) -- gcc12 支持。[#52960](https://github.com/PaddlePaddle/Paddle/pull/52960),[#52265](https://github.com/PaddlePaddle/Paddle/pull/52265),[#46546](https://github.com/PaddlePaddle/Paddle/pull/46546),[#52318](https://github.com/PaddlePaddle/Paddle/pull/52318),[#46808](https://github.com/PaddlePaddle/Paddle/pull/46808),[#47466](https://github.com/PaddlePaddle/Paddle/pull/47466),[#52083](https://github.com/PaddlePaddle/Paddle/pull/52083),[#48176](https://github.com/PaddlePaddle/Paddle/pull/48176),[#49423](https://github.com/PaddlePaddle/Paddle/pull/49423),[#49452](https://github.com/PaddlePaddle/Paddle/pull/49452),[#51037](https://github.com/PaddlePaddle/Paddle/pull/51037),[#52007](https://github.com/PaddlePaddle/Paddle/pull/52007),[#52441](https://github.com/PaddlePaddle/Paddle/pull/52441),[#52085](https://github.com/PaddlePaddle/Paddle/pull/52085),[#50817](https://github.com/PaddlePaddle/Paddle/pull/50817),[#52646](https://github.com/PaddlePaddle/Paddle/pull/52646),[#50777](https://github.com/PaddlePaddle/Paddle/pull/50777),[#53288](https://github.com/PaddlePaddle/Paddle/pull/53288),[#54009](https://github.com/PaddlePaddle/Paddle/pull/54009) -- c++17 标准支持。[#53345](https://github.com/PaddlePaddle/Paddle/pull/53345),[#53892](https://github.com/PaddlePaddle/Paddle/pull/53892),[#54282](https://github.com/PaddlePaddle/Paddle/pull/54282),[#49017](https://github.com/PaddlePaddle/Paddle/pull/49017),[#47635](https://github.com/PaddlePaddle/Paddle/pull/47635),[#54258](https://github.com/PaddlePaddle/Paddle/pull/54258) -- cuda12 支持。[#52285](https://github.com/PaddlePaddle/Paddle/pull/52285),[#49592](https://github.com/PaddlePaddle/Paddle/pull/49592),[#52232](https://github.com/PaddlePaddle/Paddle/pull/52232),[#52654](https://github.com/PaddlePaddle/Paddle/pull/52654),[#54641](https://github.com/PaddlePaddle/Paddle/pull/54641) -- CodeStyle。[#45909](https://github.com/PaddlePaddle/Paddle/pull/45909),[#47772](https://github.com/PaddlePaddle/Paddle/pull/47772),[#48538](https://github.com/PaddlePaddle/Paddle/pull/48538),[#49522](https://github.com/PaddlePaddle/Paddle/pull/49522),[#47264](https://github.com/PaddlePaddle/Paddle/pull/47264),[#49558](https://github.com/PaddlePaddle/Paddle/pull/49558) -- 编译 Warning 消除。[#47163](https://github.com/PaddlePaddle/Paddle/pull/47163),[#47216](https://github.com/PaddlePaddle/Paddle/pull/47216),[#47309](https://github.com/PaddlePaddle/Paddle/pull/47309),[#47252](https://github.com/PaddlePaddle/Paddle/pull/47252),[#47341](https://github.com/PaddlePaddle/Paddle/pull/47341),[#47399](https://github.com/PaddlePaddle/Paddle/pull/47399),[#47513](https://github.com/PaddlePaddle/Paddle/pull/47513),[#47558](https://github.com/PaddlePaddle/Paddle/pull/47558),[#47706](https://github.com/PaddlePaddle/Paddle/pull/47706),[#52717](https://github.com/PaddlePaddle/Paddle/pull/52717),[#51203](https://github.com/PaddlePaddle/Paddle/pull/51203),[#51336](https://github.com/PaddlePaddle/Paddle/pull/51336),[#51608](https://github.com/PaddlePaddle/Paddle/pull/51608),[#51633](https://github.com/PaddlePaddle/Paddle/pull/51633),[#46644](https://github.com/PaddlePaddle/Paddle/pull/46644),[#53092](https://github.com/PaddlePaddle/Paddle/pull/53092),[#53185](https://github.com/PaddlePaddle/Paddle/pull/53185),[#53246](https://github.com/PaddlePaddle/Paddle/pull/53246),[#53650](https://github.com/PaddlePaddle/Paddle/pull/53650),[#53683](https://github.com/PaddlePaddle/Paddle/pull/53683),[#53687](https://github.com/PaddlePaddle/Paddle/pull/53687),[#53886](https://github.com/PaddlePaddle/Paddle/pull/53886),[#53689](https://github.com/PaddlePaddle/Paddle/pull/53689),[#53679](https://github.com/PaddlePaddle/Paddle/pull/53679),[#53681](https://github.com/PaddlePaddle/Paddle/pull/53681),[#53532](https://github.com/PaddlePaddle/Paddle/pull/53532),[#47137](https://github.com/PaddlePaddle/Paddle/pull/47137),[#47045](https://github.com/PaddlePaddle/Paddle/pull/47045),[#52186](https://github.com/PaddlePaddle/Paddle/pull/52186),[#52490](https://github.com/PaddlePaddle/Paddle/pull/52490),[#53924](https://github.com/PaddlePaddle/Paddle/pull/53924),[#53938](https://github.com/PaddlePaddle/Paddle/pull/53938),[#53945](https://github.com/PaddlePaddle/Paddle/pull/53945),[#53851](https://github.com/PaddlePaddle/Paddle/pull/53851),[#53847](https://github.com/PaddlePaddle/Paddle/pull/53847),[#53818](https://github.com/PaddlePaddle/Paddle/pull/53818),[#53931](https://github.com/PaddlePaddle/Paddle/pull/53931) -- 支持 protobuf 升级。[#49875](https://github.com/PaddlePaddle/Paddle/pull/49875),[#48495](https://github.com/PaddlePaddle/Paddle/pull/48495),[#49673](https://github.com/PaddlePaddle/Paddle/pull/49673),[#52499](https://github.com/PaddlePaddle/Paddle/pull/52499),[#51161](https://github.com/PaddlePaddle/Paddle/pull/51161),[#49168](https://github.com/PaddlePaddle/Paddle/pull/49168) -- 支持第三方库离线编译。[#54326](https://github.com/PaddlePaddle/Paddle/pull/54326),[#54370](https://github.com/PaddlePaddle/Paddle/pull/54370),[#54335](https://github.com/PaddlePaddle/Paddle/pull/54335),[#54346](https://github.com/PaddlePaddle/Paddle/pull/54346),[#53744](https://github.com/PaddlePaddle/Paddle/pull/53744),[#54319](https://github.com/PaddlePaddle/Paddle/pull/54319),[#53915](https://github.com/PaddlePaddle/Paddle/pull/53915) -- phi 独立编译头文件依赖解耦。[#50456](https://github.com/PaddlePaddle/Paddle/pull/50456),[#47088](https://github.com/PaddlePaddle/Paddle/pull/47088),[#52573](https://github.com/PaddlePaddle/Paddle/pull/52573),[#52651](https://github.com/PaddlePaddle/Paddle/pull/52651) -- Python2.x 退场。[#48685](https://github.com/PaddlePaddle/Paddle/pull/48685) - -## 6. 安全 -- 修复了诸如空指针使用、非法地址访问、内存越界、除 0、Python IndexError 等问题。[PR49976](https://github.com/PaddlePaddle/Paddle/pull/49976), [ PR49993](https://github.com/PaddlePaddle/Paddle/pull/49993)[, PR49942](https://github.com/PaddlePaddle/Paddle/pull/49942), [PR49965](https://github.com/PaddlePaddle/Paddle/pull/49965)[, PR50000](https://github.com/PaddlePaddle/Paddle/pull/50000)[, PR50005](https://github.com/PaddlePaddle/Paddle/pull/50005)[, PR49953](https://github.com/PaddlePaddle/Paddle/pull/49953)[, PR49995](https://github.com/PaddlePaddle/Paddle/pull/49995)[, PR49974](https://github.com/PaddlePaddle/Paddle/pull/49974)[, PR50015](https://github.com/PaddlePaddle/Paddle/pull/50015)[, PR50010](https://github.com/PaddlePaddle/Paddle/pull/50010), [PR49979](https://github.com/PaddlePaddle/Paddle/pull/49979), [PR49994](https://github.com/PaddlePaddle/Paddle/pull/49994), [PR49977](https://github.com/PaddlePaddle/Paddle/pull/49977)[, PR49968](https://github.com/PaddlePaddle/Paddle/pull/49968), [PR49984](https://github.com/PaddlePaddle/Paddle/pull/49984)[, PR49958](https://github.com/PaddlePaddle/Paddle/pull/49958)[, PR50008](https://github.com/PaddlePaddle/Paddle/pull/50008)[, PR51714](https://github.com/PaddlePaddle/Paddle/pull/51714), [PR51847](https://github.com/PaddlePaddle/Paddle/pull/51847), [PR51034](https://github.com/PaddlePaddle/Paddle/pull/51034)[, PR51088](https://github.com/PaddlePaddle/Paddle/pull/51088)[, PR51091](https://github.com/PaddlePaddle/Paddle/pull/51091)[, PR51092](https://github.com/PaddlePaddle/Paddle/pull/51092), [PR49966](https://github.com/PaddlePaddle/Paddle/pull/49966), [PR49656](https://github.com/PaddlePaddle/Paddle/pull/49656), [PR52161](https://github.com/PaddlePaddle/Paddle/pull/52161), [PR49548](https://github.com/PaddlePaddle/Paddle/pull/49548), [PR49546](https://github.com/PaddlePaddle/Paddle/pull/49546), [PR49547](https://github.com/PaddlePaddle/Paddle/pull/49547), [PR49549](https://github.com/PaddlePaddle/Paddle/pull/49549), [PR51850](https://github.com/PaddlePaddle/Paddle/pull/51850) - -## Thanks to our Contributors -This release contains contributions from: -1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin 吴嘉文, Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, 丁一, 傅剑寒, 六个骨头, 卢林, 周周周, 姜永久, 学渣戊, 张春乔, 张正海, 柠檬味~, 王明冬, 石晓伟, 超级码牛, 陈沧夜, 骑马小猫 - -# 2.4.2 Release Note - - 版本修复了已知问题,并新增了少量功能。 - -## 训练框架(含分布式) - - * 修复 paddle.utils.dlpack.to_dlpack 在 for 循环里 API 多次创建 dlpack 对象的报错问题,修复引用对象计数错误导致 dlpack 实际指向内容被析构的问题。 [#50138](https://github.com/PaddlePaddle/Paddle/pull/50138) - * 修复 paddle.multiplex API 在多维 Input Tensor 场景下访存越界的问题并添加 check 机制。 [#49368](https://github.com/PaddlePaddle/Paddle/pull/49368) - * 引入 cutlass,实现 gemm+gather+scatter 的融合;优化 sparse conv 的训练和推理性能;优化 batch_norm 在 1D 输入数据下的推理性能。 [#50118](https://github.com/PaddlePaddle/Paddle/pull/50118) - * 修复因使用 constexpr 导致 gcc54 环境下编译失败的问题。 [#50421](https://github.com/PaddlePaddle/Paddle/pull/50421) - * 将 sum op 的 Kernel 迁移到 PHI 算子库,并且修复 infermeta 中 SelectedRows 无法获取正确 dim 的 bug。 [#49342](https://github.com/PaddlePaddle/Paddle/pull/49342) - * 修复 eigen 头文件错误引用导致的偶发编译错误。 [#48157](https://github.com/PaddlePaddle/Paddle/pull/48157) - * 修复 fold 算子在大 bs 输入下访存越界的问题。 [#49491](https://github.com/PaddlePaddle/Paddle/pull/49491) - * 通过增加类型判别,解决发送张量时,维度不统一,造成流水线并行 hang 住的问题。 [#50337](https://github.com/PaddlePaddle/Paddle/pull/50337) - * 修复了自定义算子输出梯度的参数顺序不连续时,反向算子的输出值可能为 None 的 bug。 [#48656](https://github.com/PaddlePaddle/Paddle/pull/48656) - * 修复 paddle.queeze_ API 在 inplace 操作时 shape 重复修改导致结果错误 bug。 [#49903](https://github.com/PaddlePaddle/Paddle/pull/49903) - * 修复动转静模式下无参数 Layer 无法调用 backward 的问题。 [#49812](https://github.com/PaddlePaddle/Paddle/pull/49812) - * 修复 CUDA11.8 在 windows 的编译问题。 [#50205](https://github.com/PaddlePaddle/Paddle/pull/50205) - * 修复 `FusedDropoutActBiasGrad` 在 H100 上不支持的错误。 [#47285](https://github.com/PaddlePaddle/Paddle/pull/47285) - * 新增 `debug_graphviz_path` 选项至 `build_strategy`。 [#46531](https://github.com/PaddlePaddle/Paddle/pull/46531) - * 修复未关闭的 `popen` 物件。 [#47053](https://github.com/PaddlePaddle/Paddle/pull/47053) - -## 部署方向(Paddle Inference) - - * 完善混合精度推理功能,提高混合精度推理稳定性。重构二阶段式 convert_to_mixed_precision 接口底层实现, enable_use_gpu 新增 precision 参数支持一阶段式。 [#49077](https://github.com/PaddlePaddle/Paddle/pull/49077)、[#49239](https://github.com/PaddlePaddle/Paddle/pull/49239)、[#49477](https://github.com/PaddlePaddle/Paddle/pull/49477) - * 支持 jetson ampere 架构下编译。 [#49364](https://github.com/PaddlePaddle/Paddle/pull/49364) - * 修复 fc kernel 低精度模式下的精度问题。 [#49781](https://github.com/PaddlePaddle/Paddle/pull/49781) - * 修复 CAPI 下, trt workspace 参数类型的错误。 [#48350](https://github.com/PaddlePaddle/Paddle/pull/48350) - * 修复 Paddle 1.x 版本下 arg_max arg_min 没有 flatten dtype 参数,推理时会报错的问题。 [#49771](https://github.com/PaddlePaddle/Paddle/pull/49771) - * 修复 split infermeta 重构后关于 lod 逻辑信息缺失问题。 [#49745](https://github.com/PaddlePaddle/Paddle/pull/49745) - * 修复常量折叠 pass 不正确设置,导致 conv2d 权重经折叠后为非 persistable 而没有进入 TensorRT engine 问题。 [#50105](https://github.com/PaddlePaddle/Paddle/pull/50105) - -# 2.4.1 Release Note - - -去除飞桨对 python.so 的依赖,修复在包括 conda 在内的特定的环境下,因无法找到 python.so 而造成运行失败的 Bug。 - - - -# 2.4.0 Release Note - -## 1. 重要更新 - -- **新动态图架构正式生效**:新动态图框架调大幅提升了调度性能,超 90%API 的调度性能提升超过 50%,超 50%套件模型性能提升超过 5%,功能架构更加清晰,二次开发能力和体验显著增强。 - -- **全面提升了飞桨的动静统一能力:** 动转静功能提供了更加丰富的 Python 语法支持,飞桨的 Python 语法覆盖率达到 90%,对语法转写逻辑进行了重点地优化,完备地支持了控制流语法,提供了更加流畅的一键转静态图体验;借助全新升级的静态图执行器,让动转静训练具有更优的加速能力,重点模型测试显示接近静态图最佳水平;提升了动转静的可扩展性,新增支持多函数合并导出和推理,支持用户使用 PHI 算子库进行二次开发和灵活部署,有效支撑语音领域 U2++特色模型的自定义解码。 - -- **新增稀疏计算类 API:** 新增 55 个稀疏 API `paddle.sparse.*`,支持稀疏计算主流场景,已应用于 3D 点云目标检测、Sparse Transformers 等任务的稀疏训练和推理部署,高稀疏度场景下相比使用 DenseTensor 提速 105.75%,相比同类产品稀疏计算提速 4.01%~58.55%;支持多种稀疏 Tensor(SparseCoo 和 SparseCsr 等)的计算,极致节省显存;同时保持了一致的使用体验,和稠密 Tensor 的 API 使用方式一致。 - -- **大规模图神经网络 GPU 训练引擎:** 通过 SSD、内存、显存的异构层次化存储技术,突破显存瓶颈,支持超大规模图的全 GPU 存储和训练;实现了游走、采样、训练的全 GPU 一体化解决方案,相比传统的分布式 CPU 解决方案,相同成本的情况下训练速度提升 10+倍。 - -- **环境适配:** 新增了适配 CUDA11.7 版本的预编译安装包,新增了支持在 Ubuntu 22.04 及以上版本中运行。 - -### 前瞻性预告 - -- 飞桨框架将在 2.5 版本废弃对 python 3.6 的支持。 -- 飞桨框架将会逐步废弃 python 端的`paddle.fluild`命名空间下的 API,在 2.5 版本时,部分该命名空间下的 API 将会被直接删除。 - -## 2. 不兼容升级 - -- 取消了适配 CUDA10.1 版本的预编译安装包。 -- Tensor.clear_gradient(bool set_to_zero)接口不再接收 kwargs 传入的值,只能通过 args 传入 set_to_zero 的 bool 变量。 -- 为了提高显存利用效率,动态图默认仅保留前向叶子结点变量的梯度如训练中网络参数的梯度,而不再支持默认保留非叶子结点的梯度。如果需要保留特定 Tensor 的梯度,可以在反向执行前调用 Tensor.retain_grads()接口。 -- paddle.autograd.PyLayer 将不再支持输入是 tuple 的情况,如果输入希望是一组 Tensor 的情况请传入 list of Tensor。 - -## 3. 训练框架(含分布式) - -### (1)新增 API 和增强 API 功能 -- **新增稀疏计算类 API**:paddle.sparse - - 新增 55 个稀疏 API,支持稀疏计算主流场景,已应用于 3D 点云目标检测、Sparse Transformers 等任务的稀疏训练和推理部署,高稀疏度场景下相比使用 DenseTensor 提速 105.75%,相比同类产品稀疏计算提速 4.01%~58.55%;支持多种稀疏 Tensor(SparseCoo 和 SparseCsr 等)的计算,极致节省显存;同时保持了一致的使用体验,和稠密 Tensor 的 API 使用方式一致。[#45849](https://github.com/PaddlePaddle/Paddle/pull/45849), [#46694](https://github.com/PaddlePaddle/Paddle/pull/46694), [#45086](https://github.com/PaddlePaddle/Paddle/pull/45086), [#41857](https://github.com/PaddlePaddle/Paddle/pull/41857), [#42935](https://github.com/PaddlePaddle/Paddle/pull/42935), [#43475](https://github.com/PaddlePaddle/Paddle/pull/43475), [#43668](https://github.com/PaddlePaddle/Paddle/pull/43668), [#43966](https://github.com/PaddlePaddle/Paddle/pull/43966), [#44022](https://github.com/PaddlePaddle/Paddle/pull/44022), [#44346](https://github.com/PaddlePaddle/Paddle/pull/44346), [#44432](https://github.com/PaddlePaddle/Paddle/pull/44432), [#44451](https://github.com/PaddlePaddle/Paddle/pull/44451), [#44743](https://github.com/PaddlePaddle/Paddle/pull/44743), [#42013](https://github.com/PaddlePaddle/Paddle/pull/42013), [#43520](https://github.com/PaddlePaddle/Paddle/pull/43520), [#41434](https://github.com/PaddlePaddle/Paddle/pull/41434), [#42130](https://github.com/PaddlePaddle/Paddle/pull/42130), [#41276](https://github.com/PaddlePaddle/Paddle/pull/41276), [#41857](https://github.com/PaddlePaddle/Paddle/pull/41857), [#41356](https://github.com/PaddlePaddle/Paddle/pull/41356) -- **新增语音领域 API:** paddle.audio - - 新增 MFCC、Spectrogram、LogMelSpectrogram 等特征提取 API,支持 GPU 计算,相比 CPU 实现处理性能提升 15x 倍以上,可大幅提升语音模型训练 GPU 利用率。[#45424](https://github.com/PaddlePaddle/Paddle/pull/45424) - - 新增窗函数、离散余弦变换等特征提取基础 API,方便用户自定义语音特征提取。[#45424](https://github.com/PaddlePaddle/Paddle/pull/45424) - - 新增语音 IO 模块,提供 2 种 音频 I/O backend,支持 6 种编解码,便捷地实现语音数据的加载。 [#45939](https://github.com/PaddlePaddle/Paddle/pull/45939) - - 新增 TESS,ESC50 语音分类数据集,方便用户完成经典语音分类模型。[#45939](https://github.com/PaddlePaddle/Paddle/pull/45939) -- **新增图学习领域 API:** paddle.geometric - - 图学习逐渐成为机器学习领域的关键技术,飞桨新增 paddle.geometric 模块提供更好的图学习建模和训练开发体验。 - - 消息传递:图学习消息传递机制是图建模的基础,因此新增 7 个图学习消息传递 API,更方便完成进行图学习建模。其中,新增的 3 个消息传递融合算子可大幅减少图模型训练显存占用,稠密图场景下 GCN 系列模型可节省 50%+显存,训练速度可提升 20%+。[#44848](https://github.com/PaddlePaddle/Paddle/pull/44848), [#44580](https://github.com/PaddlePaddle/Paddle/pull/44580), [#43174](https://github.com/PaddlePaddle/Paddle/pull/43174), [#44970](https://github.com/PaddlePaddle/Paddle/pull/44970) - - 图采样:图采样是图模型训练的性能瓶颈,此次新增了高性能图采样算子,支持高并发图采样,GraphSage 的采样速度可提升 32 倍以上,模型训练速度可提升 12 倍以上。[#44970](https://github.com/PaddlePaddle/Paddle/pull/44970) -- **新增视觉领域 API** - - paddle.vision 新增目标检测领域算子 paddle.vision.distribute_fpn_proposals([#43736](https://github.com/PaddlePaddle/Paddle/pull/43736)), paddle.vision.generate_proposals([#43611](https://github.com/PaddlePaddle/Paddle/pull/43611)), paddle.vision.matrix_nms([#44357](https://github.com/PaddlePaddle/Paddle/pull/44357)), paddle.vision.prior_box 和 paddle.vision.box_coder([#47282](https://github.com/PaddlePaddle/Paddle/pull/47282))。 - -- - **新增其他 API** - - 新增 iinfo([#45321](https://github.com/PaddlePaddle/Paddle/pull/45321)), count_nonzero([#44169](https://github.com/PaddlePaddle/Paddle/pull/44169)), nanmedian([#42385](https://github.com/PaddlePaddle/Paddle/pull/42385)), remainder\_ ([#45266](https://github.com/PaddlePaddle/Paddle/pull/45266)), take([#44741](https://github.com/PaddlePaddle/Paddle/pull/44741)), triu_indices([#45168](https://github.com/PaddlePaddle/Paddle/pull/45168)), sgn([#44568](https://github.com/PaddlePaddle/Paddle/pull/44568)), bucketize([#44195](https://github.com/PaddlePaddle/Paddle/pull/44195)), nanquantile([#41343](https://github.com/PaddlePaddle/Paddle/pull/41343)), frac([#41226](https://github.com/PaddlePaddle/Paddle/pull/41226)), logcumsumexp([#42267](https://github.com/PaddlePaddle/Paddle/pull/42267)), pairwise_distance([#44161](https://github.com/PaddlePaddle/Paddle/pull/44161)), heaviside([#41872](https://github.com/PaddlePaddle/Paddle/pull/41872)), logspace([#41261](https://github.com/PaddlePaddle/Paddle/pull/41261)), corrcoef([#40690](https://github.com/PaddlePaddle/Paddle/pull/40690)) - - 新增 RReLU([#41823](https://github.com/PaddlePaddle/Paddle/pull/41823)), CyclicLR([#40698](https://github.com/PaddlePaddle/Paddle/pull/40698)), OneCycleLR([#41825](https://github.com/PaddlePaddle/Paddle/pull/41825)), Softmax2D([#40910](https://github.com/PaddlePaddle/Paddle/pull/40910)), SoftMarginLoss([#42364](https://github.com/PaddlePaddle/Paddle/pull/42364)), MultiLabelSoftMarginLoss([#41183](https://github.com/PaddlePaddle/Paddle/pull/41183)), TripletMarginLoss([#40487](https://github.com/PaddlePaddle/Paddle/pull/40487)), TripletMarginWithDistanceLoss([#40545](https://github.com/PaddlePaddle/Paddle/pull/40545)), CosineEmbeddingLoss 和 cosine_embedding_loss([#41680](https://github.com/PaddlePaddle/Paddle/pull/41680)), PixelUnshuffle([#40728](https://github.com/PaddlePaddle/Paddle/pull/40728)), ChannelShuffle([#40743](https://github.com/PaddlePaddle/Paddle/pull/40743)) -- **增强 API 功能** - - 增加 BatchNorm1D 的大 batch_size 计算功能 [#43072](https://github.com/PaddlePaddle/Paddle/pull/43072) -- **完善集合通信分布式训练 API** - - 完善`fleet.init`函数,增加`log_level`参数,方便用户查看运行过程中的日志 [#45909](https://github.com/PaddlePaddle/Paddle/pull/45909) - - 新增`paddle.distributed.fleet.recompute_sequential paddle.distributed.fleet.recompute_hybrid`接口,方便用户使用 recompute 功能[#45348](https://github.com/PaddlePaddle/Paddle/pull/45348) - - 新增`paddle.distributed.fleet.layers.mpu` package,方便用户使用张量并行功能 [#45803](https://github.com/PaddlePaddle/Paddle/pull/45803) - - 新增通信 API `paddle.distributed.destroy_process_group paddle.distributed.isend paddle.distributed.irecv paddle.distributed.all_to_all_single`,提升了通信的功能完备性和易用性 [#43918](https://github.com/PaddlePaddle/Paddle/pull/43918) - - 新增`paddle.distributed.stream` 通信 package,性能比基础版本提升 5%到 10% [#46023](https://github.com/PaddlePaddle/Paddle/pull/46023) [#45282](https://github.com/PaddlePaddle/Paddle/pull/45282) - - 通信 API 新增多种数据类型`Char/Byte/Bool`等的支持,提升了通信的功能完备性和易用性 [#45574](https://github.com/PaddlePaddle/Paddle/pull/45574) [#45440](https://github.com/PaddlePaddle/Paddle/pull/45440) - - 通信 API 异步参数从`use_calc_stream`变成`sync_op`,增强了接口的语义可读性 [#46493](https://github.com/PaddlePaddle/Paddle/pull/46493) -- **增强高层 API** - - 高层 API 中视觉模型 ResNeXt 实现复用 ResNet 代码进行重构。 [#40588](https://github.com/PaddlePaddle/Paddle/pull/40588) - - 高层 API 中视觉模型 Inceptionv3、MobileNetv1、MobileNetv2、ShuffleNetv2 实现改进。[#40431](https://github.com/PaddlePaddle/Paddle/pull/40431) - -### (2)新功能及重要功能升级 - -- **新动态图架构正式上线**:新动态图框架调度性能大幅提升,相比原有架构大幅提升了调度性能,超 90%API 的调度性能提升超过 50%,超 50%套件模型性能提升超过 5%; 新动态图架构清晰,耦合度低,基于新架构实现 Hook、PyLayer 等扩展模块的学习与开发成本显著降低。[#37550](https://github.com/PaddlePaddle/Paddle/pull/37550),[#37574](https://github.com/PaddlePaddle/Paddle/pull/37574),[#37813](https://github.com/PaddlePaddle/Paddle/pull/37813),[#37926](https://github.com/PaddlePaddle/Paddle/pull/37926),[#39192](https://github.com/PaddlePaddle/Paddle/pull/39192),[#37599](https://github.com/PaddlePaddle/Paddle/pull/37599),[#37406](https://github.com/PaddlePaddle/Paddle/pull/37406),[#37466](https://github.com/PaddlePaddle/Paddle/pull/37466),[#37599](https://github.com/PaddlePaddle/Paddle/pull/37599),[#40945](https://github.com/PaddlePaddle/Paddle/pull/40945),[#39989](https://github.com/PaddlePaddle/Paddle/pull/39989) - -- **高阶自动微分机制**:为了更好支持科学计算等场景,飞桨框架针对高阶自动微分能力进一步完善优化。目前,已在`paddle.incubate.autograd` 目录下提供了支持前反向高阶自动微分相关试用功能及 API(当前处于孵化状态,相关功能及 API 签名可能会发生变化)。如果想自行实现相关模型、探索自动微分机制,请仔细阅读[高阶自动微分使用方法及限制](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/incubate/autograd/Overview_cn.html)。具体的升级包括: - 1. 静态图高阶微分机制升级,通过基础算子体系和程序变换,支持高阶前向及反向微分,并打通编译器、分布式功能。[#41919](https://github.com/PaddlePaddle/Paddle/pull/41919), [#41201](https://github.com/PaddlePaddle/Paddle/pull/41201) - 2. 新增前向和反向高阶自动微分 API, `paddle.incubate.autograd.forward_grad`, `paddle.incubate.autograd.grad`。[#43354](https://github.com/PaddlePaddle/Paddle/pull/43354) - 3. 新增 18 个高阶自动微分算子`sin`, `cos`, `exp`, `erf`, `abs`, `log`, `cast`, `where`, `equal`, `not_equal`, `greater_than`, `greater_equal`, `elementwise_pow` `square`, `elementwise_max`, `gelu`, `reduce_mean`, `size`。[#46184](https://github.com/PaddlePaddle/Paddle/pull/46184), [#46024](https://github.com/PaddlePaddle/Paddle/pull/46024), [#45888](https://github.com/PaddlePaddle/Paddle/pull/45888), [#45338](https://github.com/PaddlePaddle/Paddle/pull/45338), [#44345](https://github.com/PaddlePaddle/Paddle/pull/44345) - 4. 修复现有`elementwise_div`, `reduce_sum`, `p_norm`等算子缺陷。[#46514](https://github.com/PaddlePaddle/Paddle/pull/46514), [#46184](https://github.com/PaddlePaddle/Paddle/pull/46184) - -- **通用异构参数服务器架构**: - - 参数服务器 GPUGraph 基础架构升级,满足大规模应用落地:针对传统 CPU 存储和训练大规模图神经网络的成本高,稳定性低,性能不足的问题打造了纯 GPU 图训练引擎(PGLBox),通过 SSD、内存、显存的异构层次化存储技术,支持超大规模图模型训练,同等成本下训练性能相对 CPU 图训练引擎提升 10+倍,任务失败率下降到极低。[#44594](https://github.com/PaddlePaddle/Paddle/pull/44594) - - 大规模联邦参数服务器架构:针对大规模个性化推荐场景,基于异构 PS 基础架构,开发了大规模联邦参数服务器训练,支持千亿参数下的横向纵向联邦,它包括两个特性:用户私有参数本地更新,公共参数在远端更新,用户可灵活配置私有参数和公共参数的切分策略;新增中心调度节点 Coordinator,用户可从基类进行二次开发,自定义 Client 选择策略。[#42682](https://github.com/PaddlePaddle/Paddle/pull/42682),[#44864](https://github.com/PaddlePaddle/Paddle/pull/44864),[#44327](https://github.com/PaddlePaddle/Paddle/pull/44327) -- **自适应并行** - - 设计并推出了完善的自动并行接口体系,支持自动动转静分布式训练、自动分布式数据加载、自动分布式保存与加载、自动参数转换、自定义切分标记和自定义执行过程等。用户只需要基于单机组网就可以非常容易获得自动分布式训练能力,支持数据并行、模型并行、流水线并行和混合并行。[#45776](https://github.com/PaddlePaddle/Paddle/pull/45776) ,[#46552](https://github.com/PaddlePaddle/Paddle/pull/46552),[#44202](https://github.com/PaddlePaddle/Paddle/pull/44202),[#45840](https://github.com/PaddlePaddle/Paddle/pull/45840),[#45518](https://github.com/PaddlePaddle/Paddle/pull/45518),[#40528](https://github.com/PaddlePaddle/Paddle/pull/40528),[#42838](https://github.com/PaddlePaddle/Paddle/pull/42838),[#43093](https://github.com/PaddlePaddle/Paddle/pull/43093),[#43312](https://github.com/PaddlePaddle/Paddle/pull/43312),[#45053](https://github.com/PaddlePaddle/Paddle/pull/45053)。 - - 完善了自适应并行底层机制,包括升级分布式 cost model 设计和实现,为切分策略提供更好评价;为 Program IR 添加了原生分布式属性,丰富了 Cluster 功能。[#40457](https://github.com/PaddlePaddle/Paddle/pull/40457),[#42601](https://github.com/PaddlePaddle/Paddle/pull/42601),[#42727](https://github.com/PaddlePaddle/Paddle/pull/42727),[#42874](https://github.com/PaddlePaddle/Paddle/pull/42784),[#43114](https://github.com/PaddlePaddle/Paddle/pull/43114),[#44095](https://github.com/PaddlePaddle/Paddle/pull/44095),[#44146](https://github.com/PaddlePaddle/Paddle/pull/44146),[#44701](https://github.com/PaddlePaddle/Paddle/pull/44701),[#44973](https://github.com/PaddlePaddle/Paddle/pull/44973),[#45002](https://github.com/PaddlePaddle/Paddle/pull/45002),[#45118](https://github.com/PaddlePaddle/Paddle/pull/45118),[#45237](https://github.com/PaddlePaddle/Paddle/pull/45237),[#42576](https://github.com/PaddlePaddle/Paddle/pull/42576),[#41722](https://github.com/PaddlePaddle/Paddle/pull/41722),[#44150](https://github.com/PaddlePaddle/Paddle/pull/44150), [#44989](https://github.com/PaddlePaddle/Paddle/pull/44989), [#44951](https://github.com/PaddlePaddle/Paddle/pull/44951), [#44963](https://github.com/PaddlePaddle/Paddle/pull/44963)。 - - 新增数据并行下 Sharding stage1/2/3 自动调优功能,在保证满足显存约束情况下,自动选择吞吐最高的 Sharding stage 策略。[#43782](https://github.com/PaddlePaddle/Paddle/pull/43782)。 - -- **训练硬件接入-插件式方案**:新增了自定义 Runtime/Kernel/CCL/Graph/Pass 等方案,硬件厂商可以根据硬件特性按需选择实现哪些模块。 - -- **ONNX 格式导出** - - 支持量化模型导出,导出后的 ONNX 模型使用 TensorRT 或 ONNXRuntime 加载推理,可获得 1.5~4 倍的推理加速 [#856](https://github.com/PaddlePaddle/Paddle2ONNX/pull/856),[#782](https://github.com/PaddlePaddle/Paddle2ONNX/pull/782) - - 新增大于 2GB 的大模型导出 [#942](https://github.com/PaddlePaddle/Paddle2ONNX/pull/942) - -### (3)功能优化 -- **动转静分析转换 & 扩展能力全面提升** - - 为了提升模型动转静转换成功率和使用体验,重构了控制流语法的转写逻辑,升级核心语法为 JIT (just-in-time)范式,实现与 Python 代码的等价转写,并完善了 break、return、continue 等语法功能。[#43666](https://github.com/PaddlePaddle/Paddle/pull/43666),[#43846](https://github.com/PaddlePaddle/Paddle/pull/43846),[#43848](https://github.com/PaddlePaddle/Paddle/pull/43848),[#43880](https://github.com/PaddlePaddle/Paddle/pull/43880),[#43957](https://github.com/PaddlePaddle/Paddle/pull/43957),[#43328](https://github.com/PaddlePaddle/Paddle/pull/43328),[#43348](https://github.com/PaddlePaddle/Paddle/pull/43348),[#43998](https://github.com/PaddlePaddle/Paddle/pull/43998),[#44465](https://github.com/PaddlePaddle/Paddle/pull/44465),[#44504](https://github.com/PaddlePaddle/Paddle/pull/44504),[#43713](https://github.com/PaddlePaddle/Paddle/pull/43713),[#43864](https://github.com/PaddlePaddle/Paddle/pull/43864),[#43967](https://github.com/PaddlePaddle/Paddle/pull/43967),[#44155](https://github.com/PaddlePaddle/Paddle/pull/44155),[#44487](https://github.com/PaddlePaddle/Paddle/pull/44487),[#44527](https://github.com/PaddlePaddle/Paddle/pull/44527),[#45105](https://github.com/PaddlePaddle/Paddle/pull/45105),[#45900](https://github.com/PaddlePaddle/Paddle/pull/45900) - - 为了支撑语音等场景自定义解码灵活部署场景,扩展了 jit.save/load 接口功能,支持用户多函数合并导出,并新增了 JITLayer 组件,支持类函数式调用,同时配合 PHI 算子库 C++ API 实现了自定义推理部署功能。[#44283](https://github.com/PaddlePaddle/Paddle/pull/44283),[#41783](https://github.com/PaddlePaddle/Paddle/pull/41783),[#43607](https://github.com/PaddlePaddle/Paddle/pull/43607),[#43754](https://github.com/PaddlePaddle/Paddle/pull/43754),[#43758](https://github.com/PaddlePaddle/Paddle/pull/43758),[#43798](https://github.com/PaddlePaddle/Paddle/pull/43798),[#44010](https://github.com/PaddlePaddle/Paddle/pull/44010),[#44351](https://github.com/PaddlePaddle/Paddle/pull/44351),[#44465](https://github.com/PaddlePaddle/Paddle/pull/44465),[#44504](https://github.com/PaddlePaddle/Paddle/pull/44504),[#44597](https://github.com/PaddlePaddle/Paddle/pull/44597),[#44738](https://github.com/PaddlePaddle/Paddle/pull/44738),[#44984](https://github.com/PaddlePaddle/Paddle/pull/44984),[#46249](https://github.com/PaddlePaddle/Paddle/pull/46249) - - 为了统一 API 动静行为,升级了 20 个算子,支持在静态图中 Op 的 attribute 信息可变,保证动静行为一致,提升模型的动转静转换成功率。包括`pad2d`、`depthwise_conv2d_transpose`、`conv2d_transpose`、`adaptive_avg_pool2d`、`reverse`、`bincount`、`multinomial`、`reduce_sum`、`reduce_mean`、`reduce_prod`、`reduce_min`、`reduce_max`、`uniform`、`squeeze`、`max_unpool2d`、`dropout`、`cumsum`、`eye`、`argmin`、`argmax`,[#44737](https://github.com/PaddlePaddle/Paddle/pull/44737),[#45084](https://github.com/PaddlePaddle/Paddle/pull/45084),[#45189](https://github.com/PaddlePaddle/Paddle/pull/45189),[#45391](https://github.com/PaddlePaddle/Paddle/pull/45391),[#45417](https://github.com/PaddlePaddle/Paddle/pull/45417),[#45427](https://github.com/PaddlePaddle/Paddle/pull/45427)、[#45514](https://github.com/PaddlePaddle/Paddle/pull/45514)、[#45525](https://github.com/PaddlePaddle/Paddle/pull/45525)、[#45543](https://github.com/PaddlePaddle/Paddle/pull/45543)、[#45660](https://github.com/PaddlePaddle/Paddle/pull/45660)、[#46352](https://github.com/PaddlePaddle/Paddle/pull/46352/)、[#46433](https://github.com/PaddlePaddle/Paddle/pull/46433)、[#45078](https://github.com/PaddlePaddle/Paddle/pull/45078),[#45342](https://github.com/PaddlePaddle/Paddle/pull/45342),[#45372](https://github.com/PaddlePaddle/Paddle/pull/45372),[#45453](https://github.com/PaddlePaddle/Paddle/pull/45453),[#45522](https://github.com/PaddlePaddle/Paddle/pull/45522),[#45620](https://github.com/PaddlePaddle/Paddle/pull/45620) - - 为了解决用户动转静报错栈偶尔丢失问题,优化了报错模块的逻辑,提升了报错栈的可读性以及用户调试的使用体验。[#44054](https://github.com/PaddlePaddle/Paddle/pull/44054),[#44083](https://github.com/PaddlePaddle/Paddle/pull/44083),[#44781](https://github.com/PaddlePaddle/Paddle/pull/44781),[#44996](https://github.com/PaddlePaddle/Paddle/pull/44996) - - 为了全面支持 Python 类型 Type Hint 语法,新增了 TypeHint 语法识别和转写模块。[#47121](https://github.com/PaddlePaddle/Paddle/pull/47121) - -- **PHI 算子库覆盖全量运算类算子**:继续建设高可复用算子库 PHI,将剩余的飞桨 2.x 运算类 PythonAPI 关联的算子以及相关内核均迁移到 PHI 算子库,并改写为函数式,新增了约 180 个前反向算子的 CPU&GPU 内核,以及 170 个 Kunlun 专用算子内核,进一步提升了新增算子时可复用的内核函数集。同时,新增了 100 余个 C++运算类 API,可支持在自定义算子中使用,进一步提升了基于飞桨进行外部扩展开发的易用性。[#44577](https://github.com/PaddlePaddle/Paddle/pull/44577),[#44631](https://github.com/PaddlePaddle/Paddle/pull/44631),[#44434](https://github.com/PaddlePaddle/Paddle/pull/44434),[#44605](https://github.com/PaddlePaddle/Paddle/pull/44605),[#44676](https://github.com/PaddlePaddle/Paddle/pull/44676),[#44742](https://github.com/PaddlePaddle/Paddle/pull/44742),[#44436](https://github.com/PaddlePaddle/Paddle/pull/44436),[#45887](https://github.com/PaddlePaddle/Paddle/pull/45887),[#45851](https://github.com/PaddlePaddle/Paddle/pull/45851),[#45623](https://github.com/PaddlePaddle/Paddle/pull/45623),[#45397](https://github.com/PaddlePaddle/Paddle/pull/45397),[#45863](https://github.com/PaddlePaddle/Paddle/pull/45863) - -- **规范化算子定义,大幅提升模型简洁度**:针对飞桨 1.x 历史算子定义存在诸多冗余参数,理解适配成本高的问题,对约 150 个高频算子的冗余参数进行了集中清理,基本上将数学无关的参数清理完毕。这些冗余参数清理后,飞桨存储的推理模型中信息量明显减少,普遍裁减掉了约 40%的属性变量,显著提升了飞桨算子定义的清晰程度,提升了模型分析调试的体验;同时,也显著减小了飞桨存储推理模型的体积,普遍减小超过 70%,显著提升了飞桨模型的轻量化程度。[#44310](https://github.com/PaddlePaddle/Paddle/pull/44310) , [#45613](https://github.com/PaddlePaddle/Paddle/pull/45613) , [#45684](https://github.com/PaddlePaddle/Paddle/pull/45684) , [#45708](https://github.com/PaddlePaddle/Paddle/pull/45708) , [#45758](https://github.com/PaddlePaddle/Paddle/pull/45758) , [#45786](https://github.com/PaddlePaddle/Paddle/pull/45786) , [#45772](https://github.com/PaddlePaddle/Paddle/pull/45772) , [#45845](https://github.com/PaddlePaddle/Paddle/pull/45845) , [#45984](https://github.com/PaddlePaddle/Paddle/pull/45984) , [#46218](https://github.com/PaddlePaddle/Paddle/pull/46218) , [#46553](https://github.com/PaddlePaddle/Paddle/pull/46553) - -### (4)性能优化 - -- AMP 性能及精度优化 - - 更多算子增加 FP16 数据类型支持,包括 elementwise 系列算子, compare 系列算子, strided_slice, set_value, uniform_ramdom 等。([#45504](https://github.com/PaddlePaddle/Paddle/pull/45504) [#44405](https://github.com/PaddlePaddle/Paddle/pull/44405) [#45496](https://github.com/PaddlePaddle/Paddle/pull/45496) [#46641](https://github.com/PaddlePaddle/Paddle/pull/46641) [#46906](https://github.com/PaddlePaddle/Paddle/pull/46906)) - - 优化 hard_swish 算子 FP16 Kernel 实现方案,保证精度无损。( [35386](https://github.com/PaddlePaddle/Paddle/pull/35386) ) - - 更多算子增加 BF16 数据类型支持,包括 fused_linear、empty、selu、pow、adam、clip、embedding、gelu、pad3d、pixel_shuffle、tile、where 等。[#46364](https://github.com/PaddlePaddle/Paddle/pull/46364),[#47177](https://github.com/PaddlePaddle/Paddle/pull/47177) -- 单机训练性能自动调优 - - Transpose OP 支持自动 Kernel 选择机制,可以针对不同模型配置自动搜索到性能最优的 Kernel 实现,提升模型性能。[#43310](https://github.com/PaddlePaddle/Paddle/pull/43310) (Transpose Op 接入自动调优功能) - - AMP Layout 自动切换支持新动态图模式,ResNet50、TSM、DeepLabV3 等模型在新动态图下通过 Layout 自动调整获得性能提升 9%~21%。([#45409](https://github.com/PaddlePaddle/Paddle/pull/45409), [#45751](https://github.com/PaddlePaddle/Paddle/pull/45751), [#45826](https://github.com/PaddlePaddle/Paddle/pull/45826), [#46880](https://github.com/PaddlePaddle/Paddle/pull/46880)) -- GPU 单机训练通用性能优化 - - 优化 Conv 类算子 cuDNN 算法的 Cache 方案,并 Cache 所有算法获取方式下的结果,大幅减少算子的 CPU 开销。([#41891](https://github.com/PaddlePaddle/Paddle/pull/41891) [#47197](https://github.com/PaddlePaddle/Paddle/pull/47197)) - - 进一步优化多个算子的 GPU Kernel 和 Python 端性能,包括 dist, poisson, depthwise_conv2d、transpose, eigh, broadcast 类计算,reduce 类计算,layer_norm,cross_entropy 等,在更多配置场景下达到更优性能。([#44946](https://github.com/PaddlePaddle/Paddle/pull/44946), [#45057](https://github.com/PaddlePaddle/Paddle/pull/45057), [#45160](https://github.com/PaddlePaddle/Paddle/pull/45160), [#42491](https://github.com/PaddlePaddle/Paddle/pull/42491), [#42704](https://github.com/PaddlePaddle/Paddle/pull/42704), [#42853](https://github.com/PaddlePaddle/Paddle/pull/42853), [#46287](https://github.com/PaddlePaddle/Paddle/pull/46287), [#46362](https://github.com/PaddlePaddle/Paddle/pull/46362), [#46490](https://github.com/PaddlePaddle/Paddle/pull/46490), [#46412](https://github.com/PaddlePaddle/Paddle/pull/46412), [#46623](https://github.com/PaddlePaddle/Paddle/pull/46623), [#40051](https://github.com/PaddlePaddle/Paddle/pull/40051)) -- 集合通信分布式训练性能优化 - - 为提高流水线并行调度效率,支持动态图 Interleaving 1F1B 调度策略,在 GPT-3 模型上性能提升 3%~4%。[#45797](https://github.com/PaddlePaddle/Paddle/pull/45797),[#45869](https://github.com/PaddlePaddle/Paddle/pull/45869),[#45922](https://github.com/PaddlePaddle/Paddle/pull/45922),[#46209](https://github.com/PaddlePaddle/Paddle/pull/46209),[#45402](https://github.com/PaddlePaddle/Paddle/pull/45402),[#45444](https://github.com/PaddlePaddle/Paddle/pull/45444),[#45497](https://github.com/PaddlePaddle/Paddle/pull/45497),[#45797](https://github.com/PaddlePaddle/Paddle/pull/45797),[#45869](https://github.com/PaddlePaddle/Paddle/pull/45869),[#45922](https://github.com/PaddlePaddle/Paddle/pull/45922),[#46209](https://github.com/PaddlePaddle/Paddle/pull/46209),[#46399](https://github.com/PaddlePaddle/Paddle/pull/46399),[#46483](https://github.com/PaddlePaddle/Paddle/pull/46483),[#46876](https://github.com/PaddlePaddle/Paddle/pull/46876),[#47242](https://github.com/PaddlePaddle/Paddle/pull/47242),[#47249](https://github.com/PaddlePaddle/Paddle/pull/47249),[#47497](https://github.com/PaddlePaddle/Paddle/pull/47497),[#47517](https://github.com/PaddlePaddle/Paddle/pull/47517) - - 为提升 MLPerf BERT 模型的分布式训练性能,DistributedFusedLamb 分布式优化器支持分层 AllReduce,在 DCU 1024 卡上 MLPerf BERT 性能提升 17%。[#44821](https://github.com/PaddlePaddle/Paddle/pull/44821),[#44843](https://github.com/PaddlePaddle/Paddle/pull/44843) - - 为优化使用数据并行 Data Parallel 时的显存占用,支持 Tensor Fusion 时的 Buffer Lazy 初始化策略,可降低等于模型参数量的显存占用量。[#45631](https://github.com/PaddlePaddle/Paddle/pull/45631)。 - - 分布式并行策略 Data Parallel 和 Sharding 支持 BF16 训练。[#46846](https://github.com/PaddlePaddle/Paddle/pull/46846),[#47246](https://github.com/PaddlePaddle/Paddle/pull/47246) - - 为支持 Sequence Parallel 等策略,分布式流水线并行策略支持 enable_partial_send_recv 策略,支持传输 sequence parallel 切分后的 tensor。[#46992](https://github.com/PaddlePaddle/Paddle/pull/46992),[#47083](https://github.com/PaddlePaddle/Paddle/pull/47083) - - 为提升 sharding stage 2 策略的性能,实现了 sharding stage 2 optimizer broadcast 参数与下一个 step forward 的 overlap,并使用多 CUDA Stream 进行通信,GPT 6.7B 模型 16 卡训练性能提升 11%。[#46495](https://github.com/PaddlePaddle/Paddle/pull/46495),[#46656](https://github.com/PaddlePaddle/Paddle/pull/46656),[#47061](https://github.com/PaddlePaddle/Paddle/pull/47061) - -### (5)问题修复 - -- 动转静 - - 修复了模型在多卡训练时 Parameter 无梯度场景下,动转静会报错的问题。[#44485](https://github.com/PaddlePaddle/Paddle/pull/44485) - - 修复了动转静时终端会有多余的框架日志误输出的问题。[#45754](https://github.com/PaddlePaddle/Paddle/pull/45754),[#46800](https://github.com/PaddlePaddle/Paddle/pull/46800) - - 修复了模型中控制流中包含无需梯度的 Tensor 时,在动转静训练时会报错的问题。[#43034](https://github.com/PaddlePaddle/Paddle/pull/43034) - - 修复了动转静训练在梯度聚合时计算值错误的问题。[#44893](https://github.com/PaddlePaddle/Paddle/pull/44893) - - 修复了函数被@staticmethod 装饰时动转静会报错的问题。[#44983](https://github.com/PaddlePaddle/Paddle/pull/44983),[#45268](https://github.com/PaddlePaddle/Paddle/pull/45268),[#45277](https://github.com/PaddlePaddle/Paddle/pull/45277) - - 修复了部分场景下模型包含控制动转静训练时,显存占用过多的问题。[#45380](https://github.com/PaddlePaddle/Paddle/pull/45380) - - 修复了模型中包含复杂控制流时,动转静在组网阶段 shape 推导报错的问题。[#45916](https://github.com/PaddlePaddle/Paddle/pull/45916),[#46020](https://github.com/PaddlePaddle/Paddle/pull/46020) -- 报错机制修复 - - 使用 np.testing.assert_allclose 替换 self.assertTrue(np.allclose(...)),获得更充分的报错信息 ([#44947)(https://github.com/PaddlePaddle/Paddle/pull/44947), [#44988](https://github.com/PaddlePaddle/Paddle/pull/44988),[#45213](https://github.com/PaddlePaddle/Paddle/pull/45213)) -- 集合通信分布式训练 - - 修复了通信库初始化、通信过程中的若干 bug,增强了系统运行稳定性 [#44964](https://github.com/PaddlePaddle/Paddle/pull/44964) [#45100](https://github.com/PaddlePaddle/Paddle/pull/45100) [#44758](https://github.com/PaddlePaddle/Paddle/pull/44758) - - 修复流水线并行容易 hang 的问题,增强策略的易用性 [#47201](https://github.com/PaddlePaddle/Paddle/pull/47201);增强流水线功能支持不均衡的输入 [#47199](https://github.com/PaddlePaddle/Paddle/pull/47199) - - 修复新动态图 MP/PP 策略下性能低于老动态图的问题 [#47071](https://github.com/PaddlePaddle/Paddle/pull/47071) - - 修复 sharding stage2 策略错误维护参数 trainable 属性的 bug [#47240](https://github.com/PaddlePaddle/Paddle/pull/47240) - - 修复一系列 OP 在 tensor numel 大于 INT32_MAX 时的 bug。[#45711](https://github.com/PaddlePaddle/Paddle/pull/45711),[#45741](https://github.com/PaddlePaddle/Paddle/pull/45741),[#45897](https://github.com/PaddlePaddle/Paddle/pull/45897),[#46158](https://github.com/PaddlePaddle/Paddle/pull/46158),[#46767](https://github.com/PaddlePaddle/Paddle/pull/46767),[#47191](https://github.com/PaddlePaddle/Paddle/pull/47191),[#46045](https://github.com/PaddlePaddle/Paddle/pull/46045),[#46160](https://github.com/PaddlePaddle/Paddle/pull/46160) - - 修复 FusedAttention 和 FusedFeedForward OP 显存占用过大的 bug。[#47236](https://github.com/PaddlePaddle/Paddle/pull/47236),[#47235](https://github.com/PaddlePaddle/Paddle/pull/47235) - - 修复 multi_tensor_adam 和 multi_tensor_momentum OP 在传入的 parameters 是 list of dict 时参数更新错误的 bug。[#47352](https://github.com/PaddlePaddle/Paddle/pull/47352),[#47372](https://github.com/PaddlePaddle/Paddle/pull/47372) - -## 4. 部署方向(Paddle Inference) - -### (1)新增特性 - -- 后端图引擎集成方案优化 - - 为了减少 Paddle-TensorRT 插件代码开发,以及减少 Paddle-TensorRT 子图数量从而降低资源占用率,开发了通用插件机制,可以自动对框架内丰富的 Phi 算子提供统一的 TensorRT 插件接口,在多数场景下可以有效减少显存占用。 [#46970](https://github.com/PaddlePaddle/Paddle/pull/46070),[#46179](https://github.com/PaddlePaddle/Paddle/pull/46179),[#46580](https://github.com/PaddlePaddle/Paddle/pull/46580) - - 为了方便用户在框架定制算子且能使得 Paddle-TensorRT 高效推理,进行功能升级支持升级框架自定义 Paddle-TensorRT 插件。[#46970](https://github.com/PaddlePaddle/Paddle/pull/46070) -- Inference 推理库构建系统优化,体积可按需裁剪 - - 预编译的安装包默认支持 TensorRT:训练用的预编译安装包与部署用的预编译安装包(Paddle Inference)统一为一个预编译安装包,且优化了构建系统,使得预编译的安装包默认支持 TensorRT,减少用户使用 PaddleTensorRT 时的切换成本。[#46008](https://github.com/PaddlePaddle/Paddle/pull/46008),[#45824](https://github.com/PaddlePaddle/Paddle/pull/45824),[#46058](https://github.com/PaddlePaddle/Paddle/pull/46058) - - 体积可按需裁剪:可依据模型算子进行裁剪。[#47033](https://github.com/PaddlePaddle/Paddle/pull/47033) , [#47049](https://github.com/PaddlePaddle/Paddle/pull/47049) , [#47047](https://github.com/PaddlePaddle/Paddle/pull/47047) -- Inference 支持原生 AMP - - 为了充分利用 GPU Tensor Core 计算能力,提升模型的推理性能,开发了模型精度转换工具,Inference GPU 原生支持了混合精度模型的推理。使用方式可参考[文档](https://github.com/PaddlePaddle/Paddle-Inference-Demo/blob/release/v2.4/docs-official/guides/nv_gpu_infer/gpu_mixed_precision.md)。[#43814](https://github.com/PaddlePaddle/Paddle/pull/43814),[#43881](https://github.com/PaddlePaddle/Paddle/pull/43881),[#44057](https://github.com/PaddlePaddle/Paddle/pull/44057),[#44307](https://github.com/PaddlePaddle/Paddle/pull/44307),[#44457](https://github.com/PaddlePaddle/Paddle/pull/44457),[#44866](https://github.com/PaddlePaddle/Paddle/pull/44866),[#45050](https://github.com/PaddlePaddle/Paddle/pull/45050),[#45346](https://github.com/PaddlePaddle/Paddle/pull/45346),[#45379](https://github.com/PaddlePaddle/Paddle/pull/45379),[#45406](https://github.com/PaddlePaddle/Paddle/pull/45406),[#45882](https://github.com/PaddlePaddle/Paddle/pull/45882) - - 为了提升混合精度下模型的推理性能,补充了未支持 FP16 计算的高频算子的 FP16 kernel,减少了由于输入精度不匹配插入 cast 算子的可能性,提升推理性能。[#44642](https://github.com/PaddlePaddle/Paddle/pull/44642),[#45061](https://github.com/PaddlePaddle/Paddle/pull/45061),[#44653](https://github.com/PaddlePaddle/Paddle/pull/44653),[#45504](https://github.com/PaddlePaddle/Paddle/pull/45504),[#45061](https://github.com/PaddlePaddle/Paddle/pull/45061),[#44969](https://github.com/PaddlePaddle/Paddle/pull/44969),[#44558](https://github.com/PaddlePaddle/Paddle/pull/44558),[#44710](https://github.com/PaddlePaddle/Paddle/pull/44710),[#43871](https://github.com/PaddlePaddle/Paddle/pull/43871),[#44792](https://github.com/PaddlePaddle/Paddle/pull/44792) -- 压缩与推理引擎打通升级 - - 升级量化模型存储格式,新格式支持 Paddle Inference、PaddleLite 和 Paddle2ONNX 3 种部署方式,支持芯片类型包括 X86 CPU、NVIDIA GPU、Arm CPU。([#46305](https://github.com/PaddlePaddle/Paddle/pull/46305) [#462832](https://github.com/PaddlePaddle/Paddle/pull/46283) [#46022](https://github.com/PaddlePaddle/Paddle/pull/46022)) - - 新增兼容 SoC/NPU 芯片的 INT8 全量化功能,可保证产出的 INT8 量化模型在 SoC/NPU 芯片上有最佳推理加速和精度。 -- 推理引擎与飞桨编译器(CINN)打通升级 - - 升级飞桨框架与编译器的接口模块,支持推理模型通过 Paddle Inference 接入编译器进行优化([#44499](https://github.com/PaddlePaddle/Paddle/pull/44499) [#44708](https://github.com/PaddlePaddle/Paddle/pull/44708) ) - -### (2)底层优化 - -- **GPU 性能优化** - - 新增 matmul_v2、LSTM、reshape、fill_constant、swish、mulitclass_nms3、bilinear_interp_v2、split、silu、shuffle_channel 算子的 TensorRT 映射及完善动态 shape 的支持。多类重点模型性能提升 7%~90% 。([#46177](https://github.com/PaddlePaddle/Paddle/pull/46177),[#44678](https://github.com/PaddlePaddle/Paddle/pull/44678),[#44314](https://github.com/PaddlePaddle/Paddle/pull/44314),[#44561](https://github.com/PaddlePaddle/Paddle/pull/44561),[#45166](https://github.com/PaddlePaddle/Paddle/pull/45166), [#44411](https://github.com/PaddlePaddle/Paddle/pull/44411),[#43424](https://github.com/PaddlePaddle/Paddle/pull/43424), [#44516](https://github.com/PaddlePaddle/Paddle/pull/44516)) - - 增加常量折叠 PASS 进行推理性能优化,提升 SwinTransformer、HifiGAN、FastSpeech2 等模型的性能。([#45494](https://github.com/PaddlePaddle/Paddle/pull/45494)) - - 增加 conv_fusion workspacesize 的 cache,提升 conv_fusion 计算性能。([#45902](https://github.com/PaddlePaddle/Paddle/pull/45902)) -- **视觉 ViT 模型优化** - - 新增 ViT 模型 Attention 结构融合 PASS,并支持 OSS Plugin 和自动 padding,ViT 推理速度提升 30%-40% [#45019](https://github.com/PaddlePaddle/Paddle/pull/45019) [#45506](https://github.com/PaddlePaddle/Paddle/pull/45506) -- **大模型推理性能优化** - - 为提高超大生成模型推理速度以及显存节省,对多层 Transformer 融合算子(fused_multi_transformer_op)增加 INT8 实现(fused_multi_transformer_int8_op),支持生成模型的量化推理。结合矩阵乘算法选择、量化反量化 kernel 融合进行性能优化。 [#46169](https://github.com/PaddlePaddle/Paddle/pull/46169) - - 为了提升大模型推理使用 fused_multi_transformer 融合的易用性,增加 Pass 进行自动匹配融合。 -- **CPU 性能优化** - - 优化语音 U2++ 模型,FP32 模型推理速度提升 35%,INT8 模型推理速度提升 69% ([#47592](https://github.com/PaddlePaddle/Paddle/pull/47592) [#47127](https://github.com/PaddlePaddle/Paddle/pull/47127) [#47391](https://github.com/PaddlePaddle/Paddle/pull/47391) [#47234](https://github.com/PaddlePaddle/Paddle/pull/47234) [#47009](https://github.com/PaddlePaddle/Paddle/pull/47009) [#47080](https://github.com/PaddlePaddle/Paddle/pull/47080)) - - -### (3)问题修复 - -- TensorRT workspace size 大小设置支持 int64。([#44469](https://github.com/PaddlePaddle/Paddle/pull/44469)) -- Paddle-TRT 中,全面支持 Op 的输入为权重。([#45545](https://github.com/PaddlePaddle/Paddle/pull/45545)) -- Paddle-TRT 中,支持 conv2d_transpose/conv3d_transpose 含 output_padding 属性。([#45004](https://github.com/PaddlePaddle/Paddle/pull/45004)) -- Paddle-TRT 中,增强 strided_slice 对动态 shape 的支持。([#46819](https://github.com/PaddlePaddle/Paddle/pull/46819)) -- Paddle-TRT 中,优化了在多线程场景下运行时 context 的显存占用。([#45468](https://github.com/PaddlePaddle/Paddle/pull/45468)) -- Paddle-TRT 中,修复了多个模型在同一进程中运行时,当初始化顺序变动时,反复生成序列化文件的问题。([#43942](https://github.com/PaddlePaddle/Paddle/pull/43942)) -- 修复了同一进程中,多次初始化 Predictor 并运行时,偶发崩溃的问题。([#45203](https://github.com/PaddlePaddle/Paddle/pull/45203)) -- 修复 MobileNetV3_large、ERNIE 3.0-Medium 和 bert 等量化模型推理精度异常问题 ([#45416](https://github.com/PaddlePaddle/Paddle/pull/45416) [#46283](https://github.com/PaddlePaddle/Paddle/pull/46283) [#45920](https://github.com/PaddlePaddle/Paddle/pull/45920) [#47573](https://github.com/PaddlePaddle/Paddle/pull/47574)) - -## 5. 环境适配 - -- 训练用的预编译安装包与部署用的预编译安装包(Paddle Inference)统一为一个预编译安装包,且优化了构建系统,使得预编译的安装包默认支持 TensorRT。 -- 取消了适配 CUDA10.1 版本的预编译安装包。 -- 新增了适配 CUDA11.7 版本的预编译安装包。 -- 源码编译时间缩短:减少模块间依赖,提升并行度,优化部分模块的编译速度,共同使的全量编译时间减少了约 20 分钟。 -- 支持在 windows 11、Centos 8、Ubuntu 22.04、Jetson 5.02 系统环境上运行飞桨,支持使用 WSL 2 工具在 windows 系统中运行飞桨 linux 安装包。 -- 修复飞桨在 glibc2.34+环境中运行错误的问题。 -- 优化了整个代码仓库中的 C++、Python、CMake 的代码风格,并引入或升级了以下的代码风格检查工具。 - - pre-commit 由 1.10.4 升级到 2.17.0: [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103) - - pylint 由默认版本改为指定 2.12.0 版本: [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103) - - remove-crlf 由 1.0.1 升级到 1.1.14: [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103) - - cpplint 由默认版本改为指定 1.6.0 版本: [#43175](https://github.com/PaddlePaddle/Paddle/pull/43175),[#43978](https://github.com/PaddlePaddle/Paddle/pull/43978),[#43673](https://github.com/PaddlePaddle/Paddle/pull/43673),[#43679](https://github.com/PaddlePaddle/Paddle/pull/43679),[#43695](https://github.com/PaddlePaddle/Paddle/pull/43695),[#43733](https://github.com/PaddlePaddle/Paddle/pull/43733),[#43740](https://github.com/PaddlePaddle/Paddle/pull/43740) - - clang-format 由 3.8 升级到 13.0: [#42840](https://github.com/PaddlePaddle/Paddle/pull/42840),[#43248](https://github.com/PaddlePaddle/Paddle/pull/43248),[#43329](https://github.com/PaddlePaddle/Paddle/pull/43329),[#43333](https://github.com/PaddlePaddle/Paddle/pull/43333),[#43633](https://github.com/PaddlePaddle/Paddle/pull/43633),[#43678](https://github.com/PaddlePaddle/Paddle/pull/43678) - - 引入 black 工具进行 python 代码的风格检查:[#46014](https://github.com/PaddlePaddle/Paddle/pull/46014) - - 引入 cmakelint 工具用于 cmake 文件代码检查,版本为 1.4.2: [#43222](https://github.com/PaddlePaddle/Paddle/pull/43222),[#43406](https://github.com/PaddlePaddle/Paddle/pull/43406),[#43414](https://github.com/PaddlePaddle/Paddle/pull/43414),[#43428](https://github.com/PaddlePaddle/Paddle/pull/43428) - - 引入 cmake-format 用于 cmake 文件的自动格式化,版本为 0.6.13: [#43057](https://github.com/PaddlePaddle/Paddle/pull/43057) - -## 6. 硬件适配 ### 海光 DCU -- 增加在 DCU 上的 Profiler 功能,可以在 DCU 上对模型运行过程的性能数据进行收集、统计和展示,支持 kernel 层面的 DCU 占用率显示。 -### 昆仑芯 -- 增加在昆仑芯 2 代芯片上的 Profiler 功能,可以在昆仑芯 2 代芯片上对模型运行过程的性能数据进行收集、统计和展示,支持 kernel 层面的昆仑芯 2 代芯片占用率显示。 -- 昆仑芯 2 代芯片(昆仑芯 AI 加速卡 R200、R300、R200-8F、R200-8FS、RG800)训练/推理支持,已验证 PPYOLOE、PP-OCR、ERNIE 3.0、PP-TSM、PP-TTS、DLRM、PPO 等总计 51 个模型,支持静态图+动态图训练,支持混合精度训练,支持单机单卡、单机多卡训练,覆盖了智能视觉、自然语言处理、智能语音、智能推荐、强化学习 5 个领域。 -### 寒武纪 -- 寒武纪 MLU 芯片(MLU370 系列板卡)训练/推理支持,已验证 ResNet50、BERT、YoloV3、OCR-DB、Deeplabv3 等多个模型,支持静态图+动态图训练,支持混合精度训练,支持单机单卡、单机多卡训练。 -### Graphcore -- Graphcore IPU 芯片(包括 IPU Mk2 GC200 和 Bow IPU)训练/推理支持,支持 ResNet50、BERT 等模型,支持静态图和动转静模式训练,支持单芯片、单机、多机分布式训练。 -- 增加更多算子支持 -- 升级到 Poplar SDK v3.0.0 版本 [#46892](https://github.com/PaddlePaddle/Paddle/pull/46892) -* 支持使用动转静模式训练模型, 添加了一个新的 paddle.incubate.identity_loss op 用来辅助构图 [#43770](https://github.com/PaddlePaddle/Paddle/pull/43770) -* 支持 Paddle 原生的分布式训练 API paddle.distributed.launch [#43311](https://github.com/PaddlePaddle/Paddle/pull/43311) -* 支持使用混合精度训练模型 [#41733](https://github.com/PaddlePaddle/Paddle/pull/41733) -* Paddle Inference 支持使用 PopART 自定义算子 [#45235](https://github.com/PaddlePaddle/Paddle/pull/45235) - -### Intel -- 迁移 oneDNN 算子 transpose2_grad([#46139](https://github.com/PaddlePaddle/Paddle/pull/46139)), relu6_grad([#46501](https://github.com/PaddlePaddle/Paddle/pull/46501)), gaussian_random([#46747](https://github.com/PaddlePaddle/Paddle/pull/46747), [#45481](https://github.com/PaddlePaddle/Paddle/pull/45481)), sgd and stack([#46374](https://github.com/PaddlePaddle/Paddle/pull/46374)), concat+grad, expand+grad,fill_constant([#45863](https://github.com/PaddlePaddle/Paddle/pull/45863)), slice, slice_grad, split,pad and pad3d([#46101](https://github.com/PaddlePaddle/Paddle/pull/46101)), softmax_grad([#46257](https://github.com/PaddlePaddle/Paddle/pull/46257)), Shape([#46051](https://github.com/PaddlePaddle/Paddle/pull/46051)), Sum([#46239](https://github.com/PaddlePaddle/Paddle/pull/46239)), Transpose2_grad([#46139](https://github.com/PaddlePaddle/Paddle/pull/46139)), Cast, clip+grad andpool+grad([#45775](https://github.com/PaddlePaddle/Paddle/pull/45775)), Reduce sum+grad,mean+grad, min and max([#45536](https://github.com/PaddlePaddle/Paddle/pull/45536)), Relu and abs([#45397](https://github.com/PaddlePaddle/Paddle/pull/45397)), Gelu([#45596](https://github.com/PaddlePaddle/Paddle/pull/45596)), Scale([#45537](https://github.com/PaddlePaddle/Paddle/pull/45537)) -- 优化 fill_constant, fc, conv 等若干算子内核 -- 增加若干 Pass 融合优化 -- 优化 Adam-W CPU FP32 优化器 ([#42522](https://github.com/PaddlePaddle/Paddle/pull/42522)) -- 优化 pad3d fp32 onednn 算子内核实现 ([#43990](https://github.com/PaddlePaddle/Paddle/pull/43990)) -- 改进 matmul, FC andlookup_v2 内核的并发执行 ([#44023](https://github.com/PaddlePaddle/Paddle/pull/44023), [#44078](https://github.com/PaddlePaddle/Paddle/pull/444078), [#44640](https://github.com/PaddlePaddle/Paddle/pull/44640), [#44744](https://github.com/PaddlePaddle/Paddle/pull/44744), [#45249](https://github.com/PaddlePaddle/Paddle/pull/45249)) -- FC onednn 算子内核支持 bf16 ( [#42758](https://github.com/PaddlePaddle/Paddle/pull/42758), [#43154](https://github.com/PaddlePaddle/Paddle/pull/43154), [#43109](https://github.com/PaddlePaddle/Paddle/pull/43109)) -- 增加矩阵乘法和激活函数的融合([#43519](https://github.com/PaddlePaddle/Paddle/pull/43519), [#43198](https://github.com/PaddlePaddle/Paddle/pull/43198)) -- 支持卷积算子 int8 参数生产 IR passes ( [#44680](https://github.com/PaddlePaddle/Paddle/pull/44680), [#42625](https://github.com/PaddlePaddle/Paddle/pull/42625)) -- 增加 pool/avg 量化和 scales 修正 ([#44186](https://github.com/PaddlePaddle/Paddle/pull/44186)) -- 增加 matmul 和 elementwise onednn 算子内核融合([#45077](https://github.com/PaddlePaddle/Paddle/pull/45077)) -- 修复 QAT 精度问题 ([#43693](https://github.com/PaddlePaddle/Paddle/pull/43693), [#45936](https://github.com/PaddlePaddle/Paddle/pull/45936), [#46378](https://github.com/PaddlePaddle/Paddle/pull/46378)) -- 迁移 42 个 oneDNN 算子内核到 PHI 算子库 ([#46374](https://github.com/PaddlePaddle/Paddle/pull/46374), [#46101](https://github.com/PaddlePaddle/Paddle/pull/46101), [#45989](https://github.com/PaddlePaddle/Paddle/pull/45989), [#45863](https://github.com/PaddlePaddle/Paddle/pull/45863), [#45775](https://github.com/PaddlePaddle/Paddle/pull/45775), [#45626](https://github.com/PaddlePaddle/Paddle/pull/45626), [#45536](https://github.com/PaddlePaddle/Paddle/pull/45536), [#46501](https://github.com/PaddlePaddle/Paddle/pull/46501), [#46257](https://github.com/PaddlePaddle/Paddle/pull/46257), [#45596](https://github.com/PaddlePaddle/Paddle/pull/45596), [#45537](https://github.com/PaddlePaddle/Paddle/pull/45537), [#45481](https://github.com/PaddlePaddle/Paddle/pull/45481), [#45397](https://github.com/PaddlePaddle/Paddle/pull/45397), [#46239](https://github.com/PaddlePaddle/Paddle/pull/46239), [#46139](https://github.com/PaddlePaddle/Paddle/pull/46139), [#46051](https://github.com/PaddlePaddle/Paddle/pull/46051)) -- 量化 elementwise_sub 和 shape 算子内核 ([#42854](https://github.com/PaddlePaddle/Paddle/pull/42854), [#44124](https://github.com/PaddlePaddle/Paddle/pull/44124)) - -## Thanks to our Contributors - -This release contains contributions from: - -0x45f, Aganlengzi, Ainavo, Allen Guo, Asthestarsfalll, Aurelius84, Baibaifan, baoachun, BiynXu, Bo Zhang, BrilliantYuKaimin, cambriconhsq, caozhou, carryyu, ccrrong, ceci3, chalsliu, Chang Xu, Charles-hit, Chen Long, Chen Weihang, chenjian, chentianyu03, Chenxiao Niu, cifar10, crystal, csy0225, danleifeng, David Nicolas, dc-cheny, denglin-github, dongfangshenzhu, duanboqiang, duanyanhui, engineer, enzodechine, Fan Zhang, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, FlyingQianMM, freeliuzc, furnace, fuyou765, fwenguang, Ghost Screaming, gongweibao, Guanghua Yu, guguguzi, Guoxia Wang, Haipeng Wang, handiz, Haohongxiang, haosicheng, helen88, heliqi, hong, HongyuJia, houj04, huangxu96, Hui Zhang, Huihuang Zheng, huzhiqiang, Jacek Czaja, Jack Zhou, jack603047588, Jackwaterveg, jakpiase, james, Jiabin Yang, jiangcheng, Jiaqi Liu, JingZhuangzhuang, joanna.wozna.intel, JYChen, JZ-LIANG, Kaipeng Deng, kangguangli, kuizhiqing, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, lidanqing, LielinJiang, Ligoml, Lijunhui, lilong12, limingshu, Lin Manhui, Linjie Chen, liqitong-a, littletomatodonkey, liu zhengxi, Liu-xiandong, liutiexing, Liyulingyue, LiYuRio, Lux et Veritas, lyq, Matsumoto Ruko, MayYouBeProsperous, mengqingchun02, Ming-Xu Huang, ming1753, minghaoBD, moyan, mrcangye, Netpunk, niuliling123, Nyakku Shigure, OccupyMars2025, onecatcn, pangyoki, parap1uie-s, peachlcy, piotrekobi, Qi Li, QingshuChen, qipengh, Rayman, Regan Yue, RichardWooSJTU, risemeup1, Roc, ronnywang, Rui Li, Ruibiao Chen, seemingwang, Shang Zhizhou, shangliang Xu, ShenLiang, shentanyue, Shijie, ShiningZhang, shixingbo, shiyutang, Shuangchi He, Siming Dai, Sing_chan, Skr Bang, SmirnovKol, sneaxiy, sprouteer, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao CHANG, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, tiancaishaonvjituizi, tianshuo78520a, Tomasz Socha, TTerror, USTCKAY, Vigi Zhang, Walter, Wang Bojun, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, WangXi, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wawltor, wbn, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, whs, Wilber, WJJ1995, wuhuachaocoding, wuhuanzhou, wuyefeilin, XiaoguangHu, xiaoguoguo626807, xiaohemaikoo, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiayanming, Xingyuan Zhang, xiongkun, yang131313, yangguohao, YangZhou, Yanxing Shi, Yao Zihang, yaoxuefeng, yaozhixin, yeliang2258, Yilingyelu, Yiqun Liu, ykkk2333, Yuang Liu, Yuanle Liu, YuanRisheng, yuguo, Yulong Ao, Yulv-git, YUNSHEN XIE, Zhang Jun, Zhang Ting, Zhang Zheng, zhangbo9674, zhangbopd, zhangchunle, Zhangjingyu06, zhangkaihuo, zhangxiaoci, zhangyikun02, zhangzhenguo, Zhanlue Yang, zhaocaibei123, zhaoying9105, zhaoyingli, Zhen Wang, Zhengyang Song, zhiboniu, Zhong Hui, Zhou Wei, zhoutianzi666, zhupengyang, ziyoujiyi, zlsh80826, zmxdream, zn, Zuza Gawrysiak, zyfncg, 傅剑寒, 六个骨头, 津, 熊峻峰, 王明冬, 石晓伟 - - -# 2.3.1 Release Note - -## 1. 重要更新 - -- 2.3.1 版本是在 2.3 版本的基础上修复了已知问题,并且发布了支持 CUDA 11.6 的安装包。 - -## 2. 训练框架(含分布式) - -### (1)功能优化 - -#### API - -- 修改 `paddle.nn.initializer.KaimingUniform` 和 `paddle.nn.initializer.KaimingNormal` 两种初始化方式,使其支持多种类型的激活函数。([#43721](https://github.com/PaddlePaddle/Paddle/pull/43721), [#43827](https://github.com/PaddlePaddle/Paddle/pull/43827)) -- 优化 `paddle.io.DataLoader` 的数据预读取功能,使其支持设置了 `prefetch_factor` 设定的预读取数据的缓存数量,避免在读取大块数据时出现 IO 阻塞。([#43674](https://github.com/PaddlePaddle/Paddle/pull/43674)) - -#### 新动态图执行机制 - -- 修改新动态图 API 逻辑中 optional 类型 Tensor 的初始化方法,防止被提前析构导致数据异常。([#42561](https://github.com/PaddlePaddle/Paddle/pull/42561)) - -#### 全新静态图执行器 - -- 延迟初始化执行器中的线程池,避免只执行一轮的 `program`(如 `save、load、startup_program` 等)创建线程池。([#43768](https://github.com/PaddlePaddle/Paddle/pull/43768)) - -#### 混合精度训练 - -- 设置 `paddle.nn.Layer` 中 `set_state_dict` 中禁用 `state_dict` hook。([#43407](https://github.com/PaddlePaddle/Paddle/pull/43407)) - -#### 分布式训练 - -- 使 `paddle.incubate.nn.functional.fused_attention` 和 `paddle.incubate.nn.functional.fused_feedforward` 支持张量模型并行。([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) - -#### 其他 - -- 调整框架算子内核打印字符串的格式,便于进行自动化拆分解析。([#42931](https://github.com/PaddlePaddle/Paddle/pull/42931)) -- 更新模型量化 API,支持 `rounding to nearest ties to even` 的四舍五入方式,支持量化取值范围 [-128, 127]。([#43829](https://github.com/PaddlePaddle/Paddle/pull/43829)) -- 量化感知训练适配支持 AMP 混合精度训练。([#43689](https://github.com/PaddlePaddle/Paddle/pull/43689)) -- 量化感知训练在启动时新增 `progress bar`,便于查看量化初始化进度,统计 out_threshold 时跳过 scale op,加速初始化过程。([#43454](https://github.com/PaddlePaddle/Paddle/pull/43454)) -- 动态图量化训练支持 `conv` 和 `bn` 融合,静态图离线量化支持设置 `skip_tensor_list` 来跳过某些层不做量化。([#43301](https://github.com/PaddlePaddle/Paddle/pull/43301)) - -### (2)性能优化 - -- 优化 `paddle.incubate.nn.functional.fused_attention` 和`paddle.incubate.nn.functional.fused_feedforward` 算子,增加 `add_residual` 属性,用以控制最后一步是否进行加 `residual` 操作,CAE 模型性能提升 7.7%。([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) -- 优化 `linspace` 算子,将 `start`、`stop`、`num` 三个输入 Tensor 初始化在 CPU 上,避免在算子中进行 GPU -> CPU 拷贝,SOLOv2 模型性能提升 6%。([#43746](https://github.com/PaddlePaddle/Paddle/pull/43746)) - -### (3)问题修复 - -#### API - -- 修复 `paddle.io.DataLoader` 在 `return_list=True` 时因多线程冲突小概率报错问题。([#43691](https://github.com/PaddlePaddle/Paddle/pull/43691)) -- 修复 `paddle.nn.Layer` 的参数存在 `None` 类型参数时 `to` 方法报 NoneType 不存在 device 属性的错误。([#43597](https://github.com/PaddlePaddle/Paddle/pull/43597)) -- 修复 cumsum op 在某些 `shape`下计算结果出错的问题。([#42500](https://github.com/PaddlePaddle/Paddle/pull/42500), [#43777](https://github.com/PaddlePaddle/Paddle/pull/43777)) -- 修复静态图下 `Tensor.__getitem__`在使用 `bool`索引时组网阶段输出结果维度为 0 的问题。([#43246](https://github.com/PaddlePaddle/Paddle/pull/43246)) -- 修复 `paddle.slice` 和 `paddle.strided_slice` 处理参数为负数时出现异常的问题。([#43432](https://github.com/PaddlePaddle/Paddle/pull/43432)) -- 修复 set_value op 在处理切片 `step`为负数时赋值结果异常的问题。([#43694](https://github.com/PaddlePaddle/Paddle/pull/43694)) -- 修复 C++ 端 `copy` 接口不能在多卡设备间拷贝的问题。([#43728](https://github.com/PaddlePaddle/Paddle/pull/43728)) -- 修改 `paddle.incubate.nn.functional.fused_attention` 和 `paddle.incubate.nn.functional.fused_feedforward` 中属性命名引发的推理时的问题。([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) -- 修复 ConditionalBlockGrad op 处理不需要 `grad`的 Tensor 时异常的问题。([#43034](https://github.com/PaddlePaddle/Paddle/pull/43034)) -- 解决 C++ 的 einsum op 反向速度优化引起的显存增加问题,并将反向优化默认打开。([#43397](https://github.com/PaddlePaddle/Paddle/pull/43397)) -- 修复单卡下 `paddle.io.DataLoader`多进程数据读取在固定随机种子时数据无法固定的问题。([#43702](https://github.com/PaddlePaddle/Paddle/pull/43702)) -- 修复 softmax op 在 Tensor 元素超过 2G 时,触发 CUDNN_STATUS_NOT_SUPPORT 的错误。([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) -- 修复 trace op `Event` 字符串在不同算子无区分,导致性能分析不便利的问题。([#42789](https://github.com/PaddlePaddle/Paddle/pull/42789)) - -#### 其他 - -- 修复动转静多次 deepcopy 并保存导致的显存溢出问题。([#43141](https://github.com/PaddlePaddle/Paddle/pull/43141)) -- 修复自定义算子中使用的 PlaceType 类型升级引入的 device id 在多卡场景中出错的问题。([#43830](https://github.com/PaddlePaddle/Paddle/pull/43830)) -- 优化 `paddle.profiler.Profiler` timeline 可视化逻辑,将在 python 脚本中自定义的事件从 C++ 折叠层显示移动至 python 折叠层显示。([#42790](https://github.com/PaddlePaddle/Paddle/pull/42790)) - -## 3. 部署方向(Paddle Inference) - -### (1)新增特性 #### 新增功能 +- 新增对海光 DCU K100 支持。[#63535](https://github.com/PaddlePaddle/Paddle/pull/63535) +- 支持 complex64/128 数据类型,并支持 fused_bias_residual_layernorm、fused_bias_dropout_residual_layer_norm、rms_norm 等融合算子。 [#63217](https://github.com/PaddlePaddle/Paddle/pull/63217) -- CPU 上 ONNX Runtime 后端新增 PaddleSlim 量化模型支持。([#43774](https://github.com/PaddlePaddle/Paddle/pull/43774), [#43796](https://github.com/PaddlePaddle/Paddle/pull/43796)) - -### (2)底层优化 - -#### CPU 性能优化 - -- EnableMkldnn 配置中移除 `gpu_cpu_reshape2_matmul_fuse_pass`,修复 ResNet50 性能下降的问题。([#43750](https://github.com/PaddlePaddle/Paddle/pull/43750)) - -#### GPU 性能优化 - -- 添加 `bilinear_interp_v2` TensorRT convert 支持。([#43618](https://github.com/PaddlePaddle/Paddle/pull/43618)) -- 添加 `matmul_scale_fuse_pass`、`multihead_matmul_fuse_pass_v3`到 GPU pass,并添加单测。([#43765](https://github.com/PaddlePaddle/Paddle/pull/43765)) -- 添加 GPU handle 延迟初始化支持。([#43661](https://github.com/PaddlePaddle/Paddle/pull/43661)) - -### (3)问题修复 - -#### 框架及 API 修复 - -- 修复联编 Paddle-Lite XPU 时的编译报错问题。([#43178](https://github.com/PaddlePaddle/Paddle/pull/43178)) -- 修复 ERNIE 3.0 pass 误触发的问题。([#43948](https://github.com/PaddlePaddle/Paddle/pull/43948)) -- 修复 multihead op 中 int8 量化属性读不到的问题。([#43020](https://github.com/PaddlePaddle/Paddle/pull/43020)) - -#### 后端能力修复 - -- 修复 MKLDNN 中 elementwise_mul 和 matmul 两个 op 在运行量化推理过程中崩溃的问题。([#43725](https://github.com/PaddlePaddle/Paddle/pull/43725)) -- 修复同一模型在推理时 TensorRT 子图序列化文件反复生成的问题。([#42945](https://github.com/PaddlePaddle/Paddle/pull/43945), [#42633](https://github.com/PaddlePaddle/Paddle/pull/42633)) -- 修复 ONNX Runtime 后端与外部使用的 protobuf 冲突问题。([#43159](https://github.com/PaddlePaddle/Paddle/pull/43159), [#43742](https://github.com/PaddlePaddle/Paddle/pull/43742)) -- 修复 python 预测库 ONNX Runtime 后端在多输入情况下推理报错问题。([#43621](https://github.com/PaddlePaddle/Paddle/pull/43621)) - -## 4. 环境适配 - -### 编译安装 - -- 完成对 CUDA 11.6 的验证和适配,并在官网发布 CUDA 11.6 的安装包。([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) -- 修复在 Windows 上使用 CUDA 11.6 编译时的 cub 报错问题。([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) -- 修复 elementwise、reduce op 编译时间较长的问题。([#43202](https://github.com/PaddlePaddle/Paddle/pull/43202), [#42779](https://github.com/PaddlePaddle/Paddle/pull/42779), [#43205](https://github.com/PaddlePaddle/Paddle/pull/43205)) - -### 新硬件适配 - -- 寒武纪 MLU 支持飞桨 Profiler。([#42115](https://github.com/PaddlePaddle/Paddle/pull/42115)) -- GraphCore IPU 支持显示编译进度。([#42078](https://github.com/PaddlePaddle/Paddle/pull/42078)) - -# 2.3.0 Release Note - -## 1. 重要更新 - -我们很高兴地发布飞桨框架 2.3.0 版本,本版本包含如下重要更新。 - -### API - -- 新增 100 多个 API,覆盖自动微分、线性代数、概率分布、稀疏张量、框架性能分析、硬件设备管理、视觉领域等方面。 - -- 新增 4 个自动微分 API,11 个线性代数 API,21 个概率分布类 API,更好地支持科学计算、强化学习等场景。 - -- 新增 11 个 稀疏张量计算 API,支持创建 COO、CSR 格式的 Sparse Tensor 以及与 Tensor 互相转换等基础功能。 - -- 新增 9 个框架性能分析 API,以 `paddle.profiler.Profiler` 为核心,提供对训练、推理过程中性能数据的收集、导出和统计的功能。 - -- 新增 7 个硬件设备管理 API,更好支持硬件相关信息获取。 - -- 新增多个视觉、文本领域 API,方便复用 MobileNetV3, ResNeXt 等骨干网络,实现快速组网。 - -### 飞桨高可复用算子库 PHI - -- 发布飞桨高可复用算子库 PHI (Paddle HIgh reusability operator library),支持组合式算子功能复用、Primitive 算子内核复用、插件式硬件加速库复用。针对飞桨框架原算子库存在的算子接口不清晰、算子复用成本较高、调用性能不够快的问题,我们重构了飞桨框架的算子库,设计了灵活、高效的函数式算子库 Phi,可以通过对函数式算子接口组合调用的方式实现新算子。新算子库提供了 200 余个跟 python 开发接口保持一致的 C++ 运算类 API,以及近 500 个可供组合调用的前、反向函数式算子内核 Kernel,可大幅降低框架原生算子和自定义算子的开发成本。新算子库支持 Primitive API 方式开发算子内核,可支持不同硬件(比如 GPU 和 XPU)的算子内核复用。新算子库支持以插件方式接入硬件(比如 NPU)的加速库,实现低成本复用硬件加速库。 - -### 分布式训练 - -- 全面升级自适应分布式训练架构,含弹性扩缩容、异步流水执行器、异构通信、自动并行等多个模块,支持了多种异构硬件下自动感知的分布式训练及分布式推理。 - -- 动态图混合并行下新增 MoE 并行策略、GroupSharded 并行策略、Pure FP16 等,进一步支持了动态图下大模型的高效并行训练。 - -- 全面升级优化了通用异构参数服务器架构,进行各模块的抽象简化,如通信、存储等,提升了参数服务器的二次开发体验;GPU 参数服务器在千亿参数百亿数据分钟级流式训练下性能提升 2.38 倍。 - -### 编译安装 - -- 从 2.3.0 版本开始,飞桨对框架支持的 GPU 架构种类进行了调整和升级。 - -### 推理部署 - -- 新增 Java API 和 ONNX Runtime CPU 后端。 - -- 支持 TensorRT 8.0 / 8.2 和结构化稀疏,针对 ERNIE 类结构模型性能深度优化。 - -### 硬件适配 - -- 新增自定义新硬件接入:提供一种插件式扩展 PaddlePaddle 硬件后端的方式。 - -- 新增对华为昇腾 910 / GraphCore IPU / 寒武纪 MLU / 昆仑芯 2 代多种异构芯片的训练/推理支持。 - -### 框架架构 - -- 这个版本中,我们在框架的执行器也做了大量工作,详情请见:[新动态图执行机制](#%E6%96%B0%E5%8A%A8%E6%80%81%E5%9B%BE%E6%89%A7%E8%A1%8C%E6%9C%BA%E5%88%B6) 与 [全新静态图执行器](#%E5%85%A8%E6%96%B0%E9%9D%99%E6%80%81%E5%9B%BE%E6%89%A7%E8%A1%8C%E5%99%A8)。 - -## 2. 不兼容升级 - -- 预编译安装包中移除 CUDA sm35 ARCH: 受到包体积大小的影响,在预编译的安装包中移除了 CUDA sm35 架构。([#41754](https://github.com/PaddlePaddle/Paddle/pull/41754)) - -- `paddle.to_tensor` 将一个 python int scalar 转换为 Tensor 时,在 Windows 上的默认数据类型由 int32 变为 int64,从而与 Linux/Mac 保持对齐。([#39662](https://github.com/PaddlePaddle/Paddle/pull/39662)) - -- 为了与 python3 下的除法行为保持一致,除法符号 `/` 从 rounding divide 变成 true divide,计算输出结果的数据类型从 int 切换成 float。([#40890](https://github.com/PaddlePaddle/Paddle/pull/40890)) - - - - - - - - - - - -
-2.2 - -2.3.0 -
-
-
-```python
->>> import paddle
->>> a = paddle.to_tensor([327])
->>> b = paddle.to_tensor([80])
->>> a / b
-Tensor(shape=[1], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
-      [4])
-```
-
-
-
-
-```python
->>> import paddle
->>> a = paddle.to_tensor([327])
->>> b = paddle.to_tensor([80])
->>> a / b
-Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=True,
-      [4.08750010])
-```
-
-
- -- 修正 ELU 的公式,alpha < 0 时的计算方式与原论文对齐,从而修复小部分情况下的计算结果错误。同时,由于在 alpha < 0 无法在数学上仅从输出计算反向梯度,因此 elu_ 在 alpha < 0 时将报错。([#37316](https://github.com/PaddlePaddle/Paddle/pull/37316)) - - - - - - - - - - - -
-2.2 - -2.3.0 -
-
-
-```python
-# elu(x) = max(0, x) + min(0, α ∗ (e^x − 1))
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-       [ 0.         , -74.48576355])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-       [ 0.         , -74.48576355])
-```
-
-
-
-
-```python
-# elu(x) = x, if x > 0
-# elu(x) = α ∗ (e^x − 1), if x <= 0
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-       [0.12642412,  6.        ])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
-Traceback (most recent call last):
-  File "", line 1, in 
-  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
-    return caller(func, *(extras + args), **kw)
-  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
-    return wrapped_func(*args, **kwargs)
-  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/inplace_utils.py", line 34, in __impl__
-    return func(*args, **kwargs)
-  File "/usr/local/lib/python3.7/dist-packages/paddle/nn/functional/activation.py", line 89, in elu_
-    assert alpha >= 0., "elu_ only support alpha >= 0, please use elu instead."
-AssertionError: elu_ only support alpha >= 0, please use elu instead.
-```
-
-
- -## 3. 训练框架(含分布式) - -### (1)新功能 - -#### API - -- 新增 4 个自动微分类 API,支持科学计算需求,具体列表如下:([#40692](https://github.com/PaddlePaddle/Paddle/pull/40692)) - - - `paddle.incubate.autograd.vjp`,计算向量-雅可比矩阵乘积。 - - - `paddle.incubate.autograd.jvp`,计算雅可比矩阵-向量乘积。 - - - `paddle.incubate.autograd.Jacobian`,计算雅可比矩阵。 - - - `paddle.incubate.autograd.Hessian`,计算海森矩阵。 - -- 新增线性代数类 API - - - 新增 `paddle.linalg.triangular_solve`,计算具有唯一解的三角系数线性方程组。([#36714](https://github.com/PaddlePaddle/Paddle/pull/36714)) - - - 新增 `paddle.linalg.eig`,计算一般方阵的特征分解。([#35764](https://github.com/PaddlePaddle/Paddle/pull/35764)) - - - 新增 `paddle.linalg.sovle`,计算线性方程组的解。([#35715](https://github.com/PaddlePaddle/Paddle/pull/35715)) - - - 新增 `paddle.linalg.lstsq`,计算线性方程组的最小二乘解。([#38585](https://github.com/PaddlePaddle/Paddle/pull/38585), [#38621](https://github.com/PaddlePaddle/Paddle/pull/38621)) - - - 新增 `paddle.linalg.qr`,计算矩阵的 QR 分解。([#35742](https://github.com/PaddlePaddle/Paddle/pull/35742), [#38824](https://github.com/PaddlePaddle/Paddle/pull/38824)) - - - 新增 `paddle.inner`,计算矩阵内积。([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706)) - - - 新增 `paddle.outer`,计算矩阵外积。([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706)) - - - 新增 `paddle.linalg.cov`,计算向量间协方差。([#38392](https://github.com/PaddlePaddle/Paddle/pull/38392)) - - - 新增 `paddle.linalg.cholesky_sovle`,计算方程 cholesky 解。([#38167](https://github.com/PaddlePaddle/Paddle/pull/38167)) - - - 新增 `paddle.linalg.lu`、 `paddle.linalg.lu_unpack`,计算矩阵 lu 分解、解压缩 lu 矩阵。([#38617](https://github.com/PaddlePaddle/Paddle/pull/38617), [#38559](https://github.com/PaddlePaddle/Paddle/pull/38559), [#38616](https://github.com/PaddlePaddle/Paddle/pull/38616)) - -- 新增 21 个概率分布类 API,包括 6 个随机变量分布,13 个随机变量变换,2 个 KL 散度计算,用于强化学习、变分推断、科学计算等场景,具体列表如下:([#40536](https://github.com/PaddlePaddle/Paddle/pull/40536), [#38820](https://github.com/PaddlePaddle/Paddle/pull/38820), [#38558](https://github.com/PaddlePaddle/Paddle/pull/38558/files), [#38445](https://github.com/PaddlePaddle/Paddle/pull/38445), [#38244](https://github.com/PaddlePaddle/Paddle/pull/38244), [#38047](https://github.com/PaddlePaddle/Paddle/pull/38047)) - - - `paddle.distribution.ExponentialFamily`,指数分布族基类。 - - - `paddle.distribution.Beta`,`Beta` 分布。 - - - `paddle.distribution.Dirichlet`,`Dirichlet` 分布。 - - - `paddle.distribution.Independent`,独立分布,用于创建高阶分布。 - - - `paddle.distribution.TransformedDistribution`,变换分布,用于通过基础分布及一系列变换生成高阶分布。 - - - `paddle.distribution.Multionmial`,多项分布。 - - - `paddle.distribution.Transform`,随机变量变换的基类。 - - - `paddle.distribution.AbsTransform`,取绝对值变换。 - - - `paddle.distribution.AffineTransform`,仿射变换。 - - - `paddle.distribution.ChainTransform`,变换的链式组合。 - - - `paddle.distribution.ExpTransform`,指数变换。 - - - `paddle.distribution.IndependentTransform`,独立变换,用于扩展变换定义域的 `event_dim`。 - - - `paddle.distribution.PowerTransform`,幂变换。 - - - `paddle.distribution.ReshapeTransform`,`reshape` 变换。 - - - `paddle.distribution.SigmoidTransform`,`sigmoid` 变换。 - - - `paddle.distribution.SoftmaxTransform`,`softmax` 变换。 - - - `paddle.distribution.StackTransform`,`stack` 变换,用于以 `stack` 方式组合多个变换。 - - - `paddle.distribution.StickBreakingTransform`, `stickbreaking` 变换。 - - - `paddle.distribution.TanhTransform`,`tanh` 变换。 - - - `paddle.distribution.kl_divergence`,计算 KL 散度。 - - - `paddle.distribution.register_kl`,注册用户自定义 KL 散度计算函数。 - -- 新增高层 API - - - 新增 `paddle.vision.models.AlexNet`、`paddle.vision.models.alexnet`,支持直接使用 AlexNet 模型。([#36058](https://github.com/PaddlePaddle/Paddle/pull/36058)) - - - 新增 `paddle.vision.models.DenseNet`、 `paddle.vision.models.densenet121`、 `paddle.vision.models.densenet161`、 `paddle.vision.models.densenet169`、 `paddle.vision.models.densenet201`、 `paddle.vision.models.densenet264`,支持直接使用 DenseNet 模型。([#36069](https://github.com/PaddlePaddle/Paddle/pull/36069)) - - - 新增 `paddle.vision.models.GoogLeNet`、`paddle.vision.models.googlenet`,支持直接使用 GoogLeNet 模型。([#36034](https://github.com/PaddlePaddle/Paddle/pull/36034)) - - - 新增 `paddle.vision.models.InceptionV3`、`paddle.vision.models.inception_v3`,支持直接使用 InceptionV3 模型。([#36064](https://github.com/PaddlePaddle/Paddle/pull/36064)) - - - 新增 `paddle.vision.models.MobileNetV3Small`、 `paddle.vision.models.MobileNetV3Large`、`paddle.vision.models.mobilenet_v3_small`、`paddle.vision.models.mobilenet_v3_large`,支持直接使用 MobileNetV3 模型。([#38653](https://github.com/PaddlePaddle/Paddle/pull/38653)) - - - 新增 `paddle.vision.models.resnext50_32x4d`、 `paddle.vision.models.resnext50_64x4d`、`paddle.vision.models.resnext101_32x4d`、`paddle.vision.models.resnext101_64x4d`、`paddle.vision.models.resnext152_32x4d`、`paddle.vision.models.resnext152_64x4d`,支持直接使用 ResNeXt 模型。([#36070](https://github.com/PaddlePaddle/Paddle/pull/36070)) - - - 新增 `paddle.vision.models.ShuffleNetV2`、 `paddle.vision.models.shufflenet_v2_x0_25`、`paddle.vision.models.shufflenet_v2_x0_33`、`paddle.vision.models.shufflenet_v2_x0_5`、`paddle.vision.models.shufflenet_v2_x1_0`、`paddle.vision.models.shufflenet_v2_x1_5`、`paddle.vision.models.shufflenet_v2_x2_0`、`paddle.vision.models.shufflenet_v2_swish`,支持直接使用 ShuffleNetV2 模型。([#36067](https://github.com/PaddlePaddle/Paddle/pull/36067)) - - - 新增 `paddle.vision.models.SqueezeNet`、 `paddle.vision.models.squeezenet1_0`、`paddle.vision.models.squeezenet1_1`,支持直接使用 SqueezeNet 模型。([#36066](https://github.com/PaddlePaddle/Paddle/pull/36066)) - - - 新增 `paddle.vision.models.wide_resnet50_2`、`paddle.vision.models.wide_resnet101_2`,支持直接使用 WideResNet 模型。([#36952](https://github.com/PaddlePaddle/Paddle/pull/36952)) - - - 新增`paddle.vision.ops.nms` API,支持单类别和多类别非极大抑制(non-maximum supression, nms)算法,用于目标检测预测任务加速。([#40962](https://github.com/PaddlePaddle/Paddle/pull/40962)) - - - 新增`paddle.vision.ops.roi_pool` 和 `paddle.vision.ops.RoIPool`,支持检测任务中 RoI 区域池化操作。([#36154](https://github.com/PaddlePaddle/Paddle/pull/36154)) - - - 新增`paddle.vision.ops.roi_align` 和 `paddle.vision.ops.RoIAlign`,支持检测任务中 RoI Align 操作。([#35102](https://github.com/PaddlePaddle/Paddle/pull/36154)) - - - 新增 `paddle.text.ViterbiDecoder`、`paddle.text.viterbi_decode` Viterbi 解码 API,主要用于序列标注模型的预测。([#35778](https://github.com/PaddlePaddle/Paddle/pull/35778)) - -- 新增 11 个 Sparse 类 API,支持创建 COO、CSR 格式的 Sparse Tensor,与 Tensor 互相转换等基础功能: - - - `paddle.sparse.sparse_coo_tensor`,创建 COO 格式的 Sparse Tensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `paddle.sparse.sparse_csr_tensor`,创建 CSR 格式的 Sparse Tensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `paddle.sparse.ReLU`,支持 SparseCooTensor 的 ReLU 激活层。([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959)) - - - `paddle.sparse.functional.relu`,支持 SparseCooTensor 的 ReLU 函数。([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959)) - - - `Tensor.values()`,获取 SparseCooTensor 或者 SparseCsrTensor 的非零元素方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.indices()`,获取 SparseCooTensor 的坐标信息的方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.crows()`,获取 SparseCsrTensor 的压缩行信息的方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.cols()`,获取 SparseCsrTensor 的列信息的方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.to_sparse_coo()`,将 DenseTensor 或者 SparseCsrTensor 转换为 SparseCooTensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `Tensor.to_sparse_csr()`,将 DenseTensor 或者 SparseCooTensor 转换为 SparseCsrTensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `Tensor.to_dense()`,将 SparseCooTensor 或者 SparseCsrTensor 转换为 DenseTensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - -- 新增硬件相关 API - - - 新增 `paddle.device.cuda.max_memory_allocated`、`paddle.device.cuda.max_memory_reserved`、 `paddle.device.cuda.memory_allocated` 和 `paddle.device.cuda.memory_reserved` 四个 GPU 显存监测相关 API,方便实时查看和分析模型显存占用指标。([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657)) - - - 新增 `paddle.device.cuda.get_device_properties`,支持返回 CUDA 设备属性信息。([#35661](https://github.com/PaddlePaddle/Paddle/pull/35661)) - - - 新增 `paddle.device.cuda.get_device_name` 和 `paddle.device.cuda.get_device_capability`,支持返回 GPU 设备名称信息和计算能力的主要和次要修订号。([#35672](https://github.com/PaddlePaddle/Paddle/pull/35672)) - -- 新增 Tensor 操作 API - - - 新增 `paddle.nansum`,沿 `axis` 对输入 Tensor 求和,且忽略掉 `NaNs` 值。([#38137](https://github.com/PaddlePaddle/Paddle/pull/38137)) - - - 新增 `paddle.nanmean`,沿 `axis`对输入 Tensor 求平均,且忽略掉 `NaNs` 值。([#40472](https://github.com/PaddlePaddle/Paddle/pull/40472)) - - - 新增 `paddle.clone`,返回输入 Tensor 的拷贝,并且提供梯度计算。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - - - 新增 `paddle.Tensor.element_size`,返回 Tensor 中的单个元素在计算机中所分配的 bytes 数量。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - - - 新增 `paddle.Tensor.to_uva_tensor`,支持将 numpy 对象转换为实际存储在 CPU,但可作为 CUDA 对象进行虚拟地址访问的功能。([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#38950](https://github.com/PaddlePaddle/Paddle/pull/38950)) - - - 新增`paddle.rot90`,沿 `axes` 指定的平面将 n 维 Tensor 旋转 90 度。([#37634](https://github.com/PaddlePaddle/Paddle/pull/37634)) - - - 新增`paddle.logit` 和 `paddle.Tensor.logit`,计算输入 Tensor 的 logit 函数值。([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844)) - - - 新增 `paddle.repeat_interleave`,沿着指定轴对输入进行复制,创建并返回到一个新的 Tensor。([#37981](https://github.com/PaddlePaddle/Paddle/pull/37981)) - - - 新增 `paddle.renorm`,把 Tensor 在指定的 `axis` 切分成多块后分别进行 p norm 操作。([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459)) - - - 新增 `paddle.mode` 和 `paddle.Tensor.mode`,沿指定轴查找输入 Tensor 的众数及对应的索引。([#38446](https://github.com/PaddlePaddle/Paddle/pull/38446)) - - - 新增 `paddle.quantile` 和 `paddle.Tensor.quantile`,沿指定轴计算 Tensor 的 q 分位数。([#38567](https://github.com/PaddlePaddle/Paddle/pull/38567)) - - - 新增 `paddle.kthvalue` 和 `paddle.Tensor.kthvalue`,查找 Tensor 中指定轴上第 k 小的数及对应的索引。([#38386](https://github.com/PaddlePaddle/Paddle/pull/38386)) - - - 新增 `paddle.is_floating_point` 和 `paddle.Tensor.is_floating_point`,判断输入 Tensor 是否为浮点类型。([#37885](https://github.com/PaddlePaddle/Paddle/pull/37885)) - - - 新增 `paddle.erfinv` 和 `paddle.Tensor.erfinv`,计算输入 Tensor 的逆误差函数。([#38295](https://github.com/PaddlePaddle/Paddle/pull/38295)) - - - 新增 `paddle.lerp` 和 `paddle.Tensor.lerp`,根据给定权重计算输入 Tensor 间的线性插值。([#37253](https://github.com/PaddlePaddle/Paddle/pull/37253)) - - - 新增 `paddle.angle`,用于计算复数 Tensor 的相位角。([#37689](https://github.com/PaddlePaddle/Paddle/pull/37689)) - - - 新增`paddle.rad2deg`和`paddle.Tensor.rad2deg`,将元素从弧度的角度转换为度。([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598)) - - - 新增`paddle.deg2rad`和`paddle.Tensor.deg2rad`,将元素从度的角度转换为弧度。([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598)) - - - 新增`paddle.gcd`和`paddle.Tensor.gcd`,计算两个输入的按元素绝对值的最大公约数。([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819)) - - - 新增`paddle.lcm`和`paddle.Tensor.lcm`,计算两个输入的按元素绝对值的最小公倍数。([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819)) - - - 新增`paddle.amax`和`paddle.Tensor.amax`,对指定维度上的 Tensor 元素求最大值,正向结果和 max 一样,有多个相等的最大值时,反向的梯度平均分到这多个值的位置上。([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417)) - - - 新增`paddle.amin`和`paddle.Tensor.amin`,对指定维度上的 Tensor 元素求最小值,正向结果和 min 一样,有多个相等的最小值时,反向的梯度平均分到这多个值的位置上。([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417)) - - - 新增`paddle.isclose`,用于判断两个 Tensor 的每个元素是否接近。([#37135](https://github.com/PaddlePaddle/Paddle/pull/37135)) - - - 新增`paddle.put_along_axis` 和`paddle.take_along_axis`,用于提取或放置指定索引下标的元素。([#38608](https://github.com/PaddlePaddle/Paddle/pull/38608)) - - - 新增 `paddle.bincount` 和 `paddle.Tensor.bincount`,用于统计 Tensor 中每个元素出现的次数。([#36317](https://github.com/PaddlePaddle/Paddle/pull/36317)) - - - 新增 `paddle.fmax`、 `paddle.fmin`,扩展了 max/min 的功能,支持比较的两个 Tensor 中有 NaN 值的情况,即如果对应位置上有 1 个 NaN 值,则返回那个非 NaN 值;如果对应位置上有 2 个 NaN 值,则返回 NaN 值。([#37826](https://github.com/PaddlePaddle/Paddle/pull/37826)) - - - 新增 `paddle.diff`,用于计算沿给定维度的第 n 个前向差值,目前支持 n=1。([#37441](https://github.com/PaddlePaddle/Paddle/pull/37441)) - - - 新增 `paddle.asinh`、`paddle.acosh`、`paddle.atanh` 反双曲函数类 API。([#37076](https://github.com/PaddlePaddle/Paddle/pull/37076)) - - - 新增 `paddle.as_real`,`paddle.as_complex` 用于实数 Tensor 和复数 Tensor 之间的转换。([#37784](https://github.com/PaddlePaddle/Paddle/pull/37784)) - - - 新增 `paddle.complex` 用于给定实部和虚部构造复数 Tensor。([#37918](https://github.com/PaddlePaddle/Paddle/pull/37918), [#38272](https://github.com/PaddlePaddle/Paddle/pull/38272)) - - - 新增 `paddle.det` 与 `paddle.slogdet`,用于计算矩阵的行列式和行列式的自然对数。([#34992](https://github.com/PaddlePaddle/Paddle/pull/34992)) - - - 新增`paddle.nn.utils.parameters_to_vector`,可以将输入的多个 parameter 展平并连接为 1 个 1-D Tensor。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - - - 新增`paddle.nn.utils.vector_to_parameters`,将 1 个 1-D Tensor 按顺序切分给输入的多个 parameter。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - -- 新增组网类 API - - - 新增 `paddle.nn.Fold`、`paddle.nn.functional.fold`,支持将提取出的滑动局部区域块还原成 batch 的 Tensor。([#38613](https://github.com/PaddlePaddle/Paddle/pull/38613)) - - - 新增 `paddle.nn.CELU`、`paddle.nn.functional.celu`,支持 CELU 激活层。([#36088](https://github.com/PaddlePaddle/Paddle/pull/36088)) - - - 新增 `paddle.nn.HingeEmbeddingLoss`,增加计算 hinge embedding 损失的方式,通常用于学习 nonlinear embedding 或半监督学习。([#37540](https://github.com/PaddlePaddle/Paddle/pull/37540)) - - - 新增 `paddle.nn.ZeroPad2D` API,按照 padding 属性对输入进行零填充。([#37151](https://github.com/PaddlePaddle/Paddle/pull/37151)) - - - 新增 `paddle.nn.MaxUnPool3D` 和 `paddle.nn.MaxUnPool1D`,用于计算 3D 最大反池化和 1D 最大反池化。([#38716](https://github.com/PaddlePaddle/Paddle/pull/38716)) - - - 新增 `paddle.incubate.graph_khop_sampler`、`paddle.incubate.graph_sample_neighbors`、 `paddle.incubate.graph_reindex` API,支持图多阶邻居采样和图编号重索引操作,主要用于图神经网络模型训练。([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#40809](https://github.com/PaddlePaddle/Paddle/pull/40809)) - -- 新增随机数类 API - - - 新增 `paddle.poisson`,以输入 Tensor 为泊松分布的 lambda 参数,生成一个泊松分布的随机数 Tensor。([#38117](https://github.com/PaddlePaddle/Paddle/pull/38117)) - - - 新增 `paddle.randint_like` API,支持新建服从均匀分布的、范围在[low, high) 的随机 Tensor,输出的形状与输入的形状一致。([#36169](https://github.com/PaddlePaddle/Paddle/pull/36169)) - - - 新增 `paddle.Tensor.exponential_`,为 inplace 式 API,通过指数分布随机数来填充输入 Tensor。([#38256](https://github.com/PaddlePaddle/Paddle/pull/38256)) - -- 新增参数初始化类 API - - - 新增`paddle.nn.initializer.Dirac`,通过迪拉克 delta 函数来初始化 3D/4D/5D 参数,其常用于卷积层 Conv1D/Conv2D/Conv3D 的参数初始化。([#37389](https://github.com/PaddlePaddle/Paddle/pull/37389)) - - - 新增`paddle.nn.initializer.Orthogonal`,正交矩阵初始化,被初始化后的参数是(半)正交向量。([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163)) - - - 新增`paddle.nn.initializer.calculate_gain`,获取激活函数的推荐增益值,增益值可用于设置某些初始化 API,以调整初始化范围。([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163)) - -- 新增学习率类 API - - - 新增 `paddle.optimizer.lr.MultiplicativeDecay`,提供 `lambda` 函数设置学习率的策略。([#38250](https://github.com/PaddlePaddle/Paddle/pull/38250)) - -- 新增分布式相关 API - - - 新增 `paddle.incubate.optimizer.DistributedFusedLamb`,使得 Lamb 优化器可分布式更新参数。([#40011](https://github.com/PaddlePaddle/Paddle/pull/40011), [#39972](https://github.com/PaddlePaddle/Paddle/pull/39972), [#39900](https://github.com/PaddlePaddle/Paddle/pull/39900), [#39747](https://github.com/PaddlePaddle/Paddle/pull/39747), [#39148](https://github.com/PaddlePaddle/Paddle/pull/39148), [#39416](https://github.com/PaddlePaddle/Paddle/pull/39416)) - -- 新增优化器相关 API([#40710](https://github.com/PaddlePaddle/Paddle/pull/40710)) - - - `paddle.incubate.optimizer.functional.minimize_bfgs`,增加二阶优化器 BFGS。 - - - `paddle.incubate.optimizer.functional.minimize_lbfgs`,增加二阶优化器 L-BFGS。 - -- 新增 `paddle.incubate.multiprocessing`模块,支持 Tensor(CPU/GPU)在 python 进程间传输。([#37302](https://github.com/PaddlePaddle/Paddle/pull/37302), [#41339](https://github.com/PaddlePaddle/Paddle/pull/41339)) - -- 新增 `paddle.incubate.autotune.set_config` API,支持多版本 Kernel 自动选择、混合精度数据布局自动转换、DataLoader 的 num_workers 自动选择,以自动提升模型性能。([#42301](https://github.com/PaddlePaddle/Paddle/pull/42301)) - -- 新增 `paddle.incubate.nn.FusedMultiTransformer` 和 `paddle.incubate.nn.functional.fused_multi_transformer` API,可将多层 transformer 融合到一个 op 中,提升模型推理性能,注意:仅支持前向推理。([#42311](https://github.com/PaddlePaddle/Paddle/pull/42311)) - -- 新增动静统一的 einsum_v2 op,兼容原有 python 端 `paddle.einsum` 实现的同时支持动转静导出和更加完备的 Infershape 推导。([#42495](https://github.com/PaddlePaddle/Paddle/pull/42495), [#42327](https://github.com/PaddlePaddle/Paddle/pull/42327), [#42397](https://github.com/PaddlePaddle/Paddle/pull/42397), [#42105](https://github.com/PaddlePaddle/Paddle/pull/42105)) - -#### IR(Intermediate Representation) - -- 动态图转静态图 - - - 变量类型 StaticAnalysis 模块新增支持类似 `a, b = paddle.shape(x)` 的类型标记。([#39245](https://github.com/PaddlePaddle/Paddle/pull/39245)) - - - 新增支持 `InputSpec.name` 作为 Program 缓存 hash key 的计算字段。([#38273](https://github.com/PaddlePaddle/Paddle/pull/38273)) - - - 新增支持 `dict['key'] = x.shape` 语法。([#40611](https://github.com/PaddlePaddle/Paddle/pull/40611)) - - - 新增支持 Pure FP16 训练。([#36944](https://github.com/PaddlePaddle/Paddle/pull/36944)) - - - 新增支持 `for i in [x,y,z]` 语法。([#37259](https://github.com/PaddlePaddle/Paddle/pull/37259)) - - - 新增支持 python3 的 type hint 语法。([#36544](https://github.com/PaddlePaddle/Paddle/pull/36544)) - -- Pass 开发 - - - 新增基于 NVIDIA cuBlasLt Epilogue 的 FC + [relu|gelu] 的前向与反向融合。([#39437](https://github.com/PaddlePaddle/Paddle/pull/39437)) - -- Kernel Primitive API - - - 新增 GPU 平台 KP 算子,包括 cast、scale、clip、bce_loss、abs_grad、reduce_sum_grad、reduce_mean_grad、clip、bce_loss、full、full_like、distribution、 random、masked_select_kernel、where_index、masked_select_grad、dropout、sigmoid、where、abs_grad。([#36203](https://github.com/PaddlePaddle/Paddle/pull/36203), [#36423](https://github.com/PaddlePaddle/Paddle/pull/36423), [#39390](https://github.com/PaddlePaddle/Paddle/pull/39390), [#39734](https://github.com/PaddlePaddle/Paddle/pull/39734), [#38500](https://github.com/PaddlePaddle/Paddle/pull/38500), [#38959](https://github.com/PaddlePaddle/Paddle/pull/38959), [#39197](https://github.com/PaddlePaddle/Paddle/pull/39197/), [#39563](https://github.com/PaddlePaddle/Paddle/pull/39563), [#39666](https://github.com/PaddlePaddle/Paddle/pull/39666), [#40517](https://github.com/PaddlePaddle/Paddle/pull/40517), [#40617](https://github.com/PaddlePaddle/Paddle/pull/40617), [#40766](https://github.com/PaddlePaddle/Paddle/pull/40766), [#39898](https://github.com/PaddlePaddle/Paddle/pull/39898), [#39609](https://github.com/PaddlePaddle/Paddle/pull/39609)) - - - 新增支持 XPU2 源码编译模式。([#37254](https://github.com/PaddlePaddle/Paddle/pull/37254), [#40397](https://github.com/PaddlePaddle/Paddle/pull/40397), [#38455](https://github.com/PaddlePaddle/Paddle/pull/38455)) - - - 新增支持 KP 算子在 XPU2 和 GPU 中复用,包括 reduce、broadcast、elementwise_add、`exp、log、relu、sigmoid、leaky_relu、softplus、hard_swish、reciprocal`。([#36904](https://github.com/PaddlePaddle/Paddle/pull/36904), [#37226](https://github.com/PaddlePaddle/Paddle/pull/37226), [#38918](https://github.com/PaddlePaddle/Paddle/pull/38918), [#40560](https://github.com/PaddlePaddle/Paddle/pull/40560/), [#39787](https://github.com/PaddlePaddle/Paddle/pull/39787), [#39917](https://github.com/PaddlePaddle/Paddle/pull/39917), [#40002](https://github.com/PaddlePaddle/Paddle/pull/40002), [#40364](https://github.com/PaddlePaddle/Paddle/pull/40364)) - - - 新增 XPU2 平台 KP 算子单测,包括 `brelu、ceil、celu、elu、floor、hard_shrink、hard_sigmoid、log1p、logsigmoid、relu6、silu、soft_relu、softsign、sqrt、square、swish、thresholded_relu、softshrink`。([#40448](https://github.com/PaddlePaddle/Paddle/pull/40448), [#40524](https://github.com/PaddlePaddle/Paddle/pull/40524)) - - - 新增 XPU2 KP 模型支持,包括 resnet50、deepfm、wide_deep、yolov3-darknet53、det_mv3_db、bert、transformer、mobilenet_v3、GPT2。 - -#### 混合精度训练 - -- 从混合精度训练 `paddle.amp.GradScaler` 的 `minimize` 中拆分出 `paddle.amp.Gradscaler.unscale_` 方法,提供恢复 loss 的独立接口。([#35825](https://github.com/PaddlePaddle/Paddle/pull/35825)) - -- 为 `paddle.nn.ClipByGlobalNorm` 动态图模式添加 FP16 支持,为 clip op 添加 FP16 Kernel,使`clip`相关操作支持 FP16。([#36198](https://github.com/PaddlePaddle/Paddle/pull/36198), [#36577](https://github.com/PaddlePaddle/Paddle/pull/36577)) - -- 支持 `paddle.amp.decorate` 传入的`optimizer`参数为 None。([#37541](https://github.com/PaddlePaddle/Paddle/pull/37541)) - -- 为 merged_momentum op 添加支持输入多学习率、支持 use_nesterov 策略的计算、支持 regularization 计算。([#37527](https://github.com/PaddlePaddle/Paddle/pull/37527)) - -- 为`paddle.optimizer.Momentum`优化器添加 multi_tensor 策略、为`Optimzizer`类的`clear_grad`添加`set_to_zero`分支。([#37564](https://github.com/PaddlePaddle/Paddle/pull/37564)) - -- 为`paddle.optimizer.Adam`优化器添加 multi_tensor 策略。([#38010](https://github.com/PaddlePaddle/Paddle/pull/38010)) - -- 为`paddle.optimizer.SGD`优化器添加 multi_precision 策略。([#38231](https://github.com/PaddlePaddle/Paddle/pull/38231)) - -- 为优化器 `state_dict` 方法添加存储 `master weight` 参数。([#39121](https://github.com/PaddlePaddle/Paddle/pull/39121)) - -- 添加支持 op CUDA bfloat16 混合精度训练,支持 O1、O2 模式,通过 `paddle.amp.auto_cast` 可开启上述训练模式。([#39029](https://github.com/PaddlePaddle/Paddle/pull/39029), [#39815](https://github.com/PaddlePaddle/Paddle/pull/39815)) - -- 为如下 ops 添加 bfloat16 CUDA Kernel:matmul、concat、split、dropout、reshape、slice、squeeze、stack、transpose、unbind、elementwize_max、elementwize_add、elementwize_mul、elementwize_sub、scale、sum、layer_norm、p_norm、reduce_sum、softmax、log_softmax、sigmoid、sqrt、softplus、square、gaussian_random、fill_constant、fill_any_like。([#39485](https://github.com/PaddlePaddle/Paddle/pull/39485), [#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683), [#39843](https://github.com/PaddlePaddle/Paddle/pull/39843), [#39999](https://github.com/PaddlePaddle/Paddle/pull/39999), [#40004](https://github.com/PaddlePaddle/Paddle/pull/40004), [#40027](https://github.com/PaddlePaddle/Paddle/pull/40027)) - -- 为如下 ops 添加 bfloat16 CPU Kernel:dropout、reshape、slice、squeeze、unsqueeze、stack、transpose、unbind、elementwize_max、elementwise_mul、elementwise_sub、gather。([#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683)) - -- 支持打印 bfloat16 类型的 Tensor。([#39375](https://github.com/PaddlePaddle/Paddle/pull/39375), [#39370](https://github.com/PaddlePaddle/Paddle/pull/39370)) - -- 为`p_norm`、`elementwise_max` 、`fill_constant_batch_size_like``scatter`增加 FP16 计算支持。([#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [#39907](https://github.com/PaddlePaddle/Paddle/pull/39907), [#38136](https://github.com/PaddlePaddle/Paddle/pull/38136), [#38499](https://github.com/PaddlePaddle/Paddle/pull/38499)) - -- 为如下 ops 增加 int16_t 支持:cumsum、less_than、less_equal、greater_than、greater_equal、equal、not_equal、fill_any_like、grather_nd、reduce_sum、where_index、reshape、unsqueeze。([#39636](https://github.com/PaddlePaddle/Paddle/pull/39636)) - -- 为 cross_entropy op 增加 int16_t label 类型的支持。([#39409](https://github.com/PaddlePaddle/Paddle/pull/39409)) - -- 为 embedding op 增加 int16_t id 类型的支持。([#39381](https://github.com/PaddlePaddle/Paddle/pull/39381)) - -- 为 reduce_mean op 增加 FP16 类型的支持。([#38289](https://github.com/PaddlePaddle/Paddle/pull/38289)) - -- 为 elementwise_min op 增加 FP16 类型的支持。([#38123](https://github.com/PaddlePaddle/Paddle/pull/38123)) - -- 更新 bfloat16 AMP oneDNN 默认支持列表。([#39304](https://github.com/PaddlePaddle/Paddle/pull/39304)) - -#### 飞桨高可复用算子库 PHI - -针对飞桨框架原算子库存在的算子接口不清晰、算子复用成本较高、调用性能不够快的问题,我们重构了飞桨框架的算子库,设计了灵活、高效的函数式算子库 PHI,可以通过对函数式算子接口组合调用的方式实现新算子。新算子库提供了 200 余个跟 python 开发接口保持一致的 C++ 运算类 API,以及近 500 个可供组合调用的前、反向函数式算子内核 Kernel,可大幅降低框架原生算子和自定义算子的开发成本。新算子库支持 Primitive API 方式开发算子内核,可支持不同硬件(比如 GPU 和 XPU)的算子内核复用。新算子库支持以插件方式接入硬件(比如 NPU)的加速库,实现低成本复用硬件加速库。主要可分为以下几部分工作: - -- **算子库基础架构、核心组件与机制实现**:合理规划新算子库的目录结构,设计实现了新算子库的公共基础数据结构、新的函数式 InferMeta 和 Kernel 开发范式以及相应的注册和管理组件,并且支持 Kernel 文件的自动化编译对象生成及编译依赖关系生成,使开发者仅需关注函数式 Kernel 的实现,开发范式简洁清晰。([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#37107](https://github.com/PaddlePaddle/Paddle/pull/37107), [#36946](https://github.com/PaddlePaddle/Paddle/pull/36946), [#36948](https://github.com/PaddlePaddle/Paddle/pull/36948), [#37876](https://github.com/PaddlePaddle/Paddle/pull/37876), [#37916](https://github.com/PaddlePaddle/Paddle/pull/37916), [#37977](https://github.com/PaddlePaddle/Paddle/pull/37977), [#38078](https://github.com/PaddlePaddle/Paddle/pull/38078), [#38861](https://github.com/PaddlePaddle/Paddle/pull/38861), [#39123](https://github.com/PaddlePaddle/Paddle/pull/39123), [#39131](https://github.com/PaddlePaddle/Paddle/pull/39131), [#39748](https://github.com/PaddlePaddle/Paddle/pull/39748), [#39790](https://github.com/PaddlePaddle/Paddle/pull/39790), [#39941](https://github.com/PaddlePaddle/Paddle/pull/39941), [#40239](https://github.com/PaddlePaddle/Paddle/pull/40239), [#40635](https://github.com/PaddlePaddle/Paddle/pull/40635), [#41091](https://github.com/PaddlePaddle/Paddle/pull/41091), [#37409](https://github.com/PaddlePaddle/Paddle/pull/37409), [#37942](https://github.com/PaddlePaddle/Paddle/pull/37942), [#39002](https://github.com/PaddlePaddle/Paddle/pull/39002), [#38109](https://github.com/PaddlePaddle/Paddle/pull/38109), [#37881](https://github.com/PaddlePaddle/Paddle/pull/37881), [#37517](https://github.com/PaddlePaddle/Paddle/pull/37517), [#39870](https://github.com/PaddlePaddle/Paddle/pull/39870), [#40975](https://github.com/PaddlePaddle/Paddle/pull/40975), [#39475](https://github.com/PaddlePaddle/Paddle/pull/39475), [#37304](https://github.com/PaddlePaddle/Paddle/pull/37304), #36910, #37120, #37146, #37215, #37255, #37369, #38258, #38257, #38355, #38853, #38937, #38977, #38946, #39085, #39153, #39228, #38301, #38275, #38506, #38607, #38473, #38632, #38811, #38880, #38996, #38914, #39101) - -- **算子库 C++ API 体系建设**:设计实现了基于 yaml 配置文件的算子定义范式、自动生成了 200 余个 C++运算类 API,供内外部开发者复用,降低了基础运算的重复开发成本。([#37668](https://github.com/PaddlePaddle/Paddle/pull/37668), [#36938](https://github.com/PaddlePaddle/Paddle/pull/36938), [#38172](https://github.com/PaddlePaddle/Paddle/pull/38172), [#38182](https://github.com/PaddlePaddle/Paddle/pull/38182), [#38311](https://github.com/PaddlePaddle/Paddle/pull/38311), [#38438](https://github.com/PaddlePaddle/Paddle/pull/38438), [#39057](https://github.com/PaddlePaddle/Paddle/pull/39057), [#39229](https://github.com/PaddlePaddle/Paddle/pull/39229), [#39281](https://github.com/PaddlePaddle/Paddle/pull/39281), [#39263](https://github.com/PaddlePaddle/Paddle/pull/39263), [#39408](https://github.com/PaddlePaddle/Paddle/pull/39408), [#39436](https://github.com/PaddlePaddle/Paddle/pull/39436), [#39482](https://github.com/PaddlePaddle/Paddle/pull/39482), [#39497](https://github.com/PaddlePaddle/Paddle/pull/39497), [#39651](https://github.com/PaddlePaddle/Paddle/pull/39651), [#39521](https://github.com/PaddlePaddle/Paddle/pull/39521), [#39760](https://github.com/PaddlePaddle/Paddle/pull/39760), [#40060](https://github.com/PaddlePaddle/Paddle/pull/40060), [#40196](https://github.com/PaddlePaddle/Paddle/pull/40196), [#40218](https://github.com/PaddlePaddle/Paddle/pull/40218), [#40640](https://github.com/PaddlePaddle/Paddle/pull/40640), [#40732](https://github.com/PaddlePaddle/Paddle/pull/40732), [#40729](https://github.com/PaddlePaddle/Paddle/pull/40729), [#40840](https://github.com/PaddlePaddle/Paddle/pull/40840), [#40867](https://github.com/PaddlePaddle/Paddle/pull/40867), [#41025](https://github.com/PaddlePaddle/Paddle/pull/41025), [#41368](https://github.com/PaddlePaddle/Paddle/pull/41368)) - -- **算子库兼容各执行体系**:实现新的 InferMeta 及 Kernel 接入原动静态图执行体系、支持原 OpKernel 注册安全移除并迁移为新的 Kernel 形式。([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#38825](https://github.com/PaddlePaddle/Paddle/pull/38825), [#38837](https://github.com/PaddlePaddle/Paddle/pull/38837), [#38842](https://github.com/PaddlePaddle/Paddle/pull/38842), [#38976](https://github.com/PaddlePaddle/Paddle/pull/38976), [#39134](https://github.com/PaddlePaddle/Paddle/pull/39134), [#39140](https://github.com/PaddlePaddle/Paddle/pull/39140), [#39135](https://github.com/PaddlePaddle/Paddle/pull/39135), [#39252](https://github.com/PaddlePaddle/Paddle/pull/39252), [#39222](https://github.com/PaddlePaddle/Paddle/pull/39222), [#39351](https://github.com/PaddlePaddle/Paddle/pull/39351)) - -- **算子库底层数据结构及工具函数与框架解耦**:解除 Phi 在核心数据结构上对 框架的依赖,为后续 Phi 独立编译奠定基础,支持 infrt、自定义 Kernel 等一系列基于 Phi 的建设工作。([#38583](https://github.com/PaddlePaddle/Paddle/pull/38583), [#39188](https://github.com/PaddlePaddle/Paddle/pull/39188), [#39560](https://github.com/PaddlePaddle/Paddle/pull/39560), [#39931](https://github.com/PaddlePaddle/Paddle/pull/39931), [#39169](https://github.com/PaddlePaddle/Paddle/pull/39169), [#38951](https://github.com/PaddlePaddle/Paddle/pull/38951), [#38898](https://github.com/PaddlePaddle/Paddle/pull/38898), [#38873](https://github.com/PaddlePaddle/Paddle/pull/38873), [#38696](https://github.com/PaddlePaddle/Paddle/pull/38696), [#38651](https://github.com/PaddlePaddle/Paddle/pull/38651), [#39359](https://github.com/PaddlePaddle/Paddle/pull/39359), [#39305](https://github.com/PaddlePaddle/Paddle/pull/39305), [#39234](https://github.com/PaddlePaddle/Paddle/pull/39234), [#39098](https://github.com/PaddlePaddle/Paddle/pull/39098), [#39120](https://github.com/PaddlePaddle/Paddle/pull/39120), [#38979](https://github.com/PaddlePaddle/Paddle/pull/38979), [#38899](https://github.com/PaddlePaddle/Paddle/pull/38899), [#38844](https://github.com/PaddlePaddle/Paddle/pull/38844), [#39714](https://github.com/PaddlePaddle/Paddle/pull/39714), [#39729](https://github.com/PaddlePaddle/Paddle/pull/39729), [#39889](https://github.com/PaddlePaddle/Paddle/pull/39889), [#39587](https://github.com/PaddlePaddle/Paddle/pull/39587), [#39558](https://github.com/PaddlePaddle/Paddle/pull/39558), [#39514](https://github.com/PaddlePaddle/Paddle/pull/39514), [#39502](https://github.com/PaddlePaddle/Paddle/pull/39502), [#39300](https://github.com/PaddlePaddle/Paddle/pull/39300), [#39246](https://github.com/PaddlePaddle/Paddle/pull/39246), [#39124](https://github.com/PaddlePaddle/Paddle/pull/39124)) - -- **自定义算子机制与 Phi 整合并完善**:支持在自定义算子编写时调用 Phi 自动生成的 200 余个 C++运算类 API,降低自定义算子开发成本,并进行一系列问题修复。([#37122](https://github.com/PaddlePaddle/Paddle/pull/37122), [#37276](https://github.com/PaddlePaddle/Paddle/pull/37276), [#37281](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37262](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37415](https://github.com/PaddlePaddle/Paddle/pull/37415), [#37423](https://github.com/PaddlePaddle/Paddle/pull/37423), [#37583](https://github.com/PaddlePaddle/Paddle/pull/37683), [#38776](https://github.com/PaddlePaddle/Paddle/pull/38776), [#39353](https://github.com/PaddlePaddle/Paddle/pull/39353), [#41072](https://github.com/PaddlePaddle/Paddle/pull/41072)) - -- **算子规模化迁移改写**:迁移了约 250 个高频算子的前、反向算子内核 Kernel 至新算子库,改写为函数式,支持在 C++端通过调用多个基础 Kernel 函数封装,快速组合实现高性能算子;同时,添加相应的 yaml 算子定义,并接入新动态图执行体系,提升 python API 调度性能。迁移改写的算子包括: - - - sqrt ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - square ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - sin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - sinh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - elementwise_fmax ([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140)) - - - elementwise_fmin ([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140)) - - - pool2d ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - max_pool2d_with_index ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - pool3d ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - max_pool3d_with_index ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - fill_constant ([#36930](https://github.com/PaddlePaddle/Paddle/pull/36930), [#39465](https://github.com/PaddlePaddle/Paddle/pull/39465)) - - - p_norm ([#40819](https://github.com/PaddlePaddle/Paddle/pull/40819)) - - - fill_constant_batch_size_like ([#40784](https://github.com/PaddlePaddle/Paddle/pull/40784)) - - - conv2d ([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354)) - - - conv2d_transpose ([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - conv3d ([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354)) - - - conv3d_transpose ([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - mish ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - gather_nd ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043)) - - - gather ([#40500](https://github.com/PaddlePaddle/Paddle/pull/40500)) - - - scatter ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043)) - - - scatter_nd_add ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043)) - - - sgd ([40045](https://github.com/PaddlePaddle/Paddle/pull/40045)) - - - momentum ([#41319](https://github.com/PaddlePaddle/Paddle/pull/41319)) - - - rmsprop ([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994)) - - - index_sample ([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459),[#39905](https://github.com/PaddlePaddle/Paddle/pull/39905)) - - - adam ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351)) - - - layer_norm ([#40193](https://github.com/PaddlePaddle/Paddle/pull/40193)) - - - adagrad ([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994/)) - - - adamax ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173)) - - - adadelta ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173)) - - - clip ([#40602](https://github.com/PaddlePaddle/Paddle/pull/40602), [#41661](https://github.com/PaddlePaddle/Paddle/pull/41661), [#41675](https://github.com/PaddlePaddle/Paddle/pull/41675)) - - - ceil ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - cos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - atan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - cosh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - erf ([#40388](https://github.com/PaddlePaddle/Paddle/pull/40388)) - - - asin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - acos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - scale ([#39278](https://github.com/PaddlePaddle/Paddle/pull/39278)) - - - elementwise_pow ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993)) - - - elementwise_sub ([#39225](https://github.com/PaddlePaddle/Paddle/pull/39225), [#37260](https://github.com/PaddlePaddle/Paddle/pull/37260)) - - - round ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - floor ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - pow ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - elementwise_floordiv ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993)) - - - reciprocal ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - log1p ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - allclose ([#40469](https://github.com/PaddlePaddle/Paddle/pull/40469)) - - - mul ([#40833](https://github.com/PaddlePaddle/Paddle/pull/40833)) - - - elementwise_max ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590)) - - - elementwise_min ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590)) - - - elementwise_mod ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590)) - - - elementwise_add ([#39048](https://github.com/PaddlePaddle/Paddle/pull/39048), [#37043](https://github.com/PaddlePaddle/Paddle/pull/37043)) - - - matmul_v2 ([#36844](https://github.com/PaddlePaddle/Paddle/pull/36844), [#38713](https://github.com/PaddlePaddle/Paddle/pull/38713)) - - - elementwise_mul ([#41042](https://github.com/PaddlePaddle/Paddle/pull/41042), [#40252](https://github.com/PaddlePaddle/Paddle/pull/40252), [#37471](https://github.com/PaddlePaddle/Paddle/pull/37471)) - - - elementwise_div ([#40172](https://github.com/PaddlePaddle/Paddle/pull/40172), [#40039](https://github.com/PaddlePaddle/Paddle/pull/40039), [#37418](https://github.com/PaddlePaddle/Paddle/pull/37418)) - - - SelectedRows ([#39037](https://github.com/PaddlePaddle/Paddle/pull/39037), [#39087](https://github.com/PaddlePaddle/Paddle/pull/39087), [#39128](https://github.com/PaddlePaddle/Paddle/pull/39128), [#39162](https://github.com/PaddlePaddle/Paddle/pull/39162), [#39236](https://github.com/PaddlePaddle/Paddle/pull/39236)) - - - fill_any_like ([#39807](https://github.com/PaddlePaddle/Paddle/pull/39807)) - - - dot ([#38359](https://github.com/PaddlePaddle/Paddle/pull/38359)) - - - sum ([#40873](https://github.com/PaddlePaddle/Paddle/pull/40873)) - - - cumsum ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - diag_v2 ([#39914](https://github.com/PaddlePaddle/Paddle/pull/39914)) - - - auc ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - log_loss ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - one_hot_v2 ([39876](https://github.com/PaddlePaddle/Paddle/pull/39876)) - - - sigmoid_cross_entropy_with_logits ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - bce_loss ([#39868](https://github.com/PaddlePaddle/Paddle/pull/39868)) - - - argsort ([#40151](https://github.com/PaddlePaddle/Paddle/pull/40151)) - - - arg_max ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222)) - - - arg_min ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222)) - - - segment_pool ([#40099](https://github.com/PaddlePaddle/Paddle/pull/40099)) - - - frobenius_norm ([#40707](https://github.com/PaddlePaddle/Paddle/pull/40707), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - dist ([#40178](https://github.com/PaddlePaddle/Paddle/pull/40178)) - - - isnan_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076)) - - - logical_and ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - logical_not ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - isfinite_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076)) - - - logical_or ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - isinf_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076)) - - - is_empty ([#39919](https://github.com/PaddlePaddle/Paddle/pull/39919)) - - - logical_xor ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - less_than ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - not_equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - less_equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - equal_all ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - uniform_random ([#39937](https://github.com/PaddlePaddle/Paddle/pull/39937)) - - - randint ([#39876](https://github.com/PaddlePaddle/Paddle/pull/39876), [#41375](https://github.com/PaddlePaddle/Paddle/pull/41375)) - - - randperm ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265)) - - - unbind ([#39789](https://github.com/PaddlePaddle/Paddle/pull/39789)) - - - bernoulli ([#39590](https://github.com/PaddlePaddle/Paddle/pull/39590)) - - - increment ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - multinomial ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - addmm ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - cholesky ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - where ([#39811](https://github.com/PaddlePaddle/Paddle/pull/39811)) - - - log10 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - log2 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - expm1 ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - atan2 ([#39806](https://github.com/PaddlePaddle/Paddle/pull/39806)) - - - gaussian_random ([#39932](https://github.com/PaddlePaddle/Paddle/pull/39932), [#40122](https://github.com/PaddlePaddle/Paddle/pull/40122), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191)) - - - empty ([#38334](https://github.com/PaddlePaddle/Paddle/pull/38334)) - - - truncated_gaussian_random ([#39971](https://github.com/PaddlePaddle/Paddle/pull/39971), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191)) - - - mv ([#39861](https://github.com/PaddlePaddle/Paddle/pull/39861), [#39954](https://github.com/PaddlePaddle/Paddle/pull/39954)) - - - tan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - set_value ([#40195](https://github.com/PaddlePaddle/Paddle/pull/40195), [#40478](https://github.com/PaddlePaddle/Paddle/pull/40478), [#40636](https://github.com/PaddlePaddle/Paddle/pull/40636)) - - - bitwise_and ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - bitwise_not ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - bitwise_or ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - poisson ([#39814](https://github.com/PaddlePaddle/Paddle/pull/39814)) - - - cholesky_solve ([#40387](https://github.com/PaddlePaddle/Paddle/pull/40387)) - - - bitwise_xor ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - triangular_solve ([#40417](https://github.com/PaddlePaddle/Paddle/pull/40417)) - - - sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626)) - - - atanh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - softsign ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - thresholded_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - tanh_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - stanh ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - reduce_mean ([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559)) - - - reduce_max ([#40225](https://github.com/PaddlePaddle/Paddle/pull/40225)) - - - reduce_min ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374)) - - - mean ([#40872](https://github.com/PaddlePaddle/Paddle/pull/40872), [#41319](https://github.com/PaddlePaddle/Paddle/pull/41319)) - - - reduce_all ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374)) - - - reduce_any ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374)) - - - logsumexp ([#40790](https://github.com/PaddlePaddle/Paddle/pull/40790)) - - - softshrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - range ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265), [#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - stack ([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - tile ([#40371](https://github.com/PaddlePaddle/Paddle/pull/40371)) - - - unique ([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - unstack ([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - slice ([#40736](https://github.com/PaddlePaddle/Paddle/pull/40736)) - - - transpose2 ([#39327](https://github.com/PaddlePaddle/Paddle/pull/39327)) - - - unsqueeze2 ([#40596](https://github.com/PaddlePaddle/Paddle/pull/40596)) - - - squeeze2 ([#40596](https://github.com/PaddlePaddle/Paddle/pull/40596)) - - - strided_slice ([#40708](https://github.com/PaddlePaddle/Paddle/pull/40708)) - - - softmax ([#39547](https://github.com/PaddlePaddle/Paddle/pull/39547)) - - - leaky_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - gelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393)) - - - prelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393)) - - - log_softmax ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393)) - - - elu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - logsigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626)) - - - psroi_pool ([#40353](https://github.com/PaddlePaddle/Paddle/pull/40353), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - kthvalue([#40575](https://github.com/PaddlePaddle/Paddle/pull/40575)) - - - mode ([#40571](https://github.com/PaddlePaddle/Paddle/pull/40571)) - - - yolo_box ([#40112](https://github.com/PaddlePaddle/Paddle/pull/40112)) - - - yolov3_loss ([#40944](https://github.com/PaddlePaddle/Paddle/pull/40944)) - - - temporal_shift ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - depthwise_conv2d ([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354)) - - - pad3d ([#40701](https://github.com/PaddlePaddle/Paddle/pull/40701)) - - - pad ([#40012](https://github.com/PaddlePaddle/Paddle/pull/40012)) - - - greater_equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - kldiv_loss ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - isclose ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - silu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - unfold ([#39778](https://github.com/PaddlePaddle/Paddle/pull/39778)) - - - batch_norm ([39347](https://github.com/PaddlePaddle/Paddle/pull/39347)) - - - norm ([#39324](https://github.com/PaddlePaddle/Paddle/pull/39324)) - - - roi_pool ([#40574](https://github.com/PaddlePaddle/Paddle/pull/40574), [#40682](https://github.com/PaddlePaddle/Paddle/pull/40682), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - roi_align ([#40382](https://github.com/PaddlePaddle/Paddle/pull/40382), [#40556](https://github.com/PaddlePaddle/Paddle/pull/40556), [#41402](https://github.com/PaddlePaddle/Paddle/pull/41402)) - - - deformable_conv ([#40700](https://github.com/PaddlePaddle/Paddle/pull/40700), [#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644)) - - - deformable_conv_v1 ([#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644)) - - - label_smooth ([#39796](https://github.com/PaddlePaddle/Paddle/pull/39796)) - - - grid_sampler ([#40585](https://github.com/PaddlePaddle/Paddle/pull/40585)) - - - greater_than ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - pixel_shuffle ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712)) - - - nearest_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855)) - - - bilinear_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855)) - - - softmax_with_cross_entropy ([#40832](https://github.com/PaddlePaddle/Paddle/pull/40832)) - - - rnn ([#41007](https://github.com/PaddlePaddle/Paddle/pull/41007)) - - - reverse ([#40791](https://github.com/PaddlePaddle/Paddle/pull/40791)) - - - trace ([#39510](https://github.com/PaddlePaddle/Paddle/pull/39510)) - - - kron ([#40427](https://github.com/PaddlePaddle/Paddle/pull/40427)) - - - accuracy ([#39982](https://github.com/PaddlePaddle/Paddle/pull/39982)) - - - gather_tree ([#40082](https://github.com/PaddlePaddle/Paddle/pull/40082), [#39844](https://github.com/PaddlePaddle/Paddle/pull/39844)) - - - dropout ([#40148](https://github.com/PaddlePaddle/Paddle/pull/40148)) - - - bincount ([#39947](https://github.com/PaddlePaddle/Paddle/pull/39947)) - - - warpctc ([#41389](https://github.com/PaddlePaddle/Paddle/pull/41389), [#40023](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/40023)) - - - multiplex ([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40102](https://github.com/PaddlePaddle/Paddle/pull/40102)) - - - qr ([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40007](https://github.com/PaddlePaddle/Paddle/pull/40007)) - - - assign_value ([#40967](https://github.com/PaddlePaddle/Paddle/pull/40967)) - - - assign ([#40022](https://github.com/PaddlePaddle/Paddle/pull/40022)) - - - cast ([#37610](https://github.com/PaddlePaddle/Paddle/pull/37610)) - - - tril_triu ([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - where_index ([#40255](https://github.com/PaddlePaddle/Paddle/pull/40255)) - - - index_select ([#40260](https://github.com/PaddlePaddle/Paddle/pull/40260), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - roll ([#40257](https://github.com/PaddlePaddle/Paddle/pull/40257), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - cumprod (熊昆 [#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - shard_index ([#40254](https://github.com/PaddlePaddle/Paddle/pull/40254)) - - - reshape2 ([#40914](https://github.com/PaddlePaddle/Paddle/pull/40914), [#39631](https://github.com/PaddlePaddle/Paddle/pull/39631), [#38833](https://github.com/PaddlePaddle/Paddle/pull/38833), [#37164](https://github.com/PaddlePaddle/Paddle/pull/37164)) - - - flip ([#39822](https://github.com/PaddlePaddle/Paddle/pull/39822), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - eye ([#39712](https://github.com/PaddlePaddle/Paddle/pull/39712), [#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#41476](https://github.com/PaddlePaddle/Paddle/pull/41476)) - - - lookup_table_v2 ([#39901](https://github.com/PaddlePaddle/Paddle/pull/39901)) - - - searchsorted ([#40520](https://github.com/PaddlePaddle/Paddle/pull/40520), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - adamw ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351)) - - - tanh ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - cross ([#39829](https://github.com/PaddlePaddle/Paddle/pull/39829)) - - - concat ([#38955](https://github.com/PaddlePaddle/Paddle/pull/38955), [#41112](https://github.com/PaddlePaddle/Paddle/pull/41112)) - - - split ([#39060](https://github.com/PaddlePaddle/Paddle/pull/39060)) - - - linspace ([#40124](https://github.com/PaddlePaddle/Paddle/pull/40124)) - - - huber_loss ([#39761](https://github.com/PaddlePaddle/Paddle/pull/39761)) - - - hierarchical_sigmoid ([#40553](https://github.com/PaddlePaddle/Paddle/pull/40553)) - - - nll_loss ([#39936](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39936)) - - - graph_send_recv ([#40092](https://github.com/PaddlePaddle/Paddle/pull/40092), [#40320](https://github.com/PaddlePaddle/Paddle/pull/40320)) - - - abs ([#39492](https://github.com/PaddlePaddle/Paddle/pull/39492), [#39762](https://github.com/PaddlePaddle/Paddle/pull/39762)) - - - exp ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - rsqrt ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - viterbi_decode ([#40186](https://github.com/PaddlePaddle/Paddle/pull/40186)) - - - conj ([#38247](https://github.com/PaddlePaddle/Paddle/pull/38247)) - - - real ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - imag ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - take_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40270](https://github.com/PaddlePaddle/Paddle/pull/40270), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - put_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - lgamma ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - relu ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - maxout ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - log ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - bilinear_tensor_product ([#39903](https://github.com/PaddlePaddle/Paddle/pull/39903)) - - - flatten_contiguous_range ([#38712](https://github.com/PaddlePaddle/Paddle/pull/38712), [#36957](https://github.com/PaddlePaddle/Paddle/pull/36957), [#41345](https://github.com/PaddlePaddle/Paddle/pull/41345)) - - - matrix_rank ([#40074](https://github.com/PaddlePaddle/Paddle/pull/40074), [#40519](https://github.com/PaddlePaddle/Paddle/pull/40519), [#41466](https://github.com/PaddlePaddle/Paddle/pull/41466)) - - - logit ([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844)) - - - lerp ([#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#39524](https://github.com/PaddlePaddle/Paddle/pull/39524)) - - - erfinv ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712)) - - - broadcast_tensors ([#40047](https://github.com/PaddlePaddle/Paddle/pull/40047)) - - - gumbel_softmax ([#39873](https://github.com/PaddlePaddle/Paddle/pull/39873)) - - - diagonal ([#39575](https://github.com/PaddlePaddle/Paddle/pull/39575)) - - - trunc ([#39543](https://github.com/PaddlePaddle/Paddle/pull/39543), [#39772](https://github.com/PaddlePaddle/Paddle/pull/39772)) - - - multi_dot ([#40038](https://github.com/PaddlePaddle/Paddle/pull/40038)) - - - matrix_power ([#40231](https://github.com/PaddlePaddle/Paddle/pull/40231)) - - - digamma ([#39240](https://github.com/PaddlePaddle/Paddle/pull/39240)) - - - masked_select ([#39193](https://github.com/PaddlePaddle/Paddle/pull/39193)) - - - determinant ([#40539](https://github.com/PaddlePaddle/Paddle/pull/40539)) - - - eigh ([#40213](https://github.com/PaddlePaddle/Paddle/pull/40213)) - - - size ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712)) - - - shape ([#40248](https://github.com/PaddlePaddle/Paddle/pull/40248)) - - - reduce_sum ([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559), [#41295](https://github.com/PaddlePaddle/Paddle/pull/41295)) - - - reduce_prod ([#39844](https://github.com/PaddlePaddle/Paddle/pull/39844)) - - - histogram ([#39496](https://github.com/PaddlePaddle/Paddle/pull/39496)) - - - meshgrid ([#41411](https://github.com/PaddlePaddle/Paddle/pull/41411)) - - - brelu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - hard_swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - hard_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - selu (熊昆 [#39819](https://github.com/PaddlePaddle/Paddle/pull/39819)) - - - expand_v2 ([#39471](https://github.com/PaddlePaddle/Paddle/pull/39471)) - - - top_k_v2 ([#40064](https://github.com/PaddlePaddle/Paddle/pull/40064)) - - - expand_as_v2 ([#40373](https://github.com/PaddlePaddle/Paddle/pull/40373)) - - - swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - hard_sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626)) - - - exp, det, assign, gaussian_random, matrix_rank, eye, deformable_conv。([#41755]exp, det, assign, gaussian_random, matrix_rank, eye, deformable_conv。([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755), [#41737](https://github.com/PaddlePaddle/Paddle/pull/41737) - -#### 新动态图执行机制 - -针对飞桨原动态图执行机制的调度性能、二次开发能力差的问题,我们重构了动态图的底层执行机制。通过全新的调用执行方式,配合 Phi 算子库进行高效的运行时执行,对于 Phi 算子库支持的算子,切换到新动态图模式能体验到调度性能有较大幅度的提升。但是由于整体框架执行机制升级的工作量巨大,且该部分工作耦合了大量 Phi 算子库的工作, 因此在这个版本下我们仍未默认使用该执行方式。如果想要试用可以通过设置环境变量 `FLAGS_enable_eager_mode=1` 来切换使用。具体包括如下内容: - -- **新动态图执行机制基础架构、核心组件与机制实现**:静态化动态图相关执行代码,将原本的同质化的算子构建变成针对不同 Phi API 的特异化调用从而极大的优化了调度开销。([#36059](https://github.com/PaddlePaddle/Paddle/pull/36059), [#37323](https://github.com/PaddlePaddle/Paddle/pull/37323), [#37556](https://github.com/PaddlePaddle/Paddle/pull/37556), [#37555](https://github.com/PaddlePaddle/Paddle/pull/37555), [#37478](https://github.com/PaddlePaddle/Paddle/pull/37478), [#37458](https://github.com/PaddlePaddle/Paddle/pull/37458), [#37479](https://github.com/PaddlePaddle/Paddle/pull/37479), [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599), [#37659](https://github.com/PaddlePaddle/Paddle/pull/37659), [#37654](https://github.com/PaddlePaddle/Paddle/pull/37654), [#39200](https://github.com/PaddlePaddle/Paddle/pull/39200), [#39309](https://github.com/PaddlePaddle/Paddle/pull/39309), [#39319](https://github.com/PaddlePaddle/Paddle/pull/39319), [#39414](https://github.com/PaddlePaddle/Paddle/pull/39414), [#39504](https://github.com/PaddlePaddle/Paddle/pull/39504), [#39526](https://github.com/PaddlePaddle/Paddle/pull/39526), [#39878](https://github.com/PaddlePaddle/Paddle/pull/39878), [#39963](https://github.com/PaddlePaddle/Paddle/pull/39963)) - -- **新动态图执行机制子功能开发、适配**:支持了更加灵活,更加完备的动态图子功能例如 hook,pylayer,double_grad, inplace,amp 等等。([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396), [#40400](https://github.com/PaddlePaddle/Paddle/pull/40400), [#40695](https://github.com/PaddlePaddle/Paddle/pull/40695), [#41043](https://github.com/PaddlePaddle/Paddle/pull/41043), [#40915](https://github.com/PaddlePaddle/Paddle/pull/40915), [#41104](https://github.com/PaddlePaddle/Paddle/pull/41104), [#41350](https://github.com/PaddlePaddle/Paddle/pull/41350), [#41209](https://github.com/PaddlePaddle/Paddle/pull/41209), [#40830](https://github.com/PaddlePaddle/Paddle/pull/40830), [#40891](https://github.com/PaddlePaddle/Paddle/pull/40891), [#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377), [#37193](https://github.com/PaddlePaddle/Paddle/pull/37193), [#36965](https://github.com/PaddlePaddle/Paddle/pull/36965), [#37810](https://github.com/PaddlePaddle/Paddle/pull/37810), [#36837](https://github.com/PaddlePaddle/Paddle/pull/36837), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#39531](https://github.com/PaddlePaddle/Paddle/pull/39531), [#39638](https://github.com/PaddlePaddle/Paddle/pull/39638), [#39674](https://github.com/PaddlePaddle/Paddle/pull/39674), [#39893](https://github.com/PaddlePaddle/Paddle/pull/39893), [#40170](https://github.com/PaddlePaddle/Paddle/pull/40170), [#40693](https://github.com/PaddlePaddle/Paddle/pull/40693), [#40937](https://github.com/PaddlePaddle/Paddle/pull/40937), [#41016](https://github.com/PaddlePaddle/Paddle/pull/41016), [#41051](https://github.com/PaddlePaddle/Paddle/pull/41051), [#41121](https://github.com/PaddlePaddle/Paddle/pull/41121), [#41198](https://github.com/PaddlePaddle/Paddle/pull/41198), [#41287](https://github.com/PaddlePaddle/Paddle/pull/41287), [#41380](https://github.com/PaddlePaddle/Paddle/pull/41380), [#41306](https://github.com/PaddlePaddle/Paddle/pull/41306), [#41387](https://github.com/PaddlePaddle/Paddle/pull/41387), [#40623](https://github.com/PaddlePaddle/Paddle/pull/40623), [#40945](https://github.com/PaddlePaddle/Paddle/pull/40945), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488)) - -- **新动态图执行的自动代码生成机制**:当我们为了将大量的同质化算子的计算和调度逻辑分化成不同的特异化的调度逻辑时,我们发现这是一个非常庞大的工作,因此我们引入了全新的自动代码生成逻辑来生成代码从而简化动态图的运行时逻辑。同时,为了能够适配之前框架中的各类运行时逻辑,我们也利用了一些复杂的编译手段来运行时的获取信息从而生成更加准确的调度代码。([#37574](https://github.com/PaddlePaddle/Paddle/pull/37574), [#37575](https://github.com/PaddlePaddle/Paddle/pull/37575), [#37639](https://github.com/PaddlePaddle/Paddle/pull/37639), [#37723](https://github.com/PaddlePaddle/Paddle/pull/37723), [#37753](https://github.com/PaddlePaddle/Paddle/pull/37753), [#37812](https://github.com/PaddlePaddle/Paddle/pull/37812), [#37837](https://github.com/PaddlePaddle/Paddle/pull/37837), [#37910](https://github.com/PaddlePaddle/Paddle/pull/37910), [#37943](https://github.com/PaddlePaddle/Paddle/pull/37943), [#37992](https://github.com/PaddlePaddle/Paddle/pull/37992), [#37959](https://github.com/PaddlePaddle/Paddle/pull/37959), [#38017](https://github.com/PaddlePaddle/Paddle/pull/38017), [#37969](https://github.com/PaddlePaddle/Paddle/pull/37969), [#38160](https://github.com/PaddlePaddle/Paddle/pull/38160), [#38085](https://github.com/PaddlePaddle/Paddle/pull/38085), [#38562](https://github.com/PaddlePaddle/Paddle/pull/38562), [#38573](https://github.com/PaddlePaddle/Paddle/pull/38573), [#39192](https://github.com/PaddlePaddle/Paddle/pull/39192), [#39215](https://github.com/PaddlePaddle/Paddle/pull/39215), [#39355](https://github.com/PaddlePaddle/Paddle/pull/39355), [#39358](https://github.com/PaddlePaddle/Paddle/pull/39358), [#39328](https://github.com/PaddlePaddle/Paddle/pull/39328), [#39233](https://github.com/PaddlePaddle/Paddle/pull/39233), [#39628](https://github.com/PaddlePaddle/Paddle/pull/39628), [#39767](https://github.com/PaddlePaddle/Paddle/pull/39767), [#39743](https://github.com/PaddlePaddle/Paddle/pull/39743), [#39897](https://github.com/PaddlePaddle/Paddle/pull/39897), [#39797](https://github.com/PaddlePaddle/Paddle/pull/39797), [#39997](https://github.com/PaddlePaddle/Paddle/pull/39997), [#40058](https://github.com/PaddlePaddle/Paddle/pull/40058), [#40080](https://github.com/PaddlePaddle/Paddle/pull/40080), [#40107](https://github.com/PaddlePaddle/Paddle/pull/40107), [#39962](https://github.com/PaddlePaddle/Paddle/pull/39962), [#40132](https://github.com/PaddlePaddle/Paddle/pull/40132), [#40276](https://github.com/PaddlePaddle/Paddle/pull/40276), [#40266](https://github.com/PaddlePaddle/Paddle/pull/40266), [#40480](https://github.com/PaddlePaddle/Paddle/pull/40480), [#40482](https://github.com/PaddlePaddle/Paddle/pull/40482), [#40368](https://github.com/PaddlePaddle/Paddle/pull/40368), [#40650](https://github.com/PaddlePaddle/Paddle/pull/40650), [#40815](https://github.com/PaddlePaddle/Paddle/pull/40815), [#40907](https://github.com/PaddlePaddle/Paddle/pull/40907), [#40935](https://github.com/PaddlePaddle/Paddle/pull/40935), [#41089](https://github.com/PaddlePaddle/Paddle/pull/41089)) - -- **新动态图执行机制接入主框架,联合调试**:我们目前利用一些环境变量区分静态图模式和动态图模式(含新动态图和老动态图模式),这些模式下我们已经适配了大部分的动态图的逻辑,但是仍有大量问题正在修复中。([#37638](https://github.com/PaddlePaddle/Paddle/pull/37638), [#37643](https://github.com/PaddlePaddle/Paddle/pull/37643), [#37653](https://github.com/PaddlePaddle/Paddle/pull/37653), [#38314](https://github.com/PaddlePaddle/Paddle/pull/38314), [#38337](https://github.com/PaddlePaddle/Paddle/pull/38337), [#38338](https://github.com/PaddlePaddle/Paddle/pull/38338), [#39164](https://github.com/PaddlePaddle/Paddle/pull/39164), [#39326](https://github.com/PaddlePaddle/Paddle/pull/39326), [#40391](https://github.com/PaddlePaddle/Paddle/pull/40391), [#40201](https://github.com/PaddlePaddle/Paddle/pull/40201), [#40854](https://github.com/PaddlePaddle/Paddle/pull/40854), [#40887](https://github.com/PaddlePaddle/Paddle/pull/40887)) - -- **更新了动态图下的一些判断逻辑,支持兼容形态下的动态图快速执行路径**:([#40786](https://github.com/PaddlePaddle/Paddle/pull/40786)) - - - 非静态图模式(目前的过渡方案):`_non_static_mode()`。 - - - 在动态图模式下且判断在新动态图(推荐的判断逻辑):`_in_dygrah_mode()`。 - - - 在动态图模式下且判断在老动态图(不推荐的判断逻辑,在将来的版本中将废弃):`_in_legacy_dygraph()`。 - - - 在动态图模式下开启老动态图并关闭新动态图:`_enable_legacy_dygraph()` 或者退出 `_test_eager_guard()`。 - - - 在动态图模式下开启新动态图并关闭老动态图:`_disable_legacy_dygraph()` 或者 `with _test_eager_guard()`。 - - - 在静态图或者动态图模式下判断在新动态图:`_in_eager_without_dygraph_check()`。 - -- **动态图重构后支持 inplace 策略**:输入与输出为同一个 Tensor。 - - - 为动态图重构中间态适配 inplace 策略。([#40400](https://github.com/PaddlePaddle/Paddle/pull/40400)) - - - 为动态图重构最终态适配 inplace 策略。([#40695](https://github.com/PaddlePaddle/Paddle/pull/40695)) - - - 动态图重构后,为 PyLayer 功能添加 inplace 策略。([#41043](https://github.com/PaddlePaddle/Paddle/pull/41043)) - - - 动态图重构后,为 Tensor 的 setitem 功能添加 inplace 策略。([#40915](https://github.com/PaddlePaddle/Paddle/pull/40915)) - - - 动态图重构后添加`_reset_grad_inplace_version`接口,将 Tensor 的梯度的 inplace version 置为 0。([#41101](https://github.com/PaddlePaddle/Paddle/pull/41101)) - - - 反向计算过程中如果不需要前向 Tensor 的值(no need buffer 属性),则不需要对该 Tensor 进行 inplace version 的检测操作。 为 no_need_buffer 的 Tensor 跳过 inplace version 的检查。([#41350](https://github.com/PaddlePaddle/Paddle/pull/41350)) - - - 统一动态图重构后与重构前对 inplace version 检查的报错信息。([#41209](https://github.com/PaddlePaddle/Paddle/pull/41209)) - -- **动态图重构后支持 view 策略**:输入与输出 Tensor 共享底层数据。 - - - 为动态图重构中间态适配 view 机制。包括`reshape`、`squeeze`、`unsqueeze`、`flatten` API。([#40830](https://github.com/PaddlePaddle/Paddle/pull/40830)) - - - 为动态图重构最终态适配 view 机制。包括`reshape` API。([#40891](https://github.com/PaddlePaddle/Paddle/pull/40891)) - -- **添加支持新动态图 eager Tensor 在 python 端的 weakref**。([#41797](https://github.com/PaddlePaddle/Paddle/pull/41797)) - -- **增强新动态图 DoubleGrad 功能**,支持基础的 DoubleGrad 功能。([#41893](https://github.com/PaddlePaddle/Paddle/pull/41893), [#41894](https://github.com/PaddlePaddle/Paddle/pull/41894), [#41895](https://github.com/PaddlePaddle/Paddle/pull/41895)) - -- **新增 `core.eager.StringTensor` 接口**,支持在 python 端构造 StringTensor 以及使用 StringTensor 相关 API。([#41039](https://github.com/PaddlePaddle/Paddle/pull/41039)) - -- **为 `core.eager.Tensor` 新增 `*grad_name` 和 `_grad_value` API**,返回梯度的名称和值。([#41990](https://github.com/PaddlePaddle/Paddle/pull/41990)) - -- **为动态图中间态添加对 no_need_buffer 属性的处理**。在 inplace 反向检查操作中,会跳过具有 no_need_buffer 属性的 Tensor 的检查。([#41720](https://github.com/PaddlePaddle/Paddle/pull/41720)) - - -#### 全新静态图执行器 -为了解决飞桨原静态图执行器在部分场景下调度性能不够理想,不便于扩展多 stream 等问题,我们实现了全新的性能优越,易于扩展的静态图执行器,充分利用了多 stream、多线程的异步调度能力。新执行器相当于原执行器是兼容升级,目前已在单机单卡场景下默认使用,用户不需要在训练代码中做任何修改即可自动使用。当然,我们也提供了接口来切换回原执行器,用户可以通过设置环境变量 `FLAGS_USE_STANDALONE_EXECUTOR=false` 来切换回原执行器。([#41179](https://github.com/PaddlePaddle/Paddle/pull/41179)) 主要内容如下: - -- 基础组件:用于执行器中多线程算子调度的高性能线程池 ([#35470](https://github.com/PaddlePaddle/Paddle/pull/35470), [#35930](https://github.com/PaddlePaddle/Paddle/pull/35930), [#36030](https://github.com/PaddlePaddle/Paddle/pull/36030), [#36480](https://github.com/PaddlePaddle/Paddle/pull/36480), [#36688](https://github.com/PaddlePaddle/Paddle/pull/36688), [#36740](https://github.com/PaddlePaddle/Paddle/pull/36740), [#38335](https://github.com/PaddlePaddle/Paddle/pull/38335), [#40770](https://github.com/PaddlePaddle/Paddle/pull/40770)) 及线程协同组件 ([#38779](https://github.com/PaddlePaddle/Paddle/pull/38779), [#40876](https://github.com/PaddlePaddle/Paddle/pull/40876), [#40912](https://github.com/PaddlePaddle/Paddle/pull/40912)),算子执行后及时地显存回收 ([#37642](https://github.com/PaddlePaddle/Paddle/pull/37642), [#39617](https://github.com/PaddlePaddle/Paddle/pull/39617), [#40859](https://github.com/PaddlePaddle/Paddle/pull/40859)),并行执行器新依赖分析算法 ([#37231](https://github.com/PaddlePaddle/Paddle/pull/37231)) 等。 - -- 调度逻辑:优化执行器中算子的调度方法,支持多 stream 的多线程异步调度机制,将数据类型、设备、布局等转换改为算子调度以提升性能,支持缓存算子 Kernel 选择,支持选择全新 Phi 算子等。([#35024](https://github.com/PaddlePaddle/Paddle/pull/35024), [#34922](https://github.com/PaddlePaddle/Paddle/pull/34922), [#35711](https://github.com/PaddlePaddle/Paddle/pull/35711), [#35928](https://github.com/PaddlePaddle/Paddle/pull/35928), [#39458](https://github.com/PaddlePaddle/Paddle/pull/39458),[#36899](https://github.com/PaddlePaddle/Paddle/pull/36899))。 - -- 接口兼容:兼容原执行器的用户接口和功能,如对齐 python 端 Executor.run()、支持 Scope 中管理 Tensor 等,确保用户可以无感知地切换新执行器。([#37278](https://github.com/PaddlePaddle/Paddle/pull/37278), [#37379](https://github.com/PaddlePaddle/Paddle/pull/37379), [#37445](https://github.com/PaddlePaddle/Paddle/pull/37445), [#37510](https://github.com/PaddlePaddle/Paddle/pull/37510), [#40955](https://github.com/PaddlePaddle/Paddle/pull/40955), [#41778](https://github.com/PaddlePaddle/Paddle/pull/41178), [#41058](https://github.com/PaddlePaddle/Paddle/pull/41058), [#38584](https://github.com/PaddlePaddle/Paddle/pull/38584), [#37957](https://github.com/PaddlePaddle/Paddle/pull/37957), [#37672](https://github.com/PaddlePaddle/Paddle/pull/37672), [#37474](https://github.com/PaddlePaddle/Paddle/pull/37474), [#37085](https://github.com/PaddlePaddle/Paddle/pull/37085), [#37061](https://github.com/PaddlePaddle/Paddle/pull/37061), [#36945](https://github.com/PaddlePaddle/Paddle/pull/36945)) - -- 增强多线程场景下调试和报错功能,将子线程的报错捕获到主线程中统一抛出,以提升用户体验。([#36692](https://github.com/PaddlePaddle/Paddle/pull/36692),[#36802](https://github.com/PaddlePaddle/Paddle/pull/36802)) - -- 修复新执行器通信流重置 Allocator 中 stream 缓存信息的问题,减少跨 stream 场景下的 RecordStream 开销,优化后 DeepFM 模型性能提升约 8%。([#42046](https://github.com/PaddlePaddle/Paddle/pull/42046)) - -- 优化新执行器算子间的依赖分析方法,提升运行性能;为 send/recv 通信算子建立正确依赖以支持流水线并行。([#42009](https://github.com/PaddlePaddle/Paddle/pull/42009)) - - -#### 分布式训练 - -- 集合通信多机多卡训练基础功能 - - - 新增弹性功能(含节点故障、扩容、缩容),提升分布式的容错能力。([#36684](https://github.com/PaddlePaddle/Paddle/pull/36684), [#37177](https://github.com/PaddlePaddle/Paddle/pull/37177), [#37781](https://github.com/PaddlePaddle/Paddle/pull/37781)) - - - Launch 启动模块,重构并新增 `master` 协同和节点个数 `nnodes` 定义,提升分布式启动易用性。([#40086](https://github.com/PaddlePaddle/Paddle/pull/40086), [#40568](https://github.com/PaddlePaddle/Paddle/pull/40568), [#40782](https://github.com/PaddlePaddle/Paddle/pull/40782), [#40844](https://github.com/PaddlePaddle/Paddle/pull/40844), [#40936](https://github.com/PaddlePaddle/Paddle/pull/40936), [#41190](https://github.com/PaddlePaddle/Paddle/pull/41190), [#41314](https://github.com/PaddlePaddle/Paddle/pull/41314)) - - - 新增对 GPU/NPU/XPU 多种硬件的异构训练的支持。([#37613](https://github.com/PaddlePaddle/Paddle/pull/37613), [#37998](https://github.com/PaddlePaddle/Paddle/pull/37998)) - - - 新增 fleet_executor 异步流水执行器。([#36966](https://github.com/PaddlePaddle/Paddle/pull/36966), [#37049](https://github.com/PaddlePaddle/Paddle/pull/37049), [#37087](https://github.com/PaddlePaddle/Paddle/pull/37087), [#37126](https://github.com/PaddlePaddle/Paddle/pull/37126), [#37150](https://github.com/PaddlePaddle/Paddle/pull/37150), [#37203](https://github.com/PaddlePaddle/Paddle/pull/37203), [#37167](https://github.com/PaddlePaddle/Paddle/pull/37167), [#37282](https://github.com/PaddlePaddle/Paddle/pull/37282), [#37319](https://github.com/PaddlePaddle/Paddle/pull/37319), [#37462](https://github.com/PaddlePaddle/Paddle/pull/37462), [#37507](https://github.com/PaddlePaddle/Paddle/pull/37507), [#37533](https://github.com/PaddlePaddle/Paddle/pull/37533), [#37576](https://github.com/PaddlePaddle/Paddle/pull/37576), [#37605](https://github.com/PaddlePaddle/Paddle/pull/37605), [#37691](https://github.com/PaddlePaddle/Paddle/pull/37691), [#37742](https://github.com/PaddlePaddle/Paddle/pull/37742), [#37783](https://github.com/PaddlePaddle/Paddle/pull/37783), [#37809](https://github.com/PaddlePaddle/Paddle/pull/37809), [#37862](https://github.com/PaddlePaddle/Paddle/pull/37862), [#37882](https://github.com/PaddlePaddle/Paddle/pull/37882), [#37934](https://github.com/PaddlePaddle/Paddle/pull/37934), [#38024](https://github.com/PaddlePaddle/Paddle/pull/38024), [#38083](https://github.com/PaddlePaddle/Paddle/pull/38083), [#38164](https://github.com/PaddlePaddle/Paddle/pull/38164), [#38261](https://github.com/PaddlePaddle/Paddle/pull/38261), [#38290](https://github.com/PaddlePaddle/Paddle/pull/38290), [#40607](https://github.com/PaddlePaddle/Paddle/pull/40607), [#37093](https://github.com/PaddlePaddle/Paddle/pull/37093), [#37106](https://github.com/PaddlePaddle/Paddle/pull/37106), [#37143](https://github.com/PaddlePaddle/Paddle/pull/37143), [#37338](https://github.com/PaddlePaddle/Paddle/pull/37338), [#37376](https://github.com/PaddlePaddle/Paddle/pull/37376), [#37485](https://github.com/PaddlePaddle/Paddle/pull/37485), [#37531](https://github.com/PaddlePaddle/Paddle/pull/37531), [#37623](https://github.com/PaddlePaddle/Paddle/pull/37623), [#37693](https://github.com/PaddlePaddle/Paddle/pull/37693), [#37755](https://github.com/PaddlePaddle/Paddle/pull/37755), [#37807](https://github.com/PaddlePaddle/Paddle/pull/37807), [#37889](https://github.com/PaddlePaddle/Paddle/pull/37889), [#38420](https://github.com/PaddlePaddle/Paddle/pull/38420), [#38539](https://github.com/PaddlePaddle/Paddle/pull/38539), [#36892](https://github.com/PaddlePaddle/Paddle/pull/36892), [#37084](https://github.com/PaddlePaddle/Paddle/pull/37084), [#37158](https://github.com/PaddlePaddle/Paddle/pull/37158), [#37361](https://github.com/PaddlePaddle/Paddle/pull/37361), [#37509](https://github.com/PaddlePaddle/Paddle/pull/37509), [#37603](https://github.com/PaddlePaddle/Paddle/pull/37603), [#37703](https://github.com/PaddlePaddle/Paddle/pull/37703), [#37824](https://github.com/PaddlePaddle/Paddle/pull/37824), [#38114](https://github.com/PaddlePaddle/Paddle/pull/38114), [#38322](https://github.com/PaddlePaddle/Paddle/pull/38322), [#38535](https://github.com/PaddlePaddle/Paddle/pull/38535), [#38650](https://github.com/PaddlePaddle/Paddle/pull/38650), [#38709](https://github.com/PaddlePaddle/Paddle/pull/38709), [#38799](https://github.com/PaddlePaddle/Paddle/pull/38799), [#38839](https://github.com/PaddlePaddle/Paddle/pull/38839), [#38904](https://github.com/PaddlePaddle/Paddle/pull/38904)) - - - 新增分布式大模型推理功能。([#38795](https://github.com/PaddlePaddle/Paddle/pull/38795), [#39012](https://github.com/PaddlePaddle/Paddle/pull/39012), [#39032](https://github.com/PaddlePaddle/Paddle/pull/39032), [#39076](https://github.com/PaddlePaddle/Paddle/pull/39076), [#39194](https://github.com/PaddlePaddle/Paddle/pull/39194), [#39207](https://github.com/PaddlePaddle/Paddle/pull/39207), [#39241](https://github.com/PaddlePaddle/Paddle/pull/39241), [#39603](https://github.com/PaddlePaddle/Paddle/pull/39603), [#39758](https://github.com/PaddlePaddle/Paddle/pull/39758), [#39992](https://github.com/PaddlePaddle/Paddle/pull/39992)) - -- 动态图混合并行 - - - 重构 `paddle.distributed.fleet.utils.recompute`,支持新动态图。([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396)) - - - 支持 Pure FP16 训练。([#36420](https://github.com/PaddlePaddle/Paddle/pull/36420)) - - - 新增 MoE(Mixture of Experts)并行策略, 支持超大 MoE 模型训练。([#41092](https://github.com/PaddlePaddle/Paddle/pull/41092), [#40895](https://github.com/PaddlePaddle/Paddle/pull/40895), [#40850](https://github.com/PaddlePaddle/Paddle/pull/40580), [#39224](https://github.com/PaddlePaddle/Paddle/pull/39224)) - - - 新增 GroupSharded 并行策略,支持 stage1、stage2、stage3 三个阶段模型状态分组切片训练策略,支持同、异步通信,并可与 Recompute、AMP O1\O2、Offload、GroupShardedClipGrad、GroupShardedScaler 等基础功能组合使用。([#37489](https://github.com/PaddlePaddle/Paddle/pull/37489), [#37568](https://github.com/PaddlePaddle/Paddle/pull/37568), [#37707](https://github.com/PaddlePaddle/Paddle/pull/37707), [#37836](https://github.com/PaddlePaddle/Paddle/pull/37836), [#37947](https://github.com/PaddlePaddle/Paddle/pull/37947), [#38151](https://github.com/PaddlePaddle/Paddle/pull/38151), [#38407](https://github.com/PaddlePaddle/Paddle/pull/38407), [#38052](https://github.com/PaddlePaddle/Paddle/pull/38052), [#39112](https://github.com/PaddlePaddle/Paddle/pull/39112), [#38989](https://github.com/PaddlePaddle/Paddle/pull/38989), [#39171](https://github.com/PaddlePaddle/Paddle/pull/39171), [#39285](https://github.com/PaddlePaddle/Paddle/pull/39285), [#39334](https://github.com/PaddlePaddle/Paddle/pull/39334), [#39397](https://github.com/PaddlePaddle/Paddle/pull/39397), [#39581](https://github.com/PaddlePaddle/Paddle/pull/39581), [#39668](https://github.com/PaddlePaddle/Paddle/pull/39668), [#40129](https://github.com/PaddlePaddle/Paddle/pull/40129), [#40396](https://github.com/PaddlePaddle/Paddle/pull/40396), [#40488](https://github.com/PaddlePaddle/Paddle/pull/40488), [#40601](https://github.com/PaddlePaddle/Paddle/pull/40601),[#37725](https://github.com/PaddlePaddle/Paddle/pull/37725),[#37904](https://github.com/PaddlePaddle/Paddle/pull/37904), [#38064](https://github.com/PaddlePaddle/Paddle/pull/38064)) - -- 静态图混合并行 - - - 新增`scale_gradient`标志位至`gradient_scale_configs`,用于控制流水线并行下梯度聚合运算对梯度进行求平均运算的位置。([#36384](https://github.com/PaddlePaddle/Paddle/pull/36384)) - - - 张量模型并行下,dropout 支持设置确定性随机种子生成器,以确保非分布式变量的随机一致性和分布式变量的随机性。([#36228](https://github.com/PaddlePaddle/Paddle/pull/36228)) - - - NPU 混合并行支持 Offload,可节约 40%显存。([#37224](https://github.com/PaddlePaddle/Paddle/pull/37224)) - - - 为 seed op 增加 `force_cpu` 可选参数,使 dropout 可以直接从 CPU 读取 seed 的值。([#35820](https://github.com/PaddlePaddle/Paddle/pull/35820)) - - - 完善 Automatic Sparsity (ASP)sharding 策略,支持根据 program 选择 sharding 策略。(#[#40028](https://github.com/PaddlePaddle/Paddle/pull/40028)) - -- 自动并行 - - - 新增逻辑进程与物理设备自动映射后的进程重新启动(relaunch)。([#37523](https://github.com/PaddlePaddle/Paddle/pull/37523), [#37326](https://github.com/PaddlePaddle/Paddle/pull/37326)) - - - 完善自动并行底层机制和接口,利于各个模块统一和添加优化 pass。([#36617](https://github.com/PaddlePaddle/Paddle/pull/36617), [#38132](https://github.com/PaddlePaddle/Paddle/pull/38132)) - - - 新增统一资源表示,支持逻辑进程与物理设备自动映射功能。([#37091](https://github.com/PaddlePaddle/Paddle/pull/37091), [#37482](https://github.com/PaddlePaddle/Paddle/pull/37482), [#37094](https://github.com/PaddlePaddle/Paddle/pull/37094)) - - - 完善自动并行计算图反向和更新部分的分布式属性补全功能。([#36744](https://github.com/PaddlePaddle/Paddle/pull/36744)) - - - 新增数据切分功能。([#36055](https://github.com/PaddlePaddle/Paddle/pull/36055)) - - - 新增张量重切分功能,根据张量和算子的分布式属性对张量进行重新切分。([#40865](https://github.com/PaddlePaddle/Paddle/pull/40865), [#41106](https://github.com/PaddlePaddle/Paddle/pull/41106)) - - - 新增资源数量或并行策略变化时分布式参数的自动转换功能。([#40434](https://github.com/PaddlePaddle/Paddle/pull/40434)) - - - 新增梯度累加功能(GradientMerge),减少通信次数,提升训练效率。([#38259](https://github.com/PaddlePaddle/Paddle/pull/38259), [#40737](https://github.com/PaddlePaddle/Paddle/pull/40737)) - - - 新增重计算功能(Recompute),优化显存。([#38920](https://github.com/PaddlePaddle/Paddle/pull/38920)) - - - 新增 Sharding 优化 pass, 支持 p-g-os 3 个 stage 的切分优化。([#38502](https://github.com/PaddlePaddle/Paddle/pull/38502)) - - - 新增 AMP + FP16 优化 pass。([#38764](https://github.com/PaddlePaddle/Paddle/pull/38764), [#40615](https://github.com/PaddlePaddle/Paddle/pull/40615)) - - - 新增 Transformer 类模型的 QKV fuse 切分。([#39080](https://github.com/PaddlePaddle/Paddle/pull/39080)) - - - 新增 while op 的分布式属性推导功能,确保迭代推导算法能收敛。([#39939](https://github.com/PaddlePaddle/Paddle/pull/39939), [#39086](https://github.com/PaddlePaddle/Paddle/pull/39086), [#39014](https://github.com/PaddlePaddle/Paddle/pull/39014)) - - - 支持子 block 和 while op 控制流的训练和推理。([#39612](https://github.com/PaddlePaddle/Paddle/pull/39612), [#39895](https://github.com/PaddlePaddle/Paddle/pull/39895), [#40077](https://github.com/PaddlePaddle/Paddle/pull/40077)) - -- 参数服务器 - - - GPUPS 下,新增 NAN/INF 值检查工具。([#38131](https://github.com/PaddlePaddle/Paddle/pull/38131)) - - - GPUPS 下,新增 set_date 接口,适配增量训练。([#36194](https://github.com/PaddlePaddle/Paddle/pull/36194)) - - - GPUPS 下,新增异步 release dataset 功能。([#37790](https://github.com/PaddlePaddle/Paddle/pull/37790)) - - - GPUPS 下,支持 Dump 参数和中间层 ([#36157](https://github.com/PaddlePaddle/Paddle/pull/36157)); - - - GPUPS 下,支持优化器参数配置。([#39783](https://github.com/PaddlePaddle/Paddle/pull/39783), [#39849](https://github.com/PaddlePaddle/Paddle/pull/39849)) - - - 统一参数服务器下,重构通信、存储等各个模块基类,提升各个模块的易二次开发性。([#41207](https://github.com/PaddlePaddle/Paddle/pull/41207), [#41022](https://github.com/PaddlePaddle/Paddle/pull/41022), [#40702](https://github.com/PaddlePaddle/Paddle/pull/40702), [#39341](https://github.com/PaddlePaddle/Paddle/pull/39341) [#39377](https://github.com/PaddlePaddle/Paddle/pull/39377), [#39191](https://github.com/PaddlePaddle/Paddle/pull/39191), [#39064](https://github.com/PaddlePaddle/Paddle/pull/39064)) - - - 统一参数服务器下,新增评估指标模块,支持 AUC/WuAUC/MaskAuc 等评估指标计算及可自定义扩展。([#38789](https://github.com/PaddlePaddle/Paddle/pull/38789)) - - - 支持在昆仑芯 2 芯片上的 XPU 参数服务器训练。([#41917](https://github.com/PaddlePaddle/Paddle/pull/41917), [#42266](https://github.com/PaddlePaddle/Paddle/pull/42266), [#41916](https://github.com/PaddlePaddle/Paddle/pull/41916)) - -#### Profiler - -- Python 层新增性能分析模块 `paddle.profiler`:提供对训推过程中性能数据的收集,导出和统计的功能。([#40065](https://github.com/PaddlePaddle/Paddle/pull/40065), [#40357](https://github.com/PaddlePaddle/Paddle/pull/40357), [#40888](https://github.com/PaddlePaddle/Paddle/pull/40888)) - - - `paddle.profiler.Profiler`,性能分析器,用户交互的接口。([#41029](https://github.com/PaddlePaddle/Paddle/pull/41029), [#41524](https://github.com/PaddlePaddle/Paddle/pull/41524), [#41157](https://github.com/PaddlePaddle/Paddle/pull/41157), [#40249](https://github.com/PaddlePaddle/Paddle/pull/40249), [#40111](https://github.com/PaddlePaddle/Paddle/pull/40111), [#39964](https://github.com/PaddlePaddle/Paddle/pull/39964), [#40133](https://github.com/PaddlePaddle/Paddle/pull/40133)) - - - `paddle.profiler.RecordEvent`,提供自定义打点来记录时间的功能。([#39693](https://github.com/PaddlePaddle/Paddle/pull/39693), [#39694](https://github.com/PaddlePaddle/Paddle/pull/39694), [#39695](https://github.com/PaddlePaddle/Paddle/pull/39695), [#39675](https://github.com/PaddlePaddle/Paddle/pull/39675),[#41445](https://github.com/PaddlePaddle/Paddle/pull/41445), [#41132](https://github.com/PaddlePaddle/Paddle/pull/41132)) - - - `paddle.profiler.ProfilerTarget`,指定性能分析的目标设备。 - - - `paddle.profiler.ProfilerState`,表示性能分析器的状态。 - - - `paddle.profiler.SortedKeys`,指定统计表单内数据的排序方式。 - - - `paddle.profiler.make_scheduler`,生成性能分析器状态的调度器,实现采集范围的周期性控制。 - - - `paddle.profiler.export_chrome_tracing`,将性能数据保存到可供 chrome://tracing 插件查看的 google chrome tracing 文件。([#39316](https://github.com/PaddlePaddle/Paddle/pull/39316), [#39984](https://github.com/PaddlePaddle/Paddle/pull/39984), [#41029](https://github.com/PaddlePaddle/Paddle/pull/41029)) - - - `paddle.profiler.export_protobuf`,将性能数据保存到内部结构表示的 protobuf 文件。([#39519](https://github.com/PaddlePaddle/Paddle/pull/39519), [#39109](https://github.com/PaddlePaddle/Paddle/pull/39109), [#39474](https://github.com/PaddlePaddle/Paddle/pull/39474)) - - - `paddle.profiler.load_profiler_result`,载入所保存到 protobuf 文件的性能数据。 - - - `paddle.profiler.Profiler`通过指定 `timer_only` 参数,对模型进行数据读取、step 开销和吞吐量的统计。([#40386](https://github.com/PaddlePaddle/Paddle/pull/40386)) - -- C++层重构 Profiler 底层基础设施 - - - 重构 Profiler 的控制器架构。([#38826](https://github.com/PaddlePaddle/Paddle/pull/38826), [#39230](https://github.com/PaddlePaddle/Paddle/pull/39230), [#39779](https://github.com/PaddlePaddle/Paddle/pull/39779)) - - - 新增 Host Tracer,收集主机侧性能指标。([#37629](https://github.com/PaddlePaddle/Paddle/pull/39629), [#37766](https://github.com/PaddlePaddle/Paddle/pull/37766), [#37944](https://github.com/PaddlePaddle/Paddle/pull/37944), [#38280](https://github.com/PaddlePaddle/Paddle/pull/38280), [#39975](https://github.com/PaddlePaddle/Paddle/pull/39975), [#40460](https://github.com/PaddlePaddle/Paddle/pull/40460)) - - - 新增 CUDA Tracer,收集设备侧性能指标。([#39488](https://github.com/PaddlePaddle/Paddle/pull/39488)) - - - Profiler 支持分级。([#39926](https://github.com/PaddlePaddle/Paddle/pull/39926)) - -- 修改新动态图下 op 的打点名称和类型。([#41771](https://github.com/PaddlePaddle/Paddle/pull/41771/) - -- 添加 Kernel 表单,以及优化表单内容的展示方式。([#41989](https://github.com/PaddlePaddle/Paddle/pull/41989)) - -- 消除 Profiler 关闭情况下对模型前向计算造成性能下降的影响。([#42142](https://github.com/PaddlePaddle/Paddle/pull/42142)) - -#### CINN 编译器接入 - -飞桨的编译器功能在逐步丰富中,针对 CINN ([GitHub - PaddlePaddle/CINN: Compiler Infrastructure for Neural Networks](https://github.com/PaddlePaddle/CINN)) 的变更,Paddle 侧接入也进行了相对应的更改,以适配编译器 CINN 的功能。其中主要包括增加 Paddle-CINN 运行流程的子图管理相关功能,显存和速度性能的优化、开发过程发现的 bug 修复。 - -- 功能开发: - - - 子图 op 相关: - - - 添加从计算图中找到并生成 CINN 子图的功能。([#36345](https://github.com/PaddlePaddle/Paddle/pull/36345)) - - - 新增 cinn_launch op 作为运行时接入 CINN 的入口,负责调度 CINN 对子图进行编译、初始化数据空间、调度生成 Kernel 的执行。([#36600](https://github.com/PaddlePaddle/Paddle/pull/36600)) - - - 为 cinn_launch op 的 Kernel 实现添加辅助类 CinnLaunchContext 管理子图编译、运行的中间数据,提升可扩展性和代码可读性。([#37938](https://github.com/PaddlePaddle/Paddle/pull/37938)) - - - 为 CINN 子图添加额外的 fetch 结点,从而保证 CINN 外部结点能取到待 fetch 变量的值。([#37172](https://github.com/PaddlePaddle/Paddle/pull/37172), [#37190](https://github.com/PaddlePaddle/Paddle/pull/37190)) - - - 添加对 CINN 子图符号化的功能,符号化用于拓扑排序子图并返回 CINN 执行序列。([#36417](https://github.com/PaddlePaddle/Paddle/pull/36417)) - - - 新增 CinnCompiler 类,用于调用 CINN 编译模型中可使用 CINN 算子替换的子图。([#36562](https://github.com/PaddlePaddle/Paddle/pull/36562), [#36975](https://github.com/PaddlePaddle/Paddle/pull/36975)) - - - 为 CINN 符号化类新增获取子图 fetch 变量名的接口,防止编译优化中将 fetch 变量融合消除。([#37218](https://github.com/PaddlePaddle/Paddle/pull/37218)) - - - 程序开发检查、debug、API 变更相关: - - - 同步更新 CINN 中 NetBuilder API 名称的变化。([#40392](https://github.com/PaddlePaddle/Paddle/pull/40392)) - - - 为 Paddle-CINN 添加必要的用于 debug 的日志信息。([#36867](https://github.com/PaddlePaddle/Paddle/pull/36867)) - - - 添加 Paddle desc 与 CINN desc 互转函数。([#36100](https://github.com/PaddlePaddle/Paddle/pull/36100)) - - - 相比 Paddle,CINN 中实现的算子可能存在未使用到某些输入变量,因此在 cinn_launch op 中去除对输入变量必须被使用的检查。([#37119](https://github.com/PaddlePaddle/Paddle/pull/37119)) - - - 新增 cinn_instruction_run op 用于调用 CINN 执行单个生成指令,便于 Paddle 侧构建 Graph 调度运行子图。([#39435](https://github.com/PaddlePaddle/Paddle/pull/39435), [#39576](https://github.com/PaddlePaddle/Paddle/pull/39576)) - - - 在 Paddle 中添加编译 CINN 所需的 CUDA/CUBLAS/MKL/CINN pass 应用等控制宏。([#37066](https://github.com/PaddlePaddle/Paddle/pull/37066), [#36660](https://github.com/PaddlePaddle/Paddle/pull/36660)) - - - 增加 FLAGS_allow_cinn_ops 和 FLAGS_deny_cinn_ops 两个控制标记,用于控制 Paddle 训练中使用 CINN 算子代替原生算子的种类。([#36842](https://github.com/PaddlePaddle/Paddle/pull/36842)) - -- 性能优化: - - - 速度优化 - - - 优化 CinnCacheKey 的计算耗时。([#37786](https://github.com/PaddlePaddle/Paddle/pull/37786), [#37317](https://github.com/PaddlePaddle/Paddle/pull/37317)) - - - 缓存 CINN 编译子图的变量 scope,降低运行参数构造开销。([#37983](https://github.com/PaddlePaddle/Paddle/pull/37983)) - - - 子图编译时接入 CINN 自动调优,支持通过 flag 启用,便于后续进一步调优训练性能。([#41795](https://github.com/PaddlePaddle/Paddle/pull/41795)) - - - 重构子图编译时对编译结果的正确性校验,避免运行时重复检查,降低调度开销。([#41777](https://github.com/PaddlePaddle/Paddle/pull/41777)) - - - 在 Paddle-CINN 训练功能中默认启用 TransposeFolding 和 GemmRewriter 优化 pass。([#41084](https://github.com/PaddlePaddle/Paddle/pull/41084)) - - - 将 Paddle 中创建的 cuda stream 传入 CINN,使得 Paddle 和 CINN 执行计算时共用同一个 CUDA stream。([#37337](https://github.com/PaddlePaddle/Paddle/pull/37337)) - - - 将 CINN 优化 pass 应用逻辑从 Paddle 中移动到 CINN 中。([#42047](https://github.com/PaddlePaddle/Paddle/pull/42047), [#42070](https://github.com/PaddlePaddle/Paddle/pull/42070)) - - - 显存优化 - - - 为 cinn_launch op 添加 NoNeedBufferVars 声明无须 buffer 的输入变量列表,以便显存优化提前释放无效空间。([#38367](https://github.com/PaddlePaddle/Paddle/pull/38367)) - - - 传入子图外部变量的引用计数信息,便于 cinn_launch 内子图复用显存优化 pass,降低使用 CINN 的显存开销。([#39209](https://github.com/PaddlePaddle/Paddle/pull/39209), [#39622](https://github.com/PaddlePaddle/Paddle/pull/39622)) - - - 添加 CINN 编译生成的可执行指令集合转换为 Paddle Graph 的功能,支持复用 Paddle 调度器及显存优化 pass,进一步降低使用 CINN 的显存开销。([#39724](https://github.com/PaddlePaddle/Paddle/pull/39724), [#39911](https://github.com/PaddlePaddle/Paddle/pull/39911)) - - - 添加 cinn_instruction_run op 的 Kernel 支持根据编译结果推断的数据类型动态申请空间。([#40920](https://github.com/PaddlePaddle/Paddle/pull/40920)) - -- 问题修复: - - - 修复并优化 CINN 子图的生成逻辑。([#36503](https://github.com/PaddlePaddle/Paddle/pull/36503)) - - - 修复 Paddle-CINN 不支持无输入子图的问题。([#40814](https://github.com/PaddlePaddle/Paddle/pull/40814)) - - - 修复由于 CINN 无法处理 batch_norm 等算子中存在的无用输出而报错的问题。([#36996](https://github.com/PaddlePaddle/Paddle/pull/36996)) - - - 修复若干 CINN 子图划分以及符号化中存在的 bug,解决 Paddle 训练接入 CINN 全流程打通过程中遇到的问题。([#36739](https://github.com/PaddlePaddle/Paddle/pull/36739), [#36698](https://github.com/PaddlePaddle/Paddle/pull/36698) ) - - - CINN 尚不支持控制流,添加遇控制流跳过的逻辑。([#40812](https://github.com/PaddlePaddle/Paddle/pull/40812)) - -#### 其他 - -- 模型量化 - - - 升级量化存储格式,并统一动、静态图量化格式。([#41041](https://github.com/PaddlePaddle/Paddle/pull/41041)) - - - 新增离线量化方法:EMD、Adaround。([#40421](https://github.com/PaddlePaddle/Paddle/pull/40421), [#38460](https://github.com/PaddlePaddle/Paddle/pull/38460)) - - - 支持更多 op 适配模 op 量化。([#40083](https://github.com/PaddlePaddle/Paddle/pull/40083)) - - - 支持控制流中的 OP 量化。([#37498](https://github.com/PaddlePaddle/Paddle/pull/37498)) - - - 新增支持 matmul_v2 OP 的量化。([#36469](https://github.com/PaddlePaddle/Paddle/pull/36469)) - - - 新增支持量化后的 matmul_v2 在 TensorRT 上的推理。([#36594](https://github.com/PaddlePaddle/Paddle/pull/36594)) - -- 显存优化 - - - 实现多 stream 安全 Allocator,支持在多 stream 异步计算场景下安全高效地使用显存。([#37290](https://github.com/PaddlePaddle/Paddle/pull/37290)) - - - 新增运行时显存监控模块(paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved),支持高性能地实时统计显存数据。([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657)) - - - 实现 CPU-GPU 统一内存寻址(CUDA Managed Memory),支持在显存受限场景下训练超大模型。([#39075](https://github.com/PaddlePaddle/Paddle/pull/39075)) - - - C++底层新增 GetBasePtr 接口,用来获取设备接口 CUDAMalloc 创建的设备地址。([#37978](https://github.com/PaddlePaddle/Paddle/pull/37978)) - - - 减少 AutoGrowth Allocator 中 free blocks 的数量,提升显存分配性能。([#35732](https://github.com/PaddlePaddle/Paddle/pull/35732)) - - - 对于 `initializer.Normal` 和 `initializer.Constant` 数据类型是 FP16 的 Tensor 去除多余的 float32 临时 Tensor 以及 cast,节省 2 倍显存。([#38818](https://github.com/PaddlePaddle/Paddle/pull/38818)) - -- 动态图高阶导数组网测试 - - - 为动态图增加三阶导数组网测试,以及 Broadcast 情况的测试。([#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377)) - -- 自定义 op:支持 ROCm(HIP) 平台进行自定义 op 注册。([#36771](https://github.com/PaddlePaddle/Paddle/pull/36771)) - -- Cost Model:增加基于运行 Profile 的 Cost Model。([#35774](https://github.com/PaddlePaddle/Paddle/pull/35774)) - -- 提供定制化层 (nn.Layer)的自动稀疏训练支持,让用戶可根据自定义的 Prune 函数来对其设计的层进行稀疏剪枝。([#40253](https://github.com/PaddlePaddle/Paddle/pull/40253)) - -- 新增字符串张量底层数据结构表示,使框架具备字符串张量表示和计算的能力。([#39830](https://github.com/PaddlePaddle/Paddle/pull/39830), [#40992](https://github.com/PaddlePaddle/Paddle/pull/40992)) - -- 新增或者升级 oneDNN FP32/int8/bfloat16 Kernel,包括: - - - ELU ([#37149](https://github.com/PaddlePaddle/Paddle/pull/37149)) - - - exp ([#38624](https://github.com/PaddlePaddle/Paddle/pull/38624)) - - - stack ([#37002](https://github.com/PaddlePaddle/Paddle/pull/37002)) - - - softplus ([#36382](https://github.com/PaddlePaddle/Paddle/pull/36382)) - - - round ([#39653](https://github.com/PaddlePaddle/Paddle/pull/39653)) - - - shape ([#36033](https://github.com/PaddlePaddle/Paddle/pull/36033)) - - - flatten and flatten2 ([#35892](https://github.com/PaddlePaddle/Paddle/pull/35892)) - - - slice ([#37630](https://github.com/PaddlePaddle/Paddle/pull/37630)) - - - elementwise_mul ([#40546](https://github.com/PaddlePaddle/Paddle/pull/40546)) - - - elementwise_add ([#38176](https://github.com/PaddlePaddle/Paddle/pull/38176)) - - - ementwise_div ([#36158](https://github.com/PaddlePaddle/Paddle/pull/36158)) - - - elementwise_sub ([#35662](https://github.com/PaddlePaddle/Paddle/pull/35662)) - - - roi_align ([#37848](https://github.com/PaddlePaddle/Paddle/pull/37848)) - - - nearest_interp and nearest_interp_v2 ([#37985](https://github.com/PaddlePaddle/Paddle/pull/37985),[#38622](https://github.com/PaddlePaddle/Paddle/pull/38622),[#39490](https://github.com/PaddlePaddle/Paddle/pull/39490)) - - - assembly optimized Adam ([#39158](https://github.com/PaddlePaddle/Paddle/pull/39158)) - - - logsoftmax ([#39793](https://github.com/PaddlePaddle/Paddle/pull/39793)) - - - activation ([#40721](https://github.com/PaddlePaddle/Paddle/pull/40721)) - - - mul ([#38552](https://github.com/PaddlePaddle/Paddle/pull/38552)) - - - mean ([#37104](https://github.com/PaddlePaddle/Paddle/pull/37104)) - - - relu ([#36265](https://github.com/PaddlePaddle/Paddle/pull/36265)) - - - pool2d ([#37081](https://github.com/PaddlePaddle/Paddle/pull/37081)) - - - concat ([#35889](https://github.com/PaddlePaddle/Paddle/pull/35889)) - - - conv2d ([#38507](https://github.com/PaddlePaddle/Paddle/pull/38507),[#38938](https://github.com/PaddlePaddle/Paddle/pull/38938),[#36284](https://github.com/PaddlePaddle/Paddle/pull/36284)) - - - LayerNorm ([#40418](https://github.com/PaddlePaddle/Paddle/pull/40418)) - -- 增加基于 SSD-内存-GPU 显存 的 3 级存储图检索引擎,支持大规模图神经网络训练。([#42472](https://github.com/PaddlePaddle/Paddle/pull/42472), [#42321](https://github.com/PaddlePaddle/Paddle/pull/42321), [#42027](https://github.com/PaddlePaddle/Paddle/pull/42027)) - -- 增加异构多云训练通信模块 switch,实现 Send/Recv 接口,支持多云异构通信。([#40965](https://github.com/PaddlePaddle/Paddle/pull/40965) [40911](https://github.com/PaddlePaddle/Paddle/pull/40911)) - -### (2)功能优化 - -#### API - -- 为 `paddle.Model`新增支持混合精度训练 O2 模式,即支持原来动/静态图的 Pure FP16 训练模式。([#36441](https://github.com/PaddlePaddle/Paddle/pull/40962441)) - -- 为 `paddle.nn.Layer` 支持 self chain 调用。([#36609](https://github.com/PaddlePaddle/Paddle/pull/36609)) - -- 为 `paddle.nn.Layer`的`to`方法添加`is_distributed`属性的设置,保证网络参数转换前后分布式属性保持一致。([#36221](https://github.com/PaddlePaddle/Paddle/pull/36221)) - -- 完善 `paddle.nn.Layer`的`to` 方法的参数转换逻辑,降低转换过程占用的峰值显存,提高转换成功率。([#36862](https://github.com/PaddlePaddle/Paddle/pull/36862)) - -- 为 `paddle.incubate.graph_send_recv`支持设置输出 Tensor 的 shape,有利于减少实际计算过程的显存占用。([#40509](https://github.com/PaddlePaddle/Paddle/pull/40509)) - -- 为 `paddle.incubate.segment_sum`、`segment_mean`、`segment_max`、`segment_min` 新增 int32、int64 数据类型支持。([#40577](https://github.com/PaddlePaddle/Paddle/pull/40577)) - -- 为 transpose op 新增 bool 类型支持。([#35886](https://github.com/PaddlePaddle/Paddle/pull/35886)) - -- 将 `paddle.mm` 底层算子从 matmul 切换到 matmul_v2。([#35770](https://github.com/PaddlePaddle/Paddle/pull/35770)) - -- 为 `paddle.einsum` 支持静态图模式调用,支持未知 shape。([#40360](https://github.com/PaddlePaddle/Paddle/pull/40360)) - -- 为 `paddle.nn.functional.margin_cross_entropy` 和 `paddle.nn.functional.class_center_sample` 支持数据并行。([#39852](https://github.com/PaddlePaddle/Paddle/pull/39852)) - -- 为 `paddle.nn.functional.grid_sample`支持形状为[1]的输入。([#36183](https://github.com/PaddlePaddle/Paddle/pull/36183)) - -- 为 `paddle.nn.PRelu` 支持 `NHWC` 数据格式。([#37019](https://github.com/PaddlePaddle/Paddle/pull/37019)) - -- 为 `paddle.nn.functional.class_center_sample` 支持使用 `paddle.seed` 固定随机状态。([#38248](https://github.com/PaddlePaddle/Paddle/pull/38248)) - -- 为 `paddle.fft` 下所有 API 新增 ROCM 后端支持,并优化 CUFFT 后端报错信息。([#36415](https://github.com/PaddlePaddle/Paddle/pull/36415), [#36114](https://github.com/PaddlePaddle/Paddle/pull/36114/files)) - -- 为 `Tensor.getitem` 增加对切片部分维度为 0 的功能支持,即允许切片索引结果为空。([#37313](https://github.com/PaddlePaddle/Paddle/pull/37313)) - -- 为 `Tensor.setitem` 支持 int 和 bool 类型 Tensor 使用 bool 索引。([#37761](https://github.com/PaddlePaddle/Paddle/pull/37761)) - -- 为 `paddle.nn.functional.interpolate` 支持 nearest 模式时输入 shape 为 5D。([#38868](https://github.com/PaddlePaddle/Paddle/pull/38868)) - -- 为 `paddle.nn.Embedding`、`paddle.gather` 增加 int16 支持。([#40964](https://github.com/PaddlePaddle/Paddle/pull/40964), [#40052](https://github.com/PaddlePaddle/Paddle/pull/40052)) - -- 为 `paddle.distributed.spawn`添加 CPU 单机数据并行。([#35745](https://github.com/PaddlePaddle/Paddle/pull/35745), [#36758](https://github.com/PaddlePaddle/Paddle/pull/36758), [#36637](https://github.com/PaddlePaddle/Paddle/pull/36637)) - -- 新增`depthwise_conv2d`MKLDNN 算子。([#38484](https://github.com/PaddlePaddle/Paddle/pull/38484)) - -- 为`paddle.abs`、`paddle.transpose`、`paddle.squeeze`、`paddle.unsqueeze`、 `paddle.matmul`、`paddle.full` 静态图数据类型检测中增加复数类型。([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113)) - -- 为 `paddle.autograd.PyLayer` 支持 tuple/list 类型的参数。([#38146](https://github.com/PaddlePaddle/Paddle/pull/38146)) - -- 为 `paddle.autograd.PyLayer` 增加检查 inplace 策略下,输入叶子节点的 Tensor 的检查报错机制。([#37931](https://github.com/PaddlePaddle/Paddle/pull/37931)) - -- 为 `paddle.autograd.PyLayer` 支持 HIP 库。([#38184](https://github.com/PaddlePaddle/Paddle/pull/38184)) - -- 为 `paddle.take_along_axis`、`paddle.put_along_axis` 支持更多 size 的输入,允许 index 矩阵的 shape size 大于 arr 矩阵的 shape size。([#39072](https://github.com/PaddlePaddle/Paddle/pull/39072)) - -- 优化 API `paddle.nn.Pad2D`在 replicate 为 0 时的报错信息。([#36510](https://github.com/PaddlePaddle/Paddle/pull/36510/files)) - -- 支持 API `paddle.nn.Pad2D`在 tuple 格式的 pad 输入。([#35985](https://github.com/PaddlePaddle/Paddle/pull/35985/files)) - -- 新增 `paddle.distributed.InMemoryDataset` 中 tdm_sample API 以支持 TDM 算法中的采样操作。([#37044](https://github.com/PaddlePaddle/Paddle/pull/37044)) - -- 新增对于`paddle.jit.save`的 Pre-saving Hooks 机制。([#38186](https://github.com/PaddlePaddle/Paddle/pull/38186)) - -- 新增高阶微分相关 API: - - - `elementwise_add` 增加三阶 Kernel,支持三阶微分的计算。([#36508](https://github.com/PaddlePaddle/Paddle/pull/36508), [#36618](https://github.com/PaddlePaddle/Paddle/pull/36618)) - - - `matmul_v2` 增加三阶 Kernel,支持三阶微分的计算。([#36459](https://github.com/PaddlePaddle/Paddle/pull/36459)) - - - `elementwise_mul` 增加三阶 Kernel,支持三阶微分的计算。([#37152](https://github.com/PaddlePaddle/Paddle/pull/37547)) - -- 完善`paddle.amp.GradScaler`调用 check_finite_and_unscale op 的逻辑,消除该处创建 bool 变量所引入的 cudaMemcpy。([#37770](https://github.com/PaddlePaddle/Paddle/pull/37770)) - -- 新增对 unstack 和 unique op 元素个数为 0 的 Tensor 增加检查。([#36021](https://github.com/PaddlePaddle/Paddle/pull/36021)) - -- 新增支持昆仑芯 2 的多层、双向 LSTM 功能,完善 RNN 前反向 op,支持时序类模型训练使用。([#](https://github.com/PaddlePaddle/Paddle/pull/41781)[42076](https://github.com/PaddlePaddle/Paddle/pull/42076)) - -- 新增支持昆昆仑芯 2 的 bce_loss 前反向 op。([#41610](https://github.com/PaddlePaddle/Paddle/pull/41610)) - -- 添加 `paddle.linalg.det` 的反向实现。([#36013](https://github.com/PaddlePaddle/Paddle/pull/36013)) - -#### IR(Intermediate Representation) - -- 动态图转静态图 - - - 优化动转静下 `ProgramCache.last` 接口行为,使其返回最近使用的 Program,而非最后生成的 Program。([#39541](https://github.com/PaddlePaddle/Paddle/pull/39541)) - - - 优化动转静下 `paddle.reshape` API 的报错信息,新增推荐用法提示。([#40599](https://github.com/PaddlePaddle/Paddle/pull/40599)) - - - 优化动转静代码转写时 `is_api_in_module` 函数中异常捕获类型。([#40243](https://github.com/PaddlePaddle/Paddle/pull/40243)) - - - 优化动转静模块报错提示,默认隐藏 warning 信息。([#39730](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39730)) - - - 增加动转静对于 type hint 语法的支持,提高变量类型分析的准确性。([#39572](https://github.com/PaddlePaddle/Paddle/pull/39572)) - - - 优化 `paddle.cond` 功能,允许 bool、int 等基本类型支持值相等。([#37888](https://github.com/PaddlePaddle/Paddle/pull/37888)) - - - 优化动转静`@to_static` 装饰普通函数时,允许切换 train/eval 模式。([#37383](https://github.com/PaddlePaddle/Paddle/pull/37383)) - - - 优化动转静报错栈,突出用户相关代码,减少框架冗余报错栈。([#36741](https://github.com/PaddlePaddle/Paddle/pull/36741)) - - - 移除`paddle.cond` 返回值中 `no_value` 占位符。([#36513](https://github.com/PaddlePaddle/Paddle/pull/36513)、[#36826](https://github.com/PaddlePaddle/Paddle/pull/36826)) - - - 为动转静 run_program op 适配新动态图模式。([#40198](https://github.com/PaddlePaddle/Paddle/pull/40198), [#40355](https://github.com/PaddlePaddle/Paddle/pull/40355)) - - - 新增对于 zip 语法的检查。([#37846](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/37846)) - - - 修复 `paddle.signal.frame`、`paddle.signal.stft`、`paddle.signal.istft` 因维度和类型判断错误导致的动转静失败问题。([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113)) - - - 为 mean、pad3d ops 新增注册复数类型 Kernel。([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113)) - -#### 混合精度训练 - -- 为 amp 添加 GPU Compute Capability 环境检查,对无法产生训练加速效果的 GPU 环境添加使用警告。([#38086](https://github.com/PaddlePaddle/Paddle/pull/38086)) - -- 添加`paddle.amp.decorate`与`paddle.DataParallel`同时使用时调用顺序的检查。([#38785](https://github.com/PaddlePaddle/Paddle/pull/38785)) - -#### 分布式训练 - -- 分布式训练基础功能 - - - 优化 Fleet API 和 DistributedStrategy 配置以使用动态图并行功能,提升动态图易用性。([#40408](https://github.com/PaddlePaddle/Paddle/pull/40408)) - - - 优化动态图混合并行 HybridParallelClipGrad 策略,支持 4D 混合并行 + Pure FP16 训练。([#36237](https://github.com/PaddlePaddle/Paddle/pull/36237), [#36555](https://github.com/PaddlePaddle/Paddle/pull/36555)) - - - 重构动态图数据并行策略,以支持新动态图和新通信库功能。([#40389](https://github.com/PaddlePaddle/Paddle/pull/40389), [#40593](https://github.com/PaddlePaddle/Paddle/pull/40593), [#40836](https://github.com/PaddlePaddle/Paddle/pull/40836), [#41119](https://github.com/PaddlePaddle/Paddle/pull/41119), [#41413](https://github.com/PaddlePaddle/Paddle/pull/41413), [#39987](https://github.com/PaddlePaddle/Paddle/pull/39987)) - - - 为 fused_attention op 支持分布式张量模型并行。([#40101](https://github.com/PaddlePaddle/Paddle/pull/40101)) - - - 为 fused_feedforward op 支持分布式张量模型并行。([#40160](https://github.com/PaddlePaddle/Paddle/pull/40160)) - -- 图检索引擎 - - - 优化图引擎的图采样接口返回的数据格式,采样速度提升 3 倍。([#37315](https://github.com/PaddlePaddle/Paddle/pull/37315)) - - - 减少图引擎线程量以提升性能。([#37098](https://github.com/PaddlePaddle/Paddle/pull/37098)) - - - 优化图引擎数据传输以提升性能。([#37341](https://github.com/PaddlePaddle/Paddle/pull/37341)) - - - 利用模型中 embedding op 的拓扑关系,优化 embedding op 的合并逻辑以提升性能。[(#35942)](https://github.com/PaddlePaddle/Paddle/pull/35942) - -- 通信库:重构通信库,提升通信库的易扩展性和二次开发性,支持异构通信。([#41398](https://github.com/PaddlePaddle/Paddle/pull/41398), [#39720](https://github.com/PaddlePaddle/Paddle/pull/39720), [#40911](https://github.com/PaddlePaddle/Paddle/pull/40911), [#40579](https://github.com/PaddlePaddle/Paddle/pull/40579), [#40629](https://github.com/PaddlePaddle/Paddle/pull/40629), [#40437](https://github.com/PaddlePaddle/Paddle/pull/40437), [#40430](https://github.com/PaddlePaddle/Paddle/pull/40430), [#40228](https://github.com/PaddlePaddle/Paddle/pull/40228), [#40181](https://github.com/PaddlePaddle/Paddle/pull/40181), [#40100](https://github.com/PaddlePaddle/Paddle/pull/40100), [#40097](https://github.com/PaddlePaddle/Paddle/pull/40097), [#39892](https://github.com/PaddlePaddle/Paddle/pull/39892), [#39384](https://github.com/PaddlePaddle/Paddle/pull/39384), [#39737](https://github.com/PaddlePaddle/Paddle/pull/39737), [#40040](https://github.com/PaddlePaddle/Paddle/pull/40040)) - -- 支持 `paddle.incubate.distributed.models.moe`中 MoE 相关接口(`moe.GShardGate`, `moe.BaseGate`, `moe.SwitchGate`, `moe.MoELayer`, `moe.ClipGradForMOEByGlobalNorm` )的公开。([#42300](https://github.com/PaddlePaddle/Paddle/pull/42300)) - -- 修复 `paddle.incubate.distributed.models.moe.MoELayer` 中使用 recomputing 可能报错的问题。([#42128](https://github.com/PaddlePaddle/Paddle/pull/42128)) - -- 修复新动态图流水线并行因为数据类型不同导致的报错 ([#41937](https://github.com/PaddlePaddle/Paddle/pull/41937) [#42053](https://github.com/PaddlePaddle/Paddle/pull/42053)) - -- 修复新动态图张量模型并行因为数据类型不同导致的报错 ([#41960](https://github.com/PaddlePaddle/Paddle/pull/41960)) - -#### 自定义算子 - -- 增强 C++自定义算子机制对二阶反向算子编写功能,支持为二阶反向算子的梯度输入变量添加后缀作为输出使用。([#41781](https://github.com/PaddlePaddle/Paddle/pull/41781)) - -- 移除 Tensor API 成员方法中对废弃的枚举类型 PlaceType 的使用,进行相应兼容处理,并添加 deprecated warning 提示。([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882)) - -- 为原 Tensor API 的一系列废弃接口,包括不完整构造函数、reshape、mutable_data、copy_to 方法添加 deprecated warning 提示。([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882)) - -#### 其他 - -- 报错调试优化 - - - 优化 cross_entropy op 对 `label` 的边界检查报错信息。([#40001](https://github.com/PaddlePaddle/Paddle/pull/40001)) - - - 为动态图添加 op 执行时`infer_shape`和`compute`方法的 profile record,用于在 timeline 中展示其开销。([#39023](https://github.com/PaddlePaddle/Paddle/pull/39023)) - - - 替换了 Windows 下容易出现未知异常的 `pybind::index_error` 报错提示。([#40538](https://github.com/PaddlePaddle/Paddle/pull/40538)) - - - 添加用户 scatter op 越界检查的报错信息。([#37429](https://github.com/PaddlePaddle/Paddle/pull/37429)) - -- 下载工具:针对`paddle.utils.download.get_path_from_url`中解压含多文件目录速度慢的问题,将原先循环遍历目录下文件逐一解压的方式替换为在目录上调用 extractall 一次解压的方式,解压速度大幅提升。([#37311](https://github.com/PaddlePaddle/Paddle/pull/37311)) - -- 加速 `fake_quantize_range_abs_max`、`fake_quantize_abs_max`、`fake_quantize_dequantize_abs_max`、 `fake_quantize_moving_average_abs_max` 等量化训练。([#40491](https://github.com/PaddlePaddle/Paddle/pull/40491)) - -### (3)性能优化 - -#### 分布式训练 - -- 混合并行优化器 sharding 支持 optimize_cast 优化,将前反向参数 cast 移到优化器阶段,性能提升 7%。([#35878](https://github.com/PaddlePaddle/Paddle/pull/35878)) - -- GPUPS 优化:支持梯度 fuse allreduce 训练,训练提升 20%。([#35131](https://github.com/PaddlePaddle/Paddle/pull/35131)) - -- GPUPS 优化:dump CPU 优化提速 3.21 倍。([#40068](https://github.com/PaddlePaddle/Paddle/pull/40068)) - -- CPU 参数服务器流式训练优化:支持稀疏参数统计量自动统计、稀疏参数增量保存等功能,训练性能提升 20%。([#36465](https://github.com/PaddlePaddle/Paddle/pull/36465), [#36601](https://github.com/PaddlePaddle/Paddle/pull/36601), [#36734](https://github.com/PaddlePaddle/Paddle/pull/36734), [#36909](https://github.com/PaddlePaddle/Paddle/pull/36909), [#36943](https://github.com/PaddlePaddle/Paddle/pull/36943), [#37181](https://github.com/PaddlePaddle/Paddle/pull/37181), [#37194](https://github.com/PaddlePaddle/Paddle/pull/37194), [#37515](https://github.com/PaddlePaddle/Paddle/pull/37515), [#37626](https://github.com/PaddlePaddle/Paddle/pull/37626), [#37995](https://github.com/PaddlePaddle/Paddle/pull/37995), [#38582](https://github.com/PaddlePaddle/Paddle/pull/38582), [#39250](https://github.com/PaddlePaddle/Paddle/pull/39250), [#40762](https://github.com/PaddlePaddle/Paddle/pull/40762), [#41234](https://github.com/PaddlePaddle/Paddle/pull/41234), [#41320](https://github.com/PaddlePaddle/Paddle/pull/41320), [#41400](https://github.com/PaddlePaddle/Paddle/pull/41400)) - -#### 算子优化 - -- 优化 `FasterTokenizer` 性能,性能与优化前相比提升 10%。([#36701](https://github.com/PaddlePaddle/Paddle/pull/36701)) - -- 优化 `index_select` 反向计算,性能较优化前有 3.7~25.2 倍提升。([#37055](https://github.com/PaddlePaddle/Paddle/pull/37055)) - -- 优化 `paddle.nn.ClipByGlobalNorm` 的性能,以 10*10 的 `paddle.nn.Linear` 为例,性能与优化前相比提升 30%左右。([#38209](https://github.com/PaddlePaddle/Paddle/pull/38209)) - -- 优化 `pnorm` 在 `axis` 维度极大或极小情况下的性能,前向速度提升 31~96 倍,反向速度提升 1.1~19 倍。([#37685](https://github.com/PaddlePaddle/Paddle/pull/37685), [#38215](https://github.com/PaddlePaddle/Paddle/pull/38215), [#39011](https://github.com/PaddlePaddle/Paddle/pull/39011)) - -- 优化 `softmax` 前、反向性能,对于 `axis!=-1` 的配置加速比为 2 倍左右。([#38602](https://github.com/PaddlePaddle/Paddle/pull/38602), [#38609](https://github.com/PaddlePaddle/Paddle/pull/38609), [#32387](https://github.com/PaddlePaddle/Paddle/pull/32387), [#37927](https://github.com/PaddlePaddle/Paddle/pull/37927/files)) - -- 优化 `log_softmax` 前、反向性能,对于 `axis!=-1`的配置加速比为 6~20 倍左右。([#38992](https://github.com/PaddlePaddle/Paddle/pull/38992), [#40612](https://github.com/PaddlePaddle/Paddle/pull/40612)) - -- 优化 `softmax_with_cross_entropy` 前、反向性能,对于 `hard_label` 的配置加速比为 1.3 倍左右。([#39553](https://github.com/PaddlePaddle/Paddle/pull/39553), [#40424](https://github.com/PaddlePaddle/Paddle/pull/40424), [#40643](https://github.com/PaddlePaddle/Paddle/pull/40643)) - -- 优化 `top_k` 性能,对于一维且 `k` 较大时(k=5000)的配置加速比为 22 倍以上。([#40941](https://github.com/PaddlePaddle/Paddle/pull/40941)) - -- 优化 `elementwise_mul` 反向计算,较优化前有 1.85~12.16 倍性能提升。([#37728](https://github.com/PaddlePaddle/Paddle/pull/37728)) - -- 优化 `elementwise_min` 反向和 `elementwise_max` 反向,较优化前打平或有 1.05~18.75 倍性能提升。([#38236](https://github.com/PaddlePaddle/Paddle/pull/38236), [#37906](https://github.com/PaddlePaddle/Paddle/pull/37906)) - -- 优化 `nearest_interp` 前向和反向计算,前向较优化前性能有 1.5~2.3 倍提升;反向性能较优化前有 60%~1.8 倍提升。([#38528](https://github.com/PaddlePaddle/Paddle/pull/38528), [#39067](https://github.com/PaddlePaddle/Paddle/pull/39067)) - -- 优化 `bilinear_interp` 前向和反向计算,前向较优化前性能有 0.4~2.3 倍提升;反向性能较优化前有 10%~30%提升。([#39243](https://github.com/PaddlePaddle/Paddle/pull/39243), [#39423](https://github.com/PaddlePaddle/Paddle/pull/39423)) - -- 优化 `dropout` 前向和反向计算,性能提升约 20%。([#39795](https://github.com/PaddlePaddle/Paddle/pull/39795), [#38859](https://github.com/PaddlePaddle/Paddle/pull/38859), [#38279](https://github.com/PaddlePaddle/Paddle/pull/38279), [#40053](https://github.com/PaddlePaddle/Paddle/pull/40053)) - -- 优化 `grid_sampler`前向和反向计算,前向较优化前性能有 10%~30%提升;反向性能较优化前有 10%~60%提升。([#39751](https://github.com/PaddlePaddle/Paddle/pull/39751)) - -- 优化 `group_norm` 前向和反向计算,前向性能提升 1.04~2.35 倍,反向性能提升 1.12~1.18 倍。([#39944](https://github.com/PaddlePaddle/Paddle/pull/39944), [#40657](https://github.com/PaddlePaddle/Paddle/pull/40657), [#39596](https://github.com/PaddlePaddle/Paddle/pull/39596)) - -- 优化 `conv1d` 前向和反向计算,前向性能提升 1.00~2.01 倍,反向性能提升 1.01~474.56 倍。([#38425](https://github.com/PaddlePaddle/Paddle/pull/38425)) - -- 优化 `elementwise_div` 反向计算,反向性能提升 1.02~29.25 倍。([#38044](https://github.com/PaddlePaddle/Paddle/pull/38044)) - -- 优化 `gelu` 前向和反向计算,前向性能提升 1.13~1.43 倍,反向性能提升 1.10~1.55 倍。([#38188](https://github.com/PaddlePaddle/Paddle/pull/38188), [#38263](https://github.com/PaddlePaddle/Paddle/pull/38263)) - -- 优化 `elementwise_sub` 反向计算,反向性能提升 1.04~15.64 倍。([#37754](https://github.com/PaddlePaddle/Paddle/pull/37754)) - -- 优化 `flip` 在输入一维数据时前向性能,性能提升 100%。([#37825](https://github.com/PaddlePaddle/Paddle/pull/37825)) - -- 优化 `layer_norm` 前向和反向计算,前向较优化前提升 2-5 倍,反向较优化前提升 20%~50%。([#39167](https://github.com/PaddlePaddle/Paddle/pull/39167), [#39247](https://github.com/PaddlePaddle/Paddle/pull/39247)) - -- 优化 `embedding` 前向和反向计算,前向较优化前最大提升 1.51 倍,反向较优化前提升 1.03~7.79 倍。([#39856](https://github.com/PaddlePaddle/Paddle/pull/39856), [#39886](https://github.com/PaddlePaddle/Paddle/pull/398866)) - -- 优化 `gelu` FP16 前向和反向计算,前向较优化前提升 9%~12%,反向较优化前提升 2%~9%。([#38980](https://github.com/PaddlePaddle/Paddle/pull/38980)) - -- 移除 `gather_nd`前反向算子中的 CPU -> GPU 显式数据传输操作,移除 `index_select` 前反向算子中的显式同步操作,将 `scatter_nd` 中的 GPU -> GPU 数据传输由同步操作改成异步操作。([#40933](https://github.com/PaddlePaddle/Paddle/pull/40933)) - -- 优化 `Lars optimzier` 计算,优化后 Resnet50 PF16 模型训练性能较优化前提升 5.1%。([#35652](https://github.com/PaddlePaddle/Paddle/pull/35652), [#35476](https://github.com/PaddlePaddle/Paddle/pull/35476)) - -- 优化 `AvgPool2dGrad` 计算,优化后性能较优化前提升 2.6 倍。([#35389](https://github.com/PaddlePaddle/Paddle/pull/35389)) - -- 优化 `Elementwise` 类计算对于多元输出的功能支持,优化后计算性能较优化前提升最多可达 15%。([#38329](https://github.com/PaddlePaddle/Paddle/pull/38329), [#38410](https://github.com/PaddlePaddle/Paddle/pull/38410)) - -- 优化 `Categorical`的 `probs`计算,简化计算逻辑,性能提升 4 ~ 5 倍。([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178)) - -- `paddle.sum` 性能优化,性能相比优化前提升约 20%。([#42309](https://github.com/PaddlePaddle/Paddle/pull/42309)) - -#### 自动调优 - -新增训练全流程硬件感知性能自动调优功能,在图像分类、分割、检测和图像生成任务上与模型默认参数配置下的性能相比提升约 3%~50%以上。通过 `paddle.incubate.autotune.set_config` API 设置自动调优状态,当前默认关闭。自动调优具体包括三个层次: - -- `paddle.io.DataLoader` 新增自动调优功能,根据训练数据和设备资源选择最佳的模型 num_workers。([#42004](https://github.com/PaddlePaddle/Paddle/pull/42004)) - -- 新增混合精度训练数据布局自动调优功能,根据设备类型和数据类型选择最佳数据布局,并在运行时自动转换。([#41964](https://github.com/PaddlePaddle/Paddle/pull/41964)) - -- 新增 Conv 运行时所需 workspace size 阈值自动调整功能,根据 GPU 当前可申请显存资源情况来自动设置;基于通用的 AlgorithmCache 设计和 Kernel 计时组件,新增 Conv cuDNN 算法自动选择功能,支持数据变长模型。([#41833](https://github.com/PaddlePaddle/Paddle/pull/41833)) - -#### 调度优化 - -- 移除 `paddle.nn.ClipGradByGlobalNorm` 中的 CudaStreamSync 隐藏操作,减少执行时的调度开销,在 ptb 模型上有 5%的性能提升。([#42170](https://github.com/PaddlePaddle/Paddle/pull/42170)) - -- 优化一系列底层数据结构及原动态图执行体系中的细节实现,提升原动态图的调度性能。([#42010](https://github.com/PaddlePaddle/Paddle/pull/42010), [#42171](https://github.com/PaddlePaddle/Paddle/pull/42171), [#42224](https://github.com/PaddlePaddle/Paddle/pull/42224), [#42256](https://github.com/PaddlePaddle/Paddle/pull/42256), [#42306](https://github.com/PaddlePaddle/Paddle/pull/42306), [#42329](https://github.com/PaddlePaddle/Paddle/pull/42329)[, #42340](https://github.com/PaddlePaddle/Paddle/pull/42340), [#42368](https://github.com/PaddlePaddle/Paddle/pull/42368), [#42425](https://github.com/PaddlePaddle/Paddle/pull/42425)) - -- 简化 `paddle.distribution.Categorical`的 probs 计算逻辑,提升性能 4 到 5 倍。([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178)) - -### (4)问题修复 - -#### API - -- 修复 `paddle.sum` 输入参数类型和输出参数类型不一致且 `axis` 轴对应的 reduce 元素个数为 1 时,输出类型错误问题。([#36123](https://github.com/PaddlePaddle/Paddle/pull/36123)) - -- 修复 `paddle.flops` 在 layer 输出类型为 tuple 时的 `AttributeError`。([#38850](https://github.com/PaddlePaddle/Paddle/pull/38850)) - -- 修复 `paddle.diag` 因为没有反向 Kernel 而无法传播梯度的问题。([#40447](https://github.com/PaddlePaddle/Paddle/pull/40447)) - -- 修复 `paddle.sort` 输入存在 NaN 值排序错误。([#41070](https://github.com/PaddlePaddle/Paddle/pull/41070)) - -- 修复 `paddle.full_like` 输入存在 Inf 值构建 Tensor 错误。([#40232](https://github.com/PaddlePaddle/Paddle/pull/40232)) - -- 修复 `paddle.strided_slice` 在输入 starts 中数据小于 -rank 时,strided_slice 结果与 slice 不一致的 bug。([#39066](https://github.com/PaddlePaddle/Paddle/pull/39066)) - -- 修复 `max_pool` 系列算子在返回 index 时 infer_shape 计算错误的问题,受影响的 API 有 `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`。([#40139](https://github.com/PaddlePaddle/Paddle/pull/40139)) - -- 修复 `max_pool` 系列算子返回的 pooling_mask 的 dtype 错误的问题,现在 pooling_mask 的 dtype 为 int32,受影响的 API 有 `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`。([#39314](https://github.com/PaddlePaddle/Paddle/pull/39314)) - -- 修复 `paddle.shape` 默认存在反向梯度导致计算错误的问题。([#37340](https://github.com/PaddlePaddle/Paddle/pull/37340)) - -- 修复 `paddle.nn.Layer` 的 `to` 方法同时转换 dtype 和 place 存在的 bug。([#37007](https://github.com/PaddlePaddle/Paddle/pull/38007)) - -- 修复 `paddle.amp.decorate` 无法对非叶子网络层的参数改写为 FP16 的 bug。([#38402](https://github.com/PaddlePaddle/Paddle/pull/38402)) - -- 修复 `paddle.amp.decorate` 将 `paddle.nn.BatchNorm1D`、`paddle.nn.BatchNorm2D`、`paddle.nn.BatchNorm3D` 非输入参数改写为 FP16 的 bug。([#38541](https://github.com/PaddlePaddle/Paddle/pull/38541)) - -- 修复 `paddle.amp.decorate` 将 `paddle.nn.SyncBatchNorm` 非输入参数改写为 FP16 的 bug。([#40943](https://github.com/PaddlePaddle/Paddle/pull/40943)) - -- 修复 `paddle.nn.Layer.to` 当中多余的 warning。([#36700](https://github.com/PaddlePaddle/Paddle/pull/36700)) - -- 修复 `paddle.nn.RNN` 在控制流下使用报错的问题。([#41162](https://github.com/PaddlePaddle/Paddle/pull/41162)) - -- 修复 `paddle.to_tensor` 无法指定 Tensor 的 CUDA Place 的问题。([#39662](https://github.com/PaddlePaddle/Paddle/pull/39662)) - -- 修复 `paddle.nn.Identity` 没有公开的问题。([#39615](https://github.com/PaddlePaddle/Paddle/pull/39615)) - -- 修复动态图重构后,`fill_` 和 `zero_` inplace API 的输入在 CUDAPinned Place 上时,输出值不正确的 bug。([#41229](https://github.com/PaddlePaddle/Paddle/pull/41229)) - -- 动态图重构后,修复使用 append op 的方式调用 assign op 导致输出 Tensor 的 inplace version 值不正确的 bug,修改为使用 `_C_ops` 的方式调用 assign op。([#41118](https://github.com/PaddlePaddle/Paddle/pull/41118)) - -- 移除 `elementwise_add` 三阶 Kernel 中不合理的代码,修复组网过程未初始化问题。([#36618](https://github.com/PaddlePaddle/Paddle/pull/36618)) - -- 修复 `conv2d` 执行 cuDNN Kernel 时属性缺失的问题。([#38827](https://github.com/PaddlePaddle/Paddle/pull/38827)) - -- 修复 `multiclass_nms3` 输出 shape 不正确的问题。([#40059](https://github.com/PaddlePaddle/Paddle/pull/40059)) - -- 修复 `yolo_box` 输出 shape 不正确的问题。([#40056](https://github.com/PaddlePaddle/Paddle/pull/40056)) - -- 修复高阶微分 `gradients` 接口在指定 target_grad 时未按预期生效的问题。([#40940](https://github.com/PaddlePaddle/Paddle/pull/40940/)) - -- 修复动态图 op`_BatchNormBase` 基类中修改了 default_dtype,导致后续组网参数类型错误的问题,受影响的 API 有 `paddle.nn.BatchNorm1D`,`paddle.nn.BatchNorm2D`,`paddle.nn.BatchNorm3D`,`paddle.nn.SyncBatchNorm`。具体原因是当 `get_default_dtype() == 'float16'` 时,通过 `set_default_dtype('float32')`修改默认参数数据类型,动态图组网的参数类型是通过 default_dtype 来创建的,因此当默认参数类型被修改后导致后续的组网参数类型错误。([#36376](https://github.com/PaddlePaddle/Paddle/pull/36376)) - -- 修复 batchnorm op 中,当数据类型为 FP32,且数据维度 `dims = 2,data_layout = NHWC` 时,反向 op 内中间变量未定义问题。([#37020](https://github.com/PaddlePaddle/Paddle/pull/37020)) - -- 修复静态图模式下,`paddle.static.nn.prelu` 对于 `NHWC` 输入格式且 `mode==channel` 权重的 shape 错误问题。([#38310](https://github.com/PaddlePaddle/Paddle/pull/38310)) - -- 修复多机情况下,`paddle.nn.functional.class_center_sample` CUDA 种子设置 bug。([#38815](https://github.com/PaddlePaddle/Paddle/pull/38815)) - -- 修复 `paddle.nn.functional.one_hot` 在输入不正确参数时,CUDA 版本无法正确报错的问题。([#41335](https://github.com/PaddlePaddle/Paddle/pull/41335)) - -- 修复 DCU 设备上回收显存的 callback 未及时触发导致显存 OOM 的问题。([#40445](https://github.com/PaddlePaddle/Paddle/pull/40445)) - -- 修复 `setitem` 索引赋值反向梯度传递异常以及动态图部分场景下 inplace 逻辑处理异常的问题。([#37023](https://github.com/PaddlePaddle/Paddle/pull/37023), [#38298](https://github.com/PaddlePaddle/Paddle/pull/38298)) - -- 修复动转静下 Tensor array 使用 Slice 索引异常的问题。([#39251](https://github.com/PaddlePaddle/Paddle/pull/39251)) - -- 修复 `paddle.Tensor.register_hook` 接口使用时临时变量未析构,从而导致内存或显存泄漏的问题。([#40716](https://github.com/PaddlePaddle/Paddle/pull/40716)) - -- 修复 `Tensor.getitem` 当索引是全为 False 的 bool Tensor 时无法取值的问题。([#41297](https://github.com/PaddlePaddle/Paddle/pull/41297)) - -- 修复 `Tensor.getitem` 当索引是 bool scalar Tensor 时无法取值的问题。([#40829](https://github.com/PaddlePaddle/Paddle/pull/40829)) - -- 修复 `paddle.index_select` 在 index 为 0-shape Tensor 时报错的问题。([#41383](https://github.com/PaddlePaddle/Paddle/pull/41383)) - -- 修复 `paddle.index_select`,`paddle.index_sample` 申请的 GPU 线程数超过有限机器资源时报错的问题。([#41127](https://github.com/PaddlePaddle/Paddle/pull/41127), [#37816](https://github.com/PaddlePaddle/Paddle/pull/37816), [#39736](https://github.com/PaddlePaddle/Paddle/pull/39736), [#41563](https://github.com/PaddlePaddle/Paddle/pull/41563)) - -- 修复 ReduceConfig、elemwise_grad、gather、gather_nd、scatter ops 申请 GPU 线程数超过有限机器资源时报错的问题。([#40813](https://github.com/PaddlePaddle/Paddle/pull/40813), [#41127](https://github.com/PaddlePaddle/Paddle/pull/41127)) - -- 修复 Kernel Primitive API 中 ReadData,ReadDataBc,ReadDataReduce 在 NX != 1 时访存越界的问题。([#36373](https://github.com/PaddlePaddle/Paddle/pull/36373)) - -- 修复 IndexRandom 数据类型错误导致数据溢出计算结果异常的问题。([#39867](https://github.com/PaddlePaddle/Paddle/pull/39867), [#39891](https://github.com/PaddlePaddle/Paddle/pull/39891)) - -- 修复 reduce op 在 reduce_num = 1 计算结果返回错误的问题。([#38771](https://github.com/PaddlePaddle/Paddle/pull/38771)) - -- 修复 reduce op 在 HIP 环境下 reduce 中间维度出现访存越界的问题。([#41273](https://github.com/PaddlePaddle/Paddle/pull/41273)) - -- 修复 matmul op 两个 FP16 一维向量计算时 Kernel 无法正常释放的问题。 - -- 修复部分算子在 CUDA 上因整型计算溢出导致的问题,包括:bernoulli、gaussian_random、gumbel_softmax、multinomial、truncated_gaussian_random、uniform_random_inplace、uniform_random ops。([#37670](https://github.com/PaddlePaddle/Paddle/pull/37670)) - -- 修复 `paddle.nn.Sequential` 在 for 循环遍历 sublayers 时会报 KeyError 错误的 bug。([#39372](https://github.com/PaddlePaddle/Paddle/pull/39372)) - -- 修复 `paddle.nn.functional.unfold` 在静态图下编译时检查 shape 错误的 bug。([#38907](https://github.com/PaddlePaddle/Paddle/pull/38907), [#38819](https://github.com/PaddlePaddle/Paddle/pull/38819)) - -- 修复静态图使用 dropout 时如果指定了 `axis` 后会报错的问题。([#37223](https://github.com/PaddlePaddle/Paddle/pull/37223)) - -- 迁移 `paddle.nn.MultiHeadAttention`中 matmul 算子到 matmul_v2 算子。([#36222](https://github.com/PaddlePaddle/Paddle/pull/36222)) - -- 修复 `paddle.nn.functional.label_smooth`在输入为空 Tensor 时抛出 FPE 的问题。([#35861](https://github.com/PaddlePaddle/Paddle/pull/35861)) - -- 修复 reshape op 空 Tensor 形变问题, 支持将空 Tensor rehape 成[-1]。([#36087](https://github.com/PaddlePaddle/Paddle/pull/36087)) - -- 修复 `fill_diagonal`参数 offset 非零时会造成修改值跨行问题。([#36212](https://github.com/PaddlePaddle/Paddle/pull/36212)) - -- 修改动态图模式下 range op 返回 stop gradient 设置成 True。([#37486](https://github.com/PaddlePaddle/Paddle/pull/37486)) - -- 修复 Lamb 优化器当 Beta1Pow 和 Beta2Pow 在 GPU 上时更新错误的 bug。([#38518](https://github.com/PaddlePaddle/Paddle/pull/38518)) - -- 修复 conv2d 算子 FLAGS_cudnn_deterministic 设置不生效的问题。([#37173](https://github.com/PaddlePaddle/Paddle/pull/37173)) - -- 修复因早期版本的 cufft 没有定义 CUFFT_VERSION 引发的问题。([#37312](https://github.com/PaddlePaddle/Paddle/pull/37312)) - -- 修复 `paddle.ifftshit`, `paddle.fftshift` 计算错误问题。([#36834](https://github.com/PaddlePaddle/Paddle/pull/36834), [#36748](https://github.com/PaddlePaddle/Paddle/pull/36748)) - -- 修复 `paddle.fft` 系列 API 中的 `axis` 计算错误。([#36321](https://github.com/PaddlePaddle/Paddle/pull/36321)) - -- 修复 batch_norm_grad op 在 FP16 数据类型时输出数据类型注册的 bug,该 bug 会导致部分场景下编译失败,并且对 FP16 计算精度会有一定影响。([#42461](https://github.com/PaddlePaddle/Paddle/pull/42461)) - -- 修复 `paddle.nn.functional.pad` API 在模型动转静时,padding 为 Tensor 条件下的 Infershape 信息错误问题。([#42414](https://github.com/PaddlePaddle/Paddle/pull/42414)) - -- 修复 `paddle.distribution.StickBreakingTransform` 输入维度超过 2 时异常的问题。([#41762](https://github.com/PaddlePaddle/Paddle/pull/41672)) - -- 修复 fused_attention op 中 QK^T 计算出 nan/inf 的问题。([#42032](https://github.com/PaddlePaddle/Paddle/pull/42032)) - -- 修复 fused_attention op 中 FusedResidualDropoutBias 在 V100 上计算出 nan/inf 问题。([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398)) - -- 修复 full_like op 在执行时引入的多余的 data transform 问题。([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973)) - -- 修复 p_norm op 在 GPU 环境上计算 nan 的问题。([#41804](https://github.com/PaddlePaddle/Paddle/pull/41804)) - -- 修复 split op 在参数 sections 存在为 0 的 size 情况下,段错误的问题。([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755)) - -- 修复 6 个 elementwise op(pow、complex、divide_double、multiply_double、fmax、fmin)在需要 broadcast 的情况下,多卡训练时报 Place(gpu:0) 不支持的问题。([#42332](https://github.com/PaddlePaddle/Paddle/pull/42332)) - -- 修复 import paddle 时由于 PIL 版本升级导致的废弃接口报 warning 的问题。([#42307](https://github.com/PaddlePaddle/Paddle/pull/42307)) - -- 修复静态图下 `paddle.linalg.matrix_rank`不支持 tol 为 FP64 Tensor 的问题。([#42085](https://github.com/PaddlePaddle/Paddle/pull/42085)) - -#### IR(Intermediate Representation) - -- 动态图转静态图 - - - 修复 `tensor_array` 搭配控制流使用时,在反向梯度累加时存在的类型推导错误问题。([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585), [#39689](https://github.com/PaddlePaddle/Paddle/pull/39689)) - - - 修复动转静 AMP 训练时参数梯度类型未被正确设置的问题。([#40938](https://github.com/PaddlePaddle/Paddle/pull/40938)) - - - 修复代码中存在错位注释时,动转静代码解析报错的问题。([#39035](https://github.com/PaddlePaddle/Paddle/pull/39035), [#38003](https://github.com/PaddlePaddle/Paddle/pull/38003)) - - - 修复动转静代码中调用非 forward 函数时,Tensor 未被正确转化为 Variable 的问题。([#37296](https://github.com/PaddlePaddle/Paddle/pull/37296), [#38540](https://github.com/PaddlePaddle/Paddle/pull/38540)) - - - 修复动转静代码转写时 `paddle` 被错误地作为变量传递的问题。([#37999](https://github.com/PaddlePaddle/Paddle/pull/37999)) - - - 修复模型动转静后调用 `paddle.flops` 时模型参数统计错误的问题。([#36852](https://github.com/PaddlePaddle/Paddle/pull/36852)) - - - 修复使用 `paddle.jit.save/load` 接口加载模型后,在 train 模式和 no_grad 上下文中,显存会一直增长的问题。([#36434](https://github.com/PaddlePaddle/Paddle/pull/36434)) - - - 添加在 convert_call 对 generator function 转换时的警告。([#35369](https://github.com/PaddlePaddle/Paddle/pull/35369)) - - - 修复 run_program op 依赖分析的问题。([#38470](https://github.com/PaddlePaddle/Paddle/pull/38470)) - - - 修复控制流 For 中返回单值时代码转换的问题。([#40683](https://github.com/PaddlePaddle/Paddle/pull/40683)) - - - 修复控制流 cond 的输入包含 LoDTensorArray 时,生成反向 op 会报错的问题。([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585)) - - - 修复 `padddle.jit.save`在导出动转静模型时丢失顶层 Layer 的 forward_pre_hook 和 forward_post_hook 的问题。([#42273](https://github.com/PaddlePaddle/Paddle/pull/42273)) - - - 修复 `paddle.expand`中 shape 参数包含 Tensor 在动转静时会转换报错的问题。([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973)) - -#### 分布式训练 - -- 分布式训练基础功能 - - - 修复分布式多机训练时,端口报错的问题。([#37274](https://github.com/PaddlePaddle/Paddle/pull/37274)) - - - 修复 brpc 编译依赖问题。([#37064](https://github.com/PaddlePaddle/Paddle/pull/37064)) - - - 修复 Fleet 启动时,由于 tcp 自连接产生的端口被占用的问题。([#38174](https://github.com/PaddlePaddle/Paddle/pull/38174)) - - - 修复数据并行下,由于 FP16 参数在多卡下初始化不一致,导致精度下降的问题。([#38838](https://github.com/PaddlePaddle/Paddle/pull/38838), [#38563](https://github.com/PaddlePaddle/Paddle/pull/38563), [#38405](https://github.com/PaddlePaddle/Paddle/pull/38405)) - - - 修复数据并行下,由于 FP16 梯度同步时,没有除以卡数,导致精度下降的问题。([#38378](https://github.com/PaddlePaddle/Paddle/pull/38378)) - -- 动态图混合并行 - - - 修复在混合并行下,通过使用新 update 接口,FP16 模式不更新参数的问题。([#36017](https://github.com/PaddlePaddle/Paddle/pull/36017)) - -- 静态图混合并行 - - - 修复分布式 dp 模式下 grad merge 与 ClipGradientByGlobalNorm 不兼容的问题。([#36334](https://github.com/PaddlePaddle/Paddle/pull/36334)) - - - 修复混合并行下,张量模型并行的非分布式参数在初始化阶段未被广播,导致各卡非分布式参数不一致的问题。([#36186](https://github.com/PaddlePaddle/Paddle/pull/36186)) - - - 修复 sharding 开启 offload 时,sharding 的 save_persistables 接口未保存 FP16 参数和 offload 持久化变量的问题。([#40477](https://github.com/PaddlePaddle/Paddle/pull/40477)) - - - 修复开启 sharding 训练时,ema 参数在非 0 号卡上无法保存的问题。([#39860](https://github.com/PaddlePaddle/Paddle/pull/39860)) - - - 修复 FC 按照列切分梯度计算错误的问题。([#38724](https://github.com/PaddlePaddle/Paddle/pull/38724)) - - - 修复 DistributedStrategy 设置为 without_graph_optimizer 时和 rnn 一起使用报错的问题。([#36176](https://github.com/PaddlePaddle/Paddle/pull/36176)) - -- GPUPS 参数服务器训练 - - - 修复 GPUPS 宏定义触发 CPU 分支编译问题。([#37248](https://github.com/PaddlePaddle/Paddle/pull/37248)) - - - 修复 GPUPS 流水线训练时在保存 delta 和 pullsparse 并发时引发的偶发报错问题。([#37233](https://github.com/PaddlePaddle/Paddle/pull/37233)) - - - 修复 HDFSClient 查询目录未返回全路径,引发下载报错问题。([#36590](https://github.com/PaddlePaddle/Paddle/pull/36590)) - - - 修复 GPUPS 流水线训练时拉取老参数问题。([#36512](https://github.com/PaddlePaddle/Paddle/pull/36512)) - - - 修复 GPUPS 多流 allocation 问题。([#37476](https://github.com/PaddlePaddle/Paddle/pull/37476)) - - - 修复 GPUPS pybind 出 core 的问题。([#37287](https://github.com/PaddlePaddle/Paddle/pull/37287)) - -#### 其他 - -- 修复动态图量化训练保存模型时 clip_extra 的问题。([#38323](https://github.com/PaddlePaddle/Paddle/pull/38323)) - -- 修复动态图量化训练 abs_max scale 初始化的问题。([#39307](https://github.com/PaddlePaddle/Paddle/pull/39307)) - -- 修复动态图量化训练保存模型节点异常的问题。([#38102](https://github.com/PaddlePaddle/Paddle/pull/38102), [#38012](https://github.com/PaddlePaddle/Paddle/pull/38012)) - -- 修复离线量化 flatten op 输出错误问题。([#37722](https://github.com/PaddlePaddle/Paddle/pull/37722)) - -- 修复了反量化 matmul op 时,维度对不上的问题。([#36982](https://github.com/PaddlePaddle/Paddle/pull/36982)) - -- 修复了量化无权重的 matmul_v2 时,错误添加量化 op 的问题。([#36593](https://github.com/PaddlePaddle/Paddle/pull/36593)) - -- 修复 conv op channel wise 量化在保存模型时 quant_axis 属性保存错误。([#39054](https://github.com/PaddlePaddle/Paddle/pull/39054)) - -- 修复 ChannelWise 量化训练速度慢的问题。([#40772](https://github.com/PaddlePaddle/Paddle/pull/40772)) - -- 修复量化训练初始化为 0 的 Tensor 出 NAN 的问题。([#36762](https://github.com/PaddlePaddle/Paddle/pull/36762)) - -- 修复多线程场景下混合精度 amp_level 设置错误问题。([#39198](https://github.com/PaddlePaddle/Paddle/pull/39198)) - -- 修复混合精度训练与 PyLayer,Recompute 等一起使用时,PyLayer 和 Recompute 中未正确设置混合精度的问题。([#39950](https://github.com/PaddlePaddle/Paddle/pull/39950), [#40042](https://github.com/PaddlePaddle/Paddle/pull/40042)) - -- 修复了 Mac 下编译自定义算子时 `D_GLIBCXX_USE_CXX11_ABI` 未生效的问题。([#37878](https://github.com/PaddlePaddle/Paddle/pull/37878)) - -- 修复 initializer 相关 API 在 block=None 时动静行为不统一的问题。([#37827](https://github.com/PaddlePaddle/Paddle/pull/37827)) - -- 修复 python3.6 环境下没有 fluid 模块的 bug。([#35862](https://github.com/PaddlePaddle/Paddle/pull/35862)) - -- 修复优化器 `paddle.optimizer.Adamw` 错误调用 adam op 的 bug。([#36028](https://github.com/PaddlePaddle/Paddle/pull/36028)) - -- 修复 multi tensor 策略下 `paddle.optimizer.Momentum` 优化器参数 `regularizer` 属性为 None 时的逻辑错误。([#38344](https://github.com/PaddlePaddle/Paddle/pull/38344)) - -- 修复 multi tensor 策略下 `paddle.optimizer.Momentum`、`paddle.optimizer.Adam` 优化器会对 `multi_precision` 属性进行修改的错误。([#38991](https://github.com/PaddlePaddle/Paddle/pull/38991)) - -- 修复最终态 API amp 与 optional 类型 Tensor 组合使用的代码编译错误。([#40980](https://github.com/PaddlePaddle/Paddle/pull/40980)) - -- 修复 paddle+lite+xpu 预测库调用 lite CPU 预测时会报错的 bug,修复 paddle+lite(without NNAdapter) 编译时会报错的 bug。([#37449](https://github.com/PaddlePaddle/Paddle/pull/37449)) - -- 修复 Debug 编译模式下 LoDTensorArray 因 Pybind11 绑定不一致导致 crash 的 bug。([#37954](https://github.com/PaddlePaddle/Paddle/pull/37954)) - -- 修复 shape 参数为 Tensor 和 int 构成列表的极端情况下,无法正确构建 Tensor 的 bug。([#38284](https://github.com/PaddlePaddle/Paddle/pull/38284)) - -- 修复 `paddle.optimizer.AdamW` API 兼容性问题。([#37905](https://github.com/PaddlePaddle/Paddle/pull/37905)) - -- 修复 _InstanceNormBase 中 extra_repr 的返回错误。([#38537](https://github.com/PaddlePaddle/Paddle/pull/38537)) - -- 修复联编开启 -DWITH_DISTRIBUTED 生成 Paddle Inference 缺少符号 `paddle::distributed::TensorTable` 的问题。([#41128](https://github.com/PaddlePaddle/Paddle/pull/41128)) - -- matmul_v2 op 新增 shape check,在 shape 中存在 0 值进行信息报错。([#35791](https://github.com/PaddlePaddle/Paddle/pull/35791)) - -- 修复动态图 recompute 对于没有梯度输入提示信息反复打印,改成用 warning 只打印一次的方式。([#38293](https://github.com/PaddlePaddle/Paddle/pull/38293)) - -- 修复 gelu op 在视觉模型中训练后期在验证集上精度低的问题。([#38450](https://github.com/PaddlePaddle/Paddle/pull/38450)) - -- 修复 adamw op 在数值计算上误差问题。([#37746](https://github.com/PaddlePaddle/Paddle/pull/37746)) - -- 补充 sparse_momentum `_C_ops` 接口 MasterParam 和 MasterParamOut 参数。([#39969](https://github.com/PaddlePaddle/Paddle/pull/39969)) - -- 修复 python3.6 环境下没有 `distributed` 模块的 bug。([#35848](https://github.com/PaddlePaddle/Paddle/pull/35848)) - -- 修复 eigh 单元测试数据初始化问题。([#39568](https://github.com/PaddlePaddle/Paddle/pull/39568)) - -- 修复 eigvalsh 单元测试数据初始化问题。([#39841](https://github.com/PaddlePaddle/Paddle/pull/39841)) - -- 修复 segment op 在 V100 上寄存器使用过多导致不能正常运行的问题。([#38113](https://github.com/PaddlePaddle/Paddle/pull/38113)) - -- 修复 conv 相关算子稀疏化维度错误的问题。([#36054](https://github.com/PaddlePaddle/Paddle/pull/36054)) - -- 提供自动稀疏训练(Automatic SParsity)静态图相关功能 Alias 至 `Paddle.static.sparsity`。([#36525](https://github.com/PaddlePaddle/Paddle/pull/36525)) - -- 修复 divide op 整数除法还是整数的 bug。([#40890](https://github.com/PaddlePaddle/Paddle/pull/40890)) - -- 修复 `paddle.multiplex` 候选 Tensor 大小为 0 崩溃问题。([#34972](https://github.com/PaddlePaddle/Paddle/pull/34972)) - -- 修复 `paddle.kl_div` 参数 `reduction` 给定情况下速度异常的问题。([#37283](https://github.com/PaddlePaddle/Paddle/pull/37283)) - -- 修复 Cifar 数据集加载 data source 无序的问题。([#37272](https://github.com/PaddlePaddle/Paddle/pull/37272)) - -- 修复 ProgressBar 类中 loss 从 uint16 到 float 的转换。([#39231](https://github.com/PaddlePaddle/Paddle/pull/39231)) - -- 修复 ShareBufferWith 共享数据类型的问题。([#37464](https://github.com/PaddlePaddle/Paddle/pull/37464), [#37247](https://github.com/PaddlePaddle/Paddle/pull/37247)) - -- 修复 `paddle.io.DataLoader` 使用 IterableDataset 并且 num_workers>0 时的性能问题。([#40541](https://github.com/PaddlePaddle/Paddle/pull/40541)) - -- 修复 `paddle.vision.ops.yolo_loss` 动态图返回值不全的问题。([#40185](https://github.com/PaddlePaddle/Paddle/pull/40185)) - -- 移出 `paddle.io.BatchSampler` 对输入参数 dataset 需要是 `paddle.io.Dataset` 类型的限制,扩大对用户自定义数据集的支持。([#40184](https://github.com/PaddlePaddle/Paddle/pull/40184)) - -- 修复 `paddle.summary` 报错 op_flops 不存在的问题。([#36489](https://github.com/PaddlePaddle/Paddle/pull/36489)) - -- 修复 lars_momentum op 在 lars_weight_decay=0 时公式错误的问题。([#40892](https://github.com/PaddlePaddle/Paddle/pull/40892)) - -- 修复 optimize-offload 无法保存 presistable var 的问题。([#36433](https://github.com/PaddlePaddle/Paddle/pull/36433)) - -- 修复 optimizer-offload 不支持 adamw op type 的问题。([#36432](https://github.com/PaddlePaddle/Paddle/pull/36432)) - -- 修复多线程场景下,Tracer 中 enable_program_desc_tracing_数据不安全的问题。([#39776](https://github.com/PaddlePaddle/Paddle/pull/39776)) - -- 修复模型读取时模型档案大小未初始化的问题。([#40518](https://github.com/PaddlePaddle/Paddle/pull/40518)) - -- 修复 Expand op 逻辑 bug,当输入 Tensor X 的维度,小于要拓展的 shape 时,可能导致取得 Out.Shape 是错误的。([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677)) - -- 修复 Expand_As op 只取 y.shape,而没有 Y 变量输入时,导致的动转静报错。([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677)) - -- 修复 Expand_As op 计算输出 shape 时逻辑的错误。([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677)) - - -- 修复 `core.VarDesc.VarType.STRINGS` 类型的变量获取 `lod_level` 属性报错的问题,并且设置其 `lod_level` 为 None。([#39077](https://github.com/PaddlePaddle/Paddle/pull/39077)) - -- 修复框架功能 `PyLayer` 不支持不同 dtype 的问题。([#37974](https://github.com/PaddlePaddle/Paddle/pull/37974)) - -- 修复了学习率衰减 API `paddle.optimizer.lr.PolynomialDecay` 的零除问题。([#38782](https://github.com/PaddlePaddle/Paddle/pull/38782)) - -- 修复调用 DisableGlogInfo() 接口后依旧残留部分日志的问题。([#36356](https://github.com/PaddlePaddle/Paddle/pull/36356)) - -- 修复 SimpleRNN、GRU 和 LSTM API CPU 训练时多层 RNN(dropout 设置为 0 时)反向计算出错的问题。([#37080](https://github.com/PaddlePaddle/Paddle/pull/37080)) - -- 为 cufft 和 hipfft 后端的 fft 添加了 cache。([#36646](https://github.com/PaddlePaddle/Paddle/pull/36646)) - -- 使 `paddle.roll` 的 shifts 参数支持传入 Tensor。([#36727](https://github.com/PaddlePaddle/Paddle/pull/36727)) - -- 为 fft 添加 onemkl 作为可选的计算后端。([#36414](https://github.com/PaddlePaddle/Paddle/pull/36414)) - -- 修复 mamtul_v2 和 elementwise_div 两个 op 在 bfloat16 类型下的精度问题。([#42479](https://github.com/PaddlePaddle/Paddle/pull/42479)) - -- 修复显存回收时 LoDTensorArray 只清理内部 Tensor 而未清空 Array 导致的下个 step 可能出错的问题。([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398)) - -## 4. 部署方向(Paddle Inference) - -### (1)新增特性 - -#### 新增 API - -- 增加 Java API,Java 开发者可以通过简单灵活的接口实现在服务端和云端的高性能推理。([#37162](https://github.com/PaddlePaddle/Paddle/pull/37162)) - -- 增加 `GetTrtCompileVersion` 和 `GetTrtRuntimeVersion` 接口,用于获取 TensorRT 版本信息。([#36429](https://github.com/PaddlePaddle/Paddle/pull/36429)) - -- 增加 `ShareExternalData` 接口,避免推理时对输入数据进行内存拷贝。([#39809](https://github.com/PaddlePaddle/Paddle/pull/39809)) - -#### 新增功能 - -- 新增 ONNX Runtime 后端支持,当前集成版本只支持 CPU。([#39988](https://github.com/PaddlePaddle/Paddle/pull/39988), [#40561](https://github.com/PaddlePaddle/Paddle/pull/40561)) - -- 基于 Paddle Lite 子图方式,新增昇腾 310 推理支持。([#35226](https://github.com/PaddlePaddle/Paddle/pull/35226)) - -- 新增原生 GPU FP16 推理功能。([#40531](https://github.com/PaddlePaddle/Paddle/pull/40531)) - -- switch_ir_debug 接口增加 dump 模型的功能。([#36581](https://github.com/PaddlePaddle/Paddle/pull/36581)) - -- 新增 TensorRT config 的配置接口:`void UpdateConfigInterleaved(paddle_infer::Config* c, bool with_interleaved)`,用于 int8 量化推理中特殊的数据排布。([#38884](https://github.com/PaddlePaddle/Paddle/pull/38884)) - -- log 中增加 TensorRT inspector 输出信息,仅在 TensorRT 8.2 及以上版本有效。([#38362](https://github.com/PaddlePaddle/Paddle/pull/38362),[#38200](https://github.com/PaddlePaddle/Paddle/pull/38200))) - -- 增加 TensorRT ASP 稀疏推理支持。([#36413](https://github.com/PaddlePaddle/Paddle/pull/36413)) - -### (2)底层优化 - -#### CPU 性能优化 - -- 优化 MKLDNN 的缓存机制。([#38336](https://github.com/PaddlePaddle/Paddle/pull/38336), [#36980](https://github.com/PaddlePaddle/Paddle/pull/36980), [#36695](https://github.com/PaddlePaddle/Paddle/pull/36695)) - -- 新增 matmul_scale_fuse pass。([#37962](https://github.com/PaddlePaddle/Paddle/pull/37962)) - -- 新增 MKLDNN reshape_transpose_matmul_v2_mkldnn_fuse_pass。([#37847](https://github.com/PaddlePaddle/Paddle/pull/37847), [#40948](https://github.com/PaddlePaddle/Paddle/pull/40948)) - -- 新增 MKLDNN conv_hard_sigmoid_mkldnn_fuse_pass。([#36869](https://github.com/PaddlePaddle/Paddle/pull/36869)) - -- 新增 MKLDNN matmul_v2_transpose_reshape_fuse_pass。([#36481](https://github.com/PaddlePaddle/Paddle/pull/36481)) - -- 新增 MKLDNN softplus_activation_mkldnn_fuse_pass。([#36657](https://github.com/PaddlePaddle/Paddle/pull/36657)) - -- 新增 MKLDNN elt_act_mkldnn_fuse_pass。([#36541](https://github.com/PaddlePaddle/Paddle/pull/36541)) - -- 新增 MKLDNN mish 算子及 conv_mish_mkldnn_fuse_pass。([#38623](https://github.com/PaddlePaddle/Paddle/pull/38623)) - -#### GPU 性能优化 - -- 将推理默认的显存分配策略由 `naive_best_fit` 变更为 `auto_growth`,解决部分模型占满 GPU 显存问题。([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491)) - -- 支持 gelu、FC+gelu ops 使用 TensorRT 推理。([#38399](https://github.com/PaddlePaddle/Paddle/pull/38399))合作团队 - -- 支持 `deformable_conv` 在静态 shape 下使用 TensorRT 推理。([#36612](https://github.com/PaddlePaddle/Paddle/pull/36612) [#36850](https://github.com/PaddlePaddle/Paddle/pull/36850) [#37345](https://github.com/PaddlePaddle/Paddle/pull/37345)) - -- 支持 nearest_interp_v2 op 使用 TensorRT 推理。([#34126](https://github.com/PaddlePaddle/Paddle/pull/34126)) - -- 增加 `yolo_box`TensorRT plugin,支持输入参数 `iou_aware` 和 `iou_aware_factor`,使推理计算得到的 IoU 作为置信度的因子。([#34128](https://github.com/PaddlePaddle/Paddle/pull/34128)) - -- 支持 `elementwise_sub` 和 `elementwise_div` 调用 TensorRT 推理。([#40806](https://github.com/PaddlePaddle/Paddle/pull/40806) [#41253](https://github.com/PaddlePaddle/Paddle/pull/41253)) - -- 支持 `multiclass_nms3` 使用 TensorRT 推理。([#41181](https://github.com/PaddlePaddle/Paddle/pull/41181) [#41344](https://github.com/PaddlePaddle/Paddle/pull/41344)) - -- 支持 flatten_contiguous_rang op 使用 TensorRT 推理。([#38922](https://github.com/PaddlePaddle/Paddle/pull/38922)) - -- 支持 `pool2d` 属性 `padding` 的维度为 4、`global_pooling` 和 `ceil_mode` 为 True 情况下使用 TensorRT 推理。([#39545](https://github.com/PaddlePaddle/Paddle/pull/39545)) - -- 支持 batch_norm 和 elementwise_add 为 5 维时使用 TensorRT 推理。([#36446](https://github.com/PaddlePaddle/Paddle/pull/36446)) - -- 新增 pool3d 使用 TensorRT 推理。([#36545](https://github.com/PaddlePaddle/Paddle/pull/36545), [#36783](https://github.com/PaddlePaddle/Paddle/pull/36783)) - -- 增加 `reduce` int32 和 float 类型使用 TensorRT 推理,增加 `reduce_mean` GPU 算子 int32、int64 注册。([#39088](https://github.com/PaddlePaddle/Paddle/pull/39088)) - -- 修改 MatmulV2ToMul pass,修改限定条件(不支持广播)和 op_teller 映射条件。([#36652](https://github.com/PaddlePaddle/Paddle/pull/36652)) - -- 增加 TenorRT plugin 接口 AddPluginV2IOExt 的支持。([#36493](https://github.com/PaddlePaddle/Paddle/pull/36493)) - -- 增加 roi_align op 中 aligned 属性并支持 TensorRT 推理。([#38905](https://github.com/PaddlePaddle/Paddle/pull/38905)) - -- 增加 concat 属性 `axis = -1` 时支持 TensorRT 推理。([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096)) - -- 新增 TensorRT plugin :preln_emb_eltwise_layernorm、 preln_skip_la、rnorm ops, 用于 ERNIE 类模型性能优化。([#39570](https://github.com/PaddlePaddle/Paddle/pull/39570)) - -- 新增 TensorRT fuse pass:preln_embedding_eltwise_layernorm_fuse_pass, preln_skip_layernorm_fuse_pass,用于 ERNIE 类模型性能优化。([#39508](https://github.com/PaddlePaddle/Paddle/pull/39508)) - -- 将 matmul 融合相关的 pass 基于不同的后端(GPU、CPU、TensorRT)拆开,支持 FC 权重的转置功能。([#39369](https://github.com/PaddlePaddle/Paddle/pull/39369)) - -- 新增 roll、strided_slice、slice op 在动态 shape 的情况下对 TensorRT 的支持。([#41913](https://github.com/PaddlePaddle/Paddle/pull/41913), [#41573](https://github.com/PaddlePaddle/Paddle/pull/41573), [#41467](https://github.com/PaddlePaddle/Paddle/pull/41467)) - -- 新增 div op 对 TensorRT 的支持。([#41243](https://github.com/PaddlePaddle/Paddle/pull/41243)) - -- 量化支持 - - - `PostTrainingQuantization` API 新增支持`paddle.io.DataLoader` 对象或者 `Python Generator`的输入。([#38686](https://github.com/PaddlePaddle/Paddle/pull/38686)) - - - ERNIE 全量化模型推理支持 interleaved 数据排布。([#39424](https://github.com/PaddlePaddle/Paddle/pull/39424)) - - - 支持 PaddleSlim 新量化模型格式推理。([#41049](https://github.com/PaddlePaddle/Paddle/pull/41049)) - - - 新增 matmul int8 量化的推理 op converter 和 plugin。([#37285](https://github.com/PaddlePaddle/Paddle/pull/37285)) - - - 新增判断模型所有 op 能否支持 int8 量化的 pass。([#36042](https://github.com/PaddlePaddle/Paddle/pull/36042)) - - - 支持 multihead attention 非变长分支中 FC 部分的量化推理。([#39660](https://github.com/PaddlePaddle/Paddle/pull/39660)) - -#### 昇腾 NPU 相关功能 - -- - 重构 shape 算子前向计算逻辑,支持在 NPU 上执行。([#39613](https://github.com/PaddlePaddle/Paddle/pull/39613)) - - - 重构 reshape 算子前向计算逻辑,支持 ShapeTensor 输入。([#38748](https://github.com/PaddlePaddle/Paddle/pull/38748)) - - - 模型权重加载时精度类型统一。([#39160](https://github.com/PaddlePaddle/Paddle/pull/39160)) - -### (3)问题修复 - -#### 框架及 API 修复 - -- 修复保存静态图时模型剪裁的问题。([#37579](https://github.com/PaddlePaddle/Paddle/pull/37579)) - -- C API 增加对的字符串的封装 PD_Cstr,并提供构造和析构的方式,避免用户直接使用 C 运行时库来析构字符串。([#38667](https://github.com/PaddlePaddle/Paddle/pull/38667)) - -- 修复预测时内存复用的逻辑问题。([#37324](https://github.com/PaddlePaddle/Paddle/pull/37324)) - -- 修复多线程下内存复用报错问题。([#37894](https://github.com/PaddlePaddle/Paddle/pull/37894)) - -- 在没有权重文件时,允许传递空字符串进行推理。([#38579](https://github.com/PaddlePaddle/Paddle/pull/38579)) - -- 修复开启 TensorRT dynamic shape 后不支持 clone 问题。([#38520](https://github.com/PaddlePaddle/Paddle/pull/38520)) - -- 修复开启 TensorRT dynamic shape 后多线程 clone 报错问题。([#40067](https://github.com/PaddlePaddle/Paddle/pull/40067)) - -- 修复 TensorRT engine 析构问题。([#35842](https://github.com/PaddlePaddle/Paddle/pull/35842), [#35938](https://github.com/PaddlePaddle/Paddle/pull/35938)) - -- lite xpu 接口修复无法选择 xpu 卡的问题。([#36610](https://github.com/PaddlePaddle/Paddle/pull/36610)) - -- TensorRT 动态 shape 参数自动生成接口增加文件存在性检查。([#36628](https://github.com/PaddlePaddle/Paddle/pull/36628)) - -- 修复 MKLDNN 不支持 conv3d 的问题。([#42055](https://github.com/PaddlePaddle/Paddle/pull/42055)) - -#### 后端能力修复 - -- 修复预测时 cuDNN 默认算法选择配置,使用非 deterministic 策略。([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491)) - -- 修复 deformable_conv op 在 TensorRT plugin 资源回收处理错误的问题。([#38374](https://github.com/PaddlePaddle/Paddle/pull/38374)) - -- 修复 deformable_conv op 在 TensorRT plugin 序列化错误问题。([#38057](https://github.com/PaddlePaddle/Paddle/pull/38057)) - -- 适配 TensorRT 8.0 新的构建引擎和系列化 API。([#36769](https://github.com/PaddlePaddle/Paddle/pull/36769)) - -- 修复 Flatten2MatmulFusePass、Squeeze2MatmulFusePass、Reshape2MatmulFusePass 没有生效问题。([#37644](https://github.com/PaddlePaddle/Paddle/pull/37644)) - -- 修复 TensorRT 输入数据在上时报错的问题。([#37427](https://github.com/PaddlePaddle/Paddle/pull/37427)) - -- 增加输入维度错误时的报错信息。([#38962](https://github.com/PaddlePaddle/Paddle/pull/38962)) - -- 修复 EmbEltwiseLayernorm 输出类型错误的问题。([#40015](https://github.com/PaddlePaddle/Paddle/pull/40015)) - -- 删除 conv_affine_channel_fuse_pass 以及对应的单元测试。([#39817](https://github.com/PaddlePaddle/Paddle/pull/39817)) - -- 修复 adaptive_pool2d pass 错误替换 pool 属性的问题。([#39600](https://github.com/PaddlePaddle/Paddle/pull/39600)) - -- 修复 shuffle_channel_detect_pass 错误生成 shuffle_channel op 的问题。([#39242](https://github.com/PaddlePaddle/Paddle/pull/39242)) - -- 修复 transpose 参数错误。([#39006](https://github.com/PaddlePaddle/Paddle/pull/39006)) - -- 修复 nearest_interp_v2 输入 scale 维度小于 1 时崩溃的问题。([#38725](https://github.com/PaddlePaddle/Paddle/pull/38725)) - -- 修复 prelu 在 dynamic shape 时不支持一维输入的问题。([#39389](https://github.com/PaddlePaddle/Paddle/pull/39389)) - -- 修复 slice 的 special_slice_plugin 的核函数计算错误的问题。([#39875](https://github.com/PaddlePaddle/Paddle/pull/39875)) - -- 暂时禁用 skip_layernorm 变长下的 int8 分支,防止精度下降。([#39991](https://github.com/PaddlePaddle/Paddle/pull/39991)) - -- 修复关于支持 preln_ernie 模型的一些 bug。([#39733](https://github.com/PaddlePaddle/Paddle/pull/39733)) - -- 修复 slice 在 ERNIE 中 threads 可能超过限制的 bug,修复 spacial_slice 误触的 bug。([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096)) - -- 修复 elementwise 在维度相同时不支持广播的问题。([#37908](https://github.com/PaddlePaddle/Paddle/pull/37908)) - -- 修复 nearest_interp op 当 align_corners 为 True 时,TensorRT layer 的结果和原生 op 的结果有 diff,底层实现不一样。([#37525](https://github.com/PaddlePaddle/Paddle/pull/37525)) - -- 修复 qkv_plugin:核函数计算错误。([#37096](https://github.com/PaddlePaddle/Paddle/pull/37096)) - -- 修复动态量化的推理 pass 的问题。([#35879](https://github.com/PaddlePaddle/Paddle/pull/35879)) - -- 当 Tensor 请求的内存容量低于已分配的 size 时直接复用。([#37880](https://github.com/PaddlePaddle/Paddle/pull/37880)) - -- 修复 ERNIE 定长模型开启 TensorRT 出现的 hang 问题。([#37839](https://github.com/PaddlePaddle/Paddle/pull/37839)) - -- 修复 TensorRT int8 时缺失 dynamic range 信息崩溃问题。([#36900](https://github.com/PaddlePaddle/Paddle/pull/36900)) - -- 修复 slice 反序列化代码问题。([#36588](https://github.com/PaddlePaddle/Paddle/pull/36588)) - -- 修复 yolo box 计算公式错误问题。([#36240](https://github.com/PaddlePaddle/Paddle/pull/36240)) - -- 修复老版本模型在使用新版本 roi_align 时崩溃问题。([#38788](https://github.com/PaddlePaddle/Paddle/pull/38788)) 外部开发者 - -- 修复 softmax 在 python 和 C++上性能差异较大的问题。([#37130](https://github.com/PaddlePaddle/Paddle/pull/37130)) - -- 修复 matmul 在静态 shape 2 维输入和动态 shape 3 维输入情况下推理失败问题。([#36849](https://github.com/PaddlePaddle/Paddle/pull/36849)) - -- 修复 reshape_transpose_matmul_mkldnn_fuse_pass 对 shape 处理不当问题。([#36731](https://github.com/PaddlePaddle/Paddle/pull/36731)) - -- 修复输入为 2 维,但 TensorRT 获取到 4 维的问题。([#36614](https://github.com/PaddlePaddle/Paddle/pull/36614)) - -- 修复 interpolate_v2 MKLDNN 算子在 scale 属性为空时报错问题。([#36623](https://github.com/PaddlePaddle/Paddle/pull/36623)) - -- 修复 recurrent 算子在多线程场景性能差问题。([#36052](https://github.com/PaddlePaddle/Paddle/pull/36052)) - -- 移除 relu、sigmoid、tanh、relu6、batch_norm、clip、concat、gelu、hard_sigmoid、prelu、softmax、split、swish 对 TensorRT 2 维输入的限制。([#37097](https://github.com/PaddlePaddle/Paddle/pull/37097)) - -- 修复 reshape op 使用 TensorRT 推理。([#41090](https://github.com/PaddlePaddle/Paddle/pull/41090)) - -- 修复 matmul 相关 pass,兼容 matmul_v2。([#36424](https://github.com/PaddlePaddle/Paddle/pull/36424)) - -- 开启 TensorRT 时,conv2d 算子中 padding 方式支持 VALID 及 SAME 属性。([#38999](https://github.com/PaddlePaddle/Paddle/pull/38999)) - -- 修复 MKLDNN 多输入算子量化问题。([#39593](https://github.com/PaddlePaddle/Paddle/pull/39593), [#39346](https://github.com/PaddlePaddle/Paddle/pull/39346), [#40717](https://github.com/PaddlePaddle/Paddle/pull/40717)) - -- 修复 MKLDNN 量化场景下 conv+activation 的 scale 错误问题。([#38331](https://github.com/PaddlePaddle/Paddle/pull/38331)) - -- 修复 MKLDNN 无参数算子量化中,根据后续算子量化情况不同需做不同处理的问题。([#39342](https://github.com/PaddlePaddle/Paddle/pull/39342)) - -- 修复 MKLDNN cpu_bfloat16_placement_pass 中的数据类型相关问题。([#38702](https://github.com/PaddlePaddle/Paddle/pull/38702)) - -- 修复 MKLDNN bfloat16 推理中 split 算子执行问题。([#39548](https://github.com/PaddlePaddle/Paddle/pull/39548)) - -- 修复 MKLDNN matmul_v2 算子不支持 6 维问题。([#36342](https://github.com/PaddlePaddle/Paddle/pull/36342), [#38665](https://github.com/PaddlePaddle/Paddle/pull/38665)) - -- 修复 MKLDNN matmul_v2_transpose_reshape 中的 MKLDNN DeviceContext 错误问题。([#38554](https://github.com/PaddlePaddle/Paddle/pull/38554)) - -- 修复分割模型在 MKLDNN 推理场景计算结果错误问题。([#37310](https://github.com/PaddlePaddle/Paddle/pull/37310)) - -- 修复 MKLDNN bfloat16 placement 算子列表并添加缺失算子。([#36291](https://github.com/PaddlePaddle/Paddle/pull/36291)) - -- 修复 MKLDNN 算子的格式问题,包括:FC、conv_transpose、6 维 Tensor 报错问题、conv 对 `NHWC` 输入的输出 format 错误问题。([#38890](https://github.com/PaddlePaddle/Paddle/pull/38890), [#37344](https://github.com/PaddlePaddle/Paddle/pull/37344), [#37175](https://github.com/PaddlePaddle/Paddle/pull/37175), [#38553](https://github.com/PaddlePaddle/Paddle/pull/38553), [#40049](https://github.com/PaddlePaddle/Paddle/pull/40049), [#39097](https://github.com/PaddlePaddle/Paddle/pull/39097)) - -- 修复 MKLDNN 多线程推理场景因 cache 机制报错问题。([#36290](https://github.com/PaddlePaddle/Paddle/pull/36290), [#35884](https://github.com/PaddlePaddle/Paddle/pull/35884)) - -- 修复 MKLDNN 因 matmul 及 FC 引起的量化模型精度异常问题。([#38023](https://github.com/PaddlePaddle/Paddle/pull/38023), [#37618](https://github.com/PaddlePaddle/Paddle/pull/37618)) - -- 修复 MKLDNN 量化转换脚本因 pass 缺少引起的量化模型精度异常问题。([#37619](https://github.com/PaddlePaddle/Paddle/pull/37619), [#40542](https://github.com/PaddlePaddle/Paddle/pull/40542), - [#38912](https://github.com/PaddlePaddle/Paddle/pull/38912)) - -- 修复 MKLDNN 开启量 op 因为数据类型不匹配崩溃的问题。([#38133](https://github.com/PaddlePaddle/Paddle/pull/38133)) - -- 修复 MKLDNN 某些 op 修改 layout 后需要改回原 layout 的问题。([#39422](https://github.com/PaddlePaddle/Paddle/pull/39422)) - -- 修复针对昇腾 910 推理场景下,由于未释放 GIL 锁,导致与昇腾软件栈冲突,python API 下报错的问题。([#38605](https://github.com/PaddlePaddle/Paddle/pull/38605)) - -## 5. 环境适配 - -### 编译安装 - -- 从 2.3.0 版本开始,飞桨对框架支持的 GPU 架构种类进行了调整和升级。(更多请参考:[飞桨支持的 GPU 架构](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.3rc/install/Tables.html#gpu)) - -备注: - -- PIP 源安装是指用 `pip install paddlepaddle` 或 `pip install paddlepaddle-gpu`从 PIP 官网下载安装包及依赖库的安装方式,支持架构种类少,安装包更轻量,下载源来自国外(相比 bos 源支持架构种类精简,安装包更轻量,只提供一种 CUDA 版本的安装包)。 - - - 2.3 版本之前,飞桨 PIP 源安装包(CUDA10.2)支持的 GPU 架构为:3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5。 - - - 2.3 版本之后,飞桨 PIP 源安装包(CUDA11.0)支持的 GPU 架构为:6.0, 6.1, 7.0, 7.5, 8.0 - -- 飞桨官网 bos 源是指从飞桨官网下载安装包及依赖库的安装方式,支持的 GPU 架构更多,下载源来自国内,速度较快。(相比 PIP 源支持架构种类多,提供多个 CUDA 版本的安装包): - - - 2.3 版本之前,飞桨官网 bos 源安装包支持的 GPU 架构: - - - CUDA10:3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5; - - - CUDA11:5.2,6.0,6.1,7.0,7.5,8.0。 - - - 2.3 版本之后,飞桨官网 bos 源安装包支持的 GPU 架构 - - - CUDA10:3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5; - - - CUDA11:3.5, 5.0, 6.0, 6.1, 7.0, 7.5, 8.0。 - -- 支持 Python 3.10,修复 Windows 下某些 PythonC API 变化导致的编译 bug。([#41180](https://github.com/PaddlePaddle/Paddle/pull/42180)) - -- Windows 平台支持 Visual Studio 2019 编译。([#38719](https://github.com/PaddlePaddle/Paddle/pull/38719)) - -- 消除 Windows 平台编译时出现的各种 warning。([#38034](https://github.com/PaddlePaddle/Paddle/pull/38034), [#37890](https://github.com/PaddlePaddle/Paddle/pull/37890), [#37442](https://github.com/PaddlePaddle/Paddle/pull/37442), [#37439](https://github.com/PaddlePaddle/Paddle/pull/37439), [#36857](https://github.com/PaddlePaddle/Paddle/pull/36857)) - -- 修复底层数据结构升级引入的 jetson 编译问题。([#39669](https://github.com/PaddlePaddle/Paddle/pull/39669), [#39441](https://github.com/PaddlePaddle/Paddle/pull/39441)) - - -### 新硬件适配 - -- 自定义新硬件接入:提供一种插件式扩展 PaddlePaddle 硬件后端的方式。通过该功能,开发者无需为特定硬件修改 PaddlePaddle 代码,只需实现标准接口,并编译成动态链接库,则可作为插件供 PaddlePaddle 调用。降低为 PaddlePaddle 添加新硬件后端的开发难度。当前支持自定义 Runtime 接入和自定义 Kernel 接入。 - -- 华为 NPU 芯片(Ascend910)训练/推理支持,支持 ResNet50、YoloV3、BERT、Transformer 等多个模型,支持静态图与混合精度训练,支持单卡、单机、多机分布式训练。 - -- Graphcore IPU 芯片(包括 IPU Mk2 GC200 和 Bow IPU)训练/推理支持,支持 ResNet50、BERT 等模型,支持静态图训练,支持单芯片、单机、多机分布式训练。 - -- 寒武纪 MLU 芯片(MLU370x4)训练/推理支持,支持 ResNet50 等模型,支持静态图+动态图训练,支持混合精度训练,支持单卡、单机、多机分布式训练。 - -- 昆仑芯 2 代芯片(昆仑芯 AI 加速卡 R200、R300)训练/推理支持,支持 ResNet50、YoloV3、OCR-DB、SSD、MobilnetV3、UNet、BERT、Transformer、GPT-2、Wide&Deep、DeepFM,支持静态图+动态图训练,支持混合精度训练,支持单机单卡、单机多卡训练。 - -## Thanks to our Contributors - -This release contains contributions from the project core team as well as: - -Adam Osewski, Allen Guo, arlesniak, chenenquan, chenyanlann, fengkuangxiaxia, fuqianya, fwenguang, guguguzi, helen88, houj04, Jacek Czaja, jakpiase, jianghaicheng, joanna.wozna.intel, joeqiao12, Leo Chen, Leo Guo, Li-fAngyU, lidanqing, Liyulingyue, Matsumoto GAO, maxhuiy, Ming-Xu Huang, Nyakku Shigure, piotrekobi, piotrekobiIntel, QingshuChen, qipengh, Skr Bang, Sylwester Fraczek, Sławomir Siwek, taixiurong, tanzhipeng, Tomasz Socha, TTerror, Webbley, yaozhixin, ykkk2333, yujun, Zhangjingyu06, zhangxiaoci, zhangyikun02, zhangyk0314, zlsh80826, zn, Zuza. +#### Bug 修复 +- 修复 DTK 和 ROCM 版本升级的编译错误问题。 [#62832](https://github.com/PaddlePaddle/Paddle/pull/62832),[#62931](https://github.com/PaddlePaddle/Paddle/pull/62931),[#61872](https://github.com/PaddlePaddle/Paddle/pull/61872),[#63738](https://github.com/PaddlePaddle/Paddle/pull/63738) + +## 10.环境更新 +此版本飞桨完成基础依赖库的发版和更新同步,移除了不再更新的老旧依赖库。完成了多项优化提升编译效率、兼容性,完善 CI 流水线监测功能以提升用户安装体验。修复了多个已知编译问题,完善 paddle 的编译系统,新增了一些特性支持。通过相关优化工作,飞桨框架的编译安装体验进一步提升,给开发者带来更好的使用和开发体验。 + +### 新增支持 +- 支持用户安装 paddle 不依赖本地的 cuda 和 cudnn,提升用户安装体验。[#60841](https://github.com/PaddlePaddle/Paddle/pull/60841),[#61973](https://github.com/PaddlePaddle/Paddle/pull/61973),[#61862](https://github.com/PaddlePaddle/Paddle/pull/61862),[#61235](https://github.com/PaddlePaddle/Paddle/pull/61235),[#61209](https://github.com/PaddlePaddle/Paddle/pull/61209),[#61653](https://github.com/PaddlePaddle/Paddle/pull/61653),[#64083](https://github.com/PaddlePaddle/Paddle/pull/64083) +- 全面支持 CUDA 12.3,同时完成 cuda10.2 退场。[#63356](https://github.com/PaddlePaddle/Paddle/pull/63356),[#60299](https://github.com/PaddlePaddle/Paddle/pull/60299),[#64171](https://github.com/PaddlePaddle/Paddle/pull/64171),[#62189](https://github.com/PaddlePaddle/Paddle/pull/62189),[#63392](https://github.com/PaddlePaddle/Paddle/pull/63392),[#64228](https://github.com/PaddlePaddle/Paddle/pull/64228),[#62498](https://github.com/PaddlePaddle/Paddle/pull/62498),[#64298](https://github.com/PaddlePaddle/Paddle/pull/64298) +- 全面支持 Python 3.12,带来了更强大的语言特性和性能优化,同时完成 python3.7 退场。[#59875](https://github.com/PaddlePaddle/Paddle/pull/59875),[#59877](https://github.com/PaddlePaddle/Paddle/pull/59877),[#59876](https://github.com/PaddlePaddle/Paddle/pull/59876) +- 其他 paddle 依赖的第三方库升级:[#63741](https://github.com/PaddlePaddle/Paddle/pull/63741),[#64447](https://github.com/PaddlePaddle/Paddle/pull/64447),[#60195](https://github.com/PaddlePaddle/Paddle/pull/60195),[#60110](https://github.com/PaddlePaddle/Paddle/pull/60110),[#61509](https://github.com/PaddlePaddle/Paddle/pull/61509) + +### 编译优化 +- 优化了 paddle 的 CMake 代码,显著提升了编译效率和编译体验。[##59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#60167](https://github.com/PaddlePaddle/Paddle/pull/60167),[#61052](https://github.com/PaddlePaddle/Paddle/pull/61052),[#59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#59607](https://github.com/PaddlePaddle/Paddle/pull/59607),[#63093](https://github.com/PaddlePaddle/Paddle/pull/63093),[#63887](https://github.com/PaddlePaddle/Paddle/pull/63887),[#62969](https://github.com/PaddlePaddle/Paddle/pull/62969),[#64007](https://github.com/PaddlePaddle/Paddle/pull/64007),[#59811](https://github.com/PaddlePaddle/Paddle/pull/59811),[#63045](https://github.com/PaddlePaddle/Paddle/pull/63045),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#60240](https://github.com/PaddlePaddle/Paddle/pull/60240),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#61411](https://github.com/PaddlePaddle/Paddle/pull/61411),[#61944](https://github.com/PaddlePaddle/Paddle/pull/61944),[#61961](https://github.com/PaddlePaddle/Paddle/pull/61961),[#59990](https://github.com/PaddlePaddle/Paddle/pull/59990),[#59478](https://github.com/PaddlePaddle/Paddle/pull/59478),[#61501](https://github.com/PaddlePaddle/Paddle/pull/61501),[#60066](https://github.com/PaddlePaddle/Paddle/pull/60066),[#64133](https://github.com/PaddlePaddle/Paddle/pull/64133),[#64231](https://github.com/PaddlePaddle/Paddle/pull/64231),[#60087](https://github.com/PaddlePaddle/Paddle/pull/60087),[#60348](https://github.com/PaddlePaddle/Paddle/pull/60348),[#60737](https://github.com/PaddlePaddle/Paddle/pull/60737),[#61364](https://github.com/PaddlePaddle/Paddle/pull/61364),[#63214](https://github.com/PaddlePaddle/Paddle/pull/63214),[#62454](https://github.com/PaddlePaddle/Paddle/pull/62454),[#62473](https://github.com/PaddlePaddle/Paddle/pull/62473),[#63692](https://github.com/PaddlePaddle/Paddle/pull/63692),[#63950](https://github.com/PaddlePaddle/Paddle/pull/63950) +- 支持在 linux 和 windowx 下 C++单测链接动态库,大幅减少 C++单测的体积大小和整个 build 目录大小。[#60008](https://github.com/PaddlePaddle/Paddle/pull/60008),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60961](https://github.com/PaddlePaddle/Paddle/pull/60961),[#60831](https://github.com/PaddlePaddle/Paddle/pull/60831),[#60832](https://github.com/PaddlePaddle/Paddle/pull/60832),[#60833](https://github.com/PaddlePaddle/Paddle/pull/60833),[#61372](https://github.com/PaddlePaddle/Paddle/pull/61372),[#60834](https://github.com/PaddlePaddle/Paddle/pull/60834),[#61374](https://github.com/PaddlePaddle/Paddle/pull/61374),[#61463](https://github.com/PaddlePaddle/Paddle/pull/61463),[#61376](https://github.com/PaddlePaddle/Paddle/pull/61376),[#60830](https://github.com/PaddlePaddle/Paddle/pull/60830),[#61373](https://github.com/PaddlePaddle/Paddle/pull/61373),[#61672](https://github.com/PaddlePaddle/Paddle/pull/61672),[#61375](https://github.com/PaddlePaddle/Paddle/pull/61375),[#61676](https://github.com/PaddlePaddle/Paddle/pull/61676),[#62036](https://github.com/PaddlePaddle/Paddle/pull/62036),[#61945](https://github.com/PaddlePaddle/Paddle/pull/61945),[#61675](https://github.com/PaddlePaddle/Paddle/pull/61675),[#61674](https://github.com/PaddlePaddle/Paddle/pull/61674),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#61238](https://github.com/PaddlePaddle/Paddle/pull/61238),[#59988](https://github.com/PaddlePaddle/Paddle/pull/59988),[#60307](https://github.com/PaddlePaddle/Paddle/pull/60307),[#59612](https://github.com/PaddlePaddle/Paddle/pull/59612),[#59942](https://github.com/PaddlePaddle/Paddle/pull/59942),[#59968](https://github.com/PaddlePaddle/Paddle/pull/59968),[#59978](https://github.com/PaddlePaddle/Paddle/pull/59978),[#60121](https://github.com/PaddlePaddle/Paddle/pull/60121),[#60149](https://github.com/PaddlePaddle/Paddle/pull/60149),[#60161](https://github.com/PaddlePaddle/Paddle/pull/60161),[#60160](https://github.com/PaddlePaddle/Paddle/pull/60160),[#60230](https://github.com/PaddlePaddle/Paddle/pull/60230),[#60154](https://github.com/PaddlePaddle/Paddle/pull/60154),[#60356](https://github.com/PaddlePaddle/Paddle/pull/60356),[#60392](https://github.com/PaddlePaddle/Paddle/pull/60392),[#60517](https://github.com/PaddlePaddle/Paddle/pull/60517),[#61131](https://github.com/PaddlePaddle/Paddle/pull/61131),[#60959](https://github.com/PaddlePaddle/Paddle/pull/60959) +- 新增对 Clang 编译器的支持,用户现在可以使用 Clang 进行编译,享受更快的编译速度和更好的报错信息提示。[#63382](https://github.com/PaddlePaddle/Paddle/pull/63382),[#63133](https://github.com/PaddlePaddle/Paddle/pull/63133),[#61705](https://github.com/PaddlePaddle/Paddle/pull/61705),[#63152](https://github.com/PaddlePaddle/Paddle/pull/63152),[#63373](https://github.com/PaddlePaddle/Paddle/pull/63373) + +### CI 流水线改进 +- 对 CI 流水线中的合入代码监测机制进行了完善,确保更高的代码质量和稳定性。新增了功能监控模块,实时监控 CI 流水线的各项指标,确保每个阶段的顺利执行,及时发现和解决问题。[#61384](https://github.com/PaddlePaddle/Paddle/pull/61384),[#62190](https://github.com/PaddlePaddle/Paddle/pull/62190),[#60758](https://github.com/PaddlePaddle/Paddle/pull/60758),[#60399](https://github.com/PaddlePaddle/Paddle/pull/60399),[#58623](https://github.com/PaddlePaddle/Paddle/pull/58623),[#62177](https://github.com/PaddlePaddle/Paddle/pull/62177),[#62361](https://github.com/PaddlePaddle/Paddle/pull/62361),[#62893](https://github.com/PaddlePaddle/Paddle/pull/62893),[#63705](https://github.com/PaddlePaddle/Paddle/pull/63705),[#64476](https://github.com/PaddlePaddle/Paddle/pull/64476),[#64752](https://github.com/PaddlePaddle/Paddle/pull/64752),[#64733](https://github.com/PaddlePaddle/Paddle/pull/64733),[#61914](https://github.com/PaddlePaddle/Paddle/pull/61914) + +### 代码清理 +- 删除了一些老旧的代码。[#63580](https://github.com/PaddlePaddle/Paddle/pull/63580),[#62840](https://github.com/PaddlePaddle/Paddle/pull/62840),[#62886](https://github.com/PaddlePaddle/Paddle/pull/62886),[#63046](https://github.com/PaddlePaddle/Paddle/pull/63046),[#63004](https://github.com/PaddlePaddle/Paddle/pull/63004),[#63039](https://github.com/PaddlePaddle/Paddle/pull/63039),[#62733](https://github.com/PaddlePaddle/Paddle/pull/62733),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#62768](https://github.com/PaddlePaddle/Paddle/pull/62768),[#62744](https://github.com/PaddlePaddle/Paddle/pull/62744),[#62861](https://github.com/PaddlePaddle/Paddle/pull/62861),[#62774](https://github.com/PaddlePaddle/Paddle/pull/62774),[#62851](https://github.com/PaddlePaddle/Paddle/pull/62851),[#62973](https://github.com/PaddlePaddle/Paddle/pull/62973),[#63273](https://github.com/PaddlePaddle/Paddle/pull/63273),[#62445](https://github.com/PaddlePaddle/Paddle/pull/62445),[#64382](https://github.com/PaddlePaddle/Paddle/pull/64382),[#64409](https://github.com/PaddlePaddle/Paddle/pull/64409),[#64391](https://github.com/PaddlePaddle/Paddle/pull/64391),[#64310](https://github.com/PaddlePaddle/Paddle/pull/64310),[#64348](https://github.com/PaddlePaddle/Paddle/pull/64348),[#64651](https://github.com/PaddlePaddle/Paddle/pull/64651),[#64709](https://github.com/PaddlePaddle/Paddle/pull/64709),[#61714](https://github.com/PaddlePaddle/Paddle/pull/61714),[#62109](https://github.com/PaddlePaddle/Paddle/pull/62109),[#61751](https://github.com/PaddlePaddle/Paddle/pull/61751),[#61691](https://github.com/PaddlePaddle/Paddle/pull/61691),[#61735](https://github.com/PaddlePaddle/Paddle/pull/61735) +### Bug 修复 +- 修复多个 paddle 框架的编译问题。[#63297](https://github.com/PaddlePaddle/Paddle/pull/63297),[#62994](https://github.com/PaddlePaddle/Paddle/pull/62994),[#62651](https://github.com/PaddlePaddle/Paddle/pull/62651),[#64408](https://github.com/PaddlePaddle/Paddle/pull/64408),[#60934](https://github.com/PaddlePaddle/Paddle/pull/60934),[#62899](https://github.com/PaddlePaddle/Paddle/pull/62899),[#60528](https://github.com/PaddlePaddle/Paddle/pull/60528),[#63158](https://github.com/PaddlePaddle/Paddle/pull/63158),[#64549](https://github.com/PaddlePaddle/Paddle/pull/64549),[#62351](https://github.com/PaddlePaddle/Paddle/pull/62351),[#61259](https://github.com/PaddlePaddle/Paddle/pull/61259),[#61281](https://github.com/PaddlePaddle/Paddle/pull/61281),[#62304](https://github.com/PaddlePaddle/Paddle/pull/62304),[#60736](https://github.com/PaddlePaddle/Paddle/pull/60736),[#60811](https://github.com/PaddlePaddle/Paddle/pull/60811),[#63949](https://github.com/PaddlePaddle/Paddle/pull/63949),[#59892](https://github.com/PaddlePaddle/Paddle/pull/59892),[#60767](https://github.com/PaddlePaddle/Paddle/pull/60767),[#60856](https://github.com/PaddlePaddle/Paddle/pull/60856),[#61286](https://github.com/PaddlePaddle/Paddle/pull/61286),[#61638](https://github.com/PaddlePaddle/Paddle/pull/61638),[#62079](https://github.com/PaddlePaddle/Paddle/pull/62079),[#62142](https://github.com/PaddlePaddle/Paddle/pull/62142),[#62823](https://github.com/PaddlePaddle/Paddle/pull/62823),[#62814](https://github.com/PaddlePaddle/Paddle/pull/62814),[#62425](https://github.com/PaddlePaddle/Paddle/pull/62425),[#62619](https://github.com/PaddlePaddle/Paddle/pull/62619),[#60207](https://github.com/PaddlePaddle/Paddle/pull/60207),[#60765](https://github.com/PaddlePaddle/Paddle/pull/60765),[#61870](https://github.com/PaddlePaddle/Paddle/pull/61870),[#61923](https://github.com/PaddlePaddle/Paddle/pull/61923),[#62144](https://github.com/PaddlePaddle/Paddle/pull/62144),[#62426](https://github.com/PaddlePaddle/Paddle/pull/62426),[#63848](https://github.com/PaddlePaddle/Paddle/pull/63848),[#60682](https://github.com/PaddlePaddle/Paddle/pull/60682),[#61369](https://github.com/PaddlePaddle/Paddle/pull/61369),[#62882](https://github.com/PaddlePaddle/Paddle/pull/62882),[#63944](https://github.com/PaddlePaddle/Paddle/pull/63944),[#64812](https://github.com/PaddlePaddle/Paddle/pull/64812),[#60654](https://github.com/PaddlePaddle/Paddle/pull/60654),[#60887](https://github.com/PaddlePaddle/Paddle/pull/60887),[#62058](https://github.com/PaddlePaddle/Paddle/pull/62058),[#64639](https://github.com/PaddlePaddle/Paddle/pull/64639),[#60115](https://github.com/PaddlePaddle/Paddle/pull/60115),[#61940](https://github.com/PaddlePaddle/Paddle/pull/61940),[#62614](https://github.com/PaddlePaddle/Paddle/pull/62614),[#59914](https://github.com/PaddlePaddle/Paddle/pull/59914),[#63762](https://github.com/PaddlePaddle/Paddle/pull/63762),[#60145](https://github.com/PaddlePaddle/Paddle/pull/60145),[#60285](https://github.com/PaddlePaddle/Paddle/pull/60285),[#60378](https://github.com/PaddlePaddle/Paddle/pull/60378),[#60393](https://github.com/PaddlePaddle/Paddle/pull/60393),[#61057](https://github.com/PaddlePaddle/Paddle/pull/61057),[#61058](https://github.com/PaddlePaddle/Paddle/pull/61058),[#61151](https://github.com/PaddlePaddle/Paddle/pull/61151),[#61347](https://github.com/PaddlePaddle/Paddle/pull/61347),[#61554](https://github.com/PaddlePaddle/Paddle/pull/61554),[#61844](https://github.com/PaddlePaddle/Paddle/pull/61844),[#62915](https://github.com/PaddlePaddle/Paddle/pull/62915),[#61852](https://github.com/PaddlePaddle/Paddle/pull/61852),[#61704](https://github.com/PaddlePaddle/Paddle/pull/61704),[#61991](https://github.com/PaddlePaddle/Paddle/pull/61991),[#62264](https://github.com/PaddlePaddle/Paddle/pull/62264),[#62762](https://github.com/PaddlePaddle/Paddle/pull/62762),[#63820](https://github.com/PaddlePaddle/Paddle/pull/63820),[#63864](https://github.com/PaddlePaddle/Paddle/pull/63864),[#65017](https://github.com/PaddlePaddle/Paddle/pull/65017),[#61183](https://github.com/PaddlePaddle/Paddle/pull/61183),[#59866](https://github.com/PaddlePaddle/Paddle/pull/59866),[#61171](https://github.com/PaddlePaddle/Paddle/pull/61171),[#61290](https://github.com/PaddlePaddle/Paddle/pull/61290),[#61725](https://github.com/PaddlePaddle/Paddle/pull/61725),[#61614](https://github.com/PaddlePaddle/Paddle/pull/61614),[#61721](https://github.com/PaddlePaddle/Paddle/pull/61721),[#61494](https://github.com/PaddlePaddle/Paddle/pull/61494),[#61556](https://github.com/PaddlePaddle/Paddle/pull/61556),[#61689](https://github.com/PaddlePaddle/Paddle/pull/61689) + +## 11.文档相关的问题修复 +- 随着 API 功能增强工作的开展,对部分 API 文档也同步进行了修正和增强。[#62875](https://github.com/PaddlePaddle/Paddle/pull/62875), [#59793](https://github.com/PaddlePaddle/Paddle/pull/59793), [#60002](https://github.com/PaddlePaddle/Paddle/pull/60002), [#59985](https://github.com/PaddlePaddle/Paddle/pull/59985), [#63365](https://github.com/PaddlePaddle/Paddle/pull/63365), [#60962](https://github.com/PaddlePaddle/Paddle/pull/60962), [#60942](https://github.com/PaddlePaddle/Paddle/pull/60942), [#64232](https://github.com/PaddlePaddle/Paddle/pull/64232), [#63255](https://github.com/PaddlePaddle/Paddle/pull/63255) +- 更新/补充 API 文档。bernoulli_ ([#64504](https://github.com/PaddlePaddle/Paddle/pull/64504)),paddle.static.ctr_metric_bundle ([#60912](https://github.com/PaddlePaddle/Paddle/pull/60912)),LayerNorm ([#62928](https://github.com/PaddlePaddle/Paddle/pull/62928)),Sequential ([#63128](https://github.com/PaddlePaddle/Paddle/pull/63128)),paddle.summary ([#63121](https://github.com/PaddlePaddle/Paddle/pull/63121)),AutoParallel 中的 ShardOptimizer ([#62933](https://github.com/PaddlePaddle/Paddle/pull/62933)),paddle.nccl.version ([#62480](https://github.com/PaddlePaddle/Paddle/pull/62480)) +- 更新 Readme 文件。[#59883](https://github.com/PaddlePaddle/Paddle/pull/59883),[#60691](https://github.com/PaddlePaddle/Paddle/pull/60691),[#60749](https://github.com/PaddlePaddle/Paddle/pull/60749) +- 将 mkldnn 更新为 onednn。[#63199](https://github.com/PaddlePaddle/Paddle/pull/63199),[#63202](https://github.com/PaddlePaddle/Paddle/pull/63202),[#63215](https://github.com/PaddlePaddle/Paddle/pull/63215),[#63209](https://github.com/PaddlePaddle/Paddle/pull/63209) +- 修复文档渲染错误。[#59725](https://github.com/PaddlePaddle/Paddle/pull/59725),[#60306](https://github.com/PaddlePaddle/Paddle/pull/60306) +- 修改了代码中大量的错别字,增强源码可读性。[#60093](https://github.com/PaddlePaddle/Paddle/pull/60093),[#60603](https://github.com/PaddlePaddle/Paddle/pull/60603),[#60631](https://github.com/PaddlePaddle/Paddle/pull/60631),[#60679](https://github.com/PaddlePaddle/Paddle/pull/60679),[#60741](https://github.com/PaddlePaddle/Paddle/pull/60741),[#60770](https://github.com/PaddlePaddle/Paddle/pull/60770),[#60784](https://github.com/PaddlePaddle/Paddle/pull/60784),[#60825](https://github.com/PaddlePaddle/Paddle/pull/60825),[#60857](https://github.com/PaddlePaddle/Paddle/pull/60857),[#60891](https://github.com/PaddlePaddle/Paddle/pull/60891),[#60921](https://github.com/PaddlePaddle/Paddle/pull/60921),[#60920](https://github.com/PaddlePaddle/Paddle/pull/60920),[#60923](https://github.com/PaddlePaddle/Paddle/pull/60923),[#60928](https://github.com/PaddlePaddle/Paddle/pull/60928),[#60940](https://github.com/PaddlePaddle/Paddle/pull/60940),[#60936](https://github.com/PaddlePaddle/Paddle/pull/60936),[#60932](https://github.com/PaddlePaddle/Paddle/pull/60932),[#60935](https://github.com/PaddlePaddle/Paddle/pull/60935),[#60931](https://github.com/PaddlePaddle/Paddle/pull/60931),[#60951](https://github.com/PaddlePaddle/Paddle/pull/60951),[#60964](https://github.com/PaddlePaddle/Paddle/pull/60964),[#60965](https://github.com/PaddlePaddle/Paddle/pull/60965),[#60967](https://github.com/PaddlePaddle/Paddle/pull/60967),[#60972](https://github.com/PaddlePaddle/Paddle/pull/60972),[#60971](https://github.com/PaddlePaddle/Paddle/pull/60971),[#60980](https://github.com/PaddlePaddle/Paddle/pull/60980),[#60984](https://github.com/PaddlePaddle/Paddle/pull/60984),[#60985](https://github.com/PaddlePaddle/Paddle/pull/60985),[#60989](https://github.com/PaddlePaddle/Paddle/pull/60989),[#60990](https://github.com/PaddlePaddle/Paddle/pull/60990),[#60991](https://github.com/PaddlePaddle/Paddle/pull/60991),[#60992](https://github.com/PaddlePaddle/Paddle/pull/60992),[#60994](https://github.com/PaddlePaddle/Paddle/pull/60994),[#60995](https://github.com/PaddlePaddle/Paddle/pull/60995),[#60996](https://github.com/PaddlePaddle/Paddle/pull/60996),[#61001](https://github.com/PaddlePaddle/Paddle/pull/61001),[#61000](https://github.com/PaddlePaddle/Paddle/pull/61000),[#60999](https://github.com/PaddlePaddle/Paddle/pull/60999),[#60998](https://github.com/PaddlePaddle/Paddle/pull/60998),[#61026](https://github.com/PaddlePaddle/Paddle/pull/61026),[#61009](https://github.com/PaddlePaddle/Paddle/pull/61009),[#61034](https://github.com/PaddlePaddle/Paddle/pull/61034),[#61033](https://github.com/PaddlePaddle/Paddle/pull/61033),[#61020](https://github.com/PaddlePaddle/Paddle/pull/61020),[#61092](https://github.com/PaddlePaddle/Paddle/pull/61092),[#61066](https://github.com/PaddlePaddle/Paddle/pull/61066),[#61063](https://github.com/PaddlePaddle/Paddle/pull/61063),[#61089](https://github.com/PaddlePaddle/Paddle/pull/61089),[#61071](https://github.com/PaddlePaddle/Paddle/pull/61071),[#61129](https://github.com/PaddlePaddle/Paddle/pull/61129),[#61128](https://github.com/PaddlePaddle/Paddle/pull/61128),[#61126](https://github.com/PaddlePaddle/Paddle/pull/61126),[#61123](https://github.com/PaddlePaddle/Paddle/pull/61123),[#61113](https://github.com/PaddlePaddle/Paddle/pull/61113),[#61189](https://github.com/PaddlePaddle/Paddle/pull/61189),[#61175](https://github.com/PaddlePaddle/Paddle/pull/61175),[#61153](https://github.com/PaddlePaddle/Paddle/pull/61153),[#61198](https://github.com/PaddlePaddle/Paddle/pull/61198),[#61206](https://github.com/PaddlePaddle/Paddle/pull/61206),[#61256](https://github.com/PaddlePaddle/Paddle/pull/61256),[#61255](https://github.com/PaddlePaddle/Paddle/pull/61255),[#61251](https://github.com/PaddlePaddle/Paddle/pull/61251),[#61246](https://github.com/PaddlePaddle/Paddle/pull/61246),[#61245](https://github.com/PaddlePaddle/Paddle/pull/61245),[#61231](https://github.com/PaddlePaddle/Paddle/pull/61231),[#61247](https://github.com/PaddlePaddle/Paddle/pull/61247),[#61265](https://github.com/PaddlePaddle/Paddle/pull/61265),[#61264](https://github.com/PaddlePaddle/Paddle/pull/61264),[#61266](https://github.com/PaddlePaddle/Paddle/pull/61266),[#61267](https://github.com/PaddlePaddle/Paddle/pull/61267),[#61268](https://github.com/PaddlePaddle/Paddle/pull/61268),[#61270](https://github.com/PaddlePaddle/Paddle/pull/61270),[#61334](https://github.com/PaddlePaddle/Paddle/pull/61334),[#61392](https://github.com/PaddlePaddle/Paddle/pull/61392),[#61404](https://github.com/PaddlePaddle/Paddle/pull/61404),[#61318](https://github.com/PaddlePaddle/Paddle/pull/61318),[#61383](https://github.com/PaddlePaddle/Paddle/pull/61383),[#61306](https://github.com/PaddlePaddle/Paddle/pull/61306),[#61324](https://github.com/PaddlePaddle/Paddle/pull/61324),[#61426](https://github.com/PaddlePaddle/Paddle/pull/61426),[#61390](https://github.com/PaddlePaddle/Paddle/pull/61390),[#61419](https://github.com/PaddlePaddle/Paddle/pull/61419),[#61420](https://github.com/PaddlePaddle/Paddle/pull/61420),[#61408](https://github.com/PaddlePaddle/Paddle/pull/61408),[#61425](https://github.com/PaddlePaddle/Paddle/pull/61425),[#61557](https://github.com/PaddlePaddle/Paddle/pull/61557),[#61628](https://github.com/PaddlePaddle/Paddle/pull/61628),[#61652](https://github.com/PaddlePaddle/Paddle/pull/61652),[#61602](https://github.com/PaddlePaddle/Paddle/pull/61602),[#61558](https://github.com/PaddlePaddle/Paddle/pull/61558),[#61660](https://github.com/PaddlePaddle/Paddle/pull/61660),[#61423](https://github.com/PaddlePaddle/Paddle/pull/61423),[#61627](https://github.com/PaddlePaddle/Paddle/pull/61627),[#61685](https://github.com/PaddlePaddle/Paddle/pull/61685),[#61690](https://github.com/PaddlePaddle/Paddle/pull/61690),[#61727](https://github.com/PaddlePaddle/Paddle/pull/61727),[#61738](https://github.com/PaddlePaddle/Paddle/pull/61738),[#61740](https://github.com/PaddlePaddle/Paddle/pull/61740),[#61741](https://github.com/PaddlePaddle/Paddle/pull/61741),[#61743](https://github.com/PaddlePaddle/Paddle/pull/61743),[#61744](https://github.com/PaddlePaddle/Paddle/pull/61744),[#61745](https://github.com/PaddlePaddle/Paddle/pull/61745),[#61761](https://github.com/PaddlePaddle/Paddle/pull/61761),[#61762](https://github.com/PaddlePaddle/Paddle/pull/61762),[#61764](https://github.com/PaddlePaddle/Paddle/pull/61764),[#61767](https://github.com/PaddlePaddle/Paddle/pull/61767),[#61768](https://github.com/PaddlePaddle/Paddle/pull/61768),[#61774](https://github.com/PaddlePaddle/Paddle/pull/61774),[#61781](https://github.com/PaddlePaddle/Paddle/pull/61781),[#61783](https://github.com/PaddlePaddle/Paddle/pull/61783),[#61757](https://github.com/PaddlePaddle/Paddle/pull/61757),[#61732](https://github.com/PaddlePaddle/Paddle/pull/61732),[#61776](https://github.com/PaddlePaddle/Paddle/pull/61776),[#61780](https://github.com/PaddlePaddle/Paddle/pull/61780),[#61730](https://github.com/PaddlePaddle/Paddle/pull/61730),[#61728](https://github.com/PaddlePaddle/Paddle/pull/61728),[#61633](https://github.com/PaddlePaddle/Paddle/pull/61633),[#61720](https://github.com/PaddlePaddle/Paddle/pull/61720),[#61734](https://github.com/PaddlePaddle/Paddle/pull/61734),[#61779](https://github.com/PaddlePaddle/Paddle/pull/61779),[#61775](https://github.com/PaddlePaddle/Paddle/pull/61775),[#61773](https://github.com/PaddlePaddle/Paddle/pull/61773),[#61787](https://github.com/PaddlePaddle/Paddle/pull/61787),[#61687](https://github.com/PaddlePaddle/Paddle/pull/61687),[#61747](https://github.com/PaddlePaddle/Paddle/pull/61747),[#61760](https://github.com/PaddlePaddle/Paddle/pull/61760),[#61782](https://github.com/PaddlePaddle/Paddle/pull/61782),[#61800](https://github.com/PaddlePaddle/Paddle/pull/61800),[#61748](https://github.com/PaddlePaddle/Paddle/pull/61748),[#61772](https://github.com/PaddlePaddle/Paddle/pull/61772),[#61786](https://github.com/PaddlePaddle/Paddle/pull/61786),[#61880](https://github.com/PaddlePaddle/Paddle/pull/61880),[#61718](https://github.com/PaddlePaddle/Paddle/pull/61718),[#61742](https://github.com/PaddlePaddle/Paddle/pull/61742),[#61766](https://github.com/PaddlePaddle/Paddle/pull/61766),[#61835](https://github.com/PaddlePaddle/Paddle/pull/61835),[#61838](https://github.com/PaddlePaddle/Paddle/pull/61838),[#61754](https://github.com/PaddlePaddle/Paddle/pull/61754),[#61833](https://github.com/PaddlePaddle/Paddle/pull/61833),[#61749](https://github.com/PaddlePaddle/Paddle/pull/61749),[#61938](https://github.com/PaddlePaddle/Paddle/pull/61938),[#61919](https://github.com/PaddlePaddle/Paddle/pull/61919),[#61924](https://github.com/PaddlePaddle/Paddle/pull/61924),[#61778](https://github.com/PaddlePaddle/Paddle/pull/61778),[#61839](https://github.com/PaddlePaddle/Paddle/pull/61839),[#61879](https://github.com/PaddlePaddle/Paddle/pull/61879),[#61929](https://github.com/PaddlePaddle/Paddle/pull/61929),[#61801](https://github.com/PaddlePaddle/Paddle/pull/61801),[#61788](https://github.com/PaddlePaddle/Paddle/pull/61788),[#61999](https://github.com/PaddlePaddle/Paddle/pull/61999),[#61928](https://github.com/PaddlePaddle/Paddle/pull/61928),[#61958](https://github.com/PaddlePaddle/Paddle/pull/61958),[#61982](https://github.com/PaddlePaddle/Paddle/pull/61982),[#61996](https://github.com/PaddlePaddle/Paddle/pull/61996),[#61953](https://github.com/PaddlePaddle/Paddle/pull/61953),[#61998](https://github.com/PaddlePaddle/Paddle/pull/61998),[#62003](https://github.com/PaddlePaddle/Paddle/pull/62003),[#61921](https://github.com/PaddlePaddle/Paddle/pull/61921),[#61881](https://github.com/PaddlePaddle/Paddle/pull/61881),[#61746](https://github.com/PaddlePaddle/Paddle/pull/61746),[#61955](https://github.com/PaddlePaddle/Paddle/pull/61955),[#62002](https://github.com/PaddlePaddle/Paddle/pull/62002),[#62001](https://github.com/PaddlePaddle/Paddle/pull/62001),[#61997](https://github.com/PaddlePaddle/Paddle/pull/61997),[#61765](https://github.com/PaddlePaddle/Paddle/pull/61765),[#61956](https://github.com/PaddlePaddle/Paddle/pull/61956),[#62004](https://github.com/PaddlePaddle/Paddle/pull/62004),[#62044](https://github.com/PaddlePaddle/Paddle/pull/62044),[#62040](https://github.com/PaddlePaddle/Paddle/pull/62040),[#62043](https://github.com/PaddlePaddle/Paddle/pull/62043),[#62042](https://github.com/PaddlePaddle/Paddle/pull/62042),[#62041](https://github.com/PaddlePaddle/Paddle/pull/62041),[#62039](https://github.com/PaddlePaddle/Paddle/pull/62039),[#62019](https://github.com/PaddlePaddle/Paddle/pull/62019),[#61910](https://github.com/PaddlePaddle/Paddle/pull/61910),[#61882](https://github.com/PaddlePaddle/Paddle/pull/61882),[#61836](https://github.com/PaddlePaddle/Paddle/pull/61836),[#62013](https://github.com/PaddlePaddle/Paddle/pull/62013),[#62055](https://github.com/PaddlePaddle/Paddle/pull/62055),[#62047](https://github.com/PaddlePaddle/Paddle/pull/62047),[#62000](https://github.com/PaddlePaddle/Paddle/pull/62000),[#62048](https://github.com/PaddlePaddle/Paddle/pull/62048),[#62075](https://github.com/PaddlePaddle/Paddle/pull/62075),[#62038](https://github.com/PaddlePaddle/Paddle/pull/62038),[#62045](https://github.com/PaddlePaddle/Paddle/pull/62045),[#62105](https://github.com/PaddlePaddle/Paddle/pull/62105),[#62214](https://github.com/PaddlePaddle/Paddle/pull/62214),[#62212](https://github.com/PaddlePaddle/Paddle/pull/62212),[#62183](https://github.com/PaddlePaddle/Paddle/pull/62183),[#62182](https://github.com/PaddlePaddle/Paddle/pull/62182),[#62181](https://github.com/PaddlePaddle/Paddle/pull/62181),[#62179](https://github.com/PaddlePaddle/Paddle/pull/62179),[#62178](https://github.com/PaddlePaddle/Paddle/pull/62178),[#62172](https://github.com/PaddlePaddle/Paddle/pull/62172),[#62168](https://github.com/PaddlePaddle/Paddle/pull/62168),[#62163](https://github.com/PaddlePaddle/Paddle/pull/62163),[#62162](https://github.com/PaddlePaddle/Paddle/pull/62162),[#62161](https://github.com/PaddlePaddle/Paddle/pull/62161),[#62160](https://github.com/PaddlePaddle/Paddle/pull/62160),[#62046](https://github.com/PaddlePaddle/Paddle/pull/62046),[#62175](https://github.com/PaddlePaddle/Paddle/pull/62175),[#62259](https://github.com/PaddlePaddle/Paddle/pull/62259),[#62258](https://github.com/PaddlePaddle/Paddle/pull/62258),[#62213](https://github.com/PaddlePaddle/Paddle/pull/62213),[#62260](https://github.com/PaddlePaddle/Paddle/pull/62260),[#62290](https://github.com/PaddlePaddle/Paddle/pull/62290),[#62288](https://github.com/PaddlePaddle/Paddle/pull/62288),[#62323](https://github.com/PaddlePaddle/Paddle/pull/62323),[#62319](https://github.com/PaddlePaddle/Paddle/pull/62319),[#62331](https://github.com/PaddlePaddle/Paddle/pull/62331),[#62330](https://github.com/PaddlePaddle/Paddle/pull/62330),[#62329](https://github.com/PaddlePaddle/Paddle/pull/62329),[#62324](https://github.com/PaddlePaddle/Paddle/pull/62324),[#62317](https://github.com/PaddlePaddle/Paddle/pull/62317),[#62311](https://github.com/PaddlePaddle/Paddle/pull/62311),[#62310](https://github.com/PaddlePaddle/Paddle/pull/62310),[#62308](https://github.com/PaddlePaddle/Paddle/pull/62308),[#62289](https://github.com/PaddlePaddle/Paddle/pull/62289),[#62307](https://github.com/PaddlePaddle/Paddle/pull/62307),[#62315](https://github.com/PaddlePaddle/Paddle/pull/62315),[#62406](https://github.com/PaddlePaddle/Paddle/pull/62406),[#62458](https://github.com/PaddlePaddle/Paddle/pull/62458),[#62459](https://github.com/PaddlePaddle/Paddle/pull/62459),[#62481](https://github.com/PaddlePaddle/Paddle/pull/62481),[#62465](https://github.com/PaddlePaddle/Paddle/pull/62465),[#62462](https://github.com/PaddlePaddle/Paddle/pull/62462),[#62453](https://github.com/PaddlePaddle/Paddle/pull/62453),[#62496](https://github.com/PaddlePaddle/Paddle/pull/62496),[#62457](https://github.com/PaddlePaddle/Paddle/pull/62457),[#62537](https://github.com/PaddlePaddle/Paddle/pull/62537),[#62514](https://github.com/PaddlePaddle/Paddle/pull/62514),[#62548](https://github.com/PaddlePaddle/Paddle/pull/62548),[#62544](https://github.com/PaddlePaddle/Paddle/pull/62544),[#62575](https://github.com/PaddlePaddle/Paddle/pull/62575),[#62463](https://github.com/PaddlePaddle/Paddle/pull/62463),[#62643](https://github.com/PaddlePaddle/Paddle/pull/62643),[#62803](https://github.com/PaddlePaddle/Paddle/pull/62803),[#62924](https://github.com/PaddlePaddle/Paddle/pull/62924),[#63037](https://github.com/PaddlePaddle/Paddle/pull/63037),[#63102](https://github.com/PaddlePaddle/Paddle/pull/63102),[#63139](https://github.com/PaddlePaddle/Paddle/pull/63139),[#63092](https://github.com/PaddlePaddle/Paddle/pull/63092),[#63147](https://github.com/PaddlePaddle/Paddle/pull/63147),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#60485](https://github.com/PaddlePaddle/Paddle/pull/60485),[#61273](https://github.com/PaddlePaddle/Paddle/pull/61273),[#63429](https://github.com/PaddlePaddle/Paddle/pull/63429),[#61954](https://github.com/PaddlePaddle/Paddle/pull/61954) + +## 12.其他升级内容 +与用户使用无关的改动,包括废弃代码清理、无用单测清理、调试或者监控机制升级等。[#63377](https://github.com/PaddlePaddle/Paddle/pull/63377),[#64106](https://github.com/PaddlePaddle/Paddle/pull/64106),[#64220](https://github.com/PaddlePaddle/Paddle/pull/64220),[#64293](https://github.com/PaddlePaddle/Paddle/pull/64293),[#64464](https://github.com/PaddlePaddle/Paddle/pull/64464),[#64944](https://github.com/PaddlePaddle/Paddle/pull/64944),[#63638](https://github.com/PaddlePaddle/Paddle/pull/63638),[#63732](https://github.com/PaddlePaddle/Paddle/pull/63732),[#63735](https://github.com/PaddlePaddle/Paddle/pull/63735),[#63826](https://github.com/PaddlePaddle/Paddle/pull/63826),[#63982](https://github.com/PaddlePaddle/Paddle/pull/63982),[#63737](https://github.com/PaddlePaddle/Paddle/pull/63737),[#64471](https://github.com/PaddlePaddle/Paddle/pull/64471),[#64574](https://github.com/PaddlePaddle/Paddle/pull/64574),[#64494](https://github.com/PaddlePaddle/Paddle/pull/64494),[#62775](https://github.com/PaddlePaddle/Paddle/pull/62775),[#63601](https://github.com/PaddlePaddle/Paddle/pull/63601),[#62564](https://github.com/PaddlePaddle/Paddle/pull/62564),[#63772](https://github.com/PaddlePaddle/Paddle/pull/63772),[#64719](https://github.com/PaddlePaddle/Paddle/pull/64719),[#61640](https://github.com/PaddlePaddle/Paddle/pull/61640),[#63459](https://github.com/PaddlePaddle/Paddle/pull/63459),[#64062](https://github.com/PaddlePaddle/Paddle/pull/64062),[#63480](https://github.com/PaddlePaddle/Paddle/pull/63480),[#63833](https://github.com/PaddlePaddle/Paddle/pull/63833)[#63673](https://github.com/PaddlePaddle/Paddle/pull/63673),[#63672](https://github.com/PaddlePaddle/Paddle/pull/63672),[#64131](https://github.com/PaddlePaddle/Paddle/pull/64131),[#64156](https://github.com/PaddlePaddle/Paddle/pull/64156),[#64155](https://github.com/PaddlePaddle/Paddle/pull/64155),[#64159](https://github.com/PaddlePaddle/Paddle/pull/64159),[#63902](https://github.com/PaddlePaddle/Paddle/pull/63902),[#64230](https://github.com/PaddlePaddle/Paddle/pull/64230),[#64229](https://github.com/PaddlePaddle/Paddle/pull/64229),[#64236](https://github.com/PaddlePaddle/Paddle/pull/64236),[#64260](https://github.com/PaddlePaddle/Paddle/pull/64260),[#64175](https://github.com/PaddlePaddle/Paddle/pull/64175),[#64250](https://github.com/PaddlePaddle/Paddle/pull/64250),[#64269](https://github.com/PaddlePaddle/Paddle/pull/64269),[#64238](https://github.com/PaddlePaddle/Paddle/pull/64238),[#64349](https://github.com/PaddlePaddle/Paddle/pull/64349),[#64394](https://github.com/PaddlePaddle/Paddle/pull/64394),[#64402](https://github.com/PaddlePaddle/Paddle/pull/64402),[#64401](https://github.com/PaddlePaddle/Paddle/pull/64401),[#64388](https://github.com/PaddlePaddle/Paddle/pull/64388),[#64329](https://github.com/PaddlePaddle/Paddle/pull/64329),[#64502](https://github.com/PaddlePaddle/Paddle/pull/64502),[#64501](https://github.com/PaddlePaddle/Paddle/pull/64501),[#64515](https://github.com/PaddlePaddle/Paddle/pull/64515),[#64503](https://github.com/PaddlePaddle/Paddle/pull/64503),[#64514](https://github.com/PaddlePaddle/Paddle/pull/64514),[#64601](https://github.com/PaddlePaddle/Paddle/pull/64601),[#64564](https://github.com/PaddlePaddle/Paddle/pull/64564),[#64012](https://github.com/PaddlePaddle/Paddle/pull/64012),[#64697](https://github.com/PaddlePaddle/Paddle/pull/64697),[#64682](https://github.com/PaddlePaddle/Paddle/pull/64682),[#64051](https://github.com/PaddlePaddle/Paddle/pull/64051),[#63267](https://github.com/PaddlePaddle/Paddle/pull/63267),[#63426](https://github.com/PaddlePaddle/Paddle/pull/63426),[#63626](https://github.com/PaddlePaddle/Paddle/pull/63626),[#63257](https://github.com/PaddlePaddle/Paddle/pull/63257),[#63266](https://github.com/PaddlePaddle/Paddle/pull/63266),[#63468](https://github.com/PaddlePaddle/Paddle/pull/63468),[#63262](https://github.com/PaddlePaddle/Paddle/pull/63262),[#63248](https://github.com/PaddlePaddle/Paddle/pull/63248),[#63241](https://github.com/PaddlePaddle/Paddle/pull/63241),[#63252](https://github.com/PaddlePaddle/Paddle/pull/63252),[#63258](https://github.com/PaddlePaddle/Paddle/pull/63258),[#63235](https://github.com/PaddlePaddle/Paddle/pull/63235),[#63399](https://github.com/PaddlePaddle/Paddle/pull/63399),[#63488](https://github.com/PaddlePaddle/Paddle/pull/63488),[#63487](https://github.com/PaddlePaddle/Paddle/pull/63487),[#63466](https://github.com/PaddlePaddle/Paddle/pull/63466),[#63464](https://github.com/PaddlePaddle/Paddle/pull/63464),[#63483](https://github.com/PaddlePaddle/Paddle/pull/63483),[#63486](https://github.com/PaddlePaddle/Paddle/pull/63486),[#63475](https://github.com/PaddlePaddle/Paddle/pull/63475),[#63489](https://github.com/PaddlePaddle/Paddle/pull/63489),[#63470](https://github.com/PaddlePaddle/Paddle/pull/63470),[#63457](https://github.com/PaddlePaddle/Paddle/pull/63457),[#63493](https://github.com/PaddlePaddle/Paddle/pull/63493),[#63561](https://github.com/PaddlePaddle/Paddle/pull/63561),[#63584](https://github.com/PaddlePaddle/Paddle/pull/63584),[#63587](https://github.com/PaddlePaddle/Paddle/pull/63587),[#63586](https://github.com/PaddlePaddle/Paddle/pull/63586),[#63569](https://github.com/PaddlePaddle/Paddle/pull/63569),[#63559](https://github.com/PaddlePaddle/Paddle/pull/63559),[#63558](https://github.com/PaddlePaddle/Paddle/pull/63558),[#63555](https://github.com/PaddlePaddle/Paddle/pull/63555),[#63543](https://github.com/PaddlePaddle/Paddle/pull/63543),[#63589](https://github.com/PaddlePaddle/Paddle/pull/63589),[#63583](https://github.com/PaddlePaddle/Paddle/pull/63583),[#63565](https://github.com/PaddlePaddle/Paddle/pull/63565),[#63564](https://github.com/PaddlePaddle/Paddle/pull/63564),[#63265](https://github.com/PaddlePaddle/Paddle/pull/63265),[#63562](https://github.com/PaddlePaddle/Paddle/pull/63562),[#63591](https://github.com/PaddlePaddle/Paddle/pull/63591),[#63460](https://github.com/PaddlePaddle/Paddle/pull/63460),[#63238](https://github.com/PaddlePaddle/Paddle/pull/63238),[#63631](https://github.com/PaddlePaddle/Paddle/pull/63631),[#63707](https://github.com/PaddlePaddle/Paddle/pull/63707),[#63714](https://github.com/PaddlePaddle/Paddle/pull/63714),[#63854](https://github.com/PaddlePaddle/Paddle/pull/63854),[#63929](https://github.com/PaddlePaddle/Paddle/pull/63929),[#63532](https://github.com/PaddlePaddle/Paddle/pull/63532),[#59628](https://github.com/PaddlePaddle/Paddle/pull/59628),[#62209](https://github.com/PaddlePaddle/Paddle/pull/62209),[#63742](https://github.com/PaddlePaddle/Paddle/pull/63742),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#62078](https://github.com/PaddlePaddle/Paddle/pull/62078),[#62684](https://github.com/PaddlePaddle/Paddle/pull/62684),[#62723](https://github.com/PaddlePaddle/Paddle/pull/62723),[#64141](https://github.com/PaddlePaddle/Paddle/pull/64141),[#60404](https://github.com/PaddlePaddle/Paddle/pull/60404),[#64212](https://github.com/PaddlePaddle/Paddle/pull/64212),[#60652](https://github.com/PaddlePaddle/Paddle/pull/60652),[#64545](https://github.com/PaddlePaddle/Paddle/pull/64545),[#64477](https://github.com/PaddlePaddle/Paddle/pull/64477),[#64556](https://github.com/PaddlePaddle/Paddle/pull/64556),[#63160](https://github.com/PaddlePaddle/Paddle/pull/63160),[#63796](https://github.com/PaddlePaddle/Paddle/pull/63796),[#64693](https://github.com/PaddlePaddle/Paddle/pull/64693),[#64484](https://github.com/PaddlePaddle/Paddle/pull/64484),[#64677](https://github.com/PaddlePaddle/Paddle/pull/64677),[#64461](https://github.com/PaddlePaddle/Paddle/pull/64461),[#63189](https://github.com/PaddlePaddle/Paddle/pull/63189),[#63855](https://github.com/PaddlePaddle/Paddle/pull/63855),[#63896](https://github.com/PaddlePaddle/Paddle/pull/63896),[#63193](https://github.com/PaddlePaddle/Paddle/pull/63193),[#63200](https://github.com/PaddlePaddle/Paddle/pull/63200),[#63406](https://github.com/PaddlePaddle/Paddle/pull/63406),[#61283](https://github.com/PaddlePaddle/Paddle/pull/61283),[#63607](https://github.com/PaddlePaddle/Paddle/pull/63607),[#64486](https://github.com/PaddlePaddle/Paddle/pull/64486),[#64004](https://github.com/PaddlePaddle/Paddle/pull/64004),[#63132](https://github.com/PaddlePaddle/Paddle/pull/63132),[#63553](https://github.com/PaddlePaddle/Paddle/pull/63553),[#63572](https://github.com/PaddlePaddle/Paddle/pull/63572),[#63794](https://github.com/PaddlePaddle/Paddle/pull/63794),[#63919](https://github.com/PaddlePaddle/Paddle/pull/63919),[#63980](https://github.com/PaddlePaddle/Paddle/pull/63980),[#62917](https://github.com/PaddlePaddle/Paddle/pull/62917),[#64451](https://github.com/PaddlePaddle/Paddle/pull/64451),[#63541](https://github.com/PaddlePaddle/Paddle/pull/63541),[#63703](https://github.com/PaddlePaddle/Paddle/pull/63703),[#64536](https://github.com/PaddlePaddle/Paddle/pull/64536),[#63264](https://github.com/PaddlePaddle/Paddle/pull/63264),[#63335](https://github.com/PaddlePaddle/Paddle/pull/63335),[#63841](https://github.com/PaddlePaddle/Paddle/pull/63841),[#64628](https://github.com/PaddlePaddle/Paddle/pull/64628),[#63419](https://github.com/PaddlePaddle/Paddle/pull/63419),[#62210](https://github.com/PaddlePaddle/Paddle/pull/62210),[#63557](https://github.com/PaddlePaddle/Paddle/pull/63557),[#63064](https://github.com/PaddlePaddle/Paddle/pull/63064),[#61442](https://github.com/PaddlePaddle/Paddle/pull/61442),[#63537](https://github.com/PaddlePaddle/Paddle/pull/63537),[#63839](https://github.com/PaddlePaddle/Paddle/pull/63839),[#60927](https://github.com/PaddlePaddle/Paddle/pull/60927),[#60566](https://github.com/PaddlePaddle/Paddle/pull/60566),[#60842](https://github.com/PaddlePaddle/Paddle/pull/60842),[#64612](https://github.com/PaddlePaddle/Paddle/pull/64612),[#60047](https://github.com/PaddlePaddle/Paddle/pull/60047),[#63898](https://github.com/PaddlePaddle/Paddle/pull/63898),[#60415](https://github.com/PaddlePaddle/Paddle/pull/60415),[#60474](https://github.com/PaddlePaddle/Paddle/pull/60474),[#60439](https://github.com/PaddlePaddle/Paddle/pull/60439),[#60565](https://github.com/PaddlePaddle/Paddle/pull/60565),[#64414](https://github.com/PaddlePaddle/Paddle/pull/64414),[#62526](https://github.com/PaddlePaddle/Paddle/pull/62526),[#54183](https://github.com/PaddlePaddle/Paddle/pull/54183),[#64096](https://github.com/PaddlePaddle/Paddle/pull/64096),[#61325](https://github.com/PaddlePaddle/Paddle/pull/61325),[#60629](https://github.com/PaddlePaddle/Paddle/pull/60629),[#61051](https://github.com/PaddlePaddle/Paddle/pull/61051),[#62103](https://github.com/PaddlePaddle/Paddle/pull/62103),[#63594](https://github.com/PaddlePaddle/Paddle/pull/63594),[#60968](https://github.com/PaddlePaddle/Paddle/pull/60968),[#64613](https://github.com/PaddlePaddle/Paddle/pull/64613),[#64073](https://github.com/PaddlePaddle/Paddle/pull/64073),[#63816](https://github.com/PaddlePaddle/Paddle/pull/63816),[#64416](https://github.com/PaddlePaddle/Paddle/pull/64416),[#62499](https://github.com/PaddlePaddle/Paddle/pull/62499),[#64531](https://github.com/PaddlePaddle/Paddle/pull/64531),[#63827](https://github.com/PaddlePaddle/Paddle/pull/63827),[#59885](https://github.com/PaddlePaddle/Paddle/pull/59885),[#59949](https://github.com/PaddlePaddle/Paddle/pull/59949),[#63428](https://github.com/PaddlePaddle/Paddle/pull/63428),[#63218](https://github.com/PaddlePaddle/Paddle/pull/63218),[#63538](https://github.com/PaddlePaddle/Paddle/pull/63538),[#64497](https://github.com/PaddlePaddle/Paddle/pull/64497),[#63082](https://github.com/PaddlePaddle/Paddle/pull/63082),[#64395](https://github.com/PaddlePaddle/Paddle/pull/64395),[#60183](https://github.com/PaddlePaddle/Paddle/pull/60183),[#63691](https://github.com/PaddlePaddle/Paddle/pull/63691),[#64428](https://github.com/PaddlePaddle/Paddle/pull/64428),[#64648](https://github.com/PaddlePaddle/Paddle/pull/64648),[#64650](https://github.com/PaddlePaddle/Paddle/pull/64650),[#59926](https://github.com/PaddlePaddle/Paddle/pull/59926),[#59750](https://github.com/PaddlePaddle/Paddle/pull/59750),[#60080](https://github.com/PaddlePaddle/Paddle/pull/60080),[#60208](https://github.com/PaddlePaddle/Paddle/pull/60208),[#64124](https://github.com/PaddlePaddle/Paddle/pull/64124),[#64187](https://github.com/PaddlePaddle/Paddle/pull/64187),[#64166](https://github.com/PaddlePaddle/Paddle/pull/64166),[#64284](https://github.com/PaddlePaddle/Paddle/pull/64284),[#64253](https://github.com/PaddlePaddle/Paddle/pull/64253),[#64555](https://github.com/PaddlePaddle/Paddle/pull/64555),[#59878](https://github.com/PaddlePaddle/Paddle/pull/59878),[#64081](https://github.com/PaddlePaddle/Paddle/pull/64081) + +## 13.贡献者名单 +6clc, Android zhang, Asthestarsfalll, Ataf Fazledin Ahamed, Aurelius84, AyaseNana, Baizhou Zhang, bapijun, BiynXu, Botao Zhou, Bo Zhang, bukejiyu, caozhou, chalsliu, Chang Xu, Charles-hit, chen2016013, Chen Zhiyang, C.J.0_0, cmcamdy, co63oc, coco, cyber-pioneer, cyberslack_lee, danleifeng, diadestiny, Difer, Dmovic, Eddie-Wang, Eddie Zhang, engineer1109, enzodechine, fanhaoxuee, feifei-111, flying-forever, Frank Lin, freeliuzc, fsczz, Galaxy1458, GGBond8488, Ghost Screaming, gongweibao, gouzil, Guoxia Wang, handiz, HankYang, Haohongxiang, haosicheng, hess, hjyp, hong, Hongqing-work, Hongwen Xin, HongyuJia, houj04, huangjiyi, Huihuang Zheng, hxzd5568, hyDONG, HydrogenSulfate, idontkonwher, iLeGend, Jeng Bai-Cheng, Jianbang Yang, Jia Wenxuan, JYChen, jzhang533, JZ-LIANG, Kai Song, kangguangli, kevin, Kunbo Ding, lanxianghit, Leo Chen, Leo Guo, lijialin03, lijin23, linkk08, Liujie0926, Liuyinfeng, liu zhengxi, liuzhenhai93, liym27, LiYuRio, lizexu123, LoneRanger, Longzhi Wang, Lucas, Lu Qi, lzy, lzydev, MayYouBeProsperous, megemini, Meiyim, ming1753, Mingdong Wang, ndren, NeroLoh, NetPunk, Nguyen Cong Vinh, Nyakku Shigure, Omri Alon, onepick, ooo oo, pangengzheng, PommesPeter, Qi Li, QingshuChen, Qi Shao, RedContritio, Reese Wang, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, Shaopeng Ling, ShenLiang, Shijie, Shuhao Liang, Siming Dai, skywalker2012, smallpoxscattered, sneaxiy, Sonder, Sunny-bot1, Tao Luo, tc20042008, Terry, Tian, tianhaodongbd, tianshuo78520a, Tianyu Feng, Tian Zheng, Tongkai, Travis-Lee, unseenme, Vigi Zhang, walkalone20, Wang Bojun, wanghuancoder, wangna11BD, Wang Xin, Wangzheee, WangZhen, wanly young, wawltor, wendaxiao, Wen Sun, wentao yu, Wenyu, wenzhe.wang, Winters Montagne, winter-wang, WoWYoYLoL, Wu Chencan, Wu Fei, wuhuachaocoding, Xianduo Li, XiangGao, XiaociZhang, xiaoguoguo626807, xiaoxiaohehe001, Xiao Xiyuan, Xiaoxu Chen, xiaoyao0115, xiaoye, xingmingyyj, Xinyi_LI, Xinyu Yang, xiongkun, xuxinyi389, xysheng-baidu, yangguohao, YibLiu, Yichen Zhang, yinfan98, yinwei, Yiqun Liu, YKTian, Yuang Liu, Yuanle Liu, YuanRisheng, yuguo, yujun, yulangz, YUNSHEN XIE, zbt78, ZelinMa557, Zero Rains, Zeyu Chen, zhangbo9674, Zhang,Lirong, Zhang Ting, zhangyikun02, zhangyuqin1998, Zhan Rongrui, zhaohaixu, zhaoyingli, Zhenghai Zhang, zhengzhonghui, zhink, ZhouMengLei1999, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, Zichao, zxcd, zyfncg, zyt1024, 东百月, 傅剑寒, 周周周, 周波涛, 张春乔, 萧 diff --git a/docs/release_note_en.md b/docs/release_note_en.md index 3e6ea0cd50f..0052edd54df 100644 --- a/docs/release_note_en.md +++ b/docs/release_note_en.md @@ -1,3495 +1,537 @@ -# 2.6.0 Release Note +# 3.0 Beta Release Note -## 1. Important Updates +# Overview of PaddlePaddle 3.0 Beta -- **Paddle New generation IR(PIR)** : In order to further improve scalability of the PaddlePaddle framework, we have developed a new generation intermediate representaion. It abstracts underlying core concepts of the PaddlePaddle framework, such as Operation, Attribute and Type, providing developers with flexible and efficient basic components. By introducing Dialect mechanism, PIR can comprehensively and hierarchically satisfy needs of each module for intermediate representations to greatly enhancing scalability of the framework. PIR strictly follows Static Single Assignment (SSA) principle, ensuring unity of top-level structure and harmonious coexistence of "operator sequentiality" and "computational graph semantics". In addition, PIR provides a more concise and low-cost Pass development process, with a series of built-in rich and functional Pass optimization strategies. It provides technical support for the ultimate performance optimization of large-scale models. -- **Static graph construction and compiler Optimization Architecture**: In order to further improve performance of the framework, PaddlePaddle's dynamic to static training capability has been comprehensively upgraded to support adaptive graph construction capability. This has been tested on more than 700 PaddlePaddle industry-level models, with 100% success rate of one line code converter to start static training. Meanwhile, Compiler Infrastructure for Neural Networks (CINN) of PaddlePaddle framework is integrated into PaddlePaddle main Repo, making the compiler and PaddlePaddle more integrated. CINN completes architectural optimization and improvement of expansion capability, increasing system stability. Based on PIR framework, it is much more easied to bind dynamic to static, primitive operator, executor and compiler together, to providing more space for boosting overall performance of PaddlePaddle framework. -- **Enhanced dynamic graph distributed capability**: Large models pose higher demands on the distributed training performance of framework. PaddlePaddle has comprehensive optimizations in dimensions of communication library, graph analysis, distributed strategy and task enable/disable, enhancing distributed computing capability of PaddlePaddle's dynamic graph and providing support for efficient training of large models. In terms of performance, training performance is further improved by reducing pipelined GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is significantly improved by fixing related Bugs. -- **Auto parallel architecture with dynamic-static unification**: In order to further reduce difficulty of programming and optimizing large models, PaddlePaddle has fully optimized the Semi-Auto Parallel programming paradigm with dynamic-static unification, simplifying programming complexity for developers. Developers do not need to deeply understand complex concepts and APIs under the manual parallel programming paradigm, such as row-parallel, and column-parallel. They only need a small amount of tensor distribution annotations to implement the hybrid parallelism. The distribution specification will be propagated to all tensors and operators automatically, and the framework would handle the communication and synchronization needed by distributed training appropriately. Meanwhile, it supports dynamic-to-static distributed training by adding one extra code only, allowing developers to efficiently implement any mixed parallelism strategy and deeply simplify the development process of hybrid-parallel training paradigm. -- **Hardware Integration Solution (CustomDevice)**: With increased demand for parallel training on new hardware in large model scenarios, PaddlePaddle has added support for distributed advanced policies, custom operators, and custom fusion policies. Distributed communication library is upgraded, with newly added support for many advanced distributed policies such as MP, GroupShared, PP, SP and MOE. Moreover, it supports vendors to flexibly access Transformer operator libraries of different granularities and modify the computation graph through Fusion Pass for performance acceleration. -- **Installation and development experience**: use of modular compilation optimizes logics of CMake codes, and improves efficiency of PaddlePaddle full compilation and incremental compilation. In addition, this can increase efficiency of RD development. It supports Python3.12, CUDA12, Hopper architecture compilation, with introduction of Clang and other tools to fully optimize code formats. In addition, C++ is changed from linking static libraries to linking dynamic libraries to reduce compilation volume. These optimizations provide users with a smoother and more efficient installation and development experience. +The core features of this version mainly include new technologies such as dynamic-static unity auto parallel and automatic optimization of neural network compiler, to aim to address the new challenges in the current deep learning field.PaddlePaddle Framework 3.0 Beta extends the design concepts of 2.x such as dynamic-static unity and integrated training and inference. The development interface is fully compatible with 2.x version. This means that codes developed in version 2.x can run directly on version 3.x without modification in most cases. Several key features are detailed as follows: -## 2. Incompatible Upgrade +- Dynamic-static graph unified auto parallel: To make the parallel training programming of large models easier, PaddlePaddle has also optimized the semi-auto parallel programming paradigm with dynamic-static graph unified. Developers do not need to delve into the complex concepts and APIs need in manual parallel programming; developers only need to perform a small amount of tensor sharding annotation to complete the construction of hybrid parallelism for large models. The framework is able to automatically derive distributed sharding states and add communication operators, and also supports one-key dynamic-to-static distributed training, thus dramatically simplifying the development of hybrid parallel training codes. In terms of dynamic-static unity, PaddlePaddle has comprehensively upgraded its dynamic-to-static training capability by adopting bytecode-based dynamic-static conversion technology, to support adaptive graph construction functions. It has been verified on more than 700 PaddlePaddle industrial-grade models, achieving a 100% success rate of one-key dynamic-to-static training. +- Automatic optimization of neural network compiler: PaddlePaddle Compiler Infrastructure for Neural Networks (CINN) adopts the design of integration with the framework, supporting the efficient training and dynamic shape inference of generative models, scientific computing models and other models. This provides a good balance between computational flexibility and high performance. The inference performance of Llama2 and Stable Diffusion models has been improved by 30% through automatic fusion of operators and code generation technology. +- High-order automatic differentiation: In order to better support scientific computing scenarios, PaddlePaddle Framework designs and implements high-order automatic differentiation technology based on combinatorial operator mechanism, combined with automatic optimization technology of neural network compiler. We have tested more than 40 differential equations in scientific computing scenarios, and its solution speed is 70% ahead of similar products in the industry. +- Highly scalable intermediate representation: In order to improve the scalability of the PaddlePaddle framework, we have developed a highly scalable Paddle Intermediate Representation (PIR).This representation systematically abstracts the underlying core concepts and provides flexible and efficient components. PIR serves as the infrastructure to support a number of technologies such as dynamic-to-static, automatic differentiation, auto parallel, combinatorial operators, and graph optimization; it is widely used in scenarios such as distributed training, model compression, and inference deployment. With the Declarative Rewrite Rule (DRR) mechanism provided by PIR, the development cost of Pass can be reduced by 60%.We have tested over 900 model configurations and the results show that the overall performance of inference improves by more than 10% after using PIR. +- Multi-Hardware adaptation: PaddlePaddle provides a well-functioning and low-cost solution for large model hardware adaptation. The new hardware only needs to be adapted with more than 30 interfaces to support training, compression and inference of large models. Meanwhile, PaddlePaddle provides compiler-based hardware access mode, and hardware vendors only need to implement the compiler's code generation back-end in the form of plug-ins to achieve efficient adaptation with the PaddlePaddle framework.PaddlePaddle hardware access this time has additional support for the daily release of four hardware units: Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU. -- In order to avoid misuse, we removed the 0-dimensional Tensor compatibility state switch, to achieve the same API behaviors as industry's mainstream habits. In the previous version, we already supported 0-dimensional Tensor, but we added a compatibility state switch in order to avoid error reporting of some models, as much as possible. That is, in some scenarios where model suite is used frequently and modification is not completed, we still used 1-dimensional Tensor with only 1 element to replace the 0-dimensional Tensor by default. In this version, compatibility state switch is removed, so the 1-dimensional Tensor with only 1 element will no longer be used, to replace 0-dimensional Tensor in any scenario. Behaviors of 376 APIs that should support the 0-dimensional Tensor have been corrected and unified, to thoroughly complete support for the 0-dimensional Tensor.[#57036](https://github.com/PaddlePaddle/Paddle/pull/57036), [#54581](https://github.com/PaddlePaddle/Paddle/pull/54581), [#54500](https://github.com/PaddlePaddle/Paddle/pull/54500) -- To improve API usability, paddle.nn.functional.diag_embed has been streamlined to paddle.diag_embed, with support of use of Tensor.diag_embed. [#58223](https://github.com/PaddlePaddle/Paddle/pull/58223) -- In order to solve the problem of differential computation error caused by Tensor index writing (e.g., tensor[0] = 10) under static graphs, and to comply with static graph specifications, this version introduces paddle.static.setitem API. In static graph environments, this API is recommended to support indexed write operations for tensor, instead of subscript operators. This change does not affect dynamic graph environments, where index write using subscript operators are still allowed. [#53682](https://github.com/PaddlePaddle/Paddle/pull/53682) -- paddle.fluid API is completely retired in this version. In this update, we completely removed all paddle.fluid APIs and deleted the fluid directory. Meanwhile, a small number of PaddlePaddle underlying public components have been consolidated into the paddle.base directory. It is unnecessary for PaddlePaddle users to pay attention to fluid-related concepts and APIs, further simplifying PaddlePaddle API system and improving readability.[#56576](https://github.com/PaddlePaddle/Paddle/pull/56576), [#54424](https://github.com/PaddlePaddle/Paddle/pull/54424), [#54829](https://github.com/PaddlePaddle/Paddle/pull/54829), [#53992](https://github.com/PaddlePaddle/Paddle/pull/53992), [#54806](https://github.com/PaddlePaddle/Paddle/pull/54806), [#55754](https://github.com/PaddlePaddle/Paddle/pull/55754), [#55986](https://github.com/PaddlePaddle/Paddle/pull/55986), [#55345](https://github.com/PaddlePaddle/Paddle/pull/55345), [#56099](https://github.com/PaddlePaddle/Paddle/pull/56099), [#51717](https://github.com/PaddlePaddle/Paddle/pull/51717), [#54152](https://github.com/PaddlePaddle/Paddle/pull/54152), [#55522](https://github.com/PaddlePaddle/Paddle/pull/55522), [#55757](https://github.com/PaddlePaddle/Paddle/pull/55757), [#58521](https://github.com/PaddlePaddle/Paddle/pull/58521), [#54936](https://github.com/PaddlePaddle/Paddle/pull/54936), [#55007](https://github.com/PaddlePaddle/Paddle/pull/55007), [#55661](https://github.com/PaddlePaddle/Paddle/pull/55661), [#55970](https://github.com/PaddlePaddle/Paddle/pull/55970) +This version includes the continuous improvement of some of the existing features of the framework 2.x. Meanwhile, the new features of this version bring significant improvements in terms of user experience, performance, ease of secondary development and hardware adaptability. In addition to the above core features, this version continues to enrich and enhance the API functions to meet more scenarios at the user experience level, optimizes and improves the distributed parallel strategy optimization and reasoning function enhancement for the large model scenarios, makes thorough improvement in terms of ease-of-use in compilation and installation, makes a new synchronous upgrade to the installation method and version of the dependency packages, strengthens the security of the system comprehensively, and makes comprehensive error-correction checks to the product documentation. We have also carried out a cleanup of some deprecated codes to ensure architectural simplicity. The performance of PaddlePaddle 3.0 Beta is still mature and stable without the use of new features, and each new feature provides a switch for flexible control, which makes it easy for users to quickly understand the related product features and experience comparison. -## 3. Training Framework (including Distributed) +## User Experience Upgrade -### Python API +### Incompatibility Upgrade -#### Upgrade Tensor indexing mechanism +- PaddlePaddle API supports type promotion.In the most common calculations such as addition, subtraction, multiplication, and division, if the two inputs are of different data types, it is necessary to determine the data type of the output. Historically, PaddlePaddle partially supported this and the actual rules were not clear. Objectively, there were dynamic-static inconsistency, inconsistent API and operator overloading, and inconsistent interchange rates, and unexpected problems (hard to fix) especially in the case of large models using a mix of bf16/fp16 and fp32 for a wide range of calculations. Starting from the 3.0 beta, PaddlePaddle has clarified the [type promotion rules](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/advanced/auto_type_promotion_cn.html), and defined in detail the types of Tensor vs Tensor and Tensor vs. 1 number (Scalar) computation results, ensuring that the computation conforms to the exchange law, the operator overloading is consistent with the results of the binary API, and the results of the dynamic graph are consistent with those of the static graph. This is more in line with user understanding and industry practice. [#60638](https://github.com/PaddlePaddle/Paddle/pull/60638), [#63842](https://github.com/PaddlePaddle/Paddle/pull/63842), [#60011](https://github.com/PaddlePaddle/Paddle/pull/60011) -This version comprehensively optimizes basic index, advanced index and joint index functions of Tensor, to better comply with industry standards and user habits. Specifically, we added support for view in basic index, fixed some wrong behaviors in advanced index, and implemented read function of joint index. In addition, we have sunk index parsing to C++ level, improved performance of high-level indexing operators, and removed redundant computations in bool indexing. With these optimizations, performance of Tensor's basic, advanced and joint index has been improved comprehensively. [#56893](https://github.com/PaddlePaddle/Paddle/pull/56893), [#58643](https://github.com/PaddlePaddle/Paddle/pull/58643), [#57986](https://github.com/PaddlePaddle/Paddle/pull/57986), [#56272](https://github.com/PaddlePaddle/Paddle/pull/56272), [#58856](https://github.com/PaddlePaddle/Paddle/pull/58856), [#55211](https://github.com/PaddlePaddle/Paddle/pull/55211), [#57023](https://github.com/PaddlePaddle/Paddle/pull/57023), [#56613](https://github.com/PaddlePaddle/Paddle/pull/56613), [#55602](https://github.com/PaddlePaddle/Paddle/pull/55602), [#59281](https://github.com/PaddlePaddle/Paddle/pull/59281), [#57737](https://github.com/PaddlePaddle/Paddle/pull/57737) +### Deprecated Features -#### Upgrade Inplace mechanism +- There have been two versions stably supporting 0-dimensional Tensor. This version removes the switch `FLAGS_set_to_1d` that converts a 0-dimensional Tensor to a 1-dimensional Tensor with only 1 element in some cases. This switch is for compatibility with the incorrect way of writing a 1-element 1-dimensional Tensor to represent a 0-dimensional Tensor in some kits. That is, the current PaddlePaddle fully distinguish between the semantics of a 0-dimensional Tensor and a 1-dimensional Tensor with only 1 element, both are not equivalent. [#61227](https://github.com/PaddlePaddle/Paddle/pull/61227) -In earlier versions, in order to ensure correctness of inverse differentiation calculations, when reverse calculation of an API depends on its forward input data, PaddlePaddle avoids using Inplace operation method, with possibly overwriting original input data. This mechanism simplifies implementation process, and also limits the ability of many APIs to implement Inplace functionality. As a result, user experience may be affected. -In this version, PaddlePaddle has fully upgraded the Inplace mechanism. It implements automatic detection of the dependency of reverse computation on forward inputs, to save input data when needed. Therefore, more Inplace operations are supported. This improvement not only improves memory usage efficiency, but also enhances functionality of the API. -In addition, we have added 109 new APIs that support Inplace operations, including paddle.abs\_, paddle.sin\_/cos\_/tan\_, comparison operations such as paddle.greater_than\_/less_than\_/equal\_, logical operations such as paddle.logical_and\_/logical_or\_/logical_not\_, paddle.neg\_ and paddle.log\_. While enriching the feature set of PaddlePaddle, it improves users' efficiency and convenience in numerical computation and deep learning tasks. [#54683](https://github.com/PaddlePaddle/Paddle/pull/54683), [#55078](https://github.com/PaddlePaddle/Paddle/pull/55078), [#55576](https://github.com/PaddlePaddle/Paddle/pull/55576), [#56888](https://github.com/PaddlePaddle/Paddle/pull/56888), [#55509](https://github.com/PaddlePaddle/Paddle/pull/55509), [#57093](https://github.com/PaddlePaddle/Paddle/pull/57093) +### New API Features -#### Other new APIs +Compared with the previous version, this version is added with 126 new APIs, richer API functions to better support the needs of large models, and scientific computation. The details are as follows: -- Added paddle.nn.functional.scaled_dot_product_attention. This significantly improves computational efficiency of the attention mechanism in large models, and meets demand for high-performance computation in large-scale deep learning models. [#55242](https://github.com/PaddlePaddle/Paddle/pull/55242) -- Added a series of new scientific computing-related APIs, including paddle.cummax and paddle.cummin for cumulative maximum and minimum computation, paddle.index_fill and paddle.masked_fill for filling tensor by index or mask, paddle.linalg.pca_lowrank for low-rank principal component analysis, paddle.hypot for calculating length of the hypotenuses of right triangles, and paddle.atleast_1d, paddle.atleast_2d, and paddle.atleast_3d to ensure the tensor is at least one, two, or three dimensional. We also provide paddle.select_scatter and paddle.diagonal_scatter for more flexible selection and hashing of tensor data, and paddle.multigammaln for choosing the natural logarithm of multigamma function. In addition, new optimizer-related APIs are added in this version, including: paddle.optimizer.lr.LinearLR and paddle.optimizer.lr.CosineAnnealingWarmRestarts for learning rate scheduling strategies; introduction of paddle.io.SubsetRandomSampler to support random sampling from a subset of data. These added APIs will further enhance flexibility and efficiency of PaddlePaddle in various application scenarios. [#57416](https://github.com/PaddlePaddle/Paddle/pull/57416), [#53546](https://github.com/PaddlePaddle/Paddle/pull/53546), [#53743](https://github.com/PaddlePaddle/Paddle/pull/53743), [#57295](https://github.com/PaddlePaddle/Paddle/pull/57295), [#57726](https://github.com/PaddlePaddle/Paddle/pull/57726), [#58764](https://github.com/PaddlePaddle/Paddle/pull/58764), [#58323](https://github.com/PaddlePaddle/Paddle/pull/58323), [#57720](https://github.com/PaddlePaddle/Paddle/pull/57720), [#58209](https://github.com/PaddlePaddle/Paddle/pull/58209), [#58214](https://github.com/PaddlePaddle/Paddle/pull/58214), [#57792](https://github.com/PaddlePaddle/Paddle/pull/57792), [#51395](https://github.com/PaddlePaddle/Paddle/pull/51395), [#57724](https://github.com/PaddlePaddle/Paddle/pull/57724), [#57355](https://github.com/PaddlePaddle/Paddle/pull/57355), [#57744](https://github.com/PaddlePaddle/Paddle/pull/57744), [#58244](https://github.com/PaddlePaddle/Paddle/pull/58244), [#57599](https://github.com/PaddlePaddle/Paddle/pull/57599), [#59343](https://github.com/PaddlePaddle/Paddle/pull/59343), [#57879](https://github.com/PaddlePaddle/Paddle/pull/57879) +- Add Tensor computation API. `paddle.gammaln`, `paddle.gammainc`, `paddle.gammaincc`, `paddle.sinc`, `paddle.pdist`, `paddle.histogramdd`,`paddle.signbit`, `paddle.copysign`, `paddle.bitwise_right_shift/bitwise_left_shift`, `paddle.isposinf/isneginf/isreal`, `paddle.isin`, `paddle.hsplit/dsplit`, `paddle.column_stack/row_stack/dstack/hstack/vstack`, `paddle.slice_scatter`, `paddle.masked_scatter` [#60553](https://github.com/PaddlePaddle/Paddle/pull/60553), [#59311](https://github.com/PaddlePaddle/Paddle/pull/59311), [#59357](https://github.com/PaddlePaddle/Paddle/pull/59357), [#63521](https://github.com/PaddlePaddle/Paddle/pull/63521), [#57869](https://github.com/PaddlePaddle/Paddle/pull/57869), [#57880](https://github.com/PaddlePaddle/Paddle/pull/57880), [#57882](https://github.com/PaddlePaddle/Paddle/pull/57882), [#60150](https://github.com/PaddlePaddle/Paddle/pull/60150), [#57785](https://github.com/PaddlePaddle/Paddle/pull/57785), [#58092](https://github.com/PaddlePaddle/Paddle/pull/58092), [#63523](https://github.com/PaddlePaddle/Paddle/pull/63523), [#64001](https://github.com/PaddlePaddle/Paddle/pull/64001), [#58917](https://github.com/PaddlePaddle/Paddle/pull/58917), [#59127](https://github.com/PaddlePaddle/Paddle/pull/59127), [#59973](https://github.com/PaddlePaddle/Paddle/pull/59973), [#59383](https://github.com/PaddlePaddle/Paddle/pull/59383) +- Add probability distribution API. `paddle.distribution.ContinuousBernoulli`, `paddle.distribution.MultivariateNormal`, `paddle.distribution.Exponential`, `paddle.distribution.Gamma`, `paddle.distribution.Binomial`, `paddle.distribution.Poisson` [#58004](https://github.com/PaddlePaddle/Paddle/pull/58004), [#57899](https://github.com/PaddlePaddle/Paddle/pull/57899), [#57856](https://github.com/PaddlePaddle/Paddle/pull/57856) +- Add optimizer API. `paddle.optimizer.ASGD`, `paddle.optimizer.NAdam`, `paddle.optimizer.RAdam`, `paddle.optimizer.Rprop` [#58834](https://github.com/PaddlePaddle/Paddle/pull/58834), [#63671](https://github.com/PaddlePaddle/Paddle/pull/63671), [#58851](https://github.com/PaddlePaddle/Paddle/pull/58851) +- Add Linear Algebra API. `paddle.linalg.matrix_exp` [#59715](https://github.com/PaddlePaddle/Paddle/pull/59715) +- Add other APIs. `paddle.bernoulli_`, `paddle.nn.ZeroPad1D/ZeroPad3D`, `paddle.nn.AdaptiveLogSoftmaxWithLoss`, `paddle.Tensor.apply` [#64252](https://github.com/PaddlePaddle/Paddle/pull/64252), [#59690](https://github.com/PaddlePaddle/Paddle/pull/59690), [#63728](https://github.com/PaddlePaddle/Paddle/pull/63728), [#63302](https://github.com/PaddlePaddle/Paddle/pull/63302), [#59374](https://github.com/PaddlePaddle/Paddle/pull/59374),[#63227](https://github.com/PaddlePaddle/Paddle/pull/63227) -### New Generation of Paddle Intermediate Representation (PIR) +### Some API Enhancements -PIR systematically abstracts underlying core concepts such as Operation, Attribute and Type, to build a set of flexible and powerful base components for developers. In addition, PaddlePaddle can comprehensively and hierarchically manage requirements of each module on Intermediate Representation (IR) by introducing the concept of Dialect, and support developers to customize extension of Dialect according to specific needs to significantly improving scalability and adaptability of framework. In terms of designs, PIR strictly follows the Static Single Assignment (SSA) principle, unifies top-level structure, realizes compatibility of "Operator sequentiality" and "computational graph semantics". This provides a clear and consistent view of the complex computation process. In order to further optimize performance of large models, PIR also provides a set of more concise and low-cost Pass development processes, including Declarative Rewrite Rule (DRR) and Pattern Rewriter. In addition, a series of rich and full-featured Pass optimization strategies are built-in, to deeply optimize application according to characteristics of large models, thus providing strong support for ultimate performance of large models. Through these innovative designs and optimization methods, PIR lays a solid foundation for efficient operation and continuous expansion of the PaddlePaddle framework. +- Enhance about 30 APIs to support complex number computation, such as `paddle.log`, `paddle.log1p`, `paddle.square`, and `paddle.reciprocal`, to extend the support for more scientific computing scenarios. [#62448](https://github.com/PaddlePaddle/Paddle/pull/62448), [#60821](https://github.com/PaddlePaddle/Paddle/pull/60821), [#60897](https://github.com/PaddlePaddle/Paddle/pull/60897), [#62764](https://github.com/PaddlePaddle/Paddle/pull/62764), [#59536](https://github.com/PaddlePaddle/Paddle/pull/59536), [#59529](https://github.com/PaddlePaddle/Paddle/pull/59529), [#63207](https://github.com/PaddlePaddle/Paddle/pull/63207), [#62237](https://github.com/PaddlePaddle/Paddle/pull/62237), [#64684](https://github.com/PaddlePaddle/Paddle/pull/64684) +- Enhance 46 APIs, to make existing APIs easier to use and easier to convert to codes,including but not limited to, adding API parameters, extending the data types supported by the APIs, and fixing the existing unreasonable designs. [#59890](https://github.com/PaddlePaddle/Paddle/pull/59890), [#63513](https://github.com/PaddlePaddle/Paddle/pull/63513), [#59674](https://github.com/PaddlePaddle/Paddle/pull/59674), [#62778](https://github.com/PaddlePaddle/Paddle/pull/62778), [#64110](https://github.com/PaddlePaddle/Paddle/pull/64110), [#63222](https://github.com/PaddlePaddle/Paddle/pull/63222), [#64331](https://github.com/PaddlePaddle/Paddle/pull/64331), [#64715](https://github.com/PaddlePaddle/Paddle/pull/64715), [#61155](https://github.com/PaddlePaddle/Paddle/pull/61155), [#60070](https://github.com/PaddlePaddle/Paddle/pull/60070), [#61974](https://github.com/PaddlePaddle/Paddle/pull/61974), [#62407](https://github.com/PaddlePaddle/Paddle/pull/62407), [#62672](https://github.com/PaddlePaddle/Paddle/pull/62672),[#62722](https://github.com/PaddlePaddle/Paddle/pull/62722), [#62876](https://github.com/PaddlePaddle/Paddle/pull/62876), [#63284](https://github.com/PaddlePaddle/Paddle/pull/63284), [#63860](https://github.com/PaddlePaddle/Paddle/pull/63860), [#60466](https://github.com/PaddlePaddle/Paddle/pull/60466), [#63690](https://github.com/PaddlePaddle/Paddle/pull/63690), [#63953](https://github.com/PaddlePaddle/Paddle/pull/63953), [#63901](https://github.com/PaddlePaddle/Paddle/pull/63901), [#62624](https://github.com/PaddlePaddle/Paddle/pull/62624), [#59857](https://github.com/PaddlePaddle/Paddle/pull/59857), [#60084](https://github.com/PaddlePaddle/Paddle/pull/60084), [#60766](https://github.com/PaddlePaddle/Paddle/pull/60766), [#62788](https://github.com/PaddlePaddle/Paddle/pull/62788), [#62937](https://github.com/PaddlePaddle/Paddle/pull/62937), [#63134](https://github.com/PaddlePaddle/Paddle/pull/63134), [#62966](https://github.com/PaddlePaddle/Paddle/pull/62966), [#63648](https://github.com/PaddlePaddle/Paddle/pull/63648), [#63881](https://github.com/PaddlePaddle/Paddle/pull/63881), [#64358](https://github.com/PaddlePaddle/Paddle/pull/64358), [#60503](https://github.com/PaddlePaddle/Paddle/pull/60503), [#63604](https://github.com/PaddlePaddle/Paddle/pull/63604), [#62338](https://github.com/PaddlePaddle/Paddle/pull/62338) +- Enhance single-test infrastructure for higher-order differentiation, making it easier to add single-test use cases for higher-order differentiation. [#62074](https://github.com/PaddlePaddle/Paddle/pull/62074) -#### New features +### API Performance Improvements -- Abstracted core concepts of IR bottom layer and provided developers with flexible base components, such as Operation, Attribute, Value, Type, Trait, and Interface. [#56354](https://github.com/PaddlePaddle/Paddle/pull/56354),[#57106](https://github.com/PaddlePaddle/Paddle/pull/57106),[#57349](https://github.com/PaddlePaddle/Paddle/pull/57349),[#54844](https://github.com/PaddlePaddle/Paddle/pull/54844),[#54984](https://github.com/PaddlePaddle/Paddle/pull/54984),[#54565](https://github.com/PaddlePaddle/Paddle/pull/54565),[#54562](https://github.com/PaddlePaddle/Paddle/pull/54562),[#57249](https://github.com/PaddlePaddle/Paddle/pull/57249),[#57550](https://github.com/PaddlePaddle/Paddle/pull/57550),[#59278](https://github.com/PaddlePaddle/Paddle/pull/59278),[#54875](https://github.com/PaddlePaddle/Paddle/pull/54875),[#55041](https://github.com/PaddlePaddle/Paddle/pull/55041),[#54987](https://github.com/PaddlePaddle/Paddle/pull/54987),[#55903](https://github.com/PaddlePaddle/Paddle/pull/55903),[#57582](https://github.com/PaddlePaddle/Paddle/pull/57582),[#57580](https://github.com/PaddlePaddle/Paddle/pull/57580),[#58052](https://github.com/PaddlePaddle/Paddle/pull/58052),[#55322](https://github.com/PaddlePaddle/Paddle/pull/55322),[#57418](https://github.com/PaddlePaddle/Paddle/pull/57418),[#57635](https://github.com/PaddlePaddle/Paddle/pull/57635),[#55328](https://github.com/PaddlePaddle/Paddle/pull/55328),[#57463](https://github.com/PaddlePaddle/Paddle/pull/57463),[#59791](https://github.com/PaddlePaddle/Paddle/pull/59791),[#59821](https://github.com/PaddlePaddle/Paddle/pull/59821),[#59115](https://github.com/PaddlePaddle/Paddle/pull/59115),[#57461](https://github.com/PaddlePaddle/Paddle/pull/57461),[#59392](https://github.com/PaddlePaddle/Paddle/pull/59392),[#57373](https://github.com/PaddlePaddle/Paddle/pull/57373),[#59118](https://github.com/PaddlePaddle/Paddle/pull/59118) -- Added Dialect mechanism to support comprehensive and hierarchical management of intermediate representation requirements of each module of framework. Through built-in Builtin Dialect, it supports developers to customize and extend Dialect according to their needs. [#56325](https://github.com/PaddlePaddle/Paddle/pull/56325),[#57539](https://github.com/PaddlePaddle/Paddle/pull/57539),[#54682](https://github.com/PaddlePaddle/Paddle/pull/54682),[#55381](https://github.com/PaddlePaddle/Paddle/pull/55381),[#56156](https://github.com/PaddlePaddle/Paddle/pull/56156),[#56431](https://github.com/PaddlePaddle/Paddle/pull/56431),[#56615](https://github.com/PaddlePaddle/Paddle/pull/56615),[#57103](https://github.com/PaddlePaddle/Paddle/pull/57103),[#57209](https://github.com/PaddlePaddle/Paddle/pull/57209) -- Normalized PaddlePaddle static graph operator system. Added OperatorDialect and KernelDialect. Managed conceptual differences of operators in the form of Dialect during compilation and execution, making Architecture clearer. [#56284](https://github.com/PaddlePaddle/Paddle/pull/56284),[#54469](https://github.com/PaddlePaddle/Paddle/pull/54469),[#58660](https://github.com/PaddlePaddle/Paddle/pull/58660),[#58975](https://github.com/PaddlePaddle/Paddle/pull/58975),[#56680](https://github.com/PaddlePaddle/Paddle/pull/56680),[#54790](https://github.com/PaddlePaddle/Paddle/pull/54790),[#54826](https://github.com/PaddlePaddle/Paddle/pull/54826),[#54840](https://github.com/PaddlePaddle/Paddle/pull/54840),[#55699](https://github.com/PaddlePaddle/Paddle/pull/55699),[#55648](https://github.com/PaddlePaddle/Paddle/pull/55648),[#55880](https://github.com/PaddlePaddle/Paddle/pull/55880),[#56101](https://github.com/PaddlePaddle/Paddle/pull/56101),[#56754](https://github.com/PaddlePaddle/Paddle/pull/56754),[#54944](https://github.com/PaddlePaddle/Paddle/pull/54944),[#56836](https://github.com/PaddlePaddle/Paddle/pull/56836),[#57185](https://github.com/PaddlePaddle/Paddle/pull/57185),[#58757](https://github.com/PaddlePaddle/Paddle/pull/58757),[#56243](https://github.com/PaddlePaddle/Paddle/pull/56243),[#56436](https://github.com/PaddlePaddle/Paddle/pull/56436),[#57741](https://github.com/PaddlePaddle/Paddle/pull/57741),[#59124](https://github.com/PaddlePaddle/Paddle/pull/59124),[#57054](https://github.com/PaddlePaddle/Paddle/pull/57054),[#56984](https://github.com/PaddlePaddle/Paddle/pull/56984),[#57403](https://github.com/PaddlePaddle/Paddle/pull/57403),[#57904](https://github.com/PaddlePaddle/Paddle/pull/57904),[#58031](https://github.com/PaddlePaddle/Paddle/pull/58031),[#56924](https://github.com/PaddlePaddle/Paddle/pull/56924),[#59270](https://github.com/PaddlePaddle/Paddle/pull/59270),[#55343](https://github.com/PaddlePaddle/Paddle/pull/55343),[#56557](https://github.com/PaddlePaddle/Paddle/pull/56557),[#55693](https://github.com/PaddlePaddle/Paddle/pull/55693),[#54428](https://github.com/PaddlePaddle/Paddle/pull/54428) -- Added ShapeDialect with built-in rich shape operation operators for constructing dynamic shape constraints and expressions for AI compilers. [#56727](https://github.com/PaddlePaddle/Paddle/pull/56727),[#59254](https://github.com/PaddlePaddle/Paddle/pull/59254),[#58368](https://github.com/PaddlePaddle/Paddle/pull/58368),[#57069](https://github.com/PaddlePaddle/Paddle/pull/57069),[#57337](https://github.com/PaddlePaddle/Paddle/pull/57337),[#56351](https://github.com/PaddlePaddle/Paddle/pull/56351),[#57029](https://github.com/PaddlePaddle/Paddle/pull/57029),[#58036](https://github.com/PaddlePaddle/Paddle/pull/58036),[#59032](https://github.com/PaddlePaddle/Paddle/pull/59032),[#57961](https://github.com/PaddlePaddle/Paddle/pull/57961),[#56427](https://github.com/PaddlePaddle/Paddle/pull/56427),[#57459](https://github.com/PaddlePaddle/Paddle/pull/57459) -- Unified top-level structure of Framework Program, supporting compatible representation of "operator sequentiality" and "computational graph semantics", decoupling dependency on ir::Graph, and strictly following the principle of Static Single Assignment (SSA). [#59369](https://github.com/PaddlePaddle/Paddle/pull/59369),[#54563](https://github.com/PaddlePaddle/Paddle/pull/54563),[#57051](https://github.com/PaddlePaddle/Paddle/pull/57051),[#57306](https://github.com/PaddlePaddle/Paddle/pull/57306),[#57857](https://github.com/PaddlePaddle/Paddle/pull/57857) -- Added IrPrinter and IrPaser components to support serialization and deserialization of PIR Programs, providing a friendly debugging experience for PIR development. [#55695](https://github.com/PaddlePaddle/Paddle/pull/55695),[#59449](https://github.com/PaddlePaddle/Paddle/pull/59449),[#54369](https://github.com/PaddlePaddle/Paddle/pull/54369),[#54499](https://github.com/PaddlePaddle/Paddle/pull/54499),[#55518](https://github.com/PaddlePaddle/Paddle/pull/55518),[#55784](https://github.com/PaddlePaddle/Paddle/pull/55784),[#57180](https://github.com/PaddlePaddle/Paddle/pull/57180),[#57471](https://github.com/PaddlePaddle/Paddle/pull/57471),[#54859](https://github.com/PaddlePaddle/Paddle/pull/54859),[#54968](https://github.com/PaddlePaddle/Paddle/pull/54968),[#55209](https://github.com/PaddlePaddle/Paddle/pull/55209),[#57314](https://github.com/PaddlePaddle/Paddle/pull/57314),[#57969](https://github.com/PaddlePaddle/Paddle/pull/57969) -- Built a new, simple and low-cost Pass development system based on Declarative Rewrite Rule (DDR) and Pattern Rewriter, with built-in a series of rich and full-featured Pass Optimization strategies, to accelerate training and inference execution process. [#54385](https://github.com/PaddlePaddle/Paddle/pull/54385),[#54738](https://github.com/PaddlePaddle/Paddle/pull/54738),[#55859](https://github.com/PaddlePaddle/Paddle/pull/55859),[#56638](https://github.com/PaddlePaddle/Paddle/pull/56638),[#57090](https://github.com/PaddlePaddle/Paddle/pull/57090),[#58673](https://github.com/PaddlePaddle/Paddle/pull/58673),[#59415](https://github.com/PaddlePaddle/Paddle/pull/59415),[#56729](https://github.com/PaddlePaddle/Paddle/pull/56729),[#58655](https://github.com/PaddlePaddle/Paddle/pull/58655) -- Added ProgramTranslator component, to support conversion from ProgramDesc to new generation of IR representations of PaddlePaddle by pressing one key, with provision of easy-to-use C++ and Python interfaces. [#55433](https://github.com/PaddlePaddle/Paddle/pull/55433),[#54470](https://github.com/PaddlePaddle/Paddle/pull/54470),[#58044](https://github.com/PaddlePaddle/Paddle/pull/58044),[#58390](https://github.com/PaddlePaddle/Paddle/pull/58390),[#58100](https://github.com/PaddlePaddle/Paddle/pull/58100),[#55403](https://github.com/PaddlePaddle/Paddle/pull/55403),[#55406](https://github.com/PaddlePaddle/Paddle/pull/55406),[#54719](https://github.com/PaddlePaddle/Paddle/pull/54719),[#56550](https://github.com/PaddlePaddle/Paddle/pull/56550),[#55448](https://github.com/PaddlePaddle/Paddle/pull/55448),[#55453](https://github.com/PaddlePaddle/Paddle/pull/55453),[#56294](https://github.com/PaddlePaddle/Paddle/pull/56294),[#56308](https://github.com/PaddlePaddle/Paddle/pull/56308),[#56842](https://github.com/PaddlePaddle/Paddle/pull/56842),[#58517](https://github.com/PaddlePaddle/Paddle/pull/58517) -- With help of automatic code generation technology, it can generate the full amount of static graph operator representations for PaddlePaddle framework by pressing one key. Sank static graph networking logic to C++ side and bind it to \_C_ops module. This can greatly streamline API code on Python side, realize ultimate dynamic-static unification of APIs of PaddlePaddle Framework, and upgrade a lot of Python APIs to support static graph networking of the new IR. [#56570](https://github.com/PaddlePaddle/Paddle/pull/56570),[#55745](https://github.com/PaddlePaddle/Paddle/pull/55745),[#56955](https://github.com/PaddlePaddle/Paddle/pull/56955),[#57298](https://github.com/PaddlePaddle/Paddle/pull/57298),[#57946](https://github.com/PaddlePaddle/Paddle/pull/57946),[#57248](https://github.com/PaddlePaddle/Paddle/pull/57248),[#56080](https://github.com/PaddlePaddle/Paddle/pull/56080),[#54396](https://github.com/PaddlePaddle/Paddle/pull/54396),[#54551](https://github.com/PaddlePaddle/Paddle/pull/54551),[#56520](https://github.com/PaddlePaddle/Paddle/pull/56520),[#55002](https://github.com/PaddlePaddle/Paddle/pull/55002),[#57067](https://github.com/PaddlePaddle/Paddle/pull/57067),[#59320](https://github.com/PaddlePaddle/Paddle/pull/59320),[#59348](https://github.com/PaddlePaddle/Paddle/pull/59348),[#57164](https://github.com/PaddlePaddle/Paddle/pull/57164),[#57267](https://github.com/PaddlePaddle/Paddle/pull/57267),[#59064](https://github.com/PaddlePaddle/Paddle/pull/59064),[#54340](https://github.com/PaddlePaddle/Paddle/pull/54340),[#54895](https://github.com/PaddlePaddle/Paddle/pull/54895),[#55004](https://github.com/PaddlePaddle/Paddle/pull/55004),[#56196](https://github.com/PaddlePaddle/Paddle/pull/56196),[#56862](https://github.com/PaddlePaddle/Paddle/pull/56862),[#58991](https://github.com/PaddlePaddle/Paddle/pull/58991),[#55428](https://github.com/PaddlePaddle/Paddle/pull/55428),[#55909](https://github.com/PaddlePaddle/Paddle/pull/55909),[#56241](https://github.com/PaddlePaddle/Paddle/pull/56241),[#56526](https://github.com/PaddlePaddle/Paddle/pull/56526),[#56571](https://github.com/PaddlePaddle/Paddle/pull/56571),[#56518](https://github.com/PaddlePaddle/Paddle/pull/56518),[#57016](https://github.com/PaddlePaddle/Paddle/pull/57016),[#56653](https://github.com/PaddlePaddle/Paddle/pull/56653),[#56809](https://github.com/PaddlePaddle/Paddle/pull/56809),[#57158](https://github.com/PaddlePaddle/Paddle/pull/57158),[#55422](https://github.com/PaddlePaddle/Paddle/pull/55422),[#55458](https://github.com/PaddlePaddle/Paddle/pull/55458),[#55432](https://github.com/PaddlePaddle/Paddle/pull/55432),[#55467](https://github.com/PaddlePaddle/Paddle/pull/55467),[#55483](https://github.com/PaddlePaddle/Paddle/pull/55483),[#55419](https://github.com/PaddlePaddle/Paddle/pull/55419),[#55517](https://github.com/PaddlePaddle/Paddle/pull/55517),[#55500](https://github.com/PaddlePaddle/Paddle/pull/55500),[#56674](https://github.com/PaddlePaddle/Paddle/pull/56674),[#57693](https://github.com/PaddlePaddle/Paddle/pull/57693),[#55008](https://github.com/PaddlePaddle/Paddle/pull/55008),[#57166](https://github.com/PaddlePaddle/Paddle/pull/57166),[#57157](https://github.com/PaddlePaddle/Paddle/pull/57157),[#57159](https://github.com/PaddlePaddle/Paddle/pull/57159),[#57175](https://github.com/PaddlePaddle/Paddle/pull/57175),[#57325](https://github.com/PaddlePaddle/Paddle/pull/57325),[#57330](https://github.com/PaddlePaddle/Paddle/pull/57330),[#57415](https://github.com/PaddlePaddle/Paddle/pull/57415),[#57122](https://github.com/PaddlePaddle/Paddle/pull/57122),[#57393](https://github.com/PaddlePaddle/Paddle/pull/57393),[#57344](https://github.com/PaddlePaddle/Paddle/pull/57344),[#57667](https://github.com/PaddlePaddle/Paddle/pull/57667),[#57348](https://github.com/PaddlePaddle/Paddle/pull/57348),[#57700](https://github.com/PaddlePaddle/Paddle/pull/57700),[#58093](https://github.com/PaddlePaddle/Paddle/pull/58093),[#58005](https://github.com/PaddlePaddle/Paddle/pull/58005),[#58081](https://github.com/PaddlePaddle/Paddle/pull/58081),[#58094](https://github.com/PaddlePaddle/Paddle/pull/58094),[#58137](https://github.com/PaddlePaddle/Paddle/pull/58137),[#58287](https://github.com/PaddlePaddle/Paddle/pull/58287),[#58352](https://github.com/PaddlePaddle/Paddle/pull/58352),[#58340](https://github.com/PaddlePaddle/Paddle/pull/58340),[#58363](https://github.com/PaddlePaddle/Paddle/pull/58363),[#58331](https://github.com/PaddlePaddle/Paddle/pull/58331),[#58343](https://github.com/PaddlePaddle/Paddle/pull/58343),[#58317](https://github.com/PaddlePaddle/Paddle/pull/58317),[#58450](https://github.com/PaddlePaddle/Paddle/pull/58450),[#58377](https://github.com/PaddlePaddle/Paddle/pull/58377),[#58466](https://github.com/PaddlePaddle/Paddle/pull/58466),[#58470](https://github.com/PaddlePaddle/Paddle/pull/58470),[#58491](https://github.com/PaddlePaddle/Paddle/pull/58491),[#58546](https://github.com/PaddlePaddle/Paddle/pull/58546),[#58587](https://github.com/PaddlePaddle/Paddle/pull/58587),[#58453](https://github.com/PaddlePaddle/Paddle/pull/58453),[#58634](https://github.com/PaddlePaddle/Paddle/pull/58634),[#58604](https://github.com/PaddlePaddle/Paddle/pull/58604),[#58605](https://github.com/PaddlePaddle/Paddle/pull/58605),[#58593](https://github.com/PaddlePaddle/Paddle/pull/58593),[#58675](https://github.com/PaddlePaddle/Paddle/pull/58675),[#58699](https://github.com/PaddlePaddle/Paddle/pull/58699),[#58384](https://github.com/PaddlePaddle/Paddle/pull/58384),[#58629](https://github.com/PaddlePaddle/Paddle/pull/58629),[#58579](https://github.com/PaddlePaddle/Paddle/pull/58579),[#58695](https://github.com/PaddlePaddle/Paddle/pull/58695),[#58548](https://github.com/PaddlePaddle/Paddle/pull/58548),[#58688](https://github.com/PaddlePaddle/Paddle/pull/58688),[#58792](https://github.com/PaddlePaddle/Paddle/pull/58792),[#58843](https://github.com/PaddlePaddle/Paddle/pull/58843),[#58840](https://github.com/PaddlePaddle/Paddle/pull/58840),[#58718](https://github.com/PaddlePaddle/Paddle/pull/58718),[#58883](https://github.com/PaddlePaddle/Paddle/pull/58883),[#58785](https://github.com/PaddlePaddle/Paddle/pull/58785),[#58608](https://github.com/PaddlePaddle/Paddle/pull/58608),[#58781](https://github.com/PaddlePaddle/Paddle/pull/58781),[#58783](https://github.com/PaddlePaddle/Paddle/pull/58783),[#58429](https://github.com/PaddlePaddle/Paddle/pull/58429),[#58685](https://github.com/PaddlePaddle/Paddle/pull/58685),[#58696](https://github.com/PaddlePaddle/Paddle/pull/58696),[#58690](https://github.com/PaddlePaddle/Paddle/pull/58690),[#58831](https://github.com/PaddlePaddle/Paddle/pull/58831),[#58929](https://github.com/PaddlePaddle/Paddle/pull/58929),[#58740](https://github.com/PaddlePaddle/Paddle/pull/58740),[#58937](https://github.com/PaddlePaddle/Paddle/pull/58937),[#58782](https://github.com/PaddlePaddle/Paddle/pull/58782),[#58833](https://github.com/PaddlePaddle/Paddle/pull/58833),[#58882](https://github.com/PaddlePaddle/Paddle/pull/58882),[#58935](https://github.com/PaddlePaddle/Paddle/pull/58935),[#58931](https://github.com/PaddlePaddle/Paddle/pull/58931),[#59041](https://github.com/PaddlePaddle/Paddle/pull/59041),[#59040](https://github.com/PaddlePaddle/Paddle/pull/59040),[#58877](https://github.com/PaddlePaddle/Paddle/pull/58877),[#58888](https://github.com/PaddlePaddle/Paddle/pull/58888),[#59042](https://github.com/PaddlePaddle/Paddle/pull/59042),[#58780](https://github.com/PaddlePaddle/Paddle/pull/58780),[#58682](https://github.com/PaddlePaddle/Paddle/pull/58682),[#58815](https://github.com/PaddlePaddle/Paddle/pull/58815),[#58676](https://github.com/PaddlePaddle/Paddle/pull/58676),[#58678](https://github.com/PaddlePaddle/Paddle/pull/58678),[#58446](https://github.com/PaddlePaddle/Paddle/pull/58446),[#59077](https://github.com/PaddlePaddle/Paddle/pull/59077),[#59091](https://github.com/PaddlePaddle/Paddle/pull/59091),[#58661](https://github.com/PaddlePaddle/Paddle/pull/58661),[#58832](https://github.com/PaddlePaddle/Paddle/pull/58832),[#58642](https://github.com/PaddlePaddle/Paddle/pull/58642),[#58698](https://github.com/PaddlePaddle/Paddle/pull/58698),[#59313](https://github.com/PaddlePaddle/Paddle/pull/59313),[#59371](https://github.com/PaddlePaddle/Paddle/pull/59371),[#58700](https://github.com/PaddlePaddle/Paddle/pull/58700),[#58953](https://github.com/PaddlePaddle/Paddle/pull/58953),[#58879](https://github.com/PaddlePaddle/Paddle/pull/58879),[#59469](https://github.com/PaddlePaddle/Paddle/pull/59469),[#59573](https://github.com/PaddlePaddle/Paddle/pull/59573),[#59481](https://github.com/PaddlePaddle/Paddle/pull/59481),[#59419](https://github.com/PaddlePaddle/Paddle/pull/59419),[#59509](https://github.com/PaddlePaddle/Paddle/pull/59509),[#58735](https://github.com/PaddlePaddle/Paddle/pull/58735),[#59616](https://github.com/PaddlePaddle/Paddle/pull/59616),[#59582](https://github.com/PaddlePaddle/Paddle/pull/59582),[#59420](https://github.com/PaddlePaddle/Paddle/pull/59420),[#59500](https://github.com/PaddlePaddle/Paddle/pull/59500),[#58911](https://github.com/PaddlePaddle/Paddle/pull/58911),[#59535](https://github.com/PaddlePaddle/Paddle/pull/59535),[#54891](https://github.com/PaddlePaddle/Paddle/pull/54891),[#56794](https://github.com/PaddlePaddle/Paddle/pull/56794),[#57477](https://github.com/PaddlePaddle/Paddle/pull/57477),[#57929](https://github.com/PaddlePaddle/Paddle/pull/57929),[#57765](https://github.com/PaddlePaddle/Paddle/pull/57765),[#58693](https://github.com/PaddlePaddle/Paddle/pull/58693),[#58603](https://github.com/PaddlePaddle/Paddle/pull/58603),[#56291](https://github.com/PaddlePaddle/Paddle/pull/56291),[#57123](https://github.com/PaddlePaddle/Paddle/pull/57123),[#57317](https://github.com/PaddlePaddle/Paddle/pull/57317),[#57341](https://github.com/PaddlePaddle/Paddle/pull/57341),[#57020](https://github.com/PaddlePaddle/Paddle/pull/57020),[#57324](https://github.com/PaddlePaddle/Paddle/pull/57324),[#57761](https://github.com/PaddlePaddle/Paddle/pull/57761),[#57762](https://github.com/PaddlePaddle/Paddle/pull/57762),[#57907](https://github.com/PaddlePaddle/Paddle/pull/57907),[#57909](https://github.com/PaddlePaddle/Paddle/pull/57909),[#58099](https://github.com/PaddlePaddle/Paddle/pull/58099),[#58110](https://github.com/PaddlePaddle/Paddle/pull/58110),[#58114](https://github.com/PaddlePaddle/Paddle/pull/58114),[#58139](https://github.com/PaddlePaddle/Paddle/pull/58139),[#58144](https://github.com/PaddlePaddle/Paddle/pull/58144),[#58165](https://github.com/PaddlePaddle/Paddle/pull/58165),[#58194](https://github.com/PaddlePaddle/Paddle/pull/58194),[#58138](https://github.com/PaddlePaddle/Paddle/pull/58138),[#58113](https://github.com/PaddlePaddle/Paddle/pull/58113),[#58245](https://github.com/PaddlePaddle/Paddle/pull/58245),[#58318](https://github.com/PaddlePaddle/Paddle/pull/58318),[#58105](https://github.com/PaddlePaddle/Paddle/pull/58105),[#58348](https://github.com/PaddlePaddle/Paddle/pull/58348),[#58235](https://github.com/PaddlePaddle/Paddle/pull/58235),[#58354](https://github.com/PaddlePaddle/Paddle/pull/58354),[#58341](https://github.com/PaddlePaddle/Paddle/pull/58341),[#58445](https://github.com/PaddlePaddle/Paddle/pull/58445),[#58418](https://github.com/PaddlePaddle/Paddle/pull/58418),[#58239](https://github.com/PaddlePaddle/Paddle/pull/58239),[#58473](https://github.com/PaddlePaddle/Paddle/pull/58473),[#58239](https://github.com/PaddlePaddle/Paddle/pull/58239),[#58391](https://github.com/PaddlePaddle/Paddle/pull/58391),[#58501](https://github.com/PaddlePaddle/Paddle/pull/58501),[#58519](https://github.com/PaddlePaddle/Paddle/pull/58519),[#58416](https://github.com/PaddlePaddle/Paddle/pull/58416),[#58588](https://github.com/PaddlePaddle/Paddle/pull/58588),[#58531](https://github.com/PaddlePaddle/Paddle/pull/58531),[#58730](https://github.com/PaddlePaddle/Paddle/pull/58730),[#58773](https://github.com/PaddlePaddle/Paddle/pull/58773),[#58862](https://github.com/PaddlePaddle/Paddle/pull/58862),[#58946](https://github.com/PaddlePaddle/Paddle/pull/58946),[#58500](https://github.com/PaddlePaddle/Paddle/pull/58500),[#56585](https://github.com/PaddlePaddle/Paddle/pull/56585),[#57480](https://github.com/PaddlePaddle/Paddle/pull/57480),[#57433](https://github.com/PaddlePaddle/Paddle/pull/57433),[#58498](https://github.com/PaddlePaddle/Paddle/pull/58498) +- Focus on optimizing the performance of Tensor basic index, advanced index, and combined index, improving computational performance by 2X to 31X on GPUs and 1.8X to 1004X on CPUs. [#60254](https://github.com/PaddlePaddle/Paddle/pull/60254), [#60276](https://github.com/PaddlePaddle/Paddle/pull/60276), [#60452](https://github.com/PaddlePaddle/Paddle/pull/60452), [#60771](https://github.com/PaddlePaddle/Paddle/pull/60771), [#61021](https://github.com/PaddlePaddle/Paddle/pull/61021), [#60983](https://github.com/PaddlePaddle/Paddle/pull/60983), [#61060](https://github.com/PaddlePaddle/Paddle/pull/61060), [#60618](https://github.com/PaddlePaddle/Paddle/pull/60618) -#### Function optimization +### Bug Fixing -- Upgraded static graph executor to extend more Kernel Instruction types, and supported loading of PIR with efficiently scheduling execution. This has significant video memory and performance gains in training and inference. [#54570](https://github.com/PaddlePaddle/Paddle/pull/54570),[#58665](https://github.com/PaddlePaddle/Paddle/pull/58665),[#57291](https://github.com/PaddlePaddle/Paddle/pull/57291),[#54452](https://github.com/PaddlePaddle/Paddle/pull/54452),[#57431](https://github.com/PaddlePaddle/Paddle/pull/57431),[#54692](https://github.com/PaddlePaddle/Paddle/pull/54692),[#55112](https://github.com/PaddlePaddle/Paddle/pull/55112),[#55210](https://github.com/PaddlePaddle/Paddle/pull/55210),[#55401](https://github.com/PaddlePaddle/Paddle/pull/55401),[#55772](https://github.com/PaddlePaddle/Paddle/pull/55772),[#55828](https://github.com/PaddlePaddle/Paddle/pull/55828),[#56148](https://github.com/PaddlePaddle/Paddle/pull/56148),[#54763](https://github.com/PaddlePaddle/Paddle/pull/54763),[#56886](https://github.com/PaddlePaddle/Paddle/pull/56886),[#57284](https://github.com/PaddlePaddle/Paddle/pull/57284),[#57268](https://github.com/PaddlePaddle/Paddle/pull/57268),[#57791](https://github.com/PaddlePaddle/Paddle/pull/57791),[#56789](https://github.com/PaddlePaddle/Paddle/pull/56789),[#56704](https://github.com/PaddlePaddle/Paddle/pull/56704),[#57594](https://github.com/PaddlePaddle/Paddle/pull/57594),[#58397](https://github.com/PaddlePaddle/Paddle/pull/58397),[#58337](https://github.com/PaddlePaddle/Paddle/pull/58337),[#58756](https://github.com/PaddlePaddle/Paddle/pull/58756),[#58371](https://github.com/PaddlePaddle/Paddle/pull/58371) -- Reconstructed auto-differentiation module for PIR, migrate and adapted the high-order auto-differentiation function. Optimized Stop Gradient transfer mechanism, so logic is clearer and function is more robust. [#55660](https://github.com/PaddlePaddle/Paddle/pull/55660),[#57084](https://github.com/PaddlePaddle/Paddle/pull/57084),[#56890](https://github.com/PaddlePaddle/Paddle/pull/56890),[#58942](https://github.com/PaddlePaddle/Paddle/pull/58942),[#59373](https://github.com/PaddlePaddle/Paddle/pull/59373),[#57206](https://github.com/PaddlePaddle/Paddle/pull/57206),[#58145](https://github.com/PaddlePaddle/Paddle/pull/58145),[#55235](https://github.com/PaddlePaddle/Paddle/pull/55235),[#57255](https://github.com/PaddlePaddle/Paddle/pull/57255),[#56925](https://github.com/PaddlePaddle/Paddle/pull/56925),[#55957](https://github.com/PaddlePaddle/Paddle/pull/55957),[#56163](https://github.com/PaddlePaddle/Paddle/pull/56163),[#56316](https://github.com/PaddlePaddle/Paddle/pull/56316),[#57294](https://github.com/PaddlePaddle/Paddle/pull/57294),[#57449](https://github.com/PaddlePaddle/Paddle/pull/57449),[#59520](https://github.com/PaddlePaddle/Paddle/pull/59520),[#59565](https://github.com/PaddlePaddle/Paddle/pull/59565),[#56265](https://github.com/PaddlePaddle/Paddle/pull/56265),[#56512](https://github.com/PaddlePaddle/Paddle/pull/56512),[#56650](https://github.com/PaddlePaddle/Paddle/pull/56650),[#57183](https://github.com/PaddlePaddle/Paddle/pull/57183),[#57956](https://github.com/PaddlePaddle/Paddle/pull/57956),[#59100](https://github.com/PaddlePaddle/Paddle/pull/59100) -- Optimized design and representation of control flow forward and reverse operators, introduced ControlFlow Dialect, and supported conversion and execution from control flow operators to PIR under ProgramDesc. [#58729](https://github.com/PaddlePaddle/Paddle/pull/58729),[#57364](https://github.com/PaddlePaddle/Paddle/pull/57364),[#58625](https://github.com/PaddlePaddle/Paddle/pull/58625),[#57475](https://github.com/PaddlePaddle/Paddle/pull/57475),[#57265](https://github.com/PaddlePaddle/Paddle/pull/57265),[#56799](https://github.com/PaddlePaddle/Paddle/pull/56799),[#59033](https://github.com/PaddlePaddle/Paddle/pull/59033),[#57342](https://github.com/PaddlePaddle/Paddle/pull/57342),[#57801](https://github.com/PaddlePaddle/Paddle/pull/57801),[#57958](https://github.com/PaddlePaddle/Paddle/pull/57958),[#57949](https://github.com/PaddlePaddle/Paddle/pull/57949),[#57937](https://github.com/PaddlePaddle/Paddle/pull/57937),[#59231](https://github.com/PaddlePaddle/Paddle/pull/59231),[#59496](https://github.com/PaddlePaddle/Paddle/pull/59496),[#59321](https://github.com/PaddlePaddle/Paddle/pull/59321),[#58088](https://github.com/PaddlePaddle/Paddle/pull/58088),[#58198](https://github.com/PaddlePaddle/Paddle/pull/58198),[#58024](https://github.com/PaddlePaddle/Paddle/pull/58024),[#58089](https://github.com/PaddlePaddle/Paddle/pull/58089),[#58086](https://github.com/PaddlePaddle/Paddle/pull/58086),[#59175](https://github.com/PaddlePaddle/Paddle/pull/59175),[#59423](https://github.com/PaddlePaddle/Paddle/pull/59423),[#59567](https://github.com/PaddlePaddle/Paddle/pull/59567),[#58098](https://github.com/PaddlePaddle/Paddle/pull/58098),[#58163](https://github.com/PaddlePaddle/Paddle/pull/58163),[#58250](https://github.com/PaddlePaddle/Paddle/pull/58250),[#58277](https://github.com/PaddlePaddle/Paddle/pull/58277),[#58355](https://github.com/PaddlePaddle/Paddle/pull/58355),[#59020](https://github.com/PaddlePaddle/Paddle/pull/59020),[#59200](https://github.com/PaddlePaddle/Paddle/pull/59200),[#59585](https://github.com/PaddlePaddle/Paddle/pull/59585),[#58109](https://github.com/PaddlePaddle/Paddle/pull/58109) -- Upgraded dynamic to static execution flow to support PIR, optimized dynamic to static subgraph Pass mechanism, and supported users to try and use functions in the PIR system under the @to_static function. [#57566](https://github.com/PaddlePaddle/Paddle/pull/57566),[#55620](https://github.com/PaddlePaddle/Paddle/pull/55620),[#56791](https://github.com/PaddlePaddle/Paddle/pull/56791),[#57357](https://github.com/PaddlePaddle/Paddle/pull/57357),[#59152](https://github.com/PaddlePaddle/Paddle/pull/59152),[#59312](https://github.com/PaddlePaddle/Paddle/pull/59312),[#58630](https://github.com/PaddlePaddle/Paddle/pull/58630),[#56035](https://github.com/PaddlePaddle/Paddle/pull/56035),[#59447](https://github.com/PaddlePaddle/Paddle/pull/59447),[#57361](https://github.com/PaddlePaddle/Paddle/pull/57361),[#59261](https://github.com/PaddlePaddle/Paddle/pull/59261),[#59774](https://github.com/PaddlePaddle/Paddle/pull/59774) -- Upgraded combination operator function with introducing the concept of Backend to manage logic of combination operator module of dynamic and static graphs in a hierarchical way. Sank necessary components and operator splitting rules into C++, to dramatically reduce maintenance costs. [#58153](https://github.com/PaddlePaddle/Paddle/pull/58153),[#56391](https://github.com/PaddlePaddle/Paddle/pull/56391),[#56614](https://github.com/PaddlePaddle/Paddle/pull/56614),[#57030](https://github.com/PaddlePaddle/Paddle/pull/57030),[#57554](https://github.com/PaddlePaddle/Paddle/pull/57554),[#58018](https://github.com/PaddlePaddle/Paddle/pull/58018),[#58130](https://github.com/PaddlePaddle/Paddle/pull/58130),[#58581](https://github.com/PaddlePaddle/Paddle/pull/58581),[#58679](https://github.com/PaddlePaddle/Paddle/pull/58679),[#59054](https://github.com/PaddlePaddle/Paddle/pull/59054),[#55480](https://github.com/PaddlePaddle/Paddle/pull/55480),[#58451](https://github.com/PaddlePaddle/Paddle/pull/58451),[#55647](https://github.com/PaddlePaddle/Paddle/pull/55647),[#56342](https://github.com/PaddlePaddle/Paddle/pull/56342),[#56798](https://github.com/PaddlePaddle/Paddle/pull/56798),[#57561](https://github.com/PaddlePaddle/Paddle/pull/57561),[#58023](https://github.com/PaddlePaddle/Paddle/pull/58023),[#57722](https://github.com/PaddlePaddle/Paddle/pull/57722) +- Fix errors in `paddle.optimizer.LBFGS` caused by using non-Tensor computations [#60219](https://github.com/PaddlePaddle/Paddle/pull/60219) +- Fix the problem of random numbers not being fixed in `paddle.optimizer.LBFGS` [#60591](https://github.com/PaddlePaddle/Paddle/pull/60591) +- Fix the incorrect calculation of gradient of `set_value` operator [#59034](https://github.com/PaddlePaddle/Paddle/pull/59034) +- Fix the problem of Tensor basic index adapting to PIR [#60259](https://github.com/PaddlePaddle/Paddle/pull/60259), [#61103](https://github.com/PaddlePaddle/Paddle/pull/61103) +- Fix the problem of Tensor combined index assignment [problem](https://github.com/PaddlePaddle/Paddle/issues/60376) [#60447](https://github.com/PaddlePaddle/Paddle/pull/60447) +- Fix the problem when Tensor combined index takes values [problem] [#61922](https://github.com/PaddlePaddle/Paddle/pull/61922) +- Fix `paddle.flatten` stride calculation error issue, with being able to add `paddle.flatten_` [#63084](https://github.com/PaddlePaddle/Paddle/pull/63084) +- Fix the result inconsistency problem between `paddle.index_fill` and `paddle.index_fill_` [#59863](https://github.com/PaddlePaddle/Paddle/pull/59863) +- Fix the `paddle.masked_scatter` error report issue [#60835](https://github.com/PaddlePaddle/Paddle/pull/60835) +- Fix the `paddle.histogramdd` cpu error report issue [#61891](https://github.com/PaddlePaddle/Paddle/pull/61891) +- Fix the bug that `paddle.cast_` continuous use on cpu leads to incorrect result [#60054](https://github.com/PaddlePaddle/Paddle/pull/60054) +- Fix `paddle.put_along_axis` bug when input size is very large [#60551](https://github.com/PaddlePaddle/Paddle/pull/60551) +- Fix `paddle.nanmedian` cpu error report issue [#63221](https://github.com/PaddlePaddle/Paddle/pull/63221) +- Fix the bug that `paddle.median` does not support inputs other than floating-point types in the min branch. [#64444](https://github.com/PaddlePaddle/Paddle/pull/64444) +- Fix the dataloader issue in distributed scenarios. [#62696](https://github.com/PaddlePaddle/Paddle/pull/62696), [#63378](https://github.com/PaddlePaddle/Paddle/pull/63378) +- Fix the formatting issue in error prompt [#63106](https://github.com/PaddlePaddle/Paddle/pull/63106), [#63144](https://github.com/PaddlePaddle/Paddle/pull/63144) +- Fix the format issue under GLOG_v>=6. [#63345](https://github.com/PaddlePaddle/Paddle/pull/63345) -#### Performance optimization +### Security Improvements -- Added PIR Program operators such as DCE and constant_folding_pass, and structure-optimized Pass. [#54935](https://github.com/PaddlePaddle/Paddle/pull/54935),[#59430](https://github.com/PaddlePaddle/Paddle/pull/59430),[#58753](https://github.com/PaddlePaddle/Paddle/pull/58753),[#58732](https://github.com/PaddlePaddle/Paddle/pull/58732) +- Enhance the checking of parent_ids [#62826](https://github.com/PaddlePaddle/Paddle/pull/62826) -2. Added optimization operators fusing class Pass, such as fused_attention, fused_dropout_add, fused_gemm_epilogue_pass, fused_linear_param_grad_add_pass, fused_weight_only_linear_pass, and fused_softmax_mask_upper_triangle, to improve training and inference performance. [#57557](https://github.com/PaddlePaddle/Paddle/pull/57557),[#58272](https://github.com/PaddlePaddle/Paddle/pull/58272),[#58188](https://github.com/PaddlePaddle/Paddle/pull/58188),[#58401](https://github.com/PaddlePaddle/Paddle/pull/58401),[#59366](https://github.com/PaddlePaddle/Paddle/pull/59366),[#57655](https://github.com/PaddlePaddle/Paddle/pull/57655),[#57360](https://github.com/PaddlePaddle/Paddle/pull/57360),[#56672](https://github.com/PaddlePaddle/Paddle/pull/56672),[#58537](https://github.com/PaddlePaddle/Paddle/pull/58537),[#56247](https://github.com/PaddlePaddle/Paddle/pull/56247),[#59391](https://github.com/PaddlePaddle/Paddle/pull/59391),[#58897](https://github.com/PaddlePaddle/Paddle/pull/58897),[#54933](https://github.com/PaddlePaddle/Paddle/pull/54933) +## Basic Execution Architecture -### Dynamic to static capability enhancement +PIR basic functions have been upgraded and improved comprehensively, and the maturity level has been greatly improved. Based on PIR, the design of the PaddlePaddle infrastructure is more reasonable, ensuring the excellent performance and good scalability of the framework. In this version, we have completed the inference verification of PIR in multiple scenarios: For the single-machine scenario, complete the PIR back-end switching in the dynamic-to-static scenarios; For inference scenario, complete the verification of all the stock models, and 84.2% of the models have a gain of 10%+; we have completed the verification of distributed scenarios based on PIR. Meanwhile, based on PIR, we have completed the development and validation of core modules such as control flow, backward logic, save/load, and OneDNN adaptation, which lays a solid foundation for the switching of the PaddlePaddle PIR to the default mode. The functional completeness, execution efficiency and stability of the PaddlePaddle framework operator system are further improved, bringing better use and development experience to the developers. -Dynamic to static graph conversion is a key technology in deep learning frameworks. It allows developers to find the best balance between flexibility and training efficiency. This version of PaddlePaddle has fully upgraded core Dynamic to Static functionality. Success rate of dynamic to static training is up to 100% among 700+ models in PaddlePaddle industry-grade model library. +### Function Optimization -#### New features +- Improve the basic functions of PIR, including basic type system enhancement, debugging, printing, Pass development, and AMP support, to enhance the development efficiency of PIR. [#60723](https://github.com/PaddlePaddle/Paddle/pull/60723), [#60677](https://github.com/PaddlePaddle/Paddle/pull/60677), [#60783](https://github.com/PaddlePaddle/Paddle/pull/60783), [#60798](https://github.com/PaddlePaddle/Paddle/pull/60798), [#61053](https://github.com/PaddlePaddle/Paddle/pull/61053), [#61366](https://github.com/PaddlePaddle/Paddle/pull/61366), [#61446](https://github.com/PaddlePaddle/Paddle/pull/61446), [#60024](https://github.com/PaddlePaddle/Paddle/pull/60024), [#59939](https://github.com/PaddlePaddle/Paddle/pull/59939), [#63376](https://github.com/PaddlePaddle/Paddle/pull/63376), [#61853](https://github.com/PaddlePaddle/Paddle/pull/61853), [#63914](https://github.com/PaddlePaddle/Paddle/pull/63914), [#60170](https://github.com/PaddlePaddle/Paddle/pull/60170), [#60678](https://github.com/PaddlePaddle/Paddle/pull/60678), [#64093](https://github.com/PaddlePaddle/Paddle/pull/64093), [#64065](https://github.com/PaddlePaddle/Paddle/pull/64065), [#62451](https://github.com/PaddlePaddle/Paddle/pull/62451), [#59784](https://github.com/PaddlePaddle/Paddle/pull/59784), [#60136](https://github.com/PaddlePaddle/Paddle/pull/60136), [#63336](https://github.com/PaddlePaddle/Paddle/pull/63336), [#62108](https://github.com/PaddlePaddle/Paddle/pull/62108), [#60860](https://github.com/PaddlePaddle/Paddle/pull/60860), [#60536](https://github.com/PaddlePaddle/Paddle/pull/60536), [#60590](https://github.com/PaddlePaddle/Paddle/pull/60590), [#60752](https://github.com/PaddlePaddle/Paddle/pull/60752), [#61435](https://github.com/PaddlePaddle/Paddle/pull/61435), [#62977](https://github.com/PaddlePaddle/Paddle/pull/62977), [#62139](https://github.com/PaddlePaddle/Paddle/pull/62139), [#60432](https://github.com/PaddlePaddle/Paddle/pull/60432), [#61452](https://github.com/PaddlePaddle/Paddle/pull/61452), [#61978](https://github.com/PaddlePaddle/Paddle/pull/61978), [#62262](https://github.com/PaddlePaddle/Paddle/pull/62262), [#62422](https://github.com/PaddlePaddle/Paddle/pull/62422), [#60359](https://github.com/PaddlePaddle/Paddle/pull/60359), [#62989](https://github.com/PaddlePaddle/Paddle/pull/62989), [#61297](https://github.com/PaddlePaddle/Paddle/pull/61297), [#61399](https://github.com/PaddlePaddle/Paddle/pull/61399), [#61871](https://github.com/PaddlePaddle/Paddle/pull/61871), [#61496](https://github.com/PaddlePaddle/Paddle/pull/61496), [#62413](https://github.com/PaddlePaddle/Paddle/pull/62413) +- Optimize the execution logic of the PaddlePaddle actuator, improve the Pass system, enhance the performance of training and inference, to better support distributed parallel logic operation. [#60182](https://github.com/PaddlePaddle/Paddle/pull/60182), [#60516](https://github.com/PaddlePaddle/Paddle/pull/60516), [#63573](https://github.com/PaddlePaddle/Paddle/pull/63573), [#60181](https://github.com/PaddlePaddle/Paddle/pull/60181), [#59792](https://github.com/PaddlePaddle/Paddle/pull/59792), [#62025](https://github.com/PaddlePaddle/Paddle/pull/62025), [#61160](https://github.com/PaddlePaddle/Paddle/pull/61160), [#61188](https://github.com/PaddlePaddle/Paddle/pull/61188), [#61277](https://github.com/PaddlePaddle/Paddle/pull/61277), [#61669](https://github.com/PaddlePaddle/Paddle/pull/61669), [#60823](https://github.com/PaddlePaddle/Paddle/pull/60823), [#61310](https://github.com/PaddlePaddle/Paddle/pull/61310), [#60892](https://github.com/PaddlePaddle/Paddle/pull/60892), [#60578](https://github.com/PaddlePaddle/Paddle/pull/60578), [#61657](https://github.com/PaddlePaddle/Paddle/pull/61657), [#62638](https://github.com/PaddlePaddle/Paddle/pull/62638), [#63960](https://github.com/PaddlePaddle/Paddle/pull/63960), [#64234](https://github.com/PaddlePaddle/Paddle/pull/64234) -- Adopted Python Eval Frame and VM simulation execution technology to innovatively implement an adaptive Graph Break mechanism. This mechanism is especially designed for control flow scenarios. By introducing the CallLayer mechanism, it makes full use of the advantage of PaddlePaddle dynamic-static unification motion. Support hybrid mode of Abstract Syntax Tree (AST) and bytecode simulation. Efficiently captures control flow operators, thus dramatically improving ability of computational graph to be static. At cache optimization level, fuse advanced optimization technologies such as common sub-expression elimination, to significantly improve execution efficiency of Guard. These optimizations not only reduce redundant computations, but also improve overall system operation speed. To enhance robustness of the system, a simple and efficient data intermediate layer structure is designed. Structure supports correctness recovery of SideEffects, ensuring stability and reliability of system in complex environments. In addition, it is widely compatible with mainstream interpreter versions from Python 3.8 to 3.11, providing users with a wide range of applicability. [#57824](https://github.com/PaddlePaddle/Paddle/pull/57824),[#55887](https://github.com/PaddlePaddle/Paddle/pull/55887),[#58155](https://github.com/PaddlePaddle/Paddle/pull/58155),[#56107](https://github.com/PaddlePaddle/Paddle/pull/56107),[#57490](https://github.com/PaddlePaddle/Paddle/pull/57490),[#58829](https://github.com/PaddlePaddle/Paddle/pull/58829),[#57240](https://github.com/PaddlePaddle/Paddle/pull/57240),[#57588](https://github.com/PaddlePaddle/Paddle/pull/57588),[#58117](https://github.com/PaddlePaddle/Paddle/pull/58117),[#59823](https://github.com/PaddlePaddle/Paddle/pull/59823),[#56077](https://github.com/PaddlePaddle/Paddle/pull/56077),[#58956](https://github.com/PaddlePaddle/Paddle/pull/58956),[#57653](https://github.com/PaddlePaddle/Paddle/pull/57653),[#59855](https://github.com/PaddlePaddle/Paddle/pull/59855),[#59017](https://github.com/PaddlePaddle/Paddle/pull/59017),[#58424](https://github.com/PaddlePaddle/Paddle/pull/58424),[#58187](https://github.com/PaddlePaddle/Paddle/pull/58187),[#57793](https://github.com/PaddlePaddle/Paddle/pull/57793),[#59698](https://github.com/PaddlePaddle/Paddle/pull/59698),[#59747](https://github.com/PaddlePaddle/Paddle/pull/59747),[#59710](https://github.com/PaddlePaddle/Paddle/pull/59710),[#59297](https://github.com/PaddlePaddle/Paddle/pull/59297),[#58423](https://github.com/PaddlePaddle/Paddle/pull/58423),[#56262](https://github.com/PaddlePaddle/Paddle/pull/56262),[#58103](https://github.com/PaddlePaddle/Paddle/pull/58103),[#58538](https://github.com/PaddlePaddle/Paddle/pull/58538),[#58771](https://github.com/PaddlePaddle/Paddle/pull/58771),[#59191](https://github.com/PaddlePaddle/Paddle/pull/59191),[#57754](https://github.com/PaddlePaddle/Paddle/pull/57754),[#59439](https://github.com/PaddlePaddle/Paddle/pull/59439),[#59816](https://github.com/PaddlePaddle/Paddle/pull/59816),[#59035](https://github.com/PaddlePaddle/Paddle/pull/59035) -- Added dynamic to static syntax transcription parsing for PyLayer functions, making PyLayer's conversion between dynamic and static graphs smoother. Users can now seamlessly carry out dynamic to static training on PyLayer, to easily export inference models. [#56108](https://github.com/PaddlePaddle/Paddle/pull/56108),[#56531](https://github.com/PaddlePaddle/Paddle/pull/56531),[#57066](https://github.com/PaddlePaddle/Paddle/pull/57066),[#57633](https://github.com/PaddlePaddle/Paddle/pull/57633) +### PIR New Features -#### Bug Fix +- Realize reverse logic based on PIR, generate reverse computation graph directly, and support higher-order differentiation at the same time. [#60174](https://github.com/PaddlePaddle/Paddle/pull/60174), [#60328](https://github.com/PaddlePaddle/Paddle/pull/60328), [#60818](https://github.com/PaddlePaddle/Paddle/pull/60818), [#61352](https://github.com/PaddlePaddle/Paddle/pull/61352), [#61661](https://github.com/PaddlePaddle/Paddle/pull/61661), [#61927](https://github.com/PaddlePaddle/Paddle/pull/61927), [#62772](https://github.com/PaddlePaddle/Paddle/pull/62772), [#60360](https://github.com/PaddlePaddle/Paddle/pull/60360), [#60866](https://github.com/PaddlePaddle/Paddle/pull/60866), [#60970](https://github.com/PaddlePaddle/Paddle/pull/60970), [#60810](https://github.com/PaddlePaddle/Paddle/pull/60810), [#64696](https://github.com/PaddlePaddle/Paddle/pull/64696), [#59844](https://github.com/PaddlePaddle/Paddle/pull/59844), [#59999](https://github.com/PaddlePaddle/Paddle/pull/59999), [#60262](https://github.com/PaddlePaddle/Paddle/pull/60262), [#60338](https://github.com/PaddlePaddle/Paddle/pull/60338), [#59935](https://github.com/PaddlePaddle/Paddle/pull/59935), [#59982](https://github.com/PaddlePaddle/Paddle/pull/59982), [#60221](https://github.com/PaddlePaddle/Paddle/pull/60221), [#62621](https://github.com/PaddlePaddle/Paddle/pull/62621), [#60044](https://github.com/PaddlePaddle/Paddle/pull/60044), [#59790](https://github.com/PaddlePaddle/Paddle/pull/59790), [#60529](https://github.com/PaddlePaddle/Paddle/pull/60529), [#61378](https://github.com/PaddlePaddle/Paddle/pull/61378), [#61584](https://github.com/PaddlePaddle/Paddle/pull/61584) +- Implement control flow logic based on PIR to improve the expressive ability of PIR and better support multi-scenario services such as training and inference. [#61396](https://github.com/PaddlePaddle/Paddle/pull/61396), [#64045](https://github.com/PaddlePaddle/Paddle/pull/64045), [#60953](https://github.com/PaddlePaddle/Paddle/pull/60953), [#61091](https://github.com/PaddlePaddle/Paddle/pull/61091), [#61304](https://github.com/PaddlePaddle/Paddle/pull/61304), [#62093](https://github.com/PaddlePaddle/Paddle/pull/62093), [#64710](https://github.com/PaddlePaddle/Paddle/pull/64710), [#60668](https://github.com/PaddlePaddle/Paddle/pull/60668), [#60433](https://github.com/PaddlePaddle/Paddle/pull/60433), [#60963](https://github.com/PaddlePaddle/Paddle/pull/60963), [#61192](https://github.com/PaddlePaddle/Paddle/pull/61192), [#60895](https://github.com/PaddlePaddle/Paddle/pull/60895), [#60017](https://github.com/PaddlePaddle/Paddle/pull/60017), [#60369](https://github.com/PaddlePaddle/Paddle/pull/60369), [#60330](https://github.com/PaddlePaddle/Paddle/pull/60330), [#60364](https://github.com/PaddlePaddle/Paddle/pull/60364), [#61416](https://github.com/PaddlePaddle/Paddle/pull/61416), [#60460](https://github.com/PaddlePaddle/Paddle/pull/60460), [#60703](https://github.com/PaddlePaddle/Paddle/pull/60703), [#61027](https://github.com/PaddlePaddle/Paddle/pull/61027) +- Realize save/load logic based on PIR, to carry out the process of PIR and upstream/downstream training and inference services. [#63438](https://github.com/PaddlePaddle/Paddle/pull/63438), [#63574](https://github.com/PaddlePaddle/Paddle/pull/63574), [#64281](https://github.com/PaddlePaddle/Paddle/pull/64281), [#64327](https://github.com/PaddlePaddle/Paddle/pull/64327), [#63622](https://github.com/PaddlePaddle/Paddle/pull/63622), [#64507](https://github.com/PaddlePaddle/Paddle/pull/64507), [#63389](https://github.com/PaddlePaddle/Paddle/pull/63389), [#63539](https://github.com/PaddlePaddle/Paddle/pull/63539), [#63749](https://github.com/PaddlePaddle/Paddle/pull/63749), [#63957](https://github.com/PaddlePaddle/Paddle/pull/63957), [#64044](https://github.com/PaddlePaddle/Paddle/pull/64044), [#64121](https://github.com/PaddlePaddle/Paddle/pull/64121), [#64239](https://github.com/PaddlePaddle/Paddle/pull/64239), [#63818](https://github.com/PaddlePaddle/Paddle/pull/63818), [#63910](https://github.com/PaddlePaddle/Paddle/pull/63910),[#63380](https://github.com/PaddlePaddle/Paddle/pull/63380)[#63380](https://github.com/PaddlePaddle/Paddle/pull/63380),[#63275](https://github.com/PaddlePaddle/Paddle/pull/63275),[#63663](https://github.com/PaddlePaddle/Paddle/pull/63663),[#64692](https://github.com/PaddlePaddle/Paddle/pull/64692),[#63958](https://github.com/PaddlePaddle/Paddle/pull/63958) +- Completed the development and validation of OneDNN related basic functions to prepare for the full-scale switch of OneDNN. [#60680](https://github.com/PaddlePaddle/Paddle/pull/60680), [#60665](https://github.com/PaddlePaddle/Paddle/pull/60665), [#63162](https://github.com/PaddlePaddle/Paddle/pull/63162), [#59917](https://github.com/PaddlePaddle/Paddle/pull/59917), [#62901](https://github.com/PaddlePaddle/Paddle/pull/62901), [#59918](https://github.com/PaddlePaddle/Paddle/pull/59918), [#60257](https://github.com/PaddlePaddle/Paddle/pull/60257), [#60502](https://github.com/PaddlePaddle/Paddle/pull/60502), [#61062](https://github.com/PaddlePaddle/Paddle/pull/61062), [#61170](https://github.com/PaddlePaddle/Paddle/pull/61170), [#61474](https://github.com/PaddlePaddle/Paddle/pull/61474), [#60874](https://github.com/PaddlePaddle/Paddle/pull/60874), [#61495](https://github.com/PaddlePaddle/Paddle/pull/61495), [#61664](https://github.com/PaddlePaddle/Paddle/pull/61664), [#61649](https://github.com/PaddlePaddle/Paddle/pull/61649), [#61592](https://github.com/PaddlePaddle/Paddle/pull/61592), [#61667](https://github.com/PaddlePaddle/Paddle/pull/61667), [#61137](https://github.com/PaddlePaddle/Paddle/pull/61137), [#60952](https://github.com/PaddlePaddle/Paddle/pull/60952), [#61651](https://github.com/PaddlePaddle/Paddle/pull/61651), [#62126](https://github.com/PaddlePaddle/Paddle/pull/62126), [#62187](https://github.com/PaddlePaddle/Paddle/pull/62187), [#61307](https://github.com/PaddlePaddle/Paddle/pull/61307), [#62734](https://github.com/PaddlePaddle/Paddle/pull/62734), [#60974](https://github.com/PaddlePaddle/Paddle/pull/60974), [#61451](https://github.com/PaddlePaddle/Paddle/pull/61451), [#61011](https://github.com/PaddlePaddle/Paddle/pull/61011), [#61218](https://github.com/PaddlePaddle/Paddle/pull/61218), [#61623](https://github.com/PaddlePaddle/Paddle/pull/61623), [#61893](https://github.com/PaddlePaddle/Paddle/pull/61893), [#61876](https://github.com/PaddlePaddle/Paddle/pull/61876), [#61892](https://github.com/PaddlePaddle/Paddle/pull/61892), [#62085](https://github.com/PaddlePaddle/Paddle/pull/62085), [#62220](https://github.com/PaddlePaddle/Paddle/pull/62220), [#62244](https://github.com/PaddlePaddle/Paddle/pull/62244), [#62265](https://github.com/PaddlePaddle/Paddle/pull/62265), [#60754](https://github.com/PaddlePaddle/Paddle/pull/60754), [#60896](https://github.com/PaddlePaddle/Paddle/pull/60896), [#61868](https://github.com/PaddlePaddle/Paddle/pull/61868), [#61659](https://github.com/PaddlePaddle/Paddle/pull/61659), [#62241](https://github.com/PaddlePaddle/Paddle/pull/62241), [#62471](https://github.com/PaddlePaddle/Paddle/pull/62471), [#61165](https://github.com/PaddlePaddle/Paddle/pull/61165),[#64441](https://github.com/PaddlePaddle/Paddle/pull/64441),[#63141](https://github.com/PaddlePaddle/Paddle/pull/63141),[#63145](https://github.com/PaddlePaddle/Paddle/pull/63145),[#63592](https://github.com/PaddlePaddle/Paddle/pull/63592),[#63617](https://github.com/PaddlePaddle/Paddle/pull/63617),[#63518](https://github.com/PaddlePaddle/Paddle/pull/63518),[#63726](https://github.com/PaddlePaddle/Paddle/pull/63726),[#63853](https://github.com/PaddlePaddle/Paddle/pull/63853),[#63812](https://github.com/PaddlePaddle/Paddle/pull/63812),[#63811](https://github.com/PaddlePaddle/Paddle/pull/63811),[#64524](https://github.com/PaddlePaddle/Paddle/pull/64524),[#62993](https://github.com/PaddlePaddle/Paddle/pull/62993),[#63516](https://github.com/PaddlePaddle/Paddle/pull/63516),[#62998](https://github.com/PaddlePaddle/Paddle/pull/62998),[#63151](https://github.com/PaddlePaddle/Paddle/pull/63151),[#64661](https://github.com/PaddlePaddle/Paddle/pull/64661),[#64433](https://github.com/PaddlePaddle/Paddle/pull/64433),[#64448](https://github.com/PaddlePaddle/Paddle/pull/64448),[#63201](https://github.com/PaddlePaddle/Paddle/pull/63201),[#63230](https://github.com/PaddlePaddle/Paddle/pull/63230),[#63233](https://github.com/PaddlePaddle/Paddle/pull/63233),[#63281](https://github.com/PaddlePaddle/Paddle/pull/63281),[#64671](https://github.com/PaddlePaddle/Paddle/pull/64671),[#63274](https://github.com/PaddlePaddle/Paddle/pull/63274) +- Implement Sparse related logic based on PIR, including basic Type and operator expression, and complete the verification of Sparse key functions. [#62868](https://github.com/PaddlePaddle/Paddle/pull/62868), [#63015](https://github.com/PaddlePaddle/Paddle/pull/63015), [#62894](https://github.com/PaddlePaddle/Paddle/pull/62894) -- Fixed the issue that video memory is abnormal in some scenarios of dynamic to static in is_test=True mode. [#58350](https://github.com/PaddlePaddle/Paddle/pull/58350) -- Fixed the issue that function decorated by @to_static is exported to jit.save model in scenarios like foo(x,x,y). [#55963](https://github.com/PaddlePaddle/Paddle/pull/55963) -- Fixed the issue that dynamic and static logic of some API behaviors is not uniform. This improves success rate and user experience of dynamic to static graph conversion. [#56092](https://github.com/PaddlePaddle/Paddle/pull/56092) +### Dynamic-to-static Function Optimization -#### Fixed vulnerability +Optimize the dynamic-to-static basic capability, adapt to the dynamic dimension in SOT training scenarios, and support Python 3.12. -- Fixed a potential security vulnerability in use of eval() in dynamic to static syntax transcription module. [#60100](https://github.com/PaddlePaddle/Paddle/pull/60100) +- Complete the PIR adaptation in dynamic-to-static scenarios. [#60988](https://github.com/PaddlePaddle/Paddle/pull/60988), [#61936](https://github.com/PaddlePaddle/Paddle/pull/61936), [#59929](https://github.com/PaddlePaddle/Paddle/pull/59929), [#61790](https://github.com/PaddlePaddle/Paddle/pull/61790), [#64323](https://github.com/PaddlePaddle/Paddle/pull/64323), [#62030](https://github.com/PaddlePaddle/Paddle/pull/62030), [#61143](https://github.com/PaddlePaddle/Paddle/pull/61143), [#62680](https://github.com/PaddlePaddle/Paddle/pull/62680), [#63309](https://github.com/PaddlePaddle/Paddle/pull/63309), [#63311](https://github.com/PaddlePaddle/Paddle/pull/63311), [#62199](https://github.com/PaddlePaddle/Paddle/pull/62199) +- SOT adapts to Python 3.12 bytecode, and the dynamic-to-static SOT function can be used in Python 3.12. [#61414](https://github.com/PaddlePaddle/Paddle/pull/61414), [#59562](https://github.com/PaddlePaddle/Paddle/pull/59562), [#61031](https://github.com/PaddlePaddle/Paddle/pull/61031), [#61272](https://github.com/PaddlePaddle/Paddle/pull/61272), [#61412](https://github.com/PaddlePaddle/Paddle/pull/61412), [#61305](https://github.com/PaddlePaddle/Paddle/pull/61305), [#61964](https://github.com/PaddlePaddle/Paddle/pull/61964), [#62008](https://github.com/PaddlePaddle/Paddle/pull/62008), [#62028](https://github.com/PaddlePaddle/Paddle/pull/62028), [#61995](https://github.com/PaddlePaddle/Paddle/pull/61995), [#62073](https://github.com/PaddlePaddle/Paddle/pull/62073), [#62120](https://github.com/PaddlePaddle/Paddle/pull/62120), [#62218](https://github.com/PaddlePaddle/Paddle/pull/62218), [#62155](https://github.com/PaddlePaddle/Paddle/pull/62155) +- SOT completes the adaptation of the dynamic dimension of the training scenario, avoiding triggering duplicate graph compositions in dimension changes, and improving the operation efficiency. [#64278](https://github.com/PaddlePaddle/Paddle/pull/64278), [#64435](https://github.com/PaddlePaddle/Paddle/pull/64435), [#64499](https://github.com/PaddlePaddle/Paddle/pull/64499), [#64500](https://github.com/PaddlePaddle/Paddle/pull/64500), [#62080](https://github.com/PaddlePaddle/Paddle/pull/62080) -### Enhanced distributed dynamic graph capability +### Operator Mechanisms -In order to meet the needs of large models, this version focuses on improving the distributed computing capability of the dynamic graph of the PaddlePaddle. Various improvements have been made in communication library, graph analysis, distributed policies and task enable/disable, to provide comprehensive support for large model training. In terms of performance, we further improved training performance by reducing streaming parallel GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is further improved by fixing related Bugs. +For the problems of incomplete implementation of some kernels and inefficient calculation logic, we have improved and optimized some of the operator implementation and internal mechanisms of framework, fixed some known problems, and supported some new features. -#### New features +- For XPU kernel, we have optimized the data type support of `numel`, `concat`, and `slice`, and the mixed-precision training support for `AdamW` optimizer. [#63715](https://github.com/PaddlePaddle/Paddle/pull/63715), [#61617](https://github.com/PaddlePaddle/Paddle/pull/61617), [#61694](https://github.com/PaddlePaddle/Paddle/pull/61694), [#64542](https://github.com/PaddlePaddle/Paddle/pull/64542), [#63644](https://github.com/PaddlePaddle/Paddle/pull/63644), [#61340](https://github.com/PaddlePaddle/Paddle/pull/61340), [#63108](https://github.com/PaddlePaddle/Paddle/pull/63108) +- Improve the function and performance of some operators. [#59413](https://github.com/PaddlePaddle/Paddle/pull/59413), [#60295](https://github.com/PaddlePaddle/Paddle/pull/60295), [#64304](https://github.com/PaddlePaddle/Paddle/pull/64304), [#60979](https://github.com/PaddlePaddle/Paddle/pull/60979), [#63556](https://github.com/PaddlePaddle/Paddle/pull/63556), [#63061](https://github.com/PaddlePaddle/Paddle/pull/63061), [#62533](https://github.com/PaddlePaddle/Paddle/pull/62533) +- Improve the mechanism of composite operators, and optimize composite logic for some operators. [#59448](https://github.com/PaddlePaddle/Paddle/pull/59448), [#60505](https://github.com/PaddlePaddle/Paddle/pull/60505), [#59891](https://github.com/PaddlePaddle/Paddle/pull/59891), [#63161](https://github.com/PaddlePaddle/Paddle/pull/63161), [#63245](https://github.com/PaddlePaddle/Paddle/pull/63245), [#63782](https://github.com/PaddlePaddle/Paddle/pull/63782), [#64346](https://github.com/PaddlePaddle/Paddle/pull/64346), [#63156](https://github.com/PaddlePaddle/Paddle/pull/63156), [#63171](https://github.com/PaddlePaddle/Paddle/pull/63171), [#61315](https://github.com/PaddlePaddle/Paddle/pull/61315), [#61701](https://github.com/PaddlePaddle/Paddle/pull/61701), [#61874](https://github.com/PaddlePaddle/Paddle/pull/61874), [#61873](https://github.com/PaddlePaddle/Paddle/pull/61873), [#62059](https://github.com/PaddlePaddle/Paddle/pull/62059), [#61912](https://github.com/PaddlePaddle/Paddle/pull/61912), [#62112](https://github.com/PaddlePaddle/Paddle/pull/62112), [#63011](https://github.com/PaddlePaddle/Paddle/pull/63011), [#63009](https://github.com/PaddlePaddle/Paddle/pull/63009), [#64714](https://github.com/PaddlePaddle/Paddle/pull/64714) -- Added TraceHang function in communication library, to quickly locate the faulty node when cluster training has Hang problem. [#59217](https://github.com/PaddlePaddle/Paddle/pull/59217) -- In order to improve training efficiency and reduce memory, dynamic graph supports stride mechanism. [#55156](https://github.com/PaddlePaddle/Paddle/pull/55156),[#54762](https://github.com/PaddlePaddle/Paddle/pull/54762),[#55850](https://github.com/PaddlePaddle/Paddle/pull/55850),[#59190](https://github.com/PaddlePaddle/Paddle/pull/59190),[#57005](https://github.com/PaddlePaddle/Paddle/pull/57005),[#57005](https://github.com/PaddlePaddle/Paddle/pull/57005),[#57331](https://github.com/PaddlePaddle/Paddle/pull/57331),[#58033](https://github.com/PaddlePaddle/Paddle/pull/58033),[#58033](https://github.com/PaddlePaddle/Paddle/pull/58033),[#58303](https://github.com/PaddlePaddle/Paddle/pull/58303),[#57835](https://github.com/PaddlePaddle/Paddle/pull/57835),[#57189](https://github.com/PaddlePaddle/Paddle/pull/57189) -- Enhanced paddleviz function to facilitate analysis of computational graphs. [#56837](https://github.com/PaddlePaddle/Paddle/pull/56837),[#57626](https://github.com/PaddlePaddle/Paddle/pull/57626) -- In distributed Sharding strategies (Stage1,2,3), added main_grad function to support higher precision gradient accumulation, and reduce precision loss caused by low precision accumulation. [#57972](https://github.com/PaddlePaddle/Paddle/pull/57972),[#57934](https://github.com/PaddlePaddle/Paddle/pull/57934),[#57473](https://github.com/PaddlePaddle/Paddle/pull/57473),[#57537](https://github.com/PaddlePaddle/Paddle/pull/57537),[#59611](https://github.com/PaddlePaddle/Paddle/pull/59611),[#57960](https://github.com/PaddlePaddle/Paddle/pull/57960) -- In Sharding Stage1 strategy, added a switch variable to control whether to perform fusion calculation on Optimizer. [#58790](https://github.com/PaddlePaddle/Paddle/pull/58790) -- In Recompute function, added support for Tuple input parameters, enhancing calling ability of Recompute interface. [#56793](https://github.com/PaddlePaddle/Paddle/pull/56793) -- Enhanced Launch function, allowing distributed training without specifying endpoints in dynamic graphs. [#54636](https://github.com/PaddlePaddle/Paddle/pull/54636) +### Bug Fixing -#### Function optimization +- Fix the bugs related to PIR, actuator, and dynamic-to-static. [#64442](https://github.com/PaddlePaddle/Paddle/pull/64442), [#60443](https://github.com/PaddlePaddle/Paddle/pull/60443), [#60122](https://github.com/PaddlePaddle/Paddle/pull/60122), [#60625](https://github.com/PaddlePaddle/Paddle/pull/60625), [#60607](https://github.com/PaddlePaddle/Paddle/pull/60607), [#60705](https://github.com/PaddlePaddle/Paddle/pull/60705), [#61110](https://github.com/PaddlePaddle/Paddle/pull/61110), [#61278](https://github.com/PaddlePaddle/Paddle/pull/61278), [#61448](https://github.com/PaddlePaddle/Paddle/pull/61448), [#61491](https://github.com/PaddlePaddle/Paddle/pull/61491), [#61692](https://github.com/PaddlePaddle/Paddle/pull/61692), [#62100](https://github.com/PaddlePaddle/Paddle/pull/62100), [#62239](https://github.com/PaddlePaddle/Paddle/pull/62239), [#62365](https://github.com/PaddlePaddle/Paddle/pull/62365), [#62758](https://github.com/PaddlePaddle/Paddle/pull/62758), [#63395](https://github.com/PaddlePaddle/Paddle/pull/63395), [#64272](https://github.com/PaddlePaddle/Paddle/pull/64272), [#62165](https://github.com/PaddlePaddle/Paddle/pull/62165), [#64151](https://github.com/PaddlePaddle/Paddle/pull/64151), [#64204](https://github.com/PaddlePaddle/Paddle/pull/64204), [#64815](https://github.com/PaddlePaddle/Paddle/pull/64815), [#63757](https://github.com/PaddlePaddle/Paddle/pull/63757), [#61972](https://github.com/PaddlePaddle/Paddle/pull/61972), [#64806](https://github.com/PaddlePaddle/Paddle/pull/64806), [#60010](https://github.com/PaddlePaddle/Paddle/pull/60010), [#60461](https://github.com/PaddlePaddle/Paddle/pull/60461), [#60310](https://github.com/PaddlePaddle/Paddle/pull/60310), [#62006](https://github.com/PaddlePaddle/Paddle/pull/62006), [#61591](https://github.com/PaddlePaddle/Paddle/pull/61591), [#60327](https://github.com/PaddlePaddle/Paddle/pull/60327), [#60720](https://github.com/PaddlePaddle/Paddle/pull/60720), [#64656](https://github.com/PaddlePaddle/Paddle/pull/64656), [#60236](https://github.com/PaddlePaddle/Paddle/pull/60236), [#60684](https://github.com/PaddlePaddle/Paddle/pull/60684), [#60790](https://github.com/PaddlePaddle/Paddle/pull/60790), [#60944](https://github.com/PaddlePaddle/Paddle/pull/60944), [#62056](https://github.com/PaddlePaddle/Paddle/pull/62056), [#62891](https://github.com/PaddlePaddle/Paddle/pull/62891), [#64676](https://github.com/PaddlePaddle/Paddle/pull/64676), [#60271](https://github.com/PaddlePaddle/Paddle/pull/60271), [#60634](https://github.com/PaddlePaddle/Paddle/pull/60634), [#60663](https://github.com/PaddlePaddle/Paddle/pull/60663), [#60827](https://github.com/PaddlePaddle/Paddle/pull/60827), [#60845](https://github.com/PaddlePaddle/Paddle/pull/60845), [#60905](https://github.com/PaddlePaddle/Paddle/pull/60905), [#60945](https://github.com/PaddlePaddle/Paddle/pull/60945), [#60949](https://github.com/PaddlePaddle/Paddle/pull/60949), [#61107](https://github.com/PaddlePaddle/Paddle/pull/61107), [#61111](https://github.com/PaddlePaddle/Paddle/pull/61111), [#61117](https://github.com/PaddlePaddle/Paddle/pull/61117), [#61158](https://github.com/PaddlePaddle/Paddle/pull/61158), [#61177](https://github.com/PaddlePaddle/Paddle/pull/61177), [#61355](https://github.com/PaddlePaddle/Paddle/pull/61355), [#61593](https://github.com/PaddlePaddle/Paddle/pull/61593), [#61666](https://github.com/PaddlePaddle/Paddle/pull/61666), [#61934](https://github.com/PaddlePaddle/Paddle/pull/61934), [#62216](https://github.com/PaddlePaddle/Paddle/pull/62216), [#62491](https://github.com/PaddlePaddle/Paddle/pull/62491), [#62515](https://github.com/PaddlePaddle/Paddle/pull/62515), [#62594](https://github.com/PaddlePaddle/Paddle/pull/62594), [#62605](https://github.com/PaddlePaddle/Paddle/pull/62605), [#62895](https://github.com/PaddlePaddle/Paddle/pull/62895), [#62913](https://github.com/PaddlePaddle/Paddle/pull/62913), [#64413](https://github.com/PaddlePaddle/Paddle/pull/64413), [#59947](https://github.com/PaddlePaddle/Paddle/pull/59947), [#60264](https://github.com/PaddlePaddle/Paddle/pull/60264), [#60721](https://github.com/PaddlePaddle/Paddle/pull/60721), [#63113](https://github.com/PaddlePaddle/Paddle/pull/63113), [#63629](https://github.com/PaddlePaddle/Paddle/pull/63629), [#64300](https://github.com/PaddlePaddle/Paddle/pull/64300), [#64450](https://github.com/PaddlePaddle/Paddle/pull/64450), [#64532](https://github.com/PaddlePaddle/Paddle/pull/64532), [#64561](https://github.com/PaddlePaddle/Paddle/pull/64561), [#64625](https://github.com/PaddlePaddle/Paddle/pull/64625), [#64731](https://github.com/PaddlePaddle/Paddle/pull/64731), [#60059](https://github.com/PaddlePaddle/Paddle/pull/60059), [#60487](https://github.com/PaddlePaddle/Paddle/pull/60487), [#60423](https://github.com/PaddlePaddle/Paddle/pull/60423), [#61599](https://github.com/PaddlePaddle/Paddle/pull/61599), [#62032](https://github.com/PaddlePaddle/Paddle/pull/62032), [#62686](https://github.com/PaddlePaddle/Paddle/pull/62686), [#64055](https://github.com/PaddlePaddle/Paddle/pull/64055), [#60751](https://github.com/PaddlePaddle/Paddle/pull/60751), [#61646](https://github.com/PaddlePaddle/Paddle/pull/61646), [#60454](https://github.com/PaddlePaddle/Paddle/pull/60454), [#62530](https://github.com/PaddlePaddle/Paddle/pull/62530), [#62821](https://github.com/PaddlePaddle/Paddle/pull/62821), [#64454](https://github.com/PaddlePaddle/Paddle/pull/64454), [#64754](https://github.com/PaddlePaddle/Paddle/pull/64754), [#59860](https://github.com/PaddlePaddle/Paddle/pull/59860), [#60280](https://github.com/PaddlePaddle/Paddle/pull/60280), [#60357](https://github.com/PaddlePaddle/Paddle/pull/60357), [#60363](https://github.com/PaddlePaddle/Paddle/pull/60363), [#60900](https://github.com/PaddlePaddle/Paddle/pull/60900), [#61185](https://github.com/PaddlePaddle/Paddle/pull/61185), [#61505](https://github.com/PaddlePaddle/Paddle/pull/61505), [#61644](https://github.com/PaddlePaddle/Paddle/pull/61644), [#62256](https://github.com/PaddlePaddle/Paddle/pull/62256), [#62396](https://github.com/PaddlePaddle/Paddle/pull/62396), [#63040](https://github.com/PaddlePaddle/Paddle/pull/63040), [#63409](https://github.com/PaddlePaddle/Paddle/pull/63409), [#63764](https://github.com/PaddlePaddle/Paddle/pull/63764), [#59571](https://github.com/PaddlePaddle/Paddle/pull/59571), [#59894](https://github.com/PaddlePaddle/Paddle/pull/59894), [#59569](https://github.com/PaddlePaddle/Paddle/pull/59569), [#59896](https://github.com/PaddlePaddle/Paddle/pull/59896), [#60015](https://github.com/PaddlePaddle/Paddle/pull/60015), [#60081](https://github.com/PaddlePaddle/Paddle/pull/60081), [#60164](https://github.com/PaddlePaddle/Paddle/pull/60164), [#60200](https://github.com/PaddlePaddle/Paddle/pull/60200), [#60211](https://github.com/PaddlePaddle/Paddle/pull/60211), [#60267](https://github.com/PaddlePaddle/Paddle/pull/60267), [#60458](https://github.com/PaddlePaddle/Paddle/pull/60458), [#60395](https://github.com/PaddlePaddle/Paddle/pull/60395), [#60907](https://github.com/PaddlePaddle/Paddle/pull/60907), [#60707](https://github.com/PaddlePaddle/Paddle/pull/60707), [#60993](https://github.com/PaddlePaddle/Paddle/pull/60993), [#61401](https://github.com/PaddlePaddle/Paddle/pull/61401), [#61433](https://github.com/PaddlePaddle/Paddle/pull/61433), [#61450](https://github.com/PaddlePaddle/Paddle/pull/61450), [#61577](https://github.com/PaddlePaddle/Paddle/pull/61577), [#61575](https://github.com/PaddlePaddle/Paddle/pull/61575), [#61703](https://github.com/PaddlePaddle/Paddle/pull/61703), [#61711](https://github.com/PaddlePaddle/Paddle/pull/61711), [#61883](https://github.com/PaddlePaddle/Paddle/pull/61883), [#61822](https://github.com/PaddlePaddle/Paddle/pull/61822), [#62012](https://github.com/PaddlePaddle/Paddle/pull/62012), [#61858](https://github.com/PaddlePaddle/Paddle/pull/61858), [#62176](https://github.com/PaddlePaddle/Paddle/pull/62176), [#62257](https://github.com/PaddlePaddle/Paddle/pull/62257), [#62470](https://github.com/PaddlePaddle/Paddle/pull/62470), [#62536](https://github.com/PaddlePaddle/Paddle/pull/62536), [#62606](https://github.com/PaddlePaddle/Paddle/pull/62606), [#62808](https://github.com/PaddlePaddle/Paddle/pull/62808), [#62854](https://github.com/PaddlePaddle/Paddle/pull/62854), [#62879](https://github.com/PaddlePaddle/Paddle/pull/62879), [#62864](https://github.com/PaddlePaddle/Paddle/pull/62864), [#63063](https://github.com/PaddlePaddle/Paddle/pull/63063), [#62958](https://github.com/PaddlePaddle/Paddle/pull/62958), [#63397](https://github.com/PaddlePaddle/Paddle/pull/63397), [#63805](https://github.com/PaddlePaddle/Paddle/pull/63805), [#63694](https://github.com/PaddlePaddle/Paddle/pull/63694), [#64168](https://github.com/PaddlePaddle/Paddle/pull/64168), [#64184](https://github.com/PaddlePaddle/Paddle/pull/64184), [#64174](https://github.com/PaddlePaddle/Paddle/pull/64174), [#64315](https://github.com/PaddlePaddle/Paddle/pull/64315), [#64362](https://github.com/PaddlePaddle/Paddle/pull/64362), [#64400](https://github.com/PaddlePaddle/Paddle/pull/64400), [#64475](https://github.com/PaddlePaddle/Paddle/pull/64475), [#64458](https://github.com/PaddlePaddle/Paddle/pull/64458), [#64548](https://github.com/PaddlePaddle/Paddle/pull/64548), [#59858](https://github.com/PaddlePaddle/Paddle/pull/59858), [#61132](https://github.com/PaddlePaddle/Paddle/pull/61132), [#62010](https://github.com/PaddlePaddle/Paddle/pull/62010), [#62069](https://github.com/PaddlePaddle/Paddle/pull/62069), [#62707](https://github.com/PaddlePaddle/Paddle/pull/62707), [#62921](https://github.com/PaddlePaddle/Paddle/pull/62921), [#63085](https://github.com/PaddlePaddle/Paddle/pull/63085), [#63321](https://github.com/PaddlePaddle/Paddle/pull/63321), [#63351](https://github.com/PaddlePaddle/Paddle/pull/63351), [#63549](https://github.com/PaddlePaddle/Paddle/pull/63549), [#64567](https://github.com/PaddlePaddle/Paddle/pull/64567), [#59936](https://github.com/PaddlePaddle/Paddle/pull/59936), [#60269](https://github.com/PaddlePaddle/Paddle/pull/60269), [#60879](https://github.com/PaddlePaddle/Paddle/pull/60879), [#61314](https://github.com/PaddlePaddle/Paddle/pull/61314), [#61391](https://github.com/PaddlePaddle/Paddle/pull/61391), [#61479](https://github.com/PaddlePaddle/Paddle/pull/61479), [#61789](https://github.com/PaddlePaddle/Paddle/pull/61789), [#61832](https://github.com/PaddlePaddle/Paddle/pull/61832), [#61864](https://github.com/PaddlePaddle/Paddle/pull/61864), [#61917](https://github.com/PaddlePaddle/Paddle/pull/61917), [#62052](https://github.com/PaddlePaddle/Paddle/pull/62052), [#62068](https://github.com/PaddlePaddle/Paddle/pull/62068), [#62293](https://github.com/PaddlePaddle/Paddle/pull/62293), [#62479](https://github.com/PaddlePaddle/Paddle/pull/62479), [#62506](https://github.com/PaddlePaddle/Paddle/pull/62506), [#59948](https://github.com/PaddlePaddle/Paddle/pull/59948), [#64118](https://github.com/PaddlePaddle/Paddle/pull/64118), [#64126](https://github.com/PaddlePaddle/Paddle/pull/64126), [#64195](https://github.com/PaddlePaddle/Paddle/pull/64195), [#64307](https://github.com/PaddlePaddle/Paddle/pull/64307), [#64314](https://github.com/PaddlePaddle/Paddle/pull/64314), [#64276](https://github.com/PaddlePaddle/Paddle/pull/64276), [#64312](https://github.com/PaddlePaddle/Paddle/pull/64312), [#64350](https://github.com/PaddlePaddle/Paddle/pull/64350), [#64319](https://github.com/PaddlePaddle/Paddle/pull/64319), [#64463](https://github.com/PaddlePaddle/Paddle/pull/64463), [#64457](https://github.com/PaddlePaddle/Paddle/pull/64457), [#64455](https://github.com/PaddlePaddle/Paddle/pull/64455), [#64487](https://github.com/PaddlePaddle/Paddle/pull/64487), [#64645](https://github.com/PaddlePaddle/Paddle/pull/64645), [#63155](https://github.com/PaddlePaddle/Paddle/pull/63155), [#59893](https://github.com/PaddlePaddle/Paddle/pull/59893), [#63332](https://github.com/PaddlePaddle/Paddle/pull/63332), [#63332](https://github.com/PaddlePaddle/Paddle/pull/63332), [#64786](https://github.com/PaddlePaddle/Paddle/pull/64786), [#60515](https://github.com/PaddlePaddle/Paddle/pull/60515), [#60627](https://github.com/PaddlePaddle/Paddle/pull/60627), [#60863](https://github.com/PaddlePaddle/Paddle/pull/60863), [#60854](https://github.com/PaddlePaddle/Paddle/pull/60854), [#61447](https://github.com/PaddlePaddle/Paddle/pull/61447), [#61440](https://github.com/PaddlePaddle/Paddle/pull/61440), [#61932](https://github.com/PaddlePaddle/Paddle/pull/61932), [#62131](https://github.com/PaddlePaddle/Paddle/pull/62131), [#62252](https://github.com/PaddlePaddle/Paddle/pull/62252), [#62283](https://github.com/PaddlePaddle/Paddle/pull/62283), [#62358](https://github.com/PaddlePaddle/Paddle/pull/62358), [#62411](https://github.com/PaddlePaddle/Paddle/pull/62411), [#62424](https://github.com/PaddlePaddle/Paddle/pull/62424), [#62810](https://github.com/PaddlePaddle/Paddle/pull/62810), [#62811](https://github.com/PaddlePaddle/Paddle/pull/62811), [#62896](https://github.com/PaddlePaddle/Paddle/pull/62896), [#62947](https://github.com/PaddlePaddle/Paddle/pull/62947), [#63182](https://github.com/PaddlePaddle/Paddle/pull/63182), [#63190](https://github.com/PaddlePaddle/Paddle/pull/63190), [#63294](https://github.com/PaddlePaddle/Paddle/pull/63294), [#63306](https://github.com/PaddlePaddle/Paddle/pull/63306), [#63352](https://github.com/PaddlePaddle/Paddle/pull/63352), [#63404](https://github.com/PaddlePaddle/Paddle/pull/63404), [#63474](https://github.com/PaddlePaddle/Paddle/pull/63474), [#64013](https://github.com/PaddlePaddle/Paddle/pull/64013), [#64674](https://github.com/PaddlePaddle/Paddle/pull/64674),[#60055](https://github.com/PaddlePaddle/Paddle/pull/60055),[#62050](https://github.com/PaddlePaddle/Paddle/pull/62050),[#62770](https://github.com/PaddlePaddle/Paddle/pull/62770),[#63234](https://github.com/PaddlePaddle/Paddle/pull/63234),[#63374](https://github.com/PaddlePaddle/Paddle/pull/63374),[#64277](https://github.com/PaddlePaddle/Paddle/pull/64277), [#63420](https://github.com/PaddlePaddle/Paddle/pull/63420), [#60312](https://github.com/PaddlePaddle/Paddle/pull/60312), [#63810](https://github.com/PaddlePaddle/Paddle/pull/63810), [#64631](https://github.com/PaddlePaddle/Paddle/pull/64631), [#63970](https://github.com/PaddlePaddle/Paddle/pull/63970), [#63708](https://github.com/PaddlePaddle/Paddle/pull/63708), [#62062](https://github.com/PaddlePaddle/Paddle/pull/62062), [#60898](https://github.com/PaddlePaddle/Paddle/pull/60898), [#62373](https://github.com/PaddlePaddle/Paddle/pull/62373), [#59878](https://github.com/PaddlePaddle/Paddle/pull/59878) +- Fix some bugs in operator mechanism, operator implementation logic and related unit tests. [#63792](https://github.com/PaddlePaddle/Paddle/pull/63792), [#60570](https://github.com/PaddlePaddle/Paddle/pull/60570), [#61572](https://github.com/PaddlePaddle/Paddle/pull/61572), [#59971](https://github.com/PaddlePaddle/Paddle/pull/59971), [#61336](https://github.com/PaddlePaddle/Paddle/pull/61336), [#63276](https://github.com/PaddlePaddle/Paddle/pull/63276), [#63251](https://github.com/PaddlePaddle/Paddle/pull/63251), [#63697](https://github.com/PaddlePaddle/Paddle/pull/63697), [#63706](https://github.com/PaddlePaddle/Paddle/pull/63706), [#64685](https://github.com/PaddlePaddle/Paddle/pull/64685), [#64009](https://github.com/PaddlePaddle/Paddle/pull/64009), [#62461](https://github.com/PaddlePaddle/Paddle/pull/62461), [#61568](https://github.com/PaddlePaddle/Paddle/pull/61568), [#63912](https://github.com/PaddlePaddle/Paddle/pull/63912), [#60475](https://github.com/PaddlePaddle/Paddle/pull/60475), [#60222](https://github.com/PaddlePaddle/Paddle/pull/60222), [#63961](https://github.com/PaddlePaddle/Paddle/pull/63961), [#63593](https://github.com/PaddlePaddle/Paddle/pull/63593) -- Implemented new communication library with dynamic-static unification. Communication operators are fully adapted to PHI operator system, reducing development and maintenance costs to better support dynamic graphs and auto parallel architecture upgrade. [#54417](https://github.com/PaddlePaddle/Paddle/pull/54417),[#57768](https://github.com/PaddlePaddle/Paddle/pull/57768),[#57897](https://github.com/PaddlePaddle/Paddle/pull/57897),[#55537](https://github.com/PaddlePaddle/Paddle/pull/55537),[#56604](https://github.com/PaddlePaddle/Paddle/pull/56604),[#57519](https://github.com/PaddlePaddle/Paddle/pull/57519),[#56088](https://github.com/PaddlePaddle/Paddle/pull/56088),[#57153](https://github.com/PaddlePaddle/Paddle/pull/57153),[#57161](https://github.com/PaddlePaddle/Paddle/pull/57161),[#57252](https://github.com/PaddlePaddle/Paddle/pull/57252),[#57251](https://github.com/PaddlePaddle/Paddle/pull/57251),[#57208](https://github.com/PaddlePaddle/Paddle/pull/57208),[#57305](https://github.com/PaddlePaddle/Paddle/pull/57305),[#57424](https://github.com/PaddlePaddle/Paddle/pull/57424),[#57548](https://github.com/PaddlePaddle/Paddle/pull/57548),[#57560](https://github.com/PaddlePaddle/Paddle/pull/57560),[#57564](https://github.com/PaddlePaddle/Paddle/pull/57564),[#57233](https://github.com/PaddlePaddle/Paddle/pull/57233),[#55726](https://github.com/PaddlePaddle/Paddle/pull/55726),[#58073](https://github.com/PaddlePaddle/Paddle/pull/58073) -- TCPStore is changed to a single instance to support dynamic graphs and auto parallel more flexibly. [#55956](https://github.com/PaddlePaddle/Paddle/pull/55956) -- Improved maintainability and flexibility of distributed policies such as MP/PP/SP, including addition of printing warning and error messages, structural cleanup of code files, and optimization of PP restrictions on inputs. [#54448](https://github.com/PaddlePaddle/Paddle/pull/54448),[#59762](https://github.com/PaddlePaddle/Paddle/pull/59762),[#55462](https://github.com/PaddlePaddle/Paddle/pull/55462),[#54788](https://github.com/PaddlePaddle/Paddle/pull/54788),[#54664](https://github.com/PaddlePaddle/Paddle/pull/54664),[#56456](https://github.com/PaddlePaddle/Paddle/pull/56456),[#55540](https://github.com/PaddlePaddle/Paddle/pull/55540) -- In PP strategy, added support for P2P communication in computation flow, making communication mode more flexible. [#54747](https://github.com/PaddlePaddle/Paddle/pull/54747) -- Sharding strategy supports reduce Operation on gradient. [#58842](https://github.com/PaddlePaddle/Paddle/pull/58842),[#57967](https://github.com/PaddlePaddle/Paddle/pull/57967),[#55495](https://github.com/PaddlePaddle/Paddle/pull/55495) +### Developer Content -#### Performance optimization +- Developer related contents include PIR switching, unit test start, function verification and other PR. [#60621](https://github.com/PaddlePaddle/Paddle/pull/60621), [#59703](https://github.com/PaddlePaddle/Paddle/pull/59703), [#59694](https://github.com/PaddlePaddle/Paddle/pull/59694), [#59717](https://github.com/PaddlePaddle/Paddle/pull/59717), [#59729](https://github.com/PaddlePaddle/Paddle/pull/59729), [#59730](https://github.com/PaddlePaddle/Paddle/pull/59730), [#60216](https://github.com/PaddlePaddle/Paddle/pull/60216), [#60238](https://github.com/PaddlePaddle/Paddle/pull/60238), [#60246](https://github.com/PaddlePaddle/Paddle/pull/60246), [#60343](https://github.com/PaddlePaddle/Paddle/pull/60343), [#60302](https://github.com/PaddlePaddle/Paddle/pull/60302), [#60870](https://github.com/PaddlePaddle/Paddle/pull/60870), [#59956](https://github.com/PaddlePaddle/Paddle/pull/59956), [#60795](https://github.com/PaddlePaddle/Paddle/pull/60795), [#62528](https://github.com/PaddlePaddle/Paddle/pull/62528), [#59932](https://github.com/PaddlePaddle/Paddle/pull/59932), [#59636](https://github.com/PaddlePaddle/Paddle/pull/59636), [#59959](https://github.com/PaddlePaddle/Paddle/pull/59959), [#59734](https://github.com/PaddlePaddle/Paddle/pull/59734), [#60287](https://github.com/PaddlePaddle/Paddle/pull/60287), [#60347](https://github.com/PaddlePaddle/Paddle/pull/60347), [#60335](https://github.com/PaddlePaddle/Paddle/pull/60335), [#60332](https://github.com/PaddlePaddle/Paddle/pull/60332), [#59631](https://github.com/PaddlePaddle/Paddle/pull/59631), [#60255](https://github.com/PaddlePaddle/Paddle/pull/60255), [#60329](https://github.com/PaddlePaddle/Paddle/pull/60329), [#60401](https://github.com/PaddlePaddle/Paddle/pull/60401), [#60522](https://github.com/PaddlePaddle/Paddle/pull/60522), [#60792](https://github.com/PaddlePaddle/Paddle/pull/60792), [#59617](https://github.com/PaddlePaddle/Paddle/pull/59617), [#60277](https://github.com/PaddlePaddle/Paddle/pull/60277), [#60584](https://github.com/PaddlePaddle/Paddle/pull/60584), [#60911](https://github.com/PaddlePaddle/Paddle/pull/60911), [#61322](https://github.com/PaddlePaddle/Paddle/pull/61322), [#60838](https://github.com/PaddlePaddle/Paddle/pull/60838), [#60602](https://github.com/PaddlePaddle/Paddle/pull/60602), [#61458](https://github.com/PaddlePaddle/Paddle/pull/61458), [#61607](https://github.com/PaddlePaddle/Paddle/pull/61607), [#61960](https://github.com/PaddlePaddle/Paddle/pull/61960), [#60484](https://github.com/PaddlePaddle/Paddle/pull/60484), [#61662](https://github.com/PaddlePaddle/Paddle/pull/61662), [#62263](https://github.com/PaddlePaddle/Paddle/pull/62263), [#62270](https://github.com/PaddlePaddle/Paddle/pull/62270), [#62469](https://github.com/PaddlePaddle/Paddle/pull/62469), [#62416](https://github.com/PaddlePaddle/Paddle/pull/62416), [#62443](https://github.com/PaddlePaddle/Paddle/pull/62443), [#62412](https://github.com/PaddlePaddle/Paddle/pull/62412), [#62541](https://github.com/PaddlePaddle/Paddle/pull/62541), [#62634](https://github.com/PaddlePaddle/Paddle/pull/62634), [#62369](https://github.com/PaddlePaddle/Paddle/pull/62369), [#60805](https://github.com/PaddlePaddle/Paddle/pull/60805), [#62644](https://github.com/PaddlePaddle/Paddle/pull/62644), [#62494](https://github.com/PaddlePaddle/Paddle/pull/62494), [#62767](https://github.com/PaddlePaddle/Paddle/pull/62767), [#62735](https://github.com/PaddlePaddle/Paddle/pull/62735), [#62802](https://github.com/PaddlePaddle/Paddle/pull/62802), [#62801](https://github.com/PaddlePaddle/Paddle/pull/62801), [#62783](https://github.com/PaddlePaddle/Paddle/pull/62783), [#62579](https://github.com/PaddlePaddle/Paddle/pull/62579), [#62833](https://github.com/PaddlePaddle/Paddle/pull/62833), [#62668](https://github.com/PaddlePaddle/Paddle/pull/62668), [#62972](https://github.com/PaddlePaddle/Paddle/pull/62972), [#62505](https://github.com/PaddlePaddle/Paddle/pull/62505), [#63005](https://github.com/PaddlePaddle/Paddle/pull/63005), [#62900](https://github.com/PaddlePaddle/Paddle/pull/62900), [#60577](https://github.com/PaddlePaddle/Paddle/pull/60577), [#60877](https://github.com/PaddlePaddle/Paddle/pull/60877), [#61076](https://github.com/PaddlePaddle/Paddle/pull/61076), [#61038](https://github.com/PaddlePaddle/Paddle/pull/61038), [#61112](https://github.com/PaddlePaddle/Paddle/pull/61112), [#61120](https://github.com/PaddlePaddle/Paddle/pull/61120), [#61582](https://github.com/PaddlePaddle/Paddle/pull/61582), [#61119](https://github.com/PaddlePaddle/Paddle/pull/61119), [#61036](https://github.com/PaddlePaddle/Paddle/pull/61036), [#61289](https://github.com/PaddlePaddle/Paddle/pull/61289), [#60695](https://github.com/PaddlePaddle/Paddle/pull/60695), [#61039](https://github.com/PaddlePaddle/Paddle/pull/61039), [#61963](https://github.com/PaddlePaddle/Paddle/pull/61963), [#62118](https://github.com/PaddlePaddle/Paddle/pull/62118), [#62797](https://github.com/PaddlePaddle/Paddle/pull/62797), [#62807](https://github.com/PaddlePaddle/Paddle/pull/62807), [#62887](https://github.com/PaddlePaddle/Paddle/pull/62887), [#62830](https://github.com/PaddlePaddle/Paddle/pull/62830), [#62849](https://github.com/PaddlePaddle/Paddle/pull/62849), [#62750](https://github.com/PaddlePaddle/Paddle/pull/62750), [#62965](https://github.com/PaddlePaddle/Paddle/pull/62965), [#59742](https://github.com/PaddlePaddle/Paddle/pull/59742), [#59867](https://github.com/PaddlePaddle/Paddle/pull/59867), [#60836](https://github.com/PaddlePaddle/Paddle/pull/60836), [#60902](https://github.com/PaddlePaddle/Paddle/pull/60902), [#61228](https://github.com/PaddlePaddle/Paddle/pull/61228), [#60037](https://github.com/PaddlePaddle/Paddle/pull/60037), [#60079](https://github.com/PaddlePaddle/Paddle/pull/60079), [#60173](https://github.com/PaddlePaddle/Paddle/pull/60173), [#60373](https://github.com/PaddlePaddle/Paddle/pull/60373), [#60380](https://github.com/PaddlePaddle/Paddle/pull/60380), [#60381](https://github.com/PaddlePaddle/Paddle/pull/60381), [#60750](https://github.com/PaddlePaddle/Paddle/pull/60750), [#61065](https://github.com/PaddlePaddle/Paddle/pull/61065), [#61122](https://github.com/PaddlePaddle/Paddle/pull/61122), [#61074](https://github.com/PaddlePaddle/Paddle/pull/61074), [#61204](https://github.com/PaddlePaddle/Paddle/pull/61204), [#61191](https://github.com/PaddlePaddle/Paddle/pull/61191), [#61182](https://github.com/PaddlePaddle/Paddle/pull/61182), [#61219](https://github.com/PaddlePaddle/Paddle/pull/61219), [#61296](https://github.com/PaddlePaddle/Paddle/pull/61296), [#61503](https://github.com/PaddlePaddle/Paddle/pull/61503), [#61484](https://github.com/PaddlePaddle/Paddle/pull/61484), [#61513](https://github.com/PaddlePaddle/Paddle/pull/61513), [#61476](https://github.com/PaddlePaddle/Paddle/pull/61476), [#61510](https://github.com/PaddlePaddle/Paddle/pull/61510), [#61511](https://github.com/PaddlePaddle/Paddle/pull/61511), [#61526](https://github.com/PaddlePaddle/Paddle/pull/61526), [#61524](https://github.com/PaddlePaddle/Paddle/pull/61524), [#61525](https://github.com/PaddlePaddle/Paddle/pull/61525), [#61466](https://github.com/PaddlePaddle/Paddle/pull/61466), [#61497](https://github.com/PaddlePaddle/Paddle/pull/61497), [#61538](https://github.com/PaddlePaddle/Paddle/pull/61538), [#61533](https://github.com/PaddlePaddle/Paddle/pull/61533), [#61530](https://github.com/PaddlePaddle/Paddle/pull/61530), [#61468](https://github.com/PaddlePaddle/Paddle/pull/61468), [#61527](https://github.com/PaddlePaddle/Paddle/pull/61527), [#61535](https://github.com/PaddlePaddle/Paddle/pull/61535), [#61512](https://github.com/PaddlePaddle/Paddle/pull/61512), [#61531](https://github.com/PaddlePaddle/Paddle/pull/61531), [#61539](https://github.com/PaddlePaddle/Paddle/pull/61539), [#61532](https://github.com/PaddlePaddle/Paddle/pull/61532), [#61521](https://github.com/PaddlePaddle/Paddle/pull/61521), [#61517](https://github.com/PaddlePaddle/Paddle/pull/61517), [#61518](https://github.com/PaddlePaddle/Paddle/pull/61518), [#61550](https://github.com/PaddlePaddle/Paddle/pull/61550), [#61545](https://github.com/PaddlePaddle/Paddle/pull/61545), [#61548](https://github.com/PaddlePaddle/Paddle/pull/61548), [#61519](https://github.com/PaddlePaddle/Paddle/pull/61519), [#61549](https://github.com/PaddlePaddle/Paddle/pull/61549), [#61574](https://github.com/PaddlePaddle/Paddle/pull/61574), [#61585](https://github.com/PaddlePaddle/Paddle/pull/61585), [#61581](https://github.com/PaddlePaddle/Paddle/pull/61581), [#61553](https://github.com/PaddlePaddle/Paddle/pull/61553), [#61504](https://github.com/PaddlePaddle/Paddle/pull/61504), [#61603](https://github.com/PaddlePaddle/Paddle/pull/61603), [#61534](https://github.com/PaddlePaddle/Paddle/pull/61534), [#61567](https://github.com/PaddlePaddle/Paddle/pull/61567), [#61523](https://github.com/PaddlePaddle/Paddle/pull/61523), [#61565](https://github.com/PaddlePaddle/Paddle/pull/61565), [#61564](https://github.com/PaddlePaddle/Paddle/pull/61564), [#61707](https://github.com/PaddlePaddle/Paddle/pull/61707), [#61560](https://github.com/PaddlePaddle/Paddle/pull/61560), [#61684](https://github.com/PaddlePaddle/Paddle/pull/61684), [#61706](https://github.com/PaddlePaddle/Paddle/pull/61706), [#61724](https://github.com/PaddlePaddle/Paddle/pull/61724), [#61719](https://github.com/PaddlePaddle/Paddle/pull/61719), [#61729](https://github.com/PaddlePaddle/Paddle/pull/61729), [#61763](https://github.com/PaddlePaddle/Paddle/pull/61763), [#61755](https://github.com/PaddlePaddle/Paddle/pull/61755), [#61737](https://github.com/PaddlePaddle/Paddle/pull/61737), [#61750](https://github.com/PaddlePaddle/Paddle/pull/61750), [#61753](https://github.com/PaddlePaddle/Paddle/pull/61753), [#61756](https://github.com/PaddlePaddle/Paddle/pull/61756), [#61777](https://github.com/PaddlePaddle/Paddle/pull/61777), [#61758](https://github.com/PaddlePaddle/Paddle/pull/61758), [#61731](https://github.com/PaddlePaddle/Paddle/pull/61731), [#61771](https://github.com/PaddlePaddle/Paddle/pull/61771), [#61739](https://github.com/PaddlePaddle/Paddle/pull/61739), [#61559](https://github.com/PaddlePaddle/Paddle/pull/61559), [#61717](https://github.com/PaddlePaddle/Paddle/pull/61717), [#61733](https://github.com/PaddlePaddle/Paddle/pull/61733), [#61563](https://github.com/PaddlePaddle/Paddle/pull/61563), [#61546](https://github.com/PaddlePaddle/Paddle/pull/61546), [#61566](https://github.com/PaddlePaddle/Paddle/pull/61566), [#61562](https://github.com/PaddlePaddle/Paddle/pull/61562), [#61793](https://github.com/PaddlePaddle/Paddle/pull/61793), [#61902](https://github.com/PaddlePaddle/Paddle/pull/61902), [#61905](https://github.com/PaddlePaddle/Paddle/pull/61905), [#61904](https://github.com/PaddlePaddle/Paddle/pull/61904), [#62227](https://github.com/PaddlePaddle/Paddle/pull/62227), [#62332](https://github.com/PaddlePaddle/Paddle/pull/62332), [#62653](https://github.com/PaddlePaddle/Paddle/pull/62653), [#62681](https://github.com/PaddlePaddle/Paddle/pull/62681), [#62709](https://github.com/PaddlePaddle/Paddle/pull/62709), [#62794](https://github.com/PaddlePaddle/Paddle/pull/62794), [#62938](https://github.com/PaddlePaddle/Paddle/pull/62938), [#63185](https://github.com/PaddlePaddle/Paddle/pull/63185), [#63754](https://github.com/PaddlePaddle/Paddle/pull/63754), [#63769](https://github.com/PaddlePaddle/Paddle/pull/63769), [#63793](https://github.com/PaddlePaddle/Paddle/pull/63793), [#63830](https://github.com/PaddlePaddle/Paddle/pull/63830), [#63939](https://github.com/PaddlePaddle/Paddle/pull/63939), [#64340](https://github.com/PaddlePaddle/Paddle/pull/64340), [#64657](https://github.com/PaddlePaddle/Paddle/pull/64657), [#62527](https://github.com/PaddlePaddle/Paddle/pull/62527), [#64088](https://github.com/PaddlePaddle/Paddle/pull/64088), [#60203](https://github.com/PaddlePaddle/Paddle/pull/60203), [#60372](https://github.com/PaddlePaddle/Paddle/pull/60372), [#60685](https://github.com/PaddlePaddle/Paddle/pull/60685), [#60815](https://github.com/PaddlePaddle/Paddle/pull/60815), [#60791](https://github.com/PaddlePaddle/Paddle/pull/60791), [#60864](https://github.com/PaddlePaddle/Paddle/pull/60864), [#60851](https://github.com/PaddlePaddle/Paddle/pull/60851), [#60844](https://github.com/PaddlePaddle/Paddle/pull/60844), [#60694](https://github.com/PaddlePaddle/Paddle/pull/60694), [#60855](https://github.com/PaddlePaddle/Paddle/pull/60855), [#60869](https://github.com/PaddlePaddle/Paddle/pull/60869), [#60948](https://github.com/PaddlePaddle/Paddle/pull/60948), [#61042](https://github.com/PaddlePaddle/Paddle/pull/61042), [#61455](https://github.com/PaddlePaddle/Paddle/pull/61455), [#61580](https://github.com/PaddlePaddle/Paddle/pull/61580), [#61589](https://github.com/PaddlePaddle/Paddle/pull/61589), [#61609](https://github.com/PaddlePaddle/Paddle/pull/61609), [#61616](https://github.com/PaddlePaddle/Paddle/pull/61616), [#61715](https://github.com/PaddlePaddle/Paddle/pull/61715), [#61716](https://github.com/PaddlePaddle/Paddle/pull/61716), [#61759](https://github.com/PaddlePaddle/Paddle/pull/61759), [#61555](https://github.com/PaddlePaddle/Paddle/pull/61555), [#61492](https://github.com/PaddlePaddle/Paddle/pull/61492), [#61805](https://github.com/PaddlePaddle/Paddle/pull/61805), [#61712](https://github.com/PaddlePaddle/Paddle/pull/61712), [#61615](https://github.com/PaddlePaddle/Paddle/pull/61615), [#61713](https://github.com/PaddlePaddle/Paddle/pull/61713), [#62129](https://github.com/PaddlePaddle/Paddle/pull/62129), [#59294](https://github.com/PaddlePaddle/Paddle/pull/59294), [#59865](https://github.com/PaddlePaddle/Paddle/pull/59865), [#60270](https://github.com/PaddlePaddle/Paddle/pull/60270), [#60547](https://github.com/PaddlePaddle/Paddle/pull/60547), [#60698](https://github.com/PaddlePaddle/Paddle/pull/60698), [#60762](https://github.com/PaddlePaddle/Paddle/pull/60762), [#60753](https://github.com/PaddlePaddle/Paddle/pull/60753), [#60966](https://github.com/PaddlePaddle/Paddle/pull/60966), [#60976](https://github.com/PaddlePaddle/Paddle/pull/60976), [#61100](https://github.com/PaddlePaddle/Paddle/pull/61100), [#61203](https://github.com/PaddlePaddle/Paddle/pull/61203), [#61210](https://github.com/PaddlePaddle/Paddle/pull/61210), [#61424](https://github.com/PaddlePaddle/Paddle/pull/61424), [#61213](https://github.com/PaddlePaddle/Paddle/pull/61213), [#61275](https://github.com/PaddlePaddle/Paddle/pull/61275), [#61276](https://github.com/PaddlePaddle/Paddle/pull/61276), [#61279](https://github.com/PaddlePaddle/Paddle/pull/61279), [#61292](https://github.com/PaddlePaddle/Paddle/pull/61292), [#61295](https://github.com/PaddlePaddle/Paddle/pull/61295), [#61298](https://github.com/PaddlePaddle/Paddle/pull/61298), [#61299](https://github.com/PaddlePaddle/Paddle/pull/61299), [#61301](https://github.com/PaddlePaddle/Paddle/pull/61301), [#61302](https://github.com/PaddlePaddle/Paddle/pull/61302), [#61329](https://github.com/PaddlePaddle/Paddle/pull/61329), [#61804](https://github.com/PaddlePaddle/Paddle/pull/61804), [#62745](https://github.com/PaddlePaddle/Paddle/pull/62745), [#62909](https://github.com/PaddlePaddle/Paddle/pull/62909), [#64247](https://github.com/PaddlePaddle/Paddle/pull/64247), [#64308](https://github.com/PaddlePaddle/Paddle/pull/64308), [#60690](https://github.com/PaddlePaddle/Paddle/pull/60690), [#61149](https://github.com/PaddlePaddle/Paddle/pull/61149), [#61145](https://github.com/PaddlePaddle/Paddle/pull/61145), [#61193](https://github.com/PaddlePaddle/Paddle/pull/61193), [#61207](https://github.com/PaddlePaddle/Paddle/pull/61207), [#61229](https://github.com/PaddlePaddle/Paddle/pull/61229), [#61236](https://github.com/PaddlePaddle/Paddle/pull/61236), [#61244](https://github.com/PaddlePaddle/Paddle/pull/61244), [#61242](https://github.com/PaddlePaddle/Paddle/pull/61242), [#61263](https://github.com/PaddlePaddle/Paddle/pull/61263), [#61370](https://github.com/PaddlePaddle/Paddle/pull/61370), [#61410](https://github.com/PaddlePaddle/Paddle/pull/61410), [#61480](https://github.com/PaddlePaddle/Paddle/pull/61480), [#61522](https://github.com/PaddlePaddle/Paddle/pull/61522), [#61540](https://github.com/PaddlePaddle/Paddle/pull/61540), [#61520](https://github.com/PaddlePaddle/Paddle/pull/61520), [#61625](https://github.com/PaddlePaddle/Paddle/pull/61625), [#61700](https://github.com/PaddlePaddle/Paddle/pull/61700), [#61708](https://github.com/PaddlePaddle/Paddle/pull/61708), [#61736](https://github.com/PaddlePaddle/Paddle/pull/61736), [#61889](https://github.com/PaddlePaddle/Paddle/pull/61889), [#61952](https://github.com/PaddlePaddle/Paddle/pull/61952), [#62033](https://github.com/PaddlePaddle/Paddle/pull/62033), [#62637](https://github.com/PaddlePaddle/Paddle/pull/62637), [#62777](https://github.com/PaddlePaddle/Paddle/pull/62777), [#62779](https://github.com/PaddlePaddle/Paddle/pull/62779), [#63226](https://github.com/PaddlePaddle/Paddle/pull/63226), [#63287](https://github.com/PaddlePaddle/Paddle/pull/63287), [#63398](https://github.com/PaddlePaddle/Paddle/pull/63398), [#63431](https://github.com/PaddlePaddle/Paddle/pull/63431), [#64000](https://github.com/PaddlePaddle/Paddle/pull/64000), [#64058](https://github.com/PaddlePaddle/Paddle/pull/64058), [#64059](https://github.com/PaddlePaddle/Paddle/pull/64059), [#64063](https://github.com/PaddlePaddle/Paddle/pull/64063), [#64066](https://github.com/PaddlePaddle/Paddle/pull/64066), [#64089](https://github.com/PaddlePaddle/Paddle/pull/64089), [#64170](https://github.com/PaddlePaddle/Paddle/pull/64170), [#64235](https://github.com/PaddlePaddle/Paddle/pull/64235), [#64237](https://github.com/PaddlePaddle/Paddle/pull/64237), [#64243](https://github.com/PaddlePaddle/Paddle/pull/64243), [#64242](https://github.com/PaddlePaddle/Paddle/pull/64242), [#64286](https://github.com/PaddlePaddle/Paddle/pull/64286), [#64322](https://github.com/PaddlePaddle/Paddle/pull/64322), [#64317](https://github.com/PaddlePaddle/Paddle/pull/64317), [#64490](https://github.com/PaddlePaddle/Paddle/pull/64490), [#60138](https://github.com/PaddlePaddle/Paddle/pull/60138), [#62384](https://github.com/PaddlePaddle/Paddle/pull/62384), [#59702](https://github.com/PaddlePaddle/Paddle/pull/59702), [#60341](https://github.com/PaddlePaddle/Paddle/pull/60341), [#60636](https://github.com/PaddlePaddle/Paddle/pull/60636), [#60714](https://github.com/PaddlePaddle/Paddle/pull/60714), [#60716](https://github.com/PaddlePaddle/Paddle/pull/60716), [#60700](https://github.com/PaddlePaddle/Paddle/pull/60700), [#60702](https://github.com/PaddlePaddle/Paddle/pull/60702), [#60704](https://github.com/PaddlePaddle/Paddle/pull/60704), [#60715](https://github.com/PaddlePaddle/Paddle/pull/60715), [#60713](https://github.com/PaddlePaddle/Paddle/pull/60713), [#60711](https://github.com/PaddlePaddle/Paddle/pull/60711), [#60724](https://github.com/PaddlePaddle/Paddle/pull/60724), [#60803](https://github.com/PaddlePaddle/Paddle/pull/60803), [#61331](https://github.com/PaddlePaddle/Paddle/pull/61331), [#63286](https://github.com/PaddlePaddle/Paddle/pull/63286), [#60473](https://github.com/PaddlePaddle/Paddle/pull/60473), [#61046](https://github.com/PaddlePaddle/Paddle/pull/61046), [#61859](https://github.com/PaddlePaddle/Paddle/pull/61859), [#60675](https://github.com/PaddlePaddle/Paddle/pull/60675), [#60719](https://github.com/PaddlePaddle/Paddle/pull/60719), [#62863](https://github.com/PaddlePaddle/Paddle/pull/62863), [#63013](https://github.com/PaddlePaddle/Paddle/pull/63013), [#61293](https://github.com/PaddlePaddle/Paddle/pull/61293), [#62781](https://github.com/PaddlePaddle/Paddle/pull/62781), [#62935](https://github.com/PaddlePaddle/Paddle/pull/62935), [#63014](https://github.com/PaddlePaddle/Paddle/pull/63014), [#64203](https://github.com/PaddlePaddle/Paddle/pull/64203), [#63349](https://github.com/PaddlePaddle/Paddle/pull/63349), [#59572](https://github.com/PaddlePaddle/Paddle/pull/59572), [#59911](https://github.com/PaddlePaddle/Paddle/pull/59911), [#59861](https://github.com/PaddlePaddle/Paddle/pull/59861), [#60014](https://github.com/PaddlePaddle/Paddle/pull/60014), [#59913](https://github.com/PaddlePaddle/Paddle/pull/59913), [#58889](https://github.com/PaddlePaddle/Paddle/pull/58889), [#60114](https://github.com/PaddlePaddle/Paddle/pull/60114), [#59928](https://github.com/PaddlePaddle/Paddle/pull/59928), [#60180](https://github.com/PaddlePaddle/Paddle/pull/60180), [#60168](https://github.com/PaddlePaddle/Paddle/pull/60168), [#60166](https://github.com/PaddlePaddle/Paddle/pull/60166), [#60250](https://github.com/PaddlePaddle/Paddle/pull/60250), [#60247](https://github.com/PaddlePaddle/Paddle/pull/60247), [#60172](https://github.com/PaddlePaddle/Paddle/pull/60172), [#59661](https://github.com/PaddlePaddle/Paddle/pull/59661), [#58880](https://github.com/PaddlePaddle/Paddle/pull/58880), [#60291](https://github.com/PaddlePaddle/Paddle/pull/60291), [#58881](https://github.com/PaddlePaddle/Paddle/pull/58881), [#58955](https://github.com/PaddlePaddle/Paddle/pull/58955), [#58684](https://github.com/PaddlePaddle/Paddle/pull/58684), [#58708](https://github.com/PaddlePaddle/Paddle/pull/58708), [#60323](https://github.com/PaddlePaddle/Paddle/pull/60323), [#58762](https://github.com/PaddlePaddle/Paddle/pull/58762), [#60048](https://github.com/PaddlePaddle/Paddle/pull/60048), [#60345](https://github.com/PaddlePaddle/Paddle/pull/60345), [#60325](https://github.com/PaddlePaddle/Paddle/pull/60325), [#59627](https://github.com/PaddlePaddle/Paddle/pull/59627), [#60416](https://github.com/PaddlePaddle/Paddle/pull/60416), [#60434](https://github.com/PaddlePaddle/Paddle/pull/60434), [#59801](https://github.com/PaddlePaddle/Paddle/pull/59801), [#60619](https://github.com/PaddlePaddle/Paddle/pull/60619), [#60445](https://github.com/PaddlePaddle/Paddle/pull/60445), [#60666](https://github.com/PaddlePaddle/Paddle/pull/60666), [#60353](https://github.com/PaddlePaddle/Paddle/pull/60353), [#60733](https://github.com/PaddlePaddle/Paddle/pull/60733), [#60693](https://github.com/PaddlePaddle/Paddle/pull/60693), [#60350](https://github.com/PaddlePaddle/Paddle/pull/60350), [#61096](https://github.com/PaddlePaddle/Paddle/pull/61096), [#61121](https://github.com/PaddlePaddle/Paddle/pull/61121), [#61164](https://github.com/PaddlePaddle/Paddle/pull/61164), [#62054](https://github.com/PaddlePaddle/Paddle/pull/62054), [#62136](https://github.com/PaddlePaddle/Paddle/pull/62136), [#62508](https://github.com/PaddlePaddle/Paddle/pull/62508), [#62988](https://github.com/PaddlePaddle/Paddle/pull/62988), [#63472](https://github.com/PaddlePaddle/Paddle/pull/63472), [#60193](https://github.com/PaddlePaddle/Paddle/pull/60193), [#60197](https://github.com/PaddlePaddle/Paddle/pull/60197), [#60198](https://github.com/PaddlePaddle/Paddle/pull/60198), [#60346](https://github.com/PaddlePaddle/Paddle/pull/60346), [#60318](https://github.com/PaddlePaddle/Paddle/pull/60318), [#60645](https://github.com/PaddlePaddle/Paddle/pull/60645), [#60650](https://github.com/PaddlePaddle/Paddle/pull/60650), [#60660](https://github.com/PaddlePaddle/Paddle/pull/60660), [#60706](https://github.com/PaddlePaddle/Paddle/pull/60706), [#60799](https://github.com/PaddlePaddle/Paddle/pull/60799), [#60837](https://github.com/PaddlePaddle/Paddle/pull/60837), [#60817](https://github.com/PaddlePaddle/Paddle/pull/60817), [#60820](https://github.com/PaddlePaddle/Paddle/pull/60820), [#60894](https://github.com/PaddlePaddle/Paddle/pull/60894), [#61079](https://github.com/PaddlePaddle/Paddle/pull/61079), [#61087](https://github.com/PaddlePaddle/Paddle/pull/61087), [#61073](https://github.com/PaddlePaddle/Paddle/pull/61073), [#61072](https://github.com/PaddlePaddle/Paddle/pull/61072), [#61127](https://github.com/PaddlePaddle/Paddle/pull/61127), [#61097](https://github.com/PaddlePaddle/Paddle/pull/61097), [#61365](https://github.com/PaddlePaddle/Paddle/pull/61365), [#61456](https://github.com/PaddlePaddle/Paddle/pull/61456), [#61846](https://github.com/PaddlePaddle/Paddle/pull/61846), [#62217](https://github.com/PaddlePaddle/Paddle/pull/62217), [#62519](https://github.com/PaddlePaddle/Paddle/pull/62519), [#62881](https://github.com/PaddlePaddle/Paddle/pull/62881), [#62880](https://github.com/PaddlePaddle/Paddle/pull/62880), [#59723](https://github.com/PaddlePaddle/Paddle/pull/59723), [#59722](https://github.com/PaddlePaddle/Paddle/pull/59722), [#59797](https://github.com/PaddlePaddle/Paddle/pull/59797), [#59960](https://github.com/PaddlePaddle/Paddle/pull/59960), [#59761](https://github.com/PaddlePaddle/Paddle/pull/59761), [#59996](https://github.com/PaddlePaddle/Paddle/pull/59996), [#60009](https://github.com/PaddlePaddle/Paddle/pull/60009), [#58896](https://github.com/PaddlePaddle/Paddle/pull/58896), [#60051](https://github.com/PaddlePaddle/Paddle/pull/60051), [#60410](https://github.com/PaddlePaddle/Paddle/pull/60410), [#60420](https://github.com/PaddlePaddle/Paddle/pull/60420), [#60548](https://github.com/PaddlePaddle/Paddle/pull/60548), [#60575](https://github.com/PaddlePaddle/Paddle/pull/60575), [#60726](https://github.com/PaddlePaddle/Paddle/pull/60726), [#60809](https://github.com/PaddlePaddle/Paddle/pull/60809), [#61346](https://github.com/PaddlePaddle/Paddle/pull/61346), [#61222](https://github.com/PaddlePaddle/Paddle/pull/61222), [#61099](https://github.com/PaddlePaddle/Paddle/pull/61099), [#62254](https://github.com/PaddlePaddle/Paddle/pull/62254), [#62269](https://github.com/PaddlePaddle/Paddle/pull/62269), [#62362](https://github.com/PaddlePaddle/Paddle/pull/62362) +- Improve the underlying error checking mechanism of PaddlePaddle to facilitate developers' debugging. [#62571](https://github.com/PaddlePaddle/Paddle/pull/62571), [#62602](https://github.com/PaddlePaddle/Paddle/pull/62602), [#60903](https://github.com/PaddlePaddle/Paddle/pull/60903), [#64695](https://github.com/PaddlePaddle/Paddle/pull/64695), [#59907](https://github.com/PaddlePaddle/Paddle/pull/59907), [#62018](https://github.com/PaddlePaddle/Paddle/pull/62018), [#62839](https://github.com/PaddlePaddle/Paddle/pull/62839), [#60651](https://github.com/PaddlePaddle/Paddle/pull/60651), [#61488](https://github.com/PaddlePaddle/Paddle/pull/61488), [#64064](https://github.com/PaddlePaddle/Paddle/pull/64064), [#63192](https://github.com/PaddlePaddle/Paddle/pull/63192), [#63525](https://github.com/PaddlePaddle/Paddle/pull/63525)。 -- Implemented timely release of last layer of PP strategy, to save video memory. [#54505](https://github.com/PaddlePaddle/Paddle/pull/54505) -- In MP strategy Tensor fusion, supported incoming params group to enhance Tensor fusion function. Improved allreduce asynchronous communication performance, and enhanced training performance through overlap of computation and communication. [#57690](https://github.com/PaddlePaddle/Paddle/pull/57690),[#55662](https://github.com/PaddlePaddle/Paddle/pull/55662) -- In Sharding strategy, carried out overlap for reverse computation and gradient communication, to improve training performance. For Sharding stage1, added Tensor fusion and fuse grad clip, and optimizer, to improve computational efficiency. Supported overlap between VPP and DP/Sharding Stage1, to improve communication and computation parallelism. Optimized performance of Sharding Stage1 under FP16. Check only gradient responsible for this sharding rank in the check finite stage, to reduce computation overhead; added environment variables to control whether Optimize is performed to save video memory, to achieve use of fewer resources for model training debugging. [#55598](https://github.com/PaddlePaddle/Paddle/pull/55598),[#55427](https://github.com/PaddlePaddle/Paddle/pull/55427),[#56063](https://github.com/PaddlePaddle/Paddle/pull/56063),[#55766](https://github.com/PaddlePaddle/Paddle/pull/55766),[#59848](https://github.com/PaddlePaddle/Paddle/pull/59848) -- In Hybrid Parallel strategy, arranged Tensor fusion under PP/VPP to pre-run, to solve the problem of extra overhead of runtime fuse on video memory. Improved model training performance by reducing non-essential synchronous memcpy. [#54403](https://github.com/PaddlePaddle/Paddle/pull/54403),[#57215](https://github.com/PaddlePaddle/Paddle/pull/57215) +### Vulnerability Fixing -#### Bug Fix +- Fix potential security vulnerabilities. [#59957](https://github.com/PaddlePaddle/Paddle/pull/59957), [#61032](https://github.com/PaddlePaddle/Paddle/pull/61032), [#61356](https://github.com/PaddlePaddle/Paddle/pull/61356), [#61573](https://github.com/PaddlePaddle/Paddle/pull/61573), [#61671](https://github.com/PaddlePaddle/Paddle/pull/61671), [#62345](https://github.com/PaddlePaddle/Paddle/pull/62345), [#60097](https://github.com/PaddlePaddle/Paddle/pull/60097), [#61161](https://github.com/PaddlePaddle/Paddle/pull/61161), [#61294](https://github.com/PaddlePaddle/Paddle/pull/61294), [#61349](https://github.com/PaddlePaddle/Paddle/pull/61349), [#61344](https://github.com/PaddlePaddle/Paddle/pull/61344), [#61162](https://github.com/PaddlePaddle/Paddle/pull/61162), [#61285](https://github.com/PaddlePaddle/Paddle/pull/61285), [#61826](https://github.com/PaddlePaddle/Paddle/pull/61826), [#59967](https://github.com/PaddlePaddle/Paddle/pull/59967), [#59976](https://github.com/PaddlePaddle/Paddle/pull/59976), [#59979](https://github.com/PaddlePaddle/Paddle/pull/59979)[#60527](https://github.com/PaddlePaddle/Paddle/pull/60527),[#60646](https://github.com/PaddlePaddle/Paddle/pull/60646),[#61827](https://github.com/PaddlePaddle/Paddle/pull/61827) -- Fixed 13 bugs in PP, Launch function, MP strategy, and fuse_rope, to enhance stability of distributed strategies. At mechanism level, fixed errors of inplace and tensor reference to improve stability. [#55116](https://github.com/PaddlePaddle/Paddle/pull/55116),[#55782](https://github.com/PaddlePaddle/Paddle/pull/55782),[#59609](https://github.com/PaddlePaddle/Paddle/pull/59609),[#57394](https://github.com/PaddlePaddle/Paddle/pull/57394),[#55864](https://github.com/PaddlePaddle/Paddle/pull/55864),[#58482](https://github.com/PaddlePaddle/Paddle/pull/58482),[#54571](https://github.com/PaddlePaddle/Paddle/pull/54571),[#55896](https://github.com/PaddlePaddle/Paddle/pull/55896),[#54648](https://github.com/PaddlePaddle/Paddle/pull/54648),[#58307](https://github.com/PaddlePaddle/Paddle/pull/58307),[#55679](https://github.com/PaddlePaddle/Paddle/pull/55679),[#58133](https://github.com/PaddlePaddle/Paddle/pull/58133),[#58408](https://github.com/PaddlePaddle/Paddle/pull/58408),[#59707](https://github.com/PaddlePaddle/Paddle/pull/59707),[#55342](https://github.com/PaddlePaddle/Paddle/pull/55342),[#54703](https://github.com/PaddlePaddle/Paddle/pull/54703),[#54869](https://github.com/PaddlePaddle/Paddle/pull/54869),[#55568](https://github.com/PaddlePaddle/Paddle/pull/55568),[#55233](https://github.com/PaddlePaddle/Paddle/pull/55233),[#56418](https://github.com/PaddlePaddle/Paddle/pull/56418),[#56428](https://github.com/PaddlePaddle/Paddle/pull/56428),[#56892](https://github.com/PaddlePaddle/Paddle/pull/56892),[#57192](https://github.com/PaddlePaddle/Paddle/pull/57192),[#59161](https://github.com/PaddlePaddle/Paddle/pull/59161),[#59340](https://github.com/PaddlePaddle/Paddle/pull/59340),[#57006](https://github.com/PaddlePaddle/Paddle/pull/57006),[#57353](https://github.com/PaddlePaddle/Paddle/pull/57353),[#57352](https://github.com/PaddlePaddle/Paddle/pull/57352),[#59088](https://github.com/PaddlePaddle/Paddle/pull/59088) -- Fixed bug that PP strategy can't release single-layer output in time. Fixed the bug that initialization process may Hang. [#54624](https://github.com/PaddlePaddle/Paddle/pull/54624),[#58844](https://github.com/PaddlePaddle/Paddle/pull/58844),[#54673](https://github.com/PaddlePaddle/Paddle/pull/54673),[#58376](https://github.com/PaddlePaddle/Paddle/pull/58376) -- Fixed the bug calculation is wrong when input data type is not uniform under MP strategy. Fixed the bug of parameter synchronization under MP strategy. Fixed the bug user input config is not used correctly. [#58858](https://github.com/PaddlePaddle/Paddle/pull/58858),[#57918](https://github.com/PaddlePaddle/Paddle/pull/57918),[#58037](https://github.com/PaddlePaddle/Paddle/pull/58037) -- Unified judgment method of dygraph and dynamic mode. [#54633](https://github.com/PaddlePaddle/Paddle/pull/54633) -- Fixed the bug shape of sin and cos in fuse_rope is not correct. [#56132](https://github.com/PaddlePaddle/Paddle/pull/56132) -- Fixed the bug task fails to due to long endpoints in Luanch distributed scenarios. Fixed the bug endpoints may be out of order. [#55011](https://github.com/PaddlePaddle/Paddle/pull/55011),[#55478](https://github.com/PaddlePaddle/Paddle/pull/55478) -- Fixed the bug MEA function may cause segmentation fault error. [#55408](https://github.com/PaddlePaddle/Paddle/pull/55408) +### Deprecated Features -### Auto parallel +- Clean up deprecated actuators and other logic to reduce redundant codes. [#64822](https://github.com/PaddlePaddle/Paddle/pull/64822), [#60941](https://github.com/PaddlePaddle/Paddle/pull/60941) -This release fully optimizes Auto Parallel programming paradigm with dynamic-static unification to simplify programming complexity for developers. Developers do not need to understand complex concepts and APIs in manual parallel programming paradigm, such as row-parallel, column-parallel, and so on. A small amount of tensor distribution annotations is required to build a hybrid parallel model. Framework will handle the derivation of distribution states of all tensors and operators, and adding appropriate communication operators. Meanwhile, it supports the dynamic to static distributed training by just one extra code changed, enabling developers to efficiently and easily implement any hybrid parallel strategy. This can significantly reduce development costs of hybrid parallel training codes. +## Compiler Infrastructure for Neural Networks (CINN) -#### Improved auto parallel core functions +In version 3.0, the compiler architecture has been significantly upgraded. Based on Shape Dialect, build a symbolic automatic derivation and simplification system, support symbolic expression and constraint construction, and support end-to-end execution under the dynamic shape of the compiler. Meanwhile, CINN has upgraded the automatic fusion of subgraphs and Pass Pipline mechanism, merged the core modules of dynamic and static shapes, and merged the iteration paths, so that the architecture is clear and unified. In this version, the compiler has been refactored in important back-end modules such as AST Compute, Schedule strategy, and Tiling, improving the general optimization capability of the compiler, and verifies the training, inference correctness and speedup performance of the dynamic shapes on the subgraphs of PaddlePaddle Industry Suite models and typical large models Llama2-7B and Stable Diffusion models. -- Implemented auto parallel core APIs such as process_mesh, placement, shard_tensor, reshard, dtensor_from_fn, unshard_dtensor, shard_layer, to_static, and so on. [#55494](https://github.com/PaddlePaddle/Paddle/pull/55494),[#59059](https://github.com/PaddlePaddle/Paddle/pull/59059),[#56561](https://github.com/PaddlePaddle/Paddle/pull/56561),[#54425](https://github.com/PaddlePaddle/Paddle/pull/54425),[#59557](https://github.com/PaddlePaddle/Paddle/pull/59557),[#59682](https://github.com/PaddlePaddle/Paddle/pull/59682),[#56565](https://github.com/PaddlePaddle/Paddle/pull/56565),[#59862](https://github.com/PaddlePaddle/Paddle/pull/59862),[#59856](https://github.com/PaddlePaddle/Paddle/pull/59856),[#59342](https://github.com/PaddlePaddle/Paddle/pull/59342),[#59575](https://github.com/PaddlePaddle/Paddle/pull/59575),[#57604](https://github.com/PaddlePaddle/Paddle/pull/57604),[#57293](https://github.com/PaddlePaddle/Paddle/pull/57293),[#57278](https://github.com/PaddlePaddle/Paddle/pull/57278) -- Implemented Sharding derivation rules based on Enisum expressions, and completed 20+ classes of operator Sharding derivation rules, which covers LLaMA, GPT and other transformer-like large language models. [#55196](https://github.com/PaddlePaddle/Paddle/pull/55196),[#53863](https://github.com/PaddlePaddle/Paddle/pull/53863),[#56257](https://github.com/PaddlePaddle/Paddle/pull/56257),[#55394](https://github.com/PaddlePaddle/Paddle/pull/55394),[#54810](https://github.com/PaddlePaddle/Paddle/pull/54810),[#55508](https://github.com/PaddlePaddle/Paddle/pull/55508),[#56257](https://github.com/PaddlePaddle/Paddle/pull/56257),[#57813](https://github.com/PaddlePaddle/Paddle/pull/57813),[#58149](https://github.com/PaddlePaddle/Paddle/pull/58149),[#58506](https://github.com/PaddlePaddle/Paddle/pull/58506),[#58563](https://github.com/PaddlePaddle/Paddle/pull/58563),[#58360](https://github.com/PaddlePaddle/Paddle/pull/58360),[#58920](https://github.com/PaddlePaddle/Paddle/pull/58920),[#59050](https://github.com/PaddlePaddle/Paddle/pull/59050),[#58760](https://github.com/PaddlePaddle/Paddle/pull/58760),[#59083](https://github.com/PaddlePaddle/Paddle/pull/59083),[#59236](https://github.com/PaddlePaddle/Paddle/pull/59236),[#59350](https://github.com/PaddlePaddle/Paddle/pull/59350),[#59411](https://github.com/PaddlePaddle/Paddle/pull/59411),[#59260](https://github.com/PaddlePaddle/Paddle/pull/59260),[#54373](https://github.com/PaddlePaddle/Paddle/pull/54373),[#54991](https://github.com/PaddlePaddle/Paddle/pull/54991),[#55397](https://github.com/PaddlePaddle/Paddle/pull/55397),[#55350](https://github.com/PaddlePaddle/Paddle/pull/55350),[#55177](https://github.com/PaddlePaddle/Paddle/pull/55177),[#56443](https://github.com/PaddlePaddle/Paddle/pull/56443),[#58097](https://github.com/PaddlePaddle/Paddle/pull/58097),[#56509](https://github.com/PaddlePaddle/Paddle/pull/56509),[#56502](https://github.com/PaddlePaddle/Paddle/pull/56502),[#56504](https://github.com/PaddlePaddle/Paddle/pull/56504),[#56506](https://github.com/PaddlePaddle/Paddle/pull/56506),[#56507](https://github.com/PaddlePaddle/Paddle/pull/56507),[#56505](https://github.com/PaddlePaddle/Paddle/pull/56505),[#57176](https://github.com/PaddlePaddle/Paddle/pull/57176),[#57374](https://github.com/PaddlePaddle/Paddle/pull/57374),[#57573](https://github.com/PaddlePaddle/Paddle/pull/57573),[#57545](https://github.com/PaddlePaddle/Paddle/pull/57545),[#57875](https://github.com/PaddlePaddle/Paddle/pull/57875),[#57866](https://github.com/PaddlePaddle/Paddle/pull/57866),[#58854](https://github.com/PaddlePaddle/Paddle/pull/58854),[#59109](https://github.com/PaddlePaddle/Paddle/pull/59109),[#59185](https://github.com/PaddlePaddle/Paddle/pull/59185),[#58913](https://github.com/PaddlePaddle/Paddle/pull/58913),[#59547](https://github.com/PaddlePaddle/Paddle/pull/59547),[#58296](https://github.com/PaddlePaddle/Paddle/pull/58296),[#59545](https://github.com/PaddlePaddle/Paddle/pull/59545),[#59039](https://github.com/PaddlePaddle/Paddle/pull/59039),[#59002](https://github.com/PaddlePaddle/Paddle/pull/59002),[#58087](https://github.com/PaddlePaddle/Paddle/pull/58087),[#56367](https://github.com/PaddlePaddle/Paddle/pull/56367),[#57877](https://github.com/PaddlePaddle/Paddle/pull/57877),[#56839](https://github.com/PaddlePaddle/Paddle/pull/56839),[#59003](https://github.com/PaddlePaddle/Paddle/pull/59003),[#57269](https://github.com/PaddlePaddle/Paddle/pull/57269),[#55130](https://github.com/PaddlePaddle/Paddle/pull/55130),[#58474](https://github.com/PaddlePaddle/Paddle/pull/58474),[#57197](https://github.com/PaddlePaddle/Paddle/pull/57197),[#57467](https://github.com/PaddlePaddle/Paddle/pull/57467),[#57259](https://github.com/PaddlePaddle/Paddle/pull/57259),[#57280](https://github.com/PaddlePaddle/Paddle/pull/57280),[#56508](https://github.com/PaddlePaddle/Paddle/pull/56508) -- Implemented distributed checkpoint storage and loading with dynamic-static unification. Supports ReShard upon arbitrary Sharding of storage and loading in a Sharding state. [#59659](https://github.com/PaddlePaddle/Paddle/pull/59659),[#59843](https://github.com/PaddlePaddle/Paddle/pull/59843),[#60033](https://github.com/PaddlePaddle/Paddle/pull/60033),[#60034](https://github.com/PaddlePaddle/Paddle/pull/60034) +### New Features -#### Enhanced semi-auto parallel capability of dynamic graph +1. Upgrade the new automatic subgraph fusion mechanism, and innovatively propose the TrivialOp and ReduceOp fusion theory, supporting a wider range of vertical fusion and horizontal fusion, ensuring the correctness and robustness of subgraph fusion, and giving full play to the fusion potential of the neural network compiler.([#63340](https://github.com/PaddlePaddle/Paddle/pull/63340)、[#63913](https://github.com/PaddlePaddle/Paddle/pull/63913)、[#63579](https://github.com/PaddlePaddle/Paddle/pull/63579)、[#63605](https://github.com/PaddlePaddle/Paddle/pull/63605)、[#60769](https://github.com/PaddlePaddle/Paddle/pull/60769)、[#62088](https://github.com/PaddlePaddle/Paddle/pull/62088)、[#63124](https://github.com/PaddlePaddle/Paddle/pull/63124)、[#63658](https://github.com/PaddlePaddle/Paddle/pull/63658)、[#64557](https://github.com/PaddlePaddle/Paddle/pull/64557)、[#63318](https://github.com/PaddlePaddle/Paddle/pull/63318)、[#62545](https://github.com/PaddlePaddle/Paddle/pull/62545)) +2. Add the symbol derivation function of dynamic shapes. Based on the Shape Dialect, realize the dynamic symbol construction, automatic derivation, constraint expression, symbol simplification and other mechanisms, introduce the DimExpr concept, upgrade the support for the PaddlePaddle framework of the InferSymbolicShape logic of the 150 + typical primitive operators, and provide more information for training and inference with compiler support for dynamic shapes.([#60843](https://github.com/PaddlePaddle/Paddle/pull/60843)、[#62662](https://github.com/PaddlePaddle/Paddle/pull/62662)、[#63790](https://github.com/PaddlePaddle/Paddle/pull/63790)、[#60098](https://github.com/PaddlePaddle/Paddle/pull/60098)、[#60511](https://github.com/PaddlePaddle/Paddle/pull/60511)、[#61232](https://github.com/PaddlePaddle/Paddle/pull/61232)、[#61939](https://github.com/PaddlePaddle/Paddle/pull/61939)、[#62798](https://github.com/PaddlePaddle/Paddle/pull/62798)、[#62955](https://github.com/PaddlePaddle/Paddle/pull/62955)、[#63029](https://github.com/PaddlePaddle/Paddle/pull/63029)、[#60572](https://github.com/PaddlePaddle/Paddle/pull/60572)、[#61035](https://github.com/PaddlePaddle/Paddle/pull/61035)、[#61224](https://github.com/PaddlePaddle/Paddle/pull/61224)、[#61587](https://github.com/PaddlePaddle/Paddle/pull/61587)、[#61937](https://github.com/PaddlePaddle/Paddle/pull/61937)、[#62314](https://github.com/PaddlePaddle/Paddle/pull/62314)、[#62394](https://github.com/PaddlePaddle/Paddle/pull/62394)、[#62569](https://github.com/PaddlePaddle/Paddle/pull/62569)、[#62495](https://github.com/PaddlePaddle/Paddle/pull/62495)、[#62844](https://github.com/PaddlePaddle/Paddle/pull/62844)、[#63000](https://github.com/PaddlePaddle/Paddle/pull/63000)、[#63016](https://github.com/PaddlePaddle/Paddle/pull/63016)、[#64222](https://github.com/PaddlePaddle/Paddle/pull/64222)、[#60129](https://github.com/PaddlePaddle/Paddle/pull/60129)、[#60899](https://github.com/PaddlePaddle/Paddle/pull/60899)、[#61342](https://github.com/PaddlePaddle/Paddle/pull/61342)、[#61439](https://github.com/PaddlePaddle/Paddle/pull/61439)、[#62766](https://github.com/PaddlePaddle/Paddle/pull/62766)、[#61133](https://github.com/PaddlePaddle/Paddle/pull/61133)、[#61430](https://github.com/PaddlePaddle/Paddle/pull/61430)、[#61498](https://github.com/PaddlePaddle/Paddle/pull/61498)、[#61680](https://github.com/PaddlePaddle/Paddle/pull/61680)、[#63367](https://github.com/PaddlePaddle/Paddle/pull/63367)、[#62151](https://github.com/PaddlePaddle/Paddle/pull/62151)、[#62665](https://github.com/PaddlePaddle/Paddle/pull/62665)、[#61407](https://github.com/PaddlePaddle/Paddle/pull/61407)、[#61502](https://github.com/PaddlePaddle/Paddle/pull/61502)、[#61655](https://github.com/PaddlePaddle/Paddle/pull/61655)、[#64115](https://github.com/PaddlePaddle/Paddle/pull/64115)、[#61791](https://github.com/PaddlePaddle/Paddle/pull/61791)、[#62141](https://github.com/PaddlePaddle/Paddle/pull/62141)、[#63422](https://github.com/PaddlePaddle/Paddle/pull/63422)、[#63577](https://github.com/PaddlePaddle/Paddle/pull/63577)、[#63978](https://github.com/PaddlePaddle/Paddle/pull/63978)、[#63576](https://github.com/PaddlePaddle/Paddle/pull/63576)、[#63947](https://github.com/PaddlePaddle/Paddle/pull/63947)、[#64332](https://github.com/PaddlePaddle/Paddle/pull/64332)、[#63990](https://github.com/PaddlePaddle/Paddle/pull/63990)) +3. Add the Pass Pipline function, including PdToCinn, CinnPreprocess, BuildGroupOp, GroupClusterOp, CinnLowering, Accuracy Check and other Pass strategies, to support the Lowering and execution of subgraphs in dynamic and static shapes, with a clear architecture.([#61611](https://github.com/PaddlePaddle/Paddle/pull/61611)、[#62612](https://github.com/PaddlePaddle/Paddle/pull/62612)、[#64354](https://github.com/PaddlePaddle/Paddle/pull/64354)、[#61848](https://github.com/PaddlePaddle/Paddle/pull/61848)、[#62316](https://github.com/PaddlePaddle/Paddle/pull/62316)、[#64152](https://github.com/PaddlePaddle/Paddle/pull/64152)、[#61619](https://github.com/PaddlePaddle/Paddle/pull/61619)、[#62318](https://github.com/PaddlePaddle/Paddle/pull/62318)、[#61977](https://github.com/PaddlePaddle/Paddle/pull/61977)、[#62211](https://github.com/PaddlePaddle/Paddle/pull/62211)、[#63972](https://github.com/PaddlePaddle/Paddle/pull/63972)、[#63686](https://github.com/PaddlePaddle/Paddle/pull/63686)、[#64505](https://github.com/PaddlePaddle/Paddle/pull/64505)) +4. Add the support for BuketLower and DyShapeSchdule functions, to realize automatic bucket compilation and optimization according to the range of dynamic shapes; and adapt and upgrade the logic of CodeGen module to support the generation of InferShape function and the distribution of conditional branching function of Host function, so as to support the acceleration of training inference under the dynamic Shape of large models.([#62730](https://github.com/PaddlePaddle/Paddle/pull/62730)、[#61115](https://github.com/PaddlePaddle/Paddle/pull/61115)、[#59941](https://github.com/PaddlePaddle/Paddle/pull/59941)、[#62207](https://github.com/PaddlePaddle/Paddle/pull/62207)、[#64318](https://github.com/PaddlePaddle/Paddle/pull/64318)、[#64345](https://github.com/PaddlePaddle/Paddle/pull/64345)、[#60519](https://github.com/PaddlePaddle/Paddle/pull/60519)、[#62584](https://github.com/PaddlePaddle/Paddle/pull/62584)、[#60828](https://github.com/PaddlePaddle/Paddle/pull/60828)、[#60533](https://github.com/PaddlePaddle/Paddle/pull/60533)、[#61436](https://github.com/PaddlePaddle/Paddle/pull/61436)、[#62071](https://github.com/PaddlePaddle/Paddle/pull/62071)、[#63971](https://github.com/PaddlePaddle/Paddle/pull/63971)、[#61656](https://github.com/PaddlePaddle/Paddle/pull/61656)、[#63083](https://github.com/PaddlePaddle/Paddle/pull/63083)、[#64405](https://github.com/PaddlePaddle/Paddle/pull/64405)、[#63047](https://github.com/PaddlePaddle/Paddle/pull/63047)、[#64655](https://github.com/PaddlePaddle/Paddle/pull/64655)、[#63095](https://github.com/PaddlePaddle/Paddle/pull/63095)、[#63829](https://github.com/PaddlePaddle/Paddle/pull/63829)、[#63572](https://github.com/PaddlePaddle/Paddle/pull/63572)) +5. Add support for compilation caching strategy, to automatically recognize, merge and reuse compilation results of the same subgraph structure, improve compilation efficiency by using multi-threading, so as to enhance the user experience.([#62952](https://github.com/PaddlePaddle/Paddle/pull/62952)、[#63269](https://github.com/PaddlePaddle/Paddle/pull/63269)、[#64718](https://github.com/PaddlePaddle/Paddle/pull/64718)、[#61367](https://github.com/PaddlePaddle/Paddle/pull/61367)、[#63305](https://github.com/PaddlePaddle/Paddle/pull/63305)、[#63750](https://github.com/PaddlePaddle/Paddle/pull/63750)、[#63871](https://github.com/PaddlePaddle/Paddle/pull/63871)、[#64893](https://github.com/PaddlePaddle/Paddle/pull/64893)) +6. Add support for GenerateShape mechanism, add corresponding AST Compute operator definitions, support automatic resolution of dynamic symbols, and automatic generation of ShapeOp in the Lowering stage.([#64167](https://github.com/PaddlePaddle/Paddle/pull/64167)、[#64636](https://github.com/PaddlePaddle/Paddle/pull/64636)、[#61993](https://github.com/PaddlePaddle/Paddle/pull/61993)、[#64843](https://github.com/PaddlePaddle/Paddle/pull/64843)、[#62587](https://github.com/PaddlePaddle/Paddle/pull/62587)) -- Basic data structure supplementation: Added DistTensor, Placements and other distributed specific basic data structures on C++ end, and exposed to Python end. Supports debugging and printing of related attributes and values. [#58930](https://github.com/PaddlePaddle/Paddle/pull/58930),[#59068](https://github.com/PaddlePaddle/Paddle/pull/59068),[#55436](https://github.com/PaddlePaddle/Paddle/pull/55436),[#56449](https://github.com/PaddlePaddle/Paddle/pull/56449),[#59683](https://github.com/PaddlePaddle/Paddle/pull/59683),[#55593](https://github.com/PaddlePaddle/Paddle/pull/55593),[#58032](https://github.com/PaddlePaddle/Paddle/pull/58032),[#56368](https://github.com/PaddlePaddle/Paddle/pull/56368),[#59086](https://github.com/PaddlePaddle/Paddle/pull/59086) -- Added SPMD derivation and Reshard generation logic in execution flow for all operators, and adapted to multiple types of inputs and outputs such as vector and optional, as well as special mechanisms such as cpu fallback and multi-kernel selection. [#56602](https://github.com/PaddlePaddle/Paddle/pull/56602),[#57321](https://github.com/PaddlePaddle/Paddle/pull/57321),[#57092](https://github.com/PaddlePaddle/Paddle/pull/57092),[#56831](https://github.com/PaddlePaddle/Paddle/pull/56831),[#57119](https://github.com/PaddlePaddle/Paddle/pull/57119),[#58819](https://github.com/PaddlePaddle/Paddle/pull/58819),[#58254](https://github.com/PaddlePaddle/Paddle/pull/58254),[#55698](https://github.com/PaddlePaddle/Paddle/pull/55698),[#59241](https://github.com/PaddlePaddle/Paddle/pull/59241),[#59328](https://github.com/PaddlePaddle/Paddle/pull/59328),[#58644](https://github.com/PaddlePaddle/Paddle/pull/58644),[#56202](https://github.com/PaddlePaddle/Paddle/pull/56202),[#59159](https://github.com/PaddlePaddle/Paddle/pull/59159),[#58573](https://github.com/PaddlePaddle/Paddle/pull/58573),[#59246](https://github.com/PaddlePaddle/Paddle/pull/59246),[#59133](https://github.com/PaddlePaddle/Paddle/pull/59133),[#59186](https://github.com/PaddlePaddle/Paddle/pull/59186),[#57505](https://github.com/PaddlePaddle/Paddle/pull/57505),[#57241](https://github.com/PaddlePaddle/Paddle/pull/57241),[#58928](https://github.com/PaddlePaddle/Paddle/pull/58928) +### Function Optimization -- Adapted auto parallel execution logic for special types of operators, such as custom operators. Supports automatic conversion of DistTensor and DenseTensor as mixed inputs. [#57774](https://github.com/PaddlePaddle/Paddle/pull/57774),[#59108](https://github.com/PaddlePaddle/Paddle/pull/59108),[#58436](https://github.com/PaddlePaddle/Paddle/pull/58436),[#59523](https://github.com/PaddlePaddle/Paddle/pull/59523),[#59136](https://github.com/PaddlePaddle/Paddle/pull/59136),[#59352](https://github.com/PaddlePaddle/Paddle/pull/59352),[#59062](https://github.com/PaddlePaddle/Paddle/pull/59062),[#58434](https://github.com/PaddlePaddle/Paddle/pull/58434),[#59148](https://github.com/PaddlePaddle/Paddle/pull/59148),[#58553](https://github.com/PaddlePaddle/Paddle/pull/58553),[#58716](https://github.com/PaddlePaddle/Paddle/pull/58716),[#58369](https://github.com/PaddlePaddle/Paddle/pull/58369),[#59061](https://github.com/PaddlePaddle/Paddle/pull/59061),[#58841](https://github.com/PaddlePaddle/Paddle/pull/58841),[#59139](https://github.com/PaddlePaddle/Paddle/pull/59139),[#59141](https://github.com/PaddlePaddle/Paddle/pull/59141),[#58837](https://github.com/PaddlePaddle/Paddle/pull/58837),[#59137](https://github.com/PaddlePaddle/Paddle/pull/59137),[#59143](https://github.com/PaddlePaddle/Paddle/pull/59143) +1. Optimize BuildCinnPass logic, upgrade the compiler's perception strategy for black and white list operators, and improve the robustness of Pass logic.([#62372](https://github.com/PaddlePaddle/Paddle/pull/62372)、[#61081](https://github.com/PaddlePaddle/Paddle/pull/61081)、[#61225](https://github.com/PaddlePaddle/Paddle/pull/61225)、[#58863](https://github.com/PaddlePaddle/Paddle/pull/58863)) +2. Optimize the OpLoweringGroup data structure, remove unnecessary interfaces and members, and reduce the coupling between upstream and downstream modules.([#62339](https://github.com/PaddlePaddle/Paddle/pull/62339)) +3. Optimize the component design of the compiler on the architecture Arch, to abstract the concept of hardware, and reduce the cost of adapting to domestic hardware.([#63530](https://github.com/PaddlePaddle/Paddle/pull/63530)、[#64347](https://github.com/PaddlePaddle/Paddle/pull/64347)、[#64506](https://github.com/PaddlePaddle/Paddle/pull/64506)、[#64587](https://github.com/PaddlePaddle/Paddle/pull/64587)) +4. Upgrade the AST Compute module of the compiler's back-end operator, to adapt to support the computing logic of dynamic Shape.([#62488](https://github.com/PaddlePaddle/Paddle/pull/62488)、[#63581](https://github.com/PaddlePaddle/Paddle/pull/63581)、[#63687](https://github.com/PaddlePaddle/Paddle/pull/63687)、[#63654](https://github.com/PaddlePaddle/Paddle/pull/63654)、[#64217](https://github.com/PaddlePaddle/Paddle/pull/64217)) -- Optimized dynamic graph execution system: Adapted Autograd execution process. Supports dynamic graph's inverse gradient aggregation, AMP, Hook, PyLayer, View, custom operators, and other surrounding mechanisms. [#58437](https://github.com/PaddlePaddle/Paddle/pull/58437),[#58769](https://github.com/PaddlePaddle/Paddle/pull/58769),[#58796](https://github.com/PaddlePaddle/Paddle/pull/58796),[#58339](https://github.com/PaddlePaddle/Paddle/pull/58339),[#58409](https://github.com/PaddlePaddle/Paddle/pull/58409),[#58772](https://github.com/PaddlePaddle/Paddle/pull/58772),[#58380](https://github.com/PaddlePaddle/Paddle/pull/58380),[#58447](https://github.com/PaddlePaddle/Paddle/pull/58447),[#58706](https://github.com/PaddlePaddle/Paddle/pull/58706),[#58656](https://github.com/PaddlePaddle/Paddle/pull/58656),[#58172](https://github.com/PaddlePaddle/Paddle/pull/58172),[#59401](https://github.com/PaddlePaddle/Paddle/pull/59401),[#58727](https://github.com/PaddlePaddle/Paddle/pull/58727),[#58238](https://github.com/PaddlePaddle/Paddle/pull/58238),[#59243](https://github.com/PaddlePaddle/Paddle/pull/59243),[#58469](https://github.com/PaddlePaddle/Paddle/pull/58469),[#58442](https://github.com/PaddlePaddle/Paddle/pull/58442),[#58487](https://github.com/PaddlePaddle/Paddle/pull/58487),[#58476](https://github.com/PaddlePaddle/Paddle/pull/58476),[#59706](https://github.com/PaddlePaddle/Paddle/pull/59706) +### Performance Optimization -- Added support for Pipeline Parallelism, Sequence Parallelism and other distributed parallelism. [#58126](https://github.com/PaddlePaddle/Paddle/pull/58126),[#59766](https://github.com/PaddlePaddle/Paddle/pull/59766),[#59060](https://github.com/PaddlePaddle/Paddle/pull/59060),[#59841](https://github.com/PaddlePaddle/Paddle/pull/59841),[#58609](https://github.com/PaddlePaddle/Paddle/pull/58609),[#59688](https://github.com/PaddlePaddle/Paddle/pull/59688),[#58449](https://github.com/PaddlePaddle/Paddle/pull/58449)、[#59598](https://github.com/PaddlePaddle/Paddle/pull/59598) -- Added various Reshard strategies and support tensor conversions between different distributed states. [#58592](https://github.com/PaddlePaddle/Paddle/pull/58592),[#59138](https://github.com/PaddlePaddle/Paddle/pull/59138),[#59367](https://github.com/PaddlePaddle/Paddle/pull/59367),[#59621](https://github.com/PaddlePaddle/Paddle/pull/59621),[#59758](https://github.com/PaddlePaddle/Paddle/pull/59758),[#59777](https://github.com/PaddlePaddle/Paddle/pull/59777),[#56975](https://github.com/PaddlePaddle/Paddle/pull/56975),[#58550](https://github.com/PaddlePaddle/Paddle/pull/58550),[#58703](https://github.com/PaddlePaddle/Paddle/pull/58703),[#57210](https://github.com/PaddlePaddle/Paddle/pull/57210),[#58734](https://github.com/PaddlePaddle/Paddle/pull/58734),[#56833](https://github.com/PaddlePaddle/Paddle/pull/56833),[#59292](https://github.com/PaddlePaddle/Paddle/pull/59292),[#57432](https://github.com/PaddlePaddle/Paddle/pull/57432),[#57568](https://github.com/PaddlePaddle/Paddle/pull/57568),[#56553](https://github.com/PaddlePaddle/Paddle/pull/56553),[#58284](https://github.com/PaddlePaddle/Paddle/pull/58284),[#56039](https://github.com/PaddlePaddle/Paddle/pull/56039),[#55552](https://github.com/PaddlePaddle/Paddle/pull/55552),[#56149](https://github.com/PaddlePaddle/Paddle/pull/56149) +1. Optimize the Schedule logic of AST IR, restructure the core modules such as Vectorize, Unroll, AxisBind, and ComputeAt, and merged the iterative paths of dynamic and static shapes, so as to reduce the development and maintenance costs.([#60449](https://github.com/PaddlePaddle/Paddle/pull/60449)、[#60155](https://github.com/PaddlePaddle/Paddle/pull/60155)、[#60342](https://github.com/PaddlePaddle/Paddle/pull/60342)、[#60498](https://github.com/PaddlePaddle/Paddle/pull/60498)、[#60538](https://github.com/PaddlePaddle/Paddle/pull/60538)、[#60190](https://github.com/PaddlePaddle/Paddle/pull/60190)、[#61197](https://github.com/PaddlePaddle/Paddle/pull/61197)、[#63140](https://github.com/PaddlePaddle/Paddle/pull/63140)、[#61156](https://github.com/PaddlePaddle/Paddle/pull/61156)) +2. Optimize the Tiling strategy and temp Buffer function, support warp-level memory continuous Read and cache_read cache_write function, and improve the subgraph execution performance.([#64240](https://github.com/PaddlePaddle/Paddle/pull/64240)、[#60562](https://github.com/PaddlePaddle/Paddle/pull/60562)、[#64711](https://github.com/PaddlePaddle/Paddle/pull/64711)、[#62856](https://github.com/PaddlePaddle/Paddle/pull/62856)、[#61576](https://github.com/PaddlePaddle/Paddle/pull/61576)、[#61901](https://github.com/PaddlePaddle/Paddle/pull/61901)、[#62581](https://github.com/PaddlePaddle/Paddle/pull/62581)、[#61987](https://github.com/PaddlePaddle/Paddle/pull/61987)、[#60190](https://github.com/PaddlePaddle/Paddle/pull/60190)、[#63138](https://github.com/PaddlePaddle/Paddle/pull/63138)、[#62517](https://github.com/PaddlePaddle/Paddle/pull/62517)) +3. Support automatic search function of Schedule configuration and AOT offline saving mechanism to accelerate the performance of subgraph Kernel.([#64271](https://github.com/PaddlePaddle/Paddle/pull/64271)、[#64588](https://github.com/PaddlePaddle/Paddle/pull/64588)、[#64694](https://github.com/PaddlePaddle/Paddle/pull/64694)、[#64620](https://github.com/PaddlePaddle/Paddle/pull/64620)、[#64702](https://github.com/PaddlePaddle/Paddle/pull/64702)、[#63086](https://github.com/PaddlePaddle/Paddle/pull/63086)) +4. Support OptimizeReductionTactic optimization strategy to improve kernel performance in Reduce scenarios.([#6066](https://github.com/PaddlePaddle/Paddle/pull/60661)、[#61363](https://github.com/PaddlePaddle/Paddle/pull/61363)、[#60881](https://github.com/PaddlePaddle/Paddle/pull/60881)、[#63859](https://github.com/PaddlePaddle/Paddle/pull/63859)) +5. Enhance DCE Pass function, remove redundant If/For branch codes and improve execution efficiency.([#61682](https://github.com/PaddlePaddle/Paddle/pull/61682)) +6. Add support for FuseParallelMatmulPass Pass, integrate multiple Matmul operators to achieve acceleration.([#63623](https://github.com/PaddlePaddle/Paddle/pull/63623)) -#### Enhanced semi-auto parallel for static graphs +### Bug Fixing -- Added Sequence Parallel Parallelism; added FThenB, Interleaved 1F1B, Eager 1F1B, VPP and other scheduling modes for Pipeline Parallel, and supported the hybrid parallel between the above new parallelism and original parallelism. Supported visualization of pipeline scheduling. Upgraded gradient synchronization mechanism which supports gradient synchronization when data is sharded on any broadcast dimension. [#57605](https://github.com/PaddlePaddle/Paddle/pull/57605),[#54727](https://github.com/PaddlePaddle/Paddle/pull/54727),[#54409](https://github.com/PaddlePaddle/Paddle/pull/54409),[#54787](https://github.com/PaddlePaddle/Paddle/pull/54787),[#58313](https://github.com/PaddlePaddle/Paddle/pull/58313),[#59179](https://github.com/PaddlePaddle/Paddle/pull/59179),[#59416](https://github.com/PaddlePaddle/Paddle/pull/59416),[#59719](https://github.com/PaddlePaddle/Paddle/pull/59719),[#59822](https://github.com/PaddlePaddle/Paddle/pull/59822),[#59057](https://github.com/PaddlePaddle/Paddle/pull/59057),[#59522](https://github.com/PaddlePaddle/Paddle/pull/59522),[#57061](https://github.com/PaddlePaddle/Paddle/pull/57061) -- Adapted the executor to PIR, and supported PIR optimization Pass. In distributed scenarios, supports fuse_linear fuse, and etc., to improve performance. [#58459](https://github.com/PaddlePaddle/Paddle/pull/58459),[#58528](https://github.com/PaddlePaddle/Paddle/pull/58528),[#55555](https://github.com/PaddlePaddle/Paddle/pull/55555),[#59757](https://github.com/PaddlePaddle/Paddle/pull/59757),[#59102](https://github.com/PaddlePaddle/Paddle/pull/59102),[#57917](https://github.com/PaddlePaddle/Paddle/pull/57917) -- Upgraded underlying architecture: upgraded the executor to reuse the results of data-flow dependency analysis and static kernel selection; upgraded entire graph based sharding completion mechanism, to switch to new sharding derivation rules and support some long-tailed cases; optimized the support of control flow under distributed static graph to adapt to more scenarios; reduced the graph compilation time and refined error message format to improve user experience. [#55389](https://github.com/PaddlePaddle/Paddle/pull/55389),[#55650](https://github.com/PaddlePaddle/Paddle/pull/55650),[#54938](https://github.com/PaddlePaddle/Paddle/pull/54938),[#57447](https://github.com/PaddlePaddle/Paddle/pull/57447),[#57751](https://github.com/PaddlePaddle/Paddle/pull/57751),[#57742](https://github.com/PaddlePaddle/Paddle/pull/57742),[#59524](https://github.com/PaddlePaddle/Paddle/pull/59524),[#59526](https://github.com/PaddlePaddle/Paddle/pull/59526),[#58669](https://github.com/PaddlePaddle/Paddle/pull/58669),[#57616](https://github.com/PaddlePaddle/Paddle/pull/57616),[#56511](https://github.com/PaddlePaddle/Paddle/pull/56511),[#55727](https://github.com/PaddlePaddle/Paddle/pull/55727),[#58906](https://github.com/PaddlePaddle/Paddle/pull/58906),[#56016](https://github.com/PaddlePaddle/Paddle/pull/56016),[#54897](https://github.com/PaddlePaddle/Paddle/pull/54897) -- Optimized the gpu memory usage in static graph mode, and added refined recomputing strategy; optimized auto mixed precision pass, and allows users to manually specify auto-cast region and fixed some bugs; supports parallel computation of cross-entropy; supports fusion operators such as scaled_dot_product_attention, fuse_rope, etc.; performs scheduling optimization to support better overlap between communication and computation in tensor parallelism and pipeline parallelsim. [#58421](https://github.com/PaddlePaddle/Paddle/pull/58421),[#58533](https://github.com/PaddlePaddle/Paddle/pull/58533),[#59498](https://github.com/PaddlePaddle/Paddle/pull/59498),[#59498](https://github.com/PaddlePaddle/Paddle/pull/59498),[#59187](https://github.com/PaddlePaddle/Paddle/pull/59187),[#59188](https://github.com/PaddlePaddle/Paddle/pull/59188),[#58172](https://github.com/PaddlePaddle/Paddle/pull/58172),[#58628](https://github.com/PaddlePaddle/Paddle/pull/58628),[#56185](https://github.com/PaddlePaddle/Paddle/pull/56185),[#56696](https://github.com/PaddlePaddle/Paddle/pull/56696),[#59497](https://github.com/PaddlePaddle/Paddle/pull/59497),[#58304](https://github.com/PaddlePaddle/Paddle/pull/58304),[#58977](https://github.com/PaddlePaddle/Paddle/pull/58977) +1. Fix the bug when Lowering some special operators to the compiler, to improve the end-to-end user experience.([#60800](https://github.com/PaddlePaddle/Paddle/pull/60800)、[#64720](https://github.com/PaddlePaddle/Paddle/pull/64720)、[#62593](https://github.com/PaddlePaddle/Paddle/pull/62593)、[#62661](https://github.com/PaddlePaddle/Paddle/pull/62661)、[#64626](https://github.com/PaddlePaddle/Paddle/pull/64626)、[#63320](https://github.com/PaddlePaddle/Paddle/pull/63320)、[#64581](https://github.com/PaddlePaddle/Paddle/pull/64581)、[#61608](https://github.com/PaddlePaddle/Paddle/pull/61608)、[#64135](https://github.com/PaddlePaddle/Paddle/pull/64135)、[#64659](https://github.com/PaddlePaddle/Paddle/pull/64659)、[#62391](https://github.com/PaddlePaddle/Paddle/pull/62391)、[#62490](https://github.com/PaddlePaddle/Paddle/pull/62490)、[#63891](https://github.com/PaddlePaddle/Paddle/pull/63891)、[#64529](https://github.com/PaddlePaddle/Paddle/pull/64529)) +2. Fix a bug in the symbolic derivation logic of some operators.([#62141](https://github.com/PaddlePaddle/Paddle/pull/62141)、[#62376](https://github.com/PaddlePaddle/Paddle/pull/62376)、[#62941](https://github.com/PaddlePaddle/Paddle/pull/62941)、[#63322](https://github.com/PaddlePaddle/Paddle/pull/63322)、[#64672](https://github.com/PaddlePaddle/Paddle/pull/64672)、[#64407](https://github.com/PaddlePaddle/Paddle/pull/64407)、[#60241](https://github.com/PaddlePaddle/Paddle/pull/60241)、[#60440](https://github.com/PaddlePaddle/Paddle/pull/60440)、[#62503](https://github.com/PaddlePaddle/Paddle/pull/62503)、[#62997](https://github.com/PaddlePaddle/Paddle/pull/62997)、[#63169](https://github.com/PaddlePaddle/Paddle/pull/63169)、[#61098](https://github.com/PaddlePaddle/Paddle/pull/61098)、[#63973](https://github.com/PaddlePaddle/Paddle/pull/63973)、[#62248](https://github.com/PaddlePaddle/Paddle/pull/62248)、[#62321](https://github.com/PaddlePaddle/Paddle/pull/62321)、[#63755](https://github.com/PaddlePaddle/Paddle/pull/63755)、[#63917](https://github.com/PaddlePaddle/Paddle/pull/63917)、[#63903](https://github.com/PaddlePaddle/Paddle/pull/63903)、[#64173](https://github.com/PaddlePaddle/Paddle/pull/64173)、[#64525](https://github.com/PaddlePaddle/Paddle/pull/64525)、[#64615](https://github.com/PaddlePaddle/Paddle/pull/64615)、[#62247](https://github.com/PaddlePaddle/Paddle/pull/62247)、[#62455](https://github.com/PaddlePaddle/Paddle/pull/62455)、[#62898](https://github.com/PaddlePaddle/Paddle/pull/62898)、[#62867](https://github.com/PaddlePaddle/Paddle/pull/62867)、[#63608](https://github.com/PaddlePaddle/Paddle/pull/63608)、[#63789](https://github.com/PaddlePaddle/Paddle/pull/63789)、[#64085](https://github.com/PaddlePaddle/Paddle/pull/64085)、[#64136](https://github.com/PaddlePaddle/Paddle/pull/64136)、[#64181](https://github.com/PaddlePaddle/Paddle/pull/64181)) +3. Fix the problems of compiler execution errors under dynamic and static shapes, to improve the robustness of the framework mechanism.([#60813](https://github.com/PaddlePaddle/Paddle/pull/60813)、[#61877](https://github.com/PaddlePaddle/Paddle/pull/61877)、[#61909](https://github.com/PaddlePaddle/Paddle/pull/61909)、[#62954](https://github.com/PaddlePaddle/Paddle/pull/62954)、[#63614](https://github.com/PaddlePaddle/Paddle/pull/63614)、[#60339](https://github.com/PaddlePaddle/Paddle/pull/60339)、[#60623](https://github.com/PaddlePaddle/Paddle/pull/60623)、[#60658](https://github.com/PaddlePaddle/Paddle/pull/60658)、[#60669](https://github.com/PaddlePaddle/Paddle/pull/60669)、[#58823](https://github.com/PaddlePaddle/Paddle/pull/58823)、[#62483](https://github.com/PaddlePaddle/Paddle/pull/62483)、[#62742](https://github.com/PaddlePaddle/Paddle/pull/62742)、[#61797](https://github.com/PaddlePaddle/Paddle/pull/61797)、[#63411](https://github.com/PaddlePaddle/Paddle/pull/63411)、[#64077](https://github.com/PaddlePaddle/Paddle/pull/64077)、[#62736](https://github.com/PaddlePaddle/Paddle/pull/62736)、[#62390](https://github.com/PaddlePaddle/Paddle/pull/62390)、[#63689](https://github.com/PaddlePaddle/Paddle/pull/63689)) -#### AutoTuner +### Deprecated Features -This release implements a profiling based automatic search and tuning tool named AutoTuner for parallel strategies, to automatically combine parallel and optimization strategies. Users can select effective combination configurations for experiments, and AutoTuner will search for the optimal configuration for large model training and inference given the model and hardware specification. In addition, AutoTuner implements a variety of pruning methods, including gpu memory modelling based pruning, so the search space and search time can be significantly reduced. [#54460](https://github.com/PaddlePaddle/Paddle/pull/54460),[#54668](https://github.com/PaddlePaddle/Paddle/pull/54668),[#59794](https://github.com/PaddlePaddle/Paddle/pull/59794),[#59727](https://github.com/PaddlePaddle/Paddle/pull/59727),[#59782](https://github.com/PaddlePaddle/Paddle/pull/59782),[#54834](https://github.com/PaddlePaddle/Paddle/pull/54834),[#58127](https://github.com/PaddlePaddle/Paddle/pull/58127),[#56968](https://github.com/PaddlePaddle/Paddle/pull/56968),[#55466](https://github.com/PaddlePaddle/Paddle/pull/55466),[#56939](https://github.com/PaddlePaddle/Paddle/pull/56939),[#58183](https://github.com/PaddlePaddle/Paddle/pull/58183),[#58314](https://github.com/PaddlePaddle/Paddle/pull/58314),[#55499](https://github.com/PaddlePaddle/Paddle/pull/55499),[#59748](https://github.com/PaddlePaddle/Paddle/pull/59748) +1. Remove useless symbol-related components such as adt DimExpr, SymbolicDimExpr and ShapedTypeInterface.([#60901](https://github.com/PaddlePaddle/Paddle/pull/60901)、[#60933](https://github.com/PaddlePaddle/Paddle/pull/60933)、[#60744](https://github.com/PaddlePaddle/Paddle/pull/60744)、[#64176](https://github.com/PaddlePaddle/Paddle/pull/64176)、[#64140](https://github.com/PaddlePaddle/Paddle/pull/64140)) +2. Remove the old Group Cluster, and the front-end representation under the old IR, to improve the simplicity of the architecture.([#63683](https://github.com/PaddlePaddle/Paddle/pull/63683)、[#64630](https://github.com/PaddlePaddle/Paddle/pull/64630)、[#61380](https://github.com/PaddlePaddle/Paddle/pull/61380)) -### Operator library +## Auto-Parallel Architecture -#### Incompatible upgrade +In order to further enhance the usability of the Auto Parallel architecture in large model training scenarios, PaddlePaddle has improved the Auto Parallel functionality in dynamic-static graphs, including the newly added parallel strategies such as sharding parallelism and interleaved pipeline parallelism, including support of lazy initialization parameters. Add and enhance the SPMD derivation rules for some of the operators. The auto-parallel architecture has been comprehensively verified in a number of mainstream large language models. Meanwhile, in order to build the new 3.0 architecture of PaddlePaddle, the static graph auto parallel architecture has been comprehensively upgraded based on PIR, the new generation intermediate representation of Paddlepaddle. It introduces DistDialect for distributed related components, and natively support DistAttr and DistTensor in the computation graph representation, and smooth the transfom from static to dynmaic graph, further enhance the unity of auto parallel usage in dynamic and static graph mode. Finally, a number of performance optimization technologies have been added and improved, including zero bubble pipeline scheduling strategy, achieving the same or even better end-to-end training performance compared to the manual parallelism on typical large models such as Llama-2 13B/70B. -In order to improve maintainability of PaddlePaddle framework, some deprecated operators in the framework (e.g. diag_v1, isfinite_v1, pad2d_v1, etc.) have been removed, and models using these operators saved through the PaddlePaddle 1.x training will not be able to infer on new version of PaddlePaddle. [#57895](https://github.com/PaddlePaddle/Paddle/pull/57895),[#57892](https://github.com/PaddlePaddle/Paddle/pull/57892),[#57898](https://github.com/PaddlePaddle/Paddle/pull/57898),[#57730](https://github.com/PaddlePaddle/Paddle/pull/57730),[#57732](https://github.com/PaddlePaddle/Paddle/pull/57732),[#57810](https://github.com/PaddlePaddle/Paddle/pull/57810),[#57884](https://github.com/PaddlePaddle/Paddle/pull/57884),[#57794](https://github.com/PaddlePaddle/Paddle/pull/57794),[#57926](https://github.com/PaddlePaddle/Paddle/pull/57926),[#57925](https://github.com/PaddlePaddle/Paddle/pull/57925),[#57807](https://github.com/PaddlePaddle/Paddle/pull/57807),[#57808](https://github.com/PaddlePaddle/Paddle/pull/57808) +### Function Improvements -#### Operator library enhancements +- Add the dtensor_from_local interface for creating DistTensor from local tensor after sharding (correspondingly, shard_tensor is the created DistTensor from global tensor before sharding). [#60206](https://github.com/PaddlePaddle/Paddle/pull/60206) +- Add the unshard_tensor interface to convert DistTensor to global tensor, which is reciprocal operation to shard_tensor. [#60272](https://github.com/PaddlePaddle/Paddle/pull/60272) +- To reduce the GPU memory usage during training, add Sharding parallelism, and support stage1, stage2 and stage3 modes. [#61926](https://github.com/PaddlePaddle/Paddle/pull/61926), [#62711](https://github.com/PaddlePaddle/Paddle/pull/62711), [#62486](https://github.com/PaddlePaddle/Paddle/pull/62486), [#62230](https://github.com/PaddlePaddle/Paddle/pull/62230) +- To solve the problem of insufficient GPU memory when initializing parameters first and then sharding them, add the LazyInit function, to support slicing parameters first and then initializing them. [#60316](https://github.com/PaddlePaddle/Paddle/pull/60316), [#60441](https://github.com/PaddlePaddle/Paddle/pull/60441), [#60563](https://github.com/PaddlePaddle/Paddle/pull/60563), [#61792](https://github.com/PaddlePaddle/Paddle/pull/61792) +- In order to reduce the bubble of pipeline parallel, add the interleaved pipeline parallel parallelism has been added, and support automatically converting the pipeline parallel of the user's networking to interleaved pipeline parallel through configuration, so that the user doesn't need to perform complicated marking in the networking. [#59751](https://github.com/PaddlePaddle/Paddle/pull/59751), [#60050](https://github.com/PaddlePaddle/Paddle/pull/60050), [#60467](https://github.com/PaddlePaddle/Paddle/pull/60467), [#60868](https://github.com/PaddlePaddle/Paddle/pull/60868), [#60187](https://github.com/PaddlePaddle/Paddle/pull/60187), [#62884](https://github.com/PaddlePaddle/Paddle/pull/62884), [#60560](https://github.com/PaddlePaddle/Paddle/pull/60560), [#61541](https://github.com/PaddlePaddle/Paddle/pull/61541) +- Add the SPMD derivation rules for stack, gather, scatter_grad, cumsum, unbind, swiglu, and fused_linear_param_grad. Improve and optimize the implementation of fused_rope, reshape, flatten, fused_rms_norm, slice, tile, flash_attn, cross_entropy and other operator slice derivation rules, to solve the problem of incompatibility in some of the model networking scenarios. [#62720](https://github.com/PaddlePaddle/Paddle/pull/62720), [#64202](https://github.com/PaddlePaddle/Paddle/pull/64202), [#63361](https://github.com/PaddlePaddle/Paddle/pull/63361), [#63290](https://github.com/PaddlePaddle/Paddle/pull/63290), [#61460](https://github.com/PaddlePaddle/Paddle/pull/61460), [#59986](https://github.com/PaddlePaddle/Paddle/pull/59986), [#61184](https://github.com/PaddlePaddle/Paddle/pull/61184), [#60144](https://github.com/PaddlePaddle/Paddle/pull/60144), [#62525](https://github.com/PaddlePaddle/Paddle/pull/62525), [#62053](https://github.com/PaddlePaddle/Paddle/pull/62053), [#60709](https://github.com/PaddlePaddle/Paddle/pull/60709), [#60111](https://github.com/PaddlePaddle/Paddle/pull/60111), [#63681](https://github.com/PaddlePaddle/Paddle/pull/63681), [#62180](https://github.com/PaddlePaddle/Paddle/pull/62180), [#60794](https://github.com/PaddlePaddle/Paddle/pull/60794), [#60632](https://github.com/PaddlePaddle/Paddle/pull/60632), [#62439](https://github.com/PaddlePaddle/Paddle/pull/62439) +- Improve the distributed checkpoint storage and loading function, support master_weights strategy, and fix the random hanging problem. [#60027](https://github.com/PaddlePaddle/Paddle/pull/60027), [#59872](https://github.com/PaddlePaddle/Paddle/pull/59872) +- In order to support the auto parallel of arbitrary shape tensor, add the non-uniform tensor sharding feature. [#62611](https://github.com/PaddlePaddle/Paddle/pull/62611), [#61432](https://github.com/PaddlePaddle/Paddle/pull/61432) +- In order to support users to use customized operators in the auto parallel networking, support user registration outside the framework to customize the SPMD derivation rules for this class of operators. [#60509](https://github.com/PaddlePaddle/Paddle/pull/60509) +- Improve the slice SPMD rule, and support the transition from any state to replicate and from replicate state to any state. [#60281](https://github.com/PaddlePaddle/Paddle/pull/60281), [#59869](https://github.com/PaddlePaddle/Paddle/pull/59869) +- Add MoE expert parallelism (experimental). Currently, only dynamic graph auto parallel is supported. [#63904](https://github.com/PaddlePaddle/Paddle/pull/63904) +- Fix some process adaptation problems of auto parallel and dynamic diagram execution, and dynamic to static. [#60214](https://github.com/PaddlePaddle/Paddle/pull/60214), [#60546](https://github.com/PaddlePaddle/Paddle/pull/60546), [#62082](https://github.com/PaddlePaddle/Paddle/pull/62082), [#61313](https://github.com/PaddlePaddle/Paddle/pull/61313), [#61840](https://github.com/PaddlePaddle/Paddle/pull/61840), [#60614](https://github.com/PaddlePaddle/Paddle/pull/60614), [#60234](https://github.com/PaddlePaddle/Paddle/pull/60234), [#64813](https://github.com/PaddlePaddle/Paddle/pull/64813), [#61606](https://github.com/PaddlePaddle/Paddle/pull/61606), [#63405](https://github.com/PaddlePaddle/Paddle/pull/63405), [#64334](https://github.com/PaddlePaddle/Paddle/pull/64334), [#60504](https://github.com/PaddlePaddle/Paddle/pull/60504) -- The complex kernels of PaddlePaddle PHI operator library have been further enhanced, and a total of 40+ complex kernels have been added. [#55380](https://github.com/PaddlePaddle/Paddle/pull/55380), [#56349](https://github.com/PaddlePaddle/Paddle/pull/56349), [#56412](https://github.com/PaddlePaddle/Paddle/pull/56412), [#56323](https://github.com/PaddlePaddle/Paddle/pull/56323), [#56723](https://github.com/PaddlePaddle/Paddle/pull/56723), [#56457](https://github.com/PaddlePaddle/Paddle/pull/56457), [#56903](https://github.com/PaddlePaddle/Paddle/pull/56903)[#56914](https://github.com/PaddlePaddle/Paddle/pull/56914), [#57116](https://github.com/PaddlePaddle/Paddle/pull/57116), [#56048](https://github.com/PaddlePaddle/Paddle/pull/56048), [#57244](https://github.com/PaddlePaddle/Paddle/pull/57244), [#57639](https://github.com/PaddlePaddle/Paddle/pull/57639), [#57638](https://github.com/PaddlePaddle/Paddle/pull/57638), [#57540](https://github.com/PaddlePaddle/Paddle/pull/57540), [#58545](https://github.com/PaddlePaddle/Paddle/pull/58545), [#58336](https://github.com/PaddlePaddle/Paddle/pull/58336), [#58532](https://github.com/PaddlePaddle/Paddle/pull/58532), [#58839](https://github.com/PaddlePaddle/Paddle/pull/58839), [#59079](https://github.com/PaddlePaddle/Paddle/pull/59079), [#59277](https://github.com/PaddlePaddle/Paddle/pull/59277), [#59122](https://github.com/PaddlePaddle/Paddle/pull/59122), [#57058](https://github.com/PaddlePaddle/Paddle/pull/57058) +### Performance Optimization -- Optimized and added XPU kernels for some operators, and enhanced the support for data types such as bfloat16 on XPU kernel. [#54478](https://github.com/PaddlePaddle/Paddle/pull/54478), [#57740](https://github.com/PaddlePaddle/Paddle/pull/57740), [#58346](https://github.com/PaddlePaddle/Paddle/pull/58346), [#58456](https://github.com/PaddlePaddle/Paddle/pull/58456), [#58662](https://github.com/PaddlePaddle/Paddle/pull/58662), [#59066](https://github.com/PaddlePaddle/Paddle/pull/59066), [#59263](https://github.com/PaddlePaddle/Paddle/pull/59263)), [#59375](https://github.com/PaddlePaddle/Paddle/pull/59375), [#59505](https://github.com/PaddlePaddle/Paddle/pull/59505), [#59653](https://github.com/PaddlePaddle/Paddle/pull/59653), [#55001](https://github.com/PaddlePaddle/Paddle/pull/55001), [#57272](https://github.com/PaddlePaddle/Paddle/pull/57272), [#56169](https://github.com/PaddlePaddle/Paddle/pull/56169), [#59454](https://github.com/PaddlePaddle/Paddle/pull/59454), [#59480](https://github.com/PaddlePaddle/Paddle/pull/59480), [#55914](https://github.com/PaddlePaddle/Paddle/pull/55914), [#54758](https://github.com/PaddlePaddle/Paddle/pull/54758), [#54827](https://github.com/PaddlePaddle/Paddle/pull/54827), [#58364](https://github.com/PaddlePaddle/Paddle/pull/58364), [#58419](https://github.com/PaddlePaddle/Paddle/pull/58419), [#58982](https://github.com/PaddlePaddle/Paddle/pull/58982), [#57216](https://github.com/PaddlePaddle/Paddle/pull/57216), [#59166](https://github.com/PaddlePaddle/Paddle/pull/59166), [#55033](https://github.com/PaddlePaddle/Paddle/pull/55033), [#55375](https://github.com/PaddlePaddle/Paddle/pull/55375), [#58805](https://github.com/PaddlePaddle/Paddle/pull/58805), [#59389](https://github.com/PaddlePaddle/Paddle/pull/59389), [#57077](https://github.com/PaddlePaddle/Paddle/pull/57077), [#55166](https://github.com/PaddlePaddle/Paddle/pull/55166), [#56773](https://github.com/PaddlePaddle/Paddle/pull/56773) +- In order to reduce the bubble in pipeline parallel, support the reverse computation of parameter and activation splitting in backward, and add zero bubble pipeline scheduling strategy to improve the training performance. [#62865](https://github.com/PaddlePaddle/Paddle/pull/62865), [#62737](https://github.com/PaddlePaddle/Paddle/pull/62737), [#64534](https://github.com/PaddlePaddle/Paddle/pull/64534), +- To improve the performance of sequence parallel, perform fusion on related communication operations and computation operations, and optimize redundant transopse operations. [#64807](https://github.com/PaddlePaddle/Paddle/pull/64807), [#63948](https://github.com/PaddlePaddle/Paddle/pull/63948), [#64316](https://github.com/PaddlePaddle/Paddle/pull/64316), [#64119](https://github.com/PaddlePaddle/Paddle/pull/64119) +- Optimize the time consumption of auto parallel graph optimization for static graphs, to reduce the delay from the start of training to the completion of the first step. [#59912](https://github.com/PaddlePaddle/Paddle/pull/59912), [#61817](https://github.com/PaddlePaddle/Paddle/pull/61817), [#60022](https://github.com/PaddlePaddle/Paddle/pull/60022), [#60125](https://github.com/PaddlePaddle/Paddle/pull/60125) +- Optimize the time consumption of related communication operations in hybrid parallel scenarios. [#62157](https://github.com/PaddlePaddle/Paddle/pull/62157), [#61622](https://github.com/PaddlePaddle/Paddle/pull/61622) +- Optimize the redundant video memory consumption of parameters under the auto parallel dynamic-to-static. [#62746](https://github.com/PaddlePaddle/Paddle/pull/62746) +- Improve the hybrid precision training function of auto parallel, support the setting of local auto_cast and black/white list, support master grad function, and adapt to different parallel strategies. [60158](https://github.com/PaddlePaddle/Paddle/pull/60158), [#59987](https://github.com/PaddlePaddle/Paddle/pull/59987), [#62629](https://github.com/PaddlePaddle/Paddle/pull/62629), [#60385](https://github.com/PaddlePaddle/Paddle/pull/60385), [#62015](https://github.com/PaddlePaddle/Paddle/pull/62015), [#60514](https://github.com/PaddlePaddle/Paddle/pull/60514), [#61221](https://github.com/PaddlePaddle/Paddle/pull/61221), [#60779](https://github.com/PaddlePaddle/Paddle/pull/60779), [#63228](https://github.com/PaddlePaddle/Paddle/pull/63228) +- Optimize non-essential casts caused by type promotion and amp to improve performance. [#63293](https://github.com/PaddlePaddle/Paddle/pull/63293), [#63228](https://github.com/PaddlePaddle/Paddle/pull/63228) -- Added some operators for optimizing large model training and inference performance. [#55758](https://github.com/PaddlePaddle/Paddle/pull/55758), [#54998](https://github.com/PaddlePaddle/Paddle/pull/54998), [#55400](https://github.com/PaddlePaddle/Paddle/pull/55400), [#54630](https://github.com/PaddlePaddle/Paddle/pull/54630), [#55969](https://github.com/PaddlePaddle/Paddle/pull/55969), [#55026](https://github.com/PaddlePaddle/Paddle/pull/55026), [#58986](https://github.com/PaddlePaddle/Paddle/pull/58986) +### Upgrade Static Graph Auto Parallel Architecture -- Improved mechanism of Tensor Strided in the operator library. [#59422](https://github.com/PaddlePaddle/Paddle/pull/59422), [#59325](https://github.com/PaddlePaddle/Paddle/pull/59325), [#56863](https://github.com/PaddlePaddle/Paddle/pull/56863), [#56882](https://github.com/PaddlePaddle/Paddle/pull/56882), [#56947](https://github.com/PaddlePaddle/Paddle/pull/56947) +- Based on the new generation of Intermediate Representation(PIR), add the new DistDialect, natively supporting DistAttr and DistTensor in computation graph representation, and realizing the direct binding of distributed attributes between tensor or operator, which making the auto-parallel architecture more simple and unified. [#63828](https://github.com/PaddlePaddle/Paddle/pull/63828), [#64299](https://github.com/PaddlePaddle/Paddle/pull/64299), [#63870](https://github.com/PaddlePaddle/Paddle/pull/63870), [#64144](https://github.com/PaddlePaddle/Paddle/pull/64144), [#62524](https://github.com/PaddlePaddle/Paddle/pull/62524), [#62630](https://github.com/PaddlePaddle/Paddle/pull/62630), [#62897](https://github.com/PaddlePaddle/Paddle/pull/62897), [#60478](https://github.com/PaddlePaddle/Paddle/pull/60478), [#60574](https://github.com/PaddlePaddle/Paddle/pull/60574), [#63876](https://github.com/PaddlePaddle/Paddle/pull/63876), [#63798](https://github.com/PaddlePaddle/Paddle/pull/63798), [#62560](https://github.com/PaddlePaddle/Paddle/pull/62560), [#63676](https://github.com/PaddlePaddle/Paddle/pull/63676) +- Improve APIs such as shard_tensor, reshard, and to_static, to support users to convert the dynamic graph model networking directly into PIR static computation graph for better performance. [#62945](https://github.com/PaddlePaddle/Paddle/pull/62945), [#62356](https://github.com/PaddlePaddle/Paddle/pull/62356), [#60175](https://github.com/PaddlePaddle/Paddle/pull/60175), [#62654](https://github.com/PaddlePaddle/Paddle/pull/62654), [#63347](https://github.com/PaddlePaddle/Paddle/pull/63347) +- Optimize the auto-parallel graph optimization compilation process, and reduce the compilation and optimization time of static graphs by refactoring and optimizing the procedure of computation graph parallelization and communication resolution. [#64137](https://github.com/PaddlePaddle/Paddle/pull/64137), [#62201](https://github.com/PaddlePaddle/Paddle/pull/62201), [#64143](https://github.com/PaddlePaddle/Paddle/pull/64143), [#62560](https://github.com/PaddlePaddle/Paddle/pull/62560) +- Optimize the procedure of the SPMD derivation in static graphs to achieve the consistency results under dynamic-static graphs, which improves the unity and stability of the architecture. [#62659](https://github.com/PaddlePaddle/Paddle/pull/62659), [#62547](https://github.com/PaddlePaddle/Paddle/pull/62547), [#63117](https://github.com/PaddlePaddle/Paddle/pull/63117), [#63434](https://github.com/PaddlePaddle/Paddle/pull/63434), [#63770](https://github.com/PaddlePaddle/Paddle/pull/63770), [#64361](https://github.com/PaddlePaddle/Paddle/pull/64361), [#63073](https://github.com/PaddlePaddle/Paddle/pull/63073) +- Upgrade the implementation of Reshard conversion in static graphs, and use consistent conversion rules under dynamic-static graphs to ensure the consistency of the execution logic and results of tensor reshard conversion in dynamic-static graphs, so as to improve user experience. [#62718](https://github.com/PaddlePaddle/Paddle/pull/62718), [#62694](https://github.com/PaddlePaddle/Paddle/pull/62694), [#60215](https://github.com/PaddlePaddle/Paddle/pull/60215), [#63362](https://github.com/PaddlePaddle/Paddle/pull/63362), [#63072](https://github.com/PaddlePaddle/Paddle/pull/63072), [#63962](https://github.com/PaddlePaddle/Paddle/pull/63962), [#64223](https://github.com/PaddlePaddle/Paddle/pull/64223), [#61796](https://github.com/PaddlePaddle/Paddle/pull/61796), [#64465](https://github.com/PaddlePaddle/Paddle/pull/64465), [#64623](https://github.com/PaddlePaddle/Paddle/pull/64623), [#64418](https://github.com/PaddlePaddle/Paddle/pull/64418) -- Optimized function implementation and template function in some kernels to reduce size of complied library package. [#57083](https://github.com/PaddlePaddle/Paddle/pull/57083), [#57299](https://github.com/PaddlePaddle/Paddle/pull/57299), [#57261](https://github.com/PaddlePaddle/Paddle/pull/57261), [#57290](https://github.com/PaddlePaddle/Paddle/pull/57290), [#57118](https://github.com/PaddlePaddle/Paddle/pull/57118), [#57551](https://github.com/PaddlePaddle/Paddle/pull/57551), [#57509](https://github.com/PaddlePaddle/Paddle/pull/57509), [#57558](https://github.com/PaddlePaddle/Paddle/pull/57558), [#57064](https://github.com/PaddlePaddle/Paddle/pull/57064), [#57365](https://github.com/PaddlePaddle/Paddle/pull/57365), [#57327](https://github.com/PaddlePaddle/Paddle/pull/57327), [#57603](https://github.com/PaddlePaddle/Paddle/pull/57603), [#57671](https://github.com/PaddlePaddle/Paddle/pull/57671), [#57672](https://github.com/PaddlePaddle/Paddle/pull/57672), [#57631](https://github.com/PaddlePaddle/Paddle/pull/57631), [#57082](https://github.com/PaddlePaddle/Paddle/pull/57082), [#57721](https://github.com/PaddlePaddle/Paddle/pull/57721), [#57823](https://github.com/PaddlePaddle/Paddle/pull/57823), [#57821](https://github.com/PaddlePaddle/Paddle/pull/57821), [#57815](https://github.com/PaddlePaddle/Paddle/pull/57815), [#57822](https://github.com/PaddlePaddle/Paddle/pull/57822), [#57541](https://github.com/PaddlePaddle/Paddle/pull/57541), [#57817](https://github.com/PaddlePaddle/Paddle/pull/57817), [#57838](https://github.com/PaddlePaddle/Paddle/pull/57838) +### Automatic Search and Tuning of Training Strategies -#### Fixed bug +In order to improve the ease of use of the training strategy automatic search and tuning tool (AutoTuner), support user-defined search items, support for setting the priority of search items, and support for user-configured illegal strategy combinations, to comprehensively enhance the error reporting information in the runtime and post-run logs, and support for AutoTuner on NPU devices. [#60101](https://github.com/PaddlePaddle/Paddle/pull/60101), [#60294](https://github.com/PaddlePaddle/Paddle/pull/60294), [#61898](https://github.com/PaddlePaddle/Paddle/pull/61898), [#60248](https://github.com/PaddlePaddle/Paddle/pull/60248), [#60417](https://github.com/PaddlePaddle/Paddle/pull/60417), [#60954](https://github.com/PaddlePaddle/Paddle/pull/60954), [#61499](https://github.com/PaddlePaddle/Paddle/pull/61499), [#62724](https://github.com/PaddlePaddle/Paddle/pull/62724), [#60954](https://github.com/PaddlePaddle/Paddle/pull/60954), [#63693](https://github.com/PaddlePaddle/Paddle/pull/63693), [#62853](https://github.com/PaddlePaddle/Paddle/pull/62853), [#62984](https://github.com/PaddlePaddle/Paddle/pull/62984) -- Fixed some bugs with CUDA 12 adaptation of the PaddlePaddle framework. [#54640](https://github.com/PaddlePaddle/Paddle/pull/54640), [#57820](https://github.com/PaddlePaddle/Paddle/pull/57820), [#58958](https://github.com/PaddlePaddle/Paddle/pull/58958), [#58179](https://github.com/PaddlePaddle/Paddle/pull/58179), [#55594](https://github.com/PaddlePaddle/Paddle/pull/55594) +## Cuda Training Performance Optimization -### CUDA +This upgrade achieves the improvement of large model training efficiency from multiple perspectives, such as operator computation efficiency, distributed communication optimization, and video memory optimization. -#### New features +### Function Improvements -- Added debugging class API paddle.amp.debugging.check_check_numerics. Calculated and returned number of outliers (NaN, Inf) and zero elements in this Tensor value. [#54301](https://github.com/PaddlePaddle/Paddle/pull/54301) -- Added fused_rope fusion operator to accelerate LLaMA class large model training.[#54351](https://github.com/PaddlePaddle/Paddle/pull/54351) -- Updated CUDNN Frontend API version to v0.9.1 and added fused_scale_bias_add_relu fusion operator to accelerate ResNet networks. Note this feature is in experimental period and is disabled by default. [#58367](https://github.com/PaddlePaddle/Paddle/pull/58367), [#54949](https://github.com/PaddlePaddle/Paddle/pull/54949), [#58504](https://github.com/PaddlePaddle/Paddle/pull/58504) -- Based on Flash-Attention v2, added Tensor-like Mask function support. Inverse operator supports deterministic computation for debugging. [#57276](https://github.com/PaddlePaddle/Paddle/pull/57276), [#56363](https://github.com/PaddlePaddle/Paddle/pull/56363) -- Modified sparse conv3d backend implementation to support 2d shapes, avoiding front-end reshape overhead. [#54707](https://github.com/PaddlePaddle/Paddle/pull/54707) -- Added matmul_int8 operator. ([#55228](https://github.com/PaddlePaddle/Paddle/pull/55228)) +- Enhance the FlashAttention operator function, including support for NVIDIA SM90 GPU compilation, support for Group Query Attention, support for cuDNN access, support for QKV-packed form inputs, and so on. [#59820](https://github.com/PaddlePaddle/Paddle/pull/59820),[#60776](https://github.com/PaddlePaddle/Paddle/pull/60776),[#58680](https://github.com/PaddlePaddle/Paddle/pull/58680),[#63289](https://github.com/PaddlePaddle/Paddle/pull/63289) +- In the Repeat_interleave operator, add support for BFloat16 data type. [#61854](https://github.com/PaddlePaddle/Paddle/pull/61854) +- For the issues of many interface parameters of ResNet-like models such as fused_scale_bias_add_relu, fused_scale_bias_relu_conv_bn, and fused_dconv_drelu_dbn, and the ease of use of operators, add the fuse_resunit pass, to support automatic fusion of the abovementioned operators, to achieve generic performance optimization. ([#59771](https://github.com/PaddlePaddle/Paddle/pull/59771)) -#### Function optimization +### Performance Improvement -- Optimized CUDA Graph’s support for random number operators.[#58310](https://github.com/PaddlePaddle/Paddle/pull/58310) -- Enhanced automatic mixed-precision training default functionality, including: - - Optimizing the experience of using automatic mixed precision training interface.[#58152](https://github.com/PaddlePaddle/Paddle/pull/58152),[#55364](https://github.com/PaddlePaddle/Paddle/pull/55364),[#57903](https://github.com/PaddlePaddle/Paddle/pull/57903) - - Added matrix computation class operators such as fused_attention, fused_feedforward, and fused_gemm_epilogue to framework's default whitelist, and unified default black and white list settings for dynamic and static graphs. [#55373](https://github.com/PaddlePaddle/Paddle/pull/55373), [#55713](https://github.com/PaddlePaddle/Paddle/pull/55713) - - The argsort, dist, erfinv, nanmedian, poisson operators and lamb optimizer operators support FP16 and BF16 low precision computing. [#51662](https://github.com/PaddlePaddle/Paddle/pull/51662), [#55105](https://github.com/PaddlePaddle/Paddle/pull/55105), [#55287](https://github.com/PaddlePaddle/Paddle/pull/55287), [#55824](https://github.com/PaddlePaddle/Paddle/pull/55824), [#56056](https://github.com/PaddlePaddle/Paddle/pull/56056), [#56184](https://github.com/PaddlePaddle/Paddle/pull/56184), [#55641](https://github.com/PaddlePaddle/Paddle/pull/55641) - - Fixed elementwise_max operator low-precision implementation. Changed to use FP32 type for numerical computing, and reduce precision loss. [#54799](https://github.com/PaddlePaddle/Paddle/pull/54799) - - Changed temporary result Tensor needed for Reduce class operator to FP32 type, to avoid precision loss caused by converting intermediate result to low precision. [#55709](https://github.com/PaddlePaddle/Paddle/pull/55709)) -- Optimized GPU codes for flip, roll & roll_grad, index_put & index_put_grad, etc. Removed unnecessary C++ templates to optimize compilation time and reduce compiled binary size without performance degradation. [#57309](https://github.com/PaddlePaddle/Paddle/pull/57309), [#57525](https://github.com/PaddlePaddle/Paddle/pull/57525) -- For the bernoulli operator, added a check on legitimacy of input probabilities. [#59174](https://github.com/PaddlePaddle/Paddle/pull/59174) +- To address the problem of large GPU memory consumption during the computation of SwiGLU activation module of the Llama models, add the SwiGLU fusion operator to save the memory consumption of intermediate variables, thus reducing the memory overhead during the training process of the large model, and reducing the recomputation to improve the performance. The performance of the Llama-70B model is improved by 9%. [#61508](https://github.com/PaddlePaddle/Paddle/pull/61508) +- To address the problem of higher percentage of communications in Sequence Parallel, realize the overlap between Sequence Parallel reverse process communication and Matmul computation, saving the end-to-end time consumption and improving the end-to-end performance of large model training scenarios by 1%~2%. [#62284](https://github.com/PaddlePaddle/Paddle/pull/62284),[#63531](https://github.com/PaddlePaddle/Paddle/pull/63531) +- For the problem of slow training speed due to the need to divide by nranks after sharding reverse communications, support the fusion of reverse communication and division by nranks operation, and support the mode of ReduceScatter Average, to improve the performance of large model training. [#62623](https://github.com/PaddlePaddle/Paddle/pull/62623) +- For the problem of jitter training speed caused by the input data broadcasting process of the tensor model parallel process, fix the unnecessary synchronization between CPU and GPU in the data broadcasting, to ensure the stability of the training speed. [#60816](https://github.com/PaddlePaddle/Paddle/pull/60816) +- For the problem of low training speed due to the long parallel P2P communication time of pipelined models, realize the overlap of P2P communication and forward-backward computation. The end-to-end training performance of large models is improved by 2%~3%. [#61935](https://github.com/PaddlePaddle/Paddle/pull/61935),[#62051](https://github.com/PaddlePaddle/Paddle/pull/62051,[#62051](https://github.com/PaddlePaddle/Paddle/pull/62051)) +- For the problem of low inefficiency of bias gradient computation of fused_linear_param_grad_add operator, optimize the computation efficiency of bias gradient computation, and improve the end-to-end training performance of large model by 0.2%. [#63114](https://github.com/PaddlePaddle/Paddle/pull/63114) +- For the problem of long time-consuming parameter broadcasting process after the end of sharding reverse computation, implement the overlap between parameter broadcasting and next step computation. As a result, the end-to-end training performance of large model is improved by more than 2%. [#63945](https://github.com/PaddlePaddle/Paddle/pull/63945) +- To address the problem that the gradient occupies too much video memory during the pipelined parallel training, as a result of slow training speed due to the introduction of multiple computations, we have implemented the gradient dynamic release technique, to improve the end-to-end training performance of large models by 3.4%. [#59739](https://github.com/PaddlePaddle/Paddle/pull/59739) -#### Performance optimization +### Bug Fixing -- Optimized BroadcastKernel's support for large Tensor. Change to call INT32 version implementation for multiple times for large Tensor Sharding, improving operator performance by 7.27x. [#57313](https://github.com/PaddlePaddle/Paddle/pull/57313), [#57996](https://github.com/PaddlePaddle/Paddle/pull/57996) -- Optimized performance of Tensor save interface by copying the Tensor to CPU and then converting to numpy, to avoid overhead of automatically converting the Tensor to a continuous Tensor when Tensor is not continuous. [#57040](https://github.com/PaddlePaddle/Paddle/pull/57040) +- Fix the problem of StreamSafeCUDAAllocator CUDA Event resource leakage, as a result of slowdown of large model training. [#64621](https://github.com/PaddlePaddle/Paddle/pull/64621) +- Fix the bug of reverse calculation error of fused_rotary_position_embedding operator. [#60217](https://github.com/PaddlePaddle/Paddle/pull/60217) +- Fix the bug that customized operators cannot control the calculation accuracy by black and white lists in AMP scenarios. [#60052](https://github.com/PaddlePaddle/Paddle/pull/60052) +- Fix the bug that operators such as add_, and divide_ natively supporting operations with different data types have unanticipated type boosting when type boosting occurs. [#64302](https://github.com/PaddlePaddle/Paddle/pull/64302) -#### Bug Fix +## Distributed Strategy Enhancements -- Fixed bug of memmory_efficient_attention operator supporting the sm_90. [#58070](https://github.com/PaddlePaddle/Paddle/pull/58070) -- Fixed the NaN problem of softmax operator when axis=-1 and length is greater than 100000. [#57851](https://github.com/PaddlePaddle/Paddle/pull/57851) -- Fixed bug of GPU access error in some cases for set_constant operator. [#59905](https://github.com/PaddlePaddle/Paddle/pull/59905) -- Fixed GPU storage read/write contention issue in fast implementation version of layer_norm operator. [#56435](https://github.com/PaddlePaddle/Paddle/pull/56435) +Focus on strengthening the functional experience of PaddlePaddle dynamic graph distributed computing, and make various functional improvements to parallel strategies such as AutoTuner, pipeline parallel, and sharding, and enhance the flexibility of large model training. Add the features such as Flash Attention Mask, which significantly reduce the video memory usage of large model training, especially long-sequence training, improve training performance, and provide stronger capability support for large model training. In addition, several bugs and potential security risks have been fixed, which has significantly improved the overall stability of the system. -### Expanded Compiler Infrastructure for Neural Networks (CINN) +### Function Optimization -In this update, PaddlePaddle CINN focuses on optimization of architecture and comprehensive expansion of its capabilities. In view of increasing demand for dynamic shapes for large models, effective operation and optimization strategies of compiler under dynamic shapes are initially explored and implemented. -At the architectural level, Python DSL is introduced, significantly improving CINN's development convenience and Debug capability and enabling developers to write and debug codes more efficiently. Meanwhile, logic of Schedule has been refactored to be dominated by GroupSchedule, enabling more general and stable optimization strategies at operator Group level. In order to enhance stability of CINN, a strong constraint component is explored and introduced. This can effectively reduce uncertainties and potential errors in the system. In addition, historical tool classes and software structure of CINN are systematically organized, optimized and improved, to further enhance readability and maintainability of codes. In terms of integration with other PaddlePaddle components, tight integration of CINN with PIR and Paddle has been further strengthened, making compiler more coherent with overall PaddlePaddle framework. This improvement not only enhances performance of the compiler, but also provides developers with a smoother and more unified development experience. +- Optimize the search space of Autotuner, which significantly improves the performance of search. [#62608](https://github.com/PaddlePaddle/Paddle/pull/62608) +- For the problem of pipeline parallel that the training may be wrong due to the checking of sending type in the eval process, add the training configuration, to skip the redundant receiving check of pipelined sending, featuring higher flexibility and better performance. [#63001](https://github.com/PaddlePaddle/Paddle/pull/63001) +- In the dynamic graph pipeline parallel, add the checking of the size and type of the sent and received data, and add the error message, making the robustness and debuggability better. [#59405](https://github.com/PaddlePaddle/Paddle/pull/59405) +- Support the settings of multiple loss functions with returning multiple losses in dynamic graph pipeline, which improves the flexibility of dynamic graph pipeline. [#63167](https://github.com/PaddlePaddle/Paddle/pull/63167) +- In the dynamic graph pipeline, add the pipeline cache clearing configuration option, to clear the cache sent and received in the pipeline in time to better support dynamic batchsize training. [#62277](https://github.com/PaddlePaddle/Paddle/pull/62277) +- For the problem that the sharding stage3 strategy cannot be aligned bit by bit, replace the unordered set with OrderedSet to avoid the error caused by the accumulation order, as a result of alignment bit by bit after fixing. [#60085](https://github.com/PaddlePaddle/Paddle/pull/60085) +- In order to further reduce the video memory usage in sequence parallel, add a new method of recalculating allgather, to reduce the video memory size of the activation of allgather. [#64244](https://github.com/PaddlePaddle/Paddle/pull/64244) -#### Compatibility upgrade +### New Features for Dynamic Graphs -- Updated storage read interface to be compatible with Paddle 2.0. [#55836](https://github.com/PaddlePaddle/Paddle/pull/55836) -- Updated relu6 Op Mapper compatibility. [#55611](https://github.com/PaddlePaddle/Paddle/pull/55611) +- For the search space of autotuner, add a new search dimension of refined recompute, which makes the search result more accurate and the threshold of model tuning lower. [#62430](https://github.com/PaddlePaddle/Paddle/pull/62430) +- For the problem of limiting the training batch size in virtual pipeline parallel, modify the pipeline scheduling method, to flexibly set the batch size, so as to support more flexible batch size. [#61561](https://github.com/PaddlePaddle/Paddle/pull/61561),[#60314](https://github.com/PaddlePaddle/Paddle/pull/60134) +- In order to solve the problem that the video memory occupation of the mask is a quadratic complexity with low performance in sequence length when using flash attention with a mask, the memory complexity of the mask is reduced from the quadrature of the sequence length to the first square by using the sparse mask, to optimize the memory of the mask. This reduces the number of storage accesses. Meanwhile, use share memory to accelerate memory access, greatly improving the performance. [#62029](https://github.com/PaddlePaddle/Paddle/pull/62029) +- Add the dynamic graph sharding parallel strategy, to improve the communications and computation overlap function, to improve the performance of the training process. [#60455](https://github.com/PaddlePaddle/Paddle/pull/60455) -#### Modification deprecation +### Communication Library Function Optimization -- Removed old Schedule form. [#55566](https://github.com/PaddlePaddle/Paddle/pull/55566),[#55391](https://github.com/PaddlePaddle/Paddle/pull/55391) -- Removed some obsolete tests. [#56245](https://github.com/PaddlePaddle/Paddle/pull/56245),[#57987](https://github.com/PaddlePaddle/Paddle/pull/57987) -- Removed the remove_nested_block Visitor tool that no longer works. [#56972](https://github.com/PaddlePaddle/Paddle/pull/56972) -- Removed other useless codes. [#55413](https://github.com/PaddlePaddle/Paddle/pull/55413) +- Enhance the functionality of the NCCL communication library to support the initialization of customized NCCL libraries by passing additional initialization parameters during initialization. [#62193](https://github.com/PaddlePaddle/Paddle/pull/62193) +- Add the NCCL library path search function to support more flexible NCCL library search methods. [#62492](https://github.com/PaddlePaddle/Paddle/pull/62492) -#### New features +### Bug Fixing -- Added CINN paddle.framework.core.is_run_with_cinn() API on the PaddlePaddle side. [#54355](https://github.com/PaddlePaddle/Paddle/pull/54355) -- Added CINN related operator logics, including various combinatorial operator’s disassembly logic. [#56072](https://github.com/PaddlePaddle/Paddle/pull/56072),[#58210](https://github.com/PaddlePaddle/Paddle/pull/58210),[#58502](https://github.com/PaddlePaddle/Paddle/pull/58502), [#58591](https://github.com/PaddlePaddle/Paddle/pull/58591), [#58981](https://github.com/PaddlePaddle/Paddle/pull/58981), [#59135](https://github.com/PaddlePaddle/Paddle/pull/59135), [#59274](https://github.com/PaddlePaddle/Paddle/pull/59274), [#59306](https://github.com/PaddlePaddle/Paddle/pull/59306), [#59202](https://github.com/PaddlePaddle/Paddle/pull/59202), [#59176](https://github.com/PaddlePaddle/Paddle/pull/59176), [#59534](https://github.com/PaddlePaddle/Paddle/pull/59534), [#59713](https://github.com/PaddlePaddle/Paddle/pull/59713), [#59798](https://github.com/PaddlePaddle/Paddle/pull/59798); Supports bf16, amp and other forms [#54399](https://github.com/PaddlePaddle/Paddle/pull/54399), [#54368](https://github.com/PaddlePaddle/Paddle/pull/54368), [#54608](https://github.com/PaddlePaddle/Paddle/pull/54608); Supports operator zero-dimensional capability [#54892](https://github.com/PaddlePaddle/Paddle/pull/54892), [#54919](https://github.com/PaddlePaddle/Paddle/pull/54919), [#54907](https://github.com/PaddlePaddle/Paddle/pull/54907), [#54966](https://github.com/PaddlePaddle/Paddle/pull/54966) -- Supports CINN and PaddlePaddle PIR, and combinator operator junction operation mode, so new PIR and CINN operation is integrated. [#54732](https://github.com/PaddlePaddle/Paddle/pull/54732), [#56074](https://github.com/PaddlePaddle/Paddle/pull/56074), [#58216](https://github.com/PaddlePaddle/Paddle/pull/58216), [#55680](https://github.com/PaddlePaddle/Paddle/pull/55680), [#56302](https://github.com/PaddlePaddle/Paddle/pull/56302), [#59037](https://github.com/PaddlePaddle/Paddle/pull/59037), [#55186](https://github.com/PaddlePaddle/Paddle/pull/55186), [#58641](https://github.com/PaddlePaddle/Paddle/pull/58641) -- There are strongly constrained components to stabilize CINN changes. [#58719](https://github.com/PaddlePaddle/Paddle/pull/58719), [#59309](https://github.com/PaddlePaddle/Paddle/pull/59309), [#58993](https://github.com/PaddlePaddle/Paddle/pull/58993) -- Added Group Schedule related CINN architecture process. [#58399](https://github.com/PaddlePaddle/Paddle/pull/58399), [#56444](https://github.com/PaddlePaddle/Paddle/pull/56444) -- Added CUTLASS, error handling, and NVRTC Cubin Fmad options to CINN architecture functions preliminarily. [#58079](https://github.com/PaddlePaddle/Paddle/pull/58079), [#57198](https://github.com/PaddlePaddle/Paddle/pull/57198), [#58794](https://github.com/PaddlePaddle/Paddle/pull/58794) -- Added Python interface language for CINN. [#57731](https://github.com/PaddlePaddle/Paddle/pull/57731), [#57515](https://github.com/PaddlePaddle/Paddle/pull/57515), [#57644](https://github.com/PaddlePaddle/Paddle/pull/57644), [#57981](https://github.com/PaddlePaddle/Paddle/pull/57981), [#58009](https://github.com/PaddlePaddle/Paddle/pull/58009) -- Added dynamic shape functionality for CINN to cover ASTGen to generate dynamic shape symbols, to replace the ISL to generate dynamic shape signals [#56360](https://github.com/PaddlePaddle/Paddle/pull/56360), [#57207](https://github.com/PaddlePaddle/Paddle/pull/57207), [#57454](https://github.com/PaddlePaddle/Paddle/pull/57454); Added Bucket Conditional Compilation functionality [#59165](https://github.com/PaddlePaddle/Paddle/pull/59165); Added Schedule, Device, and IR level support for dynamic shape [#58988](https://github.com/PaddlePaddle/Paddle/pull/58988), [#59493](https://github.com/PaddlePaddle/Paddle/pull/59493), [#58717](https://github.com/PaddlePaddle/Paddle/pull/58717), [#58602](https://github.com/PaddlePaddle/Paddle/pull/58602), [#59196](https://github.com/PaddlePaddle/Paddle/pull/59196) -- Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. [#56122](https://github.com/PaddlePaddle/Paddle/pull/56122), [#57777](https://github.com/PaddlePaddle/Paddle/pull/57777), [#57569](https://github.com/PaddlePaddle/Paddle/pull/57569) +- Fix the problem of dbias_out space application of fused_linear_param_grad_add_kernel operator, and add the gradient address checking logic to make the error message easier to debug. [#363433](https://github.com/PaddlePaddle/Paddle/pull/63433),[#64460](https://github.com/PaddlePaddle/Paddle/pull/64460) +- Fix the problem that the sharding policy does not scale the gradient when comm_overlap is turned off in the support of reduce_avg operation. [#62702](https://github.com/PaddlePaddle/Paddle/pull/62702) +- Fix the bug related to fusion in the calculation order of main grad in Stage2. [#59142](https://github.com/PaddlePaddle/Paddle/pull/59142) +- Fix the bug that the switch attribute cannot be found when reduce_avg communication operation is turned on under the sharding strategy. [#62502](https://github.com/PaddlePaddle/Paddle/pull/62502) +- Fix the problem of setting stop_gradient=True for some parameters when Sharding stage1 training supports non-training parameter training. [#62616](https://github.com/PaddlePaddle/Paddle/pull/62616) +- Fix the bug of message printing when TCP is turned off, to prevent misleading users. [#62631](https://github.com/PaddlePaddle/Paddle/pull/62631) +- Fix the DataParallel training problem and solve multi-card training error when some gradients are not initialized and segmentation fault error occurs in data parallel training. [#62299](https://github.com/PaddlePaddle/Paddle/pull/62299) +- For the scenario of turning on sequence parallel, fix the bug caused by weight freezing in some models. [#63596](https://github.com/PaddlePaddle/Paddle/pull/63596) +- Fix some bugs for autotuner scenarios with single dp. [#60757](https://github.com/PaddlePaddle/Paddle/pull/60757) +- Fix aadiff bug of streaming parallel strategy. ([#64716](https://github.com/PaddlePaddle/Paddle/pull/64716)) +- Remove some distributed unit tests. ([#62762](https://github.com/PaddlePaddle/Paddle/pull/62762)) -#### Function optimization +### Security Risk Fixing -- Enriched or improved operator functionality, including improvements to various operator processes such as Repair Reverse, FP16, Infershape, Operator Single Test, etc. [#56320](https://github.com/PaddlePaddle/Paddle/pull/56320), [#56845](https://github.com/PaddlePaddle/Paddle/pull/56845), [#54939](https://github.com/PaddlePaddle/Paddle/pull/54939),[#54378](https://github.com/PaddlePaddle/Paddle/pull/54378),[#55321](https://github.com/PaddlePaddle/Paddle/pull/55321),[#55336](https://github.com/PaddlePaddle/Paddle/pull/55336),[#55337](https://github.com/PaddlePaddle/Paddle/pull/55337),[#55442](https://github.com/PaddlePaddle/Paddle/pull/55442),[#55470](https://github.com/PaddlePaddle/Paddle/pull/55470),[#55489](https://github.com/PaddlePaddle/Paddle/pull/55489),[#55510](https://github.com/PaddlePaddle/Paddle/pull/55510),[#55547](https://github.com/PaddlePaddle/Paddle/pull/55547),[#55505](https://github.com/PaddlePaddle/Paddle/pull/55505),[#55563](https://github.com/PaddlePaddle/Paddle/pull/55563),[#54280](https://github.com/PaddlePaddle/Paddle/pull/54280),[#59650](https://github.com/PaddlePaddle/Paddle/pull/59650),[#54862](https://github.com/PaddlePaddle/Paddle/pull/54862),[#55135](https://github.com/PaddlePaddle/Paddle/pull/55135),[#55292](https://github.com/PaddlePaddle/Paddle/pull/55292),[#55333](https://github.com/PaddlePaddle/Paddle/pull/55333),[#55316](https://github.com/PaddlePaddle/Paddle/pull/55316),[#55379](https://github.com/PaddlePaddle/Paddle/pull/55379),[#55326](https://github.com/PaddlePaddle/Paddle/pull/55326) -- Improved CINN, PaddlePaddle, PIR, combinator operator junction operation, including various and PIR and its actuator interface and CINN mutual support. [#59170](https://github.com/PaddlePaddle/Paddle/pull/59170),[#58766](https://github.com/PaddlePaddle/Paddle/pull/58766),[#59255](https://github.com/PaddlePaddle/Paddle/pull/59255),[#59203](https://github.com/PaddlePaddle/Paddle/pull/59203),[#59024](https://github.com/PaddlePaddle/Paddle/pull/59024),[#57829](https://github.com/PaddlePaddle/Paddle/pull/57829),[#58135](https://github.com/PaddlePaddle/Paddle/pull/58135),[#58193](https://github.com/PaddlePaddle/Paddle/pull/58193),[#58207](https://github.com/PaddlePaddle/Paddle/pull/58207),[#58606](https://github.com/PaddlePaddle/Paddle/pull/58606),[#59437](https://github.com/PaddlePaddle/Paddle/pull/59437),[#59759](https://github.com/PaddlePaddle/Paddle/pull/59759),[#55075](https://github.com/PaddlePaddle/Paddle/pull/55075),[#56805](https://github.com/PaddlePaddle/Paddle/pull/56805),[#57764](https://github.com/PaddlePaddle/Paddle/pull/57764),[#58620](https://github.com/PaddlePaddle/Paddle/pull/58620),[#59769](https://github.com/PaddlePaddle/Paddle/pull/59769),[#58702](https://github.com/PaddlePaddle/Paddle/pull/58702),[#58749](https://github.com/PaddlePaddle/Paddle/pull/58749),[#59025](https://github.com/PaddlePaddle/Paddle/pull/59025),[#58820](https://github.com/PaddlePaddle/Paddle/pull/58820),[#58908](https://github.com/PaddlePaddle/Paddle/pull/58908),[#58169](https://github.com/PaddlePaddle/Paddle/pull/58169) -- There are strongly constrained components to stabilize CINN changes. [#55090](https://github.com/PaddlePaddle/Paddle/pull/55090),[#55705](https://github.com/PaddlePaddle/Paddle/pull/55705),[#57587](https://github.com/PaddlePaddle/Paddle/pull/57587),[#59501](https://github.com/PaddlePaddle/Paddle/pull/59501) -- Improved CINN IR and related tool codes. [#55145](https://github.com/PaddlePaddle/Paddle/pull/55145),[#55955](https://github.com/PaddlePaddle/Paddle/pull/55955),[#56307](https://github.com/PaddlePaddle/Paddle/pull/56307),[#55519](https://github.com/PaddlePaddle/Paddle/pull/55519),[#56958](https://github.com/PaddlePaddle/Paddle/pull/56958),[#57019](https://github.com/PaddlePaddle/Paddle/pull/57019),[#57230](https://github.com/PaddlePaddle/Paddle/pull/57230),[#57531](https://github.com/PaddlePaddle/Paddle/pull/57531),[#57532](https://github.com/PaddlePaddle/Paddle/pull/57532),[#57524](https://github.com/PaddlePaddle/Paddle/pull/57524),[#58770](https://github.com/PaddlePaddle/Paddle/pull/58770),[#59337](https://github.com/PaddlePaddle/Paddle/pull/59337),[#59096](https://github.com/PaddlePaddle/Paddle/pull/59096),[#56274](https://github.com/PaddlePaddle/Paddle/pull/56274),[#56350](https://github.com/PaddlePaddle/Paddle/pull/56350),[#57312](https://github.com/PaddlePaddle/Paddle/pull/57312),[#55171](https://github.com/PaddlePaddle/Paddle/pull/55171) -- Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. [#54982](https://github.com/PaddlePaddle/Paddle/pull/54982),[#57963](https://github.com/PaddlePaddle/Paddle/pull/57963),[#58220](https://github.com/PaddlePaddle/Paddle/pull/58220),[#55484](https://github.com/PaddlePaddle/Paddle/pull/55484),[#55935](https://github.com/PaddlePaddle/Paddle/pull/55935),[#55590](https://github.com/PaddlePaddle/Paddle/pull/55590),[#56530](https://github.com/PaddlePaddle/Paddle/pull/56530),[#58344](https://github.com/PaddlePaddle/Paddle/pull/58344),[#59810](https://github.com/PaddlePaddle/Paddle/pull/59810) -- CINN architectural improvements, including parallel compilation, low-level storage allocation method, print information, Group structure, Pass structure, etc. [#56282](https://github.com/PaddlePaddle/Paddle/pull/56282), [#59014](https://github.com/PaddlePaddle/Paddle/pull/59014),[#59209](https://github.com/PaddlePaddle/Paddle/pull/59209),[#52660](https://github.com/PaddlePaddle/Paddle/pull/52660),[#54749](https://github.com/PaddlePaddle/Paddle/pull/54749),[#58694](https://github.com/PaddlePaddle/Paddle/pull/58694),[#58940](https://github.com/PaddlePaddle/Paddle/pull/58940),[#59504](https://github.com/PaddlePaddle/Paddle/pull/59504),[#56123](https://github.com/PaddlePaddle/Paddle/pull/56123) -- Improved CINN codegen, jit instruction, dim args, and host kernel to support dynamic shape. [#58825](https://github.com/PaddlePaddle/Paddle/pull/58825),[#59395](https://github.com/PaddlePaddle/Paddle/pull/59395),[#59398](https://github.com/PaddlePaddle/Paddle/pull/59398),[#59540](https://github.com/PaddlePaddle/Paddle/pull/59540),[#59470](https://github.com/PaddlePaddle/Paddle/pull/59470),[#59640](https://github.com/PaddlePaddle/Paddle/pull/59640) -- CINN error reporting optimization. [#54983](https://github.com/PaddlePaddle/Paddle/pull/54983),[#55544](https://github.com/PaddlePaddle/Paddle/pull/55544) -- Improved cleanup of CINN codes, including CI, file paths, C++17, Flags, third-party libraries, Docker, etc. [#55018](https://github.com/PaddlePaddle/Paddle/pull/55018),[#55121](https://github.com/PaddlePaddle/Paddle/pull/55121),[#55009](https://github.com/PaddlePaddle/Paddle/pull/55009),[#55888](https://github.com/PaddlePaddle/Paddle/pull/55888),[#56168](https://github.com/PaddlePaddle/Paddle/pull/56168),[#56192](https://github.com/PaddlePaddle/Paddle/pull/56192),[#56896](https://github.com/PaddlePaddle/Paddle/pull/56896),[#53861](https://github.com/PaddlePaddle/Paddle/pull/53861),[#55208](https://github.com/PaddlePaddle/Paddle/pull/55208) +- Fix security vulnerability against security leakage risk in prune_by_memory_estimation operator. [#61320](https://github.com/PaddlePaddle/Paddle/pull/61320) -#### Performance optimization +## Parameter Server -- Fusion of vit attention. [#54139](https://github.com/PaddlePaddle/Paddle/pull/54139) -- Optimized block reduce. [#58196](https://github.com/PaddlePaddle/Paddle/pull/58196) +This update mainly fixes several bugs in the process of using the parameter server as well as compilation and installation issues. -#### Fixed bug +### Bug Fixing -- Fixed operator-related bugs. [#56280](https://github.com/PaddlePaddle/Paddle/pull/56280),[#57767](https://github.com/PaddlePaddle/Paddle/pull/57767),[#58406](https://github.com/PaddlePaddle/Paddle/pull/58406),[#54406](https://github.com/PaddlePaddle/Paddle/pull/54406),[#54494](https://github.com/PaddlePaddle/Paddle/pull/54494),[#54751](https://github.com/PaddlePaddle/Paddle/pull/54751),[#55674](https://github.com/PaddlePaddle/Paddle/pull/55674),[#55684](https://github.com/PaddlePaddle/Paddle/pull/55684),[#55683](https://github.com/PaddlePaddle/Paddle/pull/55683),[#57798](https://github.com/PaddlePaddle/Paddle/pull/57798),[#57816](https://github.com/PaddlePaddle/Paddle/pull/57816),[#57687](https://github.com/PaddlePaddle/Paddle/pull/57687),[#56719](https://github.com/PaddlePaddle/Paddle/pull/56719),[#59756](https://github.com/PaddlePaddle/Paddle/pull/59756),[#59770](https://github.com/PaddlePaddle/Paddle/pull/59770),[#58811](https://github.com/PaddlePaddle/Paddle/pull/58811) -- Fixed process architecture-related bugs. [#54899](https://github.com/PaddlePaddle/Paddle/pull/54899),[#59737](https://github.com/PaddlePaddle/Paddle/pull/59737),[#59356](https://github.com/PaddlePaddle/Paddle/pull/59356),[#56105](https://github.com/PaddlePaddle/Paddle/pull/56105),[#56662](https://github.com/PaddlePaddle/Paddle/pull/56662),[#58146](https://github.com/PaddlePaddle/Paddle/pull/58146),[#58910](https://github.com/PaddlePaddle/Paddle/pull/58910),[#58121](https://github.com/PaddlePaddle/Paddle/pull/58121),[#58943](https://github.com/PaddlePaddle/Paddle/pull/58943),[#58886](https://github.com/PaddlePaddle/Paddle/pull/58886),[#59642](https://github.com/PaddlePaddle/Paddle/pull/59642),[#56164](https://github.com/PaddlePaddle/Paddle/pull/56164),[#56338](https://github.com/PaddlePaddle/Paddle/pull/56338),[#56966](https://github.com/PaddlePaddle/Paddle/pull/56966),[#59112](https://github.com/PaddlePaddle/Paddle/pull/59112),[#55820](https://github.com/PaddlePaddle/Paddle/pull/55820),[#56660](https://github.com/PaddlePaddle/Paddle/pull/56660),[#57307](https://github.com/PaddlePaddle/Paddle/pull/57307),[#57530](https://github.com/PaddlePaddle/Paddle/pull/57530),[#58236](https://github.com/PaddlePaddle/Paddle/pull/58236),[#55190](https://github.com/PaddlePaddle/Paddle/pull/55190),[#55043](https://github.com/PaddlePaddle/Paddle/pull/55043),[#55667](https://github.com/PaddlePaddle/Paddle/pull/55667) -- Other bugs. [#57239](https://github.com/PaddlePaddle/Paddle/pull/57239),[#55530](https://github.com/PaddlePaddle/Paddle/pull/55530),[#56605](https://github.com/PaddlePaddle/Paddle/pull/56605),[#58243](https://github.com/PaddlePaddle/Paddle/pull/58243),[#58197](https://github.com/PaddlePaddle/Paddle/pull/58197),[#58197](https://github.com/PaddlePaddle/Paddle/pull/58197),[#56086](https://github.com/PaddlePaddle/Paddle/pull/56086),[#56065](https://github.com/PaddlePaddle/Paddle/pull/56065),[#58775](https://github.com/PaddlePaddle/Paddle/pull/58775),[#54750](https://github.com/PaddlePaddle/Paddle/pull/54750),[#58595](https://github.com/PaddlePaddle/Paddle/pull/58595),[#58873](https://github.com/PaddlePaddle/Paddle/pull/58873) +- For the problem of reading and writing out of bounds of the unique operator, fix the problem of setting the wrong length in the calculation process of the unique operator to ensure the correctness of the operation of the unique operator. [#60840](https://github.com/PaddlePaddle/Paddle/pull/60840) +- Fixed some bugs in PGLBox save/load and compilation process to ensure the correctness of PGLBox function in response to the lack of save/load function and compilation error in PGLBox training process. [#63905](https://github.com/PaddlePaddle/Paddle/pull/63905) +- Fix the setting value of use_ps_gpu in CPUPS to ensure the correctness of the CPUPS training process, in response to the problem that the CPUPS training process triggers the GPUPS logic and causes the training to crash. [#61406](https://github.com/PaddlePaddle/Paddle/pull/61406) +- For the problem that the cudaErrorInvalidResourceHandle error occurs in GPUPS training in CUDA 12.3, add the device id switching mechanism, to ensure that the corresponding resource operation is carried out on the correct device. [#63391](https://github.com/PaddlePaddle/Paddle/pull/63391) +- For the problem of garbled codes in PGLBox Embedding Dump process, fix the bug of improper use of C++ std::string, to ensure the correctness of Embedding Dump results. [#65179](https://github.com/PaddlePaddle/Paddle/pull/65179) -#### Documentation +### Documentation Improvement -- Added README file. [#58349](https://github.com/PaddlePaddle/Paddle/pull/58349) +- Access security warnings in the RPC interface documentation, to remind users that they need to use this interface under secure network conditions. [#64100](https://github.com/PaddlePaddle/Paddle/pull/64100) -## 4. Deployment Direction (Paddle Inference) +### Security Enhancement -### General inference optimization +- Fix several code security issues to prevent malicious code injection. [#60023](https://github.com/PaddlePaddle/Paddle/pull/60023),[#60544](https://github.com/PaddlePaddle/Paddle/pull/60544),[#60615](https://github.com/PaddlePaddle/Paddle/pull/60615) + +## Inference Deployment + +The inference framework is based on PIR upgraded PASS under GPU, XPU, CPU hardware, to significantly reduce the number of lines of codes compared with the previous version, and improve development efficiency. The underlying executor is upgraded to a new version of asynchronous executor, improving inference performance on most models. Complete the adaptive interconnection for inference acceleration based on CINN compiler. Add the switches for these features. Users can turn on the features through settings. In addition, Paddle Inference supports direct loading of optimized serialized models under mixed inference with TensorRT subgraphs natively, to reduce startup time consumption. For Paddle-TensorRT, add the interfaces to flexibly control node computation precision and whether the subgraph enters TensorRT computation. It is convenient for debugging. For performance optimization, GPU, XPU, CPU are added with more Transformer and LLM computing acceleration fusion operator, such as group attention mechanism fusion operator, GQA structure, and WINT4, and support for automatic matching by PASS. -This version of the upgrade improves performance and ease-of-use of the inference engine on GPU and CPU, reducing user cost and application cost of online inference. On GPU: A high-performance multi-threaded asynchronous executor is supported, and inference performance of each model is improved by 5%~10%. The new version of TensorRT and BF16 inference capabilities are also supported, and TensorRT inference performance and ease of use are further improved. On CPU: The latest version of OneDNN high-performance inference is supported. SwinTransformer, FastRCNN and other series of models have greatly improved performance. +### New Features -- matmul supports transpose and broadcast operations. [#56827](https://github.com/PaddlePaddle/Paddle/pull/56827) -- TruncatedNormal and Assign supports FP64 data types. [#57507](https://github.com/PaddlePaddle/Paddle/pull/57507) -- Supports conv2d explicit quantized inference. [#57160](https://github.com/PaddlePaddle/Paddle/pull/57160),[#58015](https://github.com/PaddlePaddle/Paddle/pull/58015) -- Added conv_fuse_pass. Support conv + bn fusion. The conv2d_fusion is renamed fused_conv2d_add_act. [#58724](https://github.com/PaddlePaddle/Paddle/pull/58724),[#55374](https://github.com/PaddlePaddle/Paddle/pull/55374),[#54477](https://github.com/PaddlePaddle/Paddle/pull/54477),[#59431](https://github.com/PaddlePaddle/Paddle/pull/59431) -- Mixed precision inference supports OP whitelisting. [#56535](https://github.com/PaddlePaddle/Paddle/pull/56535) -- OneDNN optimization is enabled by default. Supports SwinTransformer, FastRCNNd and other inference optimizations. [#58560](https://github.com/PaddlePaddle/Paddle/pull/58560),[#59394](https://github.com/PaddlePaddle/Paddle/pull/59394),[#59421](https://github.com/PaddlePaddle/Paddle/pull/59421),[#58435](https://github.com/PaddlePaddle/Paddle/pull/58435),[#58488](https://github.com/PaddlePaddle/Paddle/pull/58488),[#59259](https://github.com/PaddlePaddle/Paddle/pull/59259),[#56303](https://github.com/PaddlePaddle/Paddle/pull/56303),[#56782](https://github.com/PaddlePaddle/Paddle/pull/56782),[#57598](https://github.com/PaddlePaddle/Paddle/pull/57598),[#58361](https://github.com/PaddlePaddle/Paddle/pull/58361),[#59641](https://github.com/PaddlePaddle/Paddle/pull/59641),[#59527](https://github.com/PaddlePaddle/Paddle/pull/59527),[#59663](https://github.com/PaddlePaddle/Paddle/pull/59663),[#59744](https://github.com/PaddlePaddle/Paddle/pull/59744) -- Added share_data and support for pass in specified data. [#57933](https://github.com/PaddlePaddle/Paddle/pull/57933) - -### Large model inference optimized - -The fine-grained fusion inference optimization of generative large models is realized. Optimization solution ensures high-performance inference capability and excellent expandability. Users can flexibly utilize various fine-grained fusion operators and PaddlePaddle native operators to build a network structure of generative large models in free combinations as required, thus achieving efficient and low-cost inference. In addition, our solution also supports mainstream generative large model structure, significantly reducing deployment cost of inference for such models and strongly supports efficient and low-cost implementation of generative large models. +- Paddle-TensorRT + - The API called at the underlying of Paddle-TensorRT is upgraded. When the version of TensorRT is later than 8.5, the EnqueueV2 API called (which will be deprecated in the future) is upgraded to the EnqueueV3 API. [#60807](https://github.com/PaddlePaddle/Paddle/pull/60807) + - Add the config.exp_disable_tensorrt_subgraph() to set some subgraphs not to enter TensorRT. [#61967](https://github.com/PaddlePaddle/Paddle/pull/61967) + - Add the config.exp_disable_tensorrt_dynamic_shape_ops() to set dynamic shape input operators not to enter TensorRT. The default value is False. [#62352](https://github.com/PaddlePaddle/Paddle/pull/62352) + - Add the config.exp_specify_tensorrt_subgraph_precision() to set nodes to run different precision types. [#62402](https://github.com/PaddlePaddle/Paddle/pull/62402) +- In the Inference, add switch to turn on CINN compiler. When configuring inference config, turn on CINN through config.enable_cinn(). [#61949](https://github.com/PaddlePaddle/Paddle/pull/61949) +- PIR use mechanism in the Inference upgrade + - In the config, add enable_new_ir() interface to enable PIR. [#61968](https://github.com/PaddlePaddle/Paddle/pull/61968) + - In the config, add set_optimization_level() interface to set different optimization levels. [#61968](https://github.com/PaddlePaddle/Paddle/pull/61968) + - In the PIR mechanism, the PASS function supports custom C++PASS. [#62468](https://github.com/PaddlePaddle/Paddle/pull/62468) + - The inference library exposes PIR-related implementation header files to the outside world. Support users' secondary development based on PIR, such as custom Pass development. [#61863](https://github.com/PaddlePaddle/Paddle/pull/61863),[#62293](https://github.com/PaddlePaddle/Paddle/pull/62293) + - The PIR mechanism supports input and output of the Hook operator by registering the Predictor. [#63101](https://github.com/PaddlePaddle/Paddle/pull/63101) +- The multi-layer Transformer fusion operator fused_multi_transformer_op supports GQA calculation. [#64125](https://github.com/PaddlePaddle/Paddle/pull/64125) + +### Function Improvements + +- The inference supports loading optimized models directly, making it possible to skip IR optimization altogether. The deployment in this way can minimize framework overhead. [#61598](https://github.com/PaddlePaddle/Paddle/pull/61598) +- Re-specify the shape range information file when loading the saved IR PASS optimized model inference. [#60457](https://github.com/PaddlePaddle/Paddle/pull/60457) +- Collect the Shape information within the subgraph of the control flow operator, supporting the use of Paddle-TensorRT inference acceleration. [#60451](https://github.com/PaddlePaddle/Paddle/pull/60451) ,[#59588](https://github.com/PaddlePaddle/Paddle/pull/59588) +- The mixed-precision PASS (auto_mixed_precision_pass) for GPU-native inference supports the handling of sparse Tensor. [#62656](https://github.com/PaddlePaddle/Paddle/pull/62656) +- XPU hardware related function + - XPU's fused PASS for Conv and FC supports conversion from Float to INT31 type. [#59981](https://github.com/PaddlePaddle/Paddle/pull/59981) + - XPU's strided slice operator supports the setting of strides non-negative. [#62268](https://github.com/PaddlePaddle/Paddle/pull/62268) + - XPU's multi-layer Encoder fusion PASS is adaptive to sequence length and supports variable length. [#63825](https://github.com/PaddlePaddle/Paddle/pull/63825) +- Paddle TensorRT INT8 computation mode supports tile operator into TensorRT computation, to improve INT8 performance of some models. [#60189](https://github.com/PaddlePaddle/Paddle/pull/60189) + +### Model Compression + +Fix bugs and optimize functions mainly for Post Training Quantization (PTQ) and Quantization Aware Training (QAT). + +- Support the simulation quantization grouped by channel. [#61828](https://github.com/PaddlePaddle/Paddle/pull/61828) +- Support automatic saving of quantization scale to model parameter file under dynamic graphs. [#59441](https://github.com/PaddlePaddle/Paddle/pull/59441) +- Remove the restriction that the dataloader must be a DataLoader instance. [#61798](https://github.com/PaddlePaddle/Paddle/pull/61798) + +### Performance Optimization + +- Upgrade the inference executor to reduce the video memory usage at runtime while keeping the performance unchanged. This can be used through config.enable_use_executor(True). [#57920](https://github.com/PaddlePaddle/Paddle/pull/57920),[#58452](https://github.com/PaddlePaddle/Paddle/pull/58452),[#63350](https://github.com/PaddlePaddle/Paddle/pull/63350),[#64466](https://github.com/PaddlePaddle/Paddle/pull/64466) +- Upgrade oneDNN version of paddle inference to v3.4. Its overall performance has been improved compared with v3.3. [#64661](https://github.com/PaddlePaddle/Paddle/pull/64661) +- Upgrade the CUTLASS-based support for matrix multiplication and activation fusion calculation. ([#61925](https://github.com/PaddlePaddle/Paddle/pull/61925)) + +#### Add generic PASS in PIR mechanism + +- Add identity_op_clean_pass and matmul_scale_fuse_pass. [#59840](https://github.com/PaddlePaddle/Paddle/pull/59840) +- Add fused_flash_attn_pass. The pass can call flash_attention to replace the original attentions computation. [#64213](https://github.com/PaddlePaddle/Paddle/pull/64213),[#64707](https://github.com/PaddlePaddle/Paddle/pull/64707),[#63304](https://github.com/PaddlePaddle/Paddle/pull/63304) +- In the inference PIR new architecture, upgrade layout adjustment algorithm, support the NHWC inference of conv class and norm class. The performance tested on SD models is significantly improved. [#63628](https://github.com/PaddlePaddle/Paddle/pull/63628),[#64634](https://github.com/PaddlePaddle/Paddle/pull/64634),[#64658](https://github.com/PaddlePaddle/Paddle/pull/64658),[#64708](https://github.com/PaddlePaddle/Paddle/pull/64708),[#64830](https://github.com/PaddlePaddle/Paddle/pull/64830),[#64896](https://github.com/PaddlePaddle/Paddle/pull/64896) +- Add remove_redundant_transpose PASS. [#63357](https://github.com/PaddlePaddle/Paddle/pull/63357) +- Enable CSE PASS in inference to improve inference performance. [#64523](https://github.com/PaddlePaddle/Paddle/pull/64523) + +#### GPU Performance Optimizations + +Include new fusion operators and new PASS under PIR mechanism. + +- Optimize the performance of sparse convolution operator (sparse conv) to improve the inference performance of BEV and other models. [#63067](https://github.com/PaddlePaddle/Paddle/pull/63067) +- Add the fusion PASS based on flash attention. [#63220](https://github.com/PaddlePaddle/Paddle/pull/63220) +- The inference supports elementwise_add+group_norm+silu activated operator fusion pattern and its corresponding fusion kernel. [#64199](https://github.com/PaddlePaddle/Paddle/pull/64199) +- The Matrix multiplication calculation supports groupwise's Weight only INT4 calculation. [#60422](https://github.com/PaddlePaddle/Paddle/pull/60422) 、[#63212](https://github.com/PaddlePaddle/Paddle/pull/63212) 、[#60204](https://github.com/PaddlePaddle/Paddle/pull/60204)) +- The implementation of the group attention mechanism fusion operator block_multi_head_attention supports KV Cache quantization. [#59951](https://github.com/PaddlePaddle/Paddle/pull/59951)) +- The Inference uses CUTLASS upgraded conv fusion operator to implement and support PASS automatic fusion. Support bias and activation. Compared to the original cuDNN, the new operator has significant performance acceleration. It is used through config.exp_enable_use_cutlass(True). [#64201](https://github.com/PaddlePaddle/Paddle/pull/64201)、[#64641](https://github.com/PaddlePaddle/Paddle/pull/64641) +- Add the blha_get_max_len operator and remove every call to get_max_len in block_multihead_attention. The function application is used for large model dynamic inference acceleration. [#64246](https://github.com/PaddlePaddle/Paddle/pull/64246) +- Data layout optimization: PASS prohibits using NHWC mode calculation in the conv fusion operator FP32 precision type, because cuDNN will cause performance degradation under this condition. [#63400](https://github.com/PaddlePaddle/Paddle/pull/63400) +- GPU peak video memory optimization: upgrade the underlying interface TryShrinkMemory, and upgrade to support GPU place under the support for the release of the idle video memory in the pool. In certain scenarios, peak video memory can be significantly cut. [#61319](https://github.com/PaddlePaddle/Paddle/pull/61319) + +#### CPU performance optimization + +Include new fusion operator. Add PASS under PIR mechanism and optimize part of Kernel. + +- Add scale_matmul_fuse_pass. [#63313](https://github.com/PaddlePaddle/Paddle/pull/63313) +- Add CPU implementation in fused_bias_residual_layernorm and fused_rms_norm to improve inference speed. [#63196](https://github.com/PaddlePaddle/Paddle/pull/63196)、[#63165](https://github.com/PaddlePaddle/Paddle/pull/63165) +- Add the cache optimization for Deconvolution kernel, to greatly improve the execution speed of this operator. [#60922](https://github.com/PaddlePaddle/Paddle/pull/60922) +- In PIR, add depthwise_conv fusion PASS, to convert the depthwise_conv operator to conv2d, thus using the onednn conv2d kernel optimization to improve the inference speed of this operator. [#63051](https://github.com/PaddlePaddle/Paddle/pull/63051) +- In PIR, add Conv and Activation Fusion PASS (conv_activation_mkldnn_fuse_pass), to support the fusion of conv and 13 kinds of activation functions, thus greatly improving the inference speed of conv-related operators. [#63145](https://github.com/PaddlePaddle/Paddle/pull/63145) +- In PIR, add the fusion PASS (operator_unsqueeze_onednn_fuse_pass) between multiple operators and unsqueeze, to improve inference speed. [#63592](https://github.com/PaddlePaddle/Paddle/pull/63592) +- In PIR, add PASS (operator_reshape_onednn_fuse_pass) to fuse reshape into multiple operators. [#63812](https://github.com/PaddlePaddle/Paddle/pull/63812) +- In PIR, add scale fusion PASS (operator_scale_onednn_fuse_pass). [#63811](https://github.com/PaddlePaddle/Paddle/pull/63811) +- In PIR, add PASS (conv2d_transpose_bias operator) that fuses conv and bias. [#62241](https://github.com/PaddlePaddle/Paddle/pull/62241) +- In PIR, add onednn_placement_pass, which supports 151 operators to convert from Phi operators to oneDNN operators, so that the oneDNN high-performance library can be used for optimization, to improve the inference speed. [#63982](https://github.com/PaddlePaddle/Paddle/pull/63982) +- In PIR, add the fusion between Elementwise type operators and 13 activation functions, to greatly improve the inference speed of enabling Onednn on the CPU. [#63516](https://github.com/PaddlePaddle/Paddle/pull/63516) +- In PIR, add the fusion of multiple conv + concat + activation functions and fused_conv + concat + activation functions, to greatly improve the inference speed when there are concat and activation functions in conv. [#62993](https://github.com/PaddlePaddle/Paddle/pull/62993)、 [#62713](https://github.com/PaddlePaddle/Paddle/pull/62713) +- In PIR, add matmul+add operator fusion PASS (matmul_elementwise_add_fuse_pass). [#62715](https://github.com/PaddlePaddle/Paddle/pull/62715) +- In PIR, add the scale parameter to fold PASS (scale_matmul_fuse_pass). [#63313](https://github.com/PaddlePaddle/Paddle/pull/63313) +- In PIR, add the fusion PASS (softplus_activation_fuse_pass) between softplus and 12 activation functions. [#63617](https://github.com/PaddlePaddle/Paddle/pull/63617) +- In PIR, add fc operator conversion PASS (fc_onednn_enable_pass). [#63518](https://github.com/PaddlePaddle/Paddle/pull/63518) +- In PIR, add self-attention operator fusion PASS (self_attention_fuse_pass). [#63726](https://github.com/PaddlePaddle/Paddle/pull/63726) +- In PIR, add fusion PASS (fc_activation_fuse_pass) between fc and 12 activation functions. [#63853](https://github.com/PaddlePaddle/Paddle/pull/63853) +- In PIR, add BatchNorm folded PASS (conv2d_bn_onednn_fuse_pass) to amplify the fusion probability of subsequent PASS. [#64524](https://github.com/PaddlePaddle/Paddle/pull/64524) +- In PIR, add the fusion PASS (matmul_activation_fuse_pass) between matmul and 12 activation functions. [#62901](https://github.com/PaddlePaddle/Paddle/pull/62901) +- In PIR, add reshape + transpose + reshape fusion PASS (shuffle_channel_detect_pass), which is fused into a shuffle_channel operator under specific conditions. [#64053](https://github.com/PaddlePaddle/Paddle/pull/64053) +- In PIR, add reshape + transpose + matmul fusion PASS (reshape_transpose_matmul_fuse_pass). [#62998](https://github.com/PaddlePaddle/Paddle/pull/62998) +- In PIR, add matmul + transpose + reshape fusion PASS (matmul_transpose_reshape_fuse_pass) to PIR to significantly improve performance in some scenarios. [#63151](https://github.com/PaddlePaddle/Paddle/pull/63151)(https://github.com/PaddlePaddle/Paddle/pull/63151) +- XPU hardware new fusion PASS optimization: + - Add qk_qkv_attention_xpu_fuse_pass and qkv_attention_xpu_kernel in XPU hardware. [#60089](https://github.com/PaddlePaddle/Paddle/pull/60089) + - Add rotary position encoded fusion operator, to support elementwise_mul + strided_slice + sin/cos+ stack fusion to 1 operator in XPU hardware. [#60025](https://github.com/PaddlePaddle/Paddle/pull/60025) + - Add group_norm_silu_xpu_fuse_pass. [#62689](https://github.com/PaddlePaddle/Paddle/pull/62689) + - Add weight_only_linear_xpu_pass. [#64185](https://github.com/PaddlePaddle/Paddle/pull/64185) + - Add block_multihead_attention operator and PASS, to support large model inference for LLaMA2 models in XPU devices. [#65036](https://github.com/PaddlePaddle/Paddle/pull/65036) + - Support float16 type for squeeze_excitation_block_xpu_kernel. [#61023](https://github.com/PaddlePaddle/Paddle/pull/61023) + +### Bug Fixing + +- Fix mixed-precision conversions in models such as faster_rcnn_swin_tiny_fpn_1x_coco, and solve the mixed_precision_pass error. [#64673](https://github.com/PaddlePaddle/Paddle/pull/64673) +- Block fused_conv2d_add_act pass from being validated in activation functions that are sigmoid (fused conv2d and sigmoid cause performance degradation between cudnn versions 8.0 and 8.7). [#64717](https://github.com/PaddlePaddle/Paddle/pull/64717) +- Fix compilation issues with self_dp_attention and fused_layer_norm_avx_kernel in Clang12. [#63414](https://github.com/PaddlePaddle/Paddle/pull/63414) +- Fix the issue that scale and zeroPoints in the qdq operator of some models are deleted prematurely in the IR/Pass stage. [#62225](https://github.com/PaddlePaddle/Paddle/pull/62225) +- Fix the issue that causes an error to be reported when both Config.UseOptimizedModel() and config.EnableMemoryOptim() are turned on. [#62501](https://github.com/PaddlePaddle/Paddle/pull/62501) +- Add constraint on matmul_scale_fuse_pass, where input w must be a weight or the pass will not be matched. [#62850](https://github.com/PaddlePaddle/Paddle/pull/62850) +- Keep inference model output key ordering guaranteed to be the same as when dynamic graph models are exported. [#63791](https://github.com/PaddlePaddle/Paddle/pull/63791) +- Fix the error in subgraph when the constant fold PASS is in "the folded op and its input and output are not in the same subgraph." [#62148](https://github.com/PaddlePaddle/Paddle/pull/62148) +- Fix several runtime problems in PaddleTRT mode. Include the failure of quantization calibration table generation caused by yolo_box operator in int8 mode, and the error caused by incorrect handling of dim attribute data type in reduce operator. [#61596](https://github.com/PaddlePaddle/Paddle/pull/61596) +- Fix some runtime error problems in mixed-precision inference mode.Include the errors caused by sharing weights among fused conv2d operators without correctly converting weight layout, fused conv2d operator backend not properly selected as cuDNN, fused conv2d operator incorrectly handling bias dimension under NHWC, incorrectly handling input data type of norm class operator. [#60955](https://github.com/PaddlePaddle/Paddle/pull/60955)、[#60076](https://github.com/PaddlePaddle/Paddle/pull/60076)、[#63007](https://github.com/PaddlePaddle/Paddle/pull/63007)、[#63988](https://github.com/PaddlePaddle/Paddle/pull/63988) +- Fix the problem that config.delete_pass function does not take effect. [#61056](https://github.com/PaddlePaddle/Paddle/pull/61056) +- Fix the GC mechanism of While control flow in PIR to recycle unwanted inputs in advance and reduce the peak memory, for example, 2GB memory reduction in LLaMA 7B model. [#63062](https://github.com/PaddlePaddle/Paddle/pull/63062) +- Fix the OneDNN mean kernel rollback error. [#64676](https://github.com/PaddlePaddle/Paddle/pull/64676) +- Fix the conv_bias_fuse_pass strong constraints newly added, e.g., the shape of the bias cannot be 1, so as to ensure the stability of the pass inference result. [#64412](https://github.com/PaddlePaddle/Paddle/pull/64412) +- Fix the conv_elementwise_add_onednn_fuse_pass strong constraints newly added, e.g., conv2d_out and residual_param must have the same size, so that the pass inference is stable. [#64448](https://github.com/PaddlePaddle/Paddle/pull/64448) +- Fix the problem of repeatedly inserting quantized inverse-quantization operators under certain circumstances [#63082](https://github.com/PaddlePaddle/Paddle/pull/63082) + +## Hardware Adaptation + +### Adaptation Scheme (Custom Device) + +For PaddlePaddle hardware access, add the daily release supports for 4 hardware Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU this time. Meanwhile, the problems in distributed communications have been fixed through large model training and inference deployment, and performance is optimized through functions such as video memory optimization, and overlap of computation and communication. Furthermore, each hardware is also added to support a large number of BFloat16 data type operators this time, as well as many operator fusion Pass and fusion operators on each hardware. Through the hardware and software together, hardware large Transformer operator library is accessed to fully improve the performance of large models. + +#### New Features + +- Add the support for distributed policy sharding stage1 v2. [#61500](https://github.com/PaddlePaddle/Paddle/pull/61500) +- Support the distributed communication module in BF16 data type.Add some operators to support for BF16 data types such as empty, shape, etc. [#60768](https://github.com/PaddlePaddle/Paddle/pull/60768),[#62140](https://github.com/PaddlePaddle/Paddle/pull/62140),[#62604](https://github.com/PaddlePaddle/Paddle/pull/62604) +- Add the support for get_comm_name interface, support for memory stat function, and support for Profiler to record memory time. [#62556](https://github.com/PaddlePaddle/Paddle/pull/62556),[#61030](https://github.com/PaddlePaddle/Paddle/pull/61030),[#62292](https://github.com/PaddlePaddle/Paddle/pull/62292) +- Add support for some fusion strategies and operators, including silu_fuse_pass, conv_elementwise_add_act_fuse_pass, and generator offset. [#60595](https://github.com/PaddlePaddle/Paddle/pull/60595),[#60708](https://github.com/PaddlePaddle/Paddle/pull/60708),[#60616](https://github.com/PaddlePaddle/Paddle/pull/60616) + +#### Performance Optimization + +- The distributed communication strategy Sharing uses asynchronous strategy in Broadcast parameter, to improve the overlap between computation and communication. [#59745](https://github.com/PaddlePaddle/Paddle/pull/59745) +- Add the support for STRIDED Layout operator to improve the performance of the operator. [#62532](https://github.com/PaddlePaddle/Paddle/pull/62532),[#62697](https://github.com/PaddlePaddle/Paddle/pull/62697),[#62649](https://github.com/PaddlePaddle/Paddle/pull/62649) +- Optimize the memory usage of elementwise_mul operator.[#62377](https://github.com/PaddlePaddle/Paddle/pull/62377) + +#### Bug Fixing + +- Fix the bug under the distributed strategy Sharing. [#61942](https://github.com/PaddlePaddle/Paddle/pull/61942),[#62236](https://github.com/PaddlePaddle/Paddle/pull/62236),[#62305](https://github.com/PaddlePaddle/Paddle/pull/62305),[#62535](https://github.com/PaddlePaddle/Paddle/pull/62535),[#62572](https://github.com/PaddlePaddle/Paddle/pull/62572),[#61601](https://github.com/PaddlePaddle/Paddle/pull/61601) +- Fix the problem that the operator cannot be registered due to c_embedding operator is not under PHI namespace. [#60774](https://github.com/PaddlePaddle/Paddle/pull/60774) +- Fix the xccl_comm release issue. [#60465](https://github.com/PaddlePaddle/Paddle/pull/60465) +- Fix data address error caused by index_put operator fallbacking cpu. [#61842](https://github.com/PaddlePaddle/Paddle/pull/61842) +- Fix stream_safe_custom_device_allocator issue. [#63369](https://github.com/PaddlePaddle/Paddle/pull/63369) +- Fix the distributed worker port conflict issue. [#61409](https://github.com/PaddlePaddle/Paddle/pull/61409) +- Fix comm data type to improve device compatibility. [#62306](https://github.com/PaddlePaddle/Paddle/pull/62306) +- Unify the use of comm data type to phi::DataType. [#62464](https://github.com/PaddlePaddle/Paddle/pull/62464),[#62562](https://github.com/PaddlePaddle/Paddle/pull/62562) +- Fix the problem of missing precision parameter in PD_ConfigEnableCustomDevice. [#63702](https://github.com/PaddlePaddle/Paddle/pull/63702) + +### Kunlun XPU + +#### New Features + +- Add the support for BF16 data types for some operators, including compare_kernel and add reduce_all_kernel ([#63602](https://github.com/PaddlePaddle/Paddle/pull/63602)), empty([#60212](https://github.com/PaddlePaddle/Paddle/pull/60212)), hybrid_parallel_optimizer([#60213](https://github.com/PaddlePaddle/Paddle/pull/60213)), reduce_max/reduce_min([#60453](https://github.com/PaddlePaddle/Paddle/pull/60453)), all_reduce/concat/split([#62364](https://github.com/PaddlePaddle/Paddle/pull/62364)), tile/tile_grad([#63075](https://github.com/PaddlePaddle/Paddle/pull/63075)), accuracy([#63863](https://github.com/PaddlePaddle/Paddle/pull/63863)), swiglu/set_value([#64070](https://github.com/PaddlePaddle/Paddle/pull/64070)), amp_master_grad([#63865](https://github.com/PaddlePaddle/Paddle/pull/63865)), c_concat ([#63403](https://github.com/PaddlePaddle/Paddle/pull/63403)), flatten ([#63997](https://github.com/PaddlePaddle/Paddle/pull/63997)), compare_op ([#64473](https://github.com/PaddlePaddle/Paddle/pull/64473)), moment1/moment2 ([#62688](https://github.com/PaddlePaddle/Paddle/pull/62688)), fused_rope ([#60064](https://github.com/PaddlePaddle/Paddle/pull/60064)), c_softmax_with_cross_entropy ([#60472](https://github.com/PaddlePaddle/Paddle/pull/60472)), elementwise_pow/square/sin/cos ([#60402](https://github.com/PaddlePaddle/Paddle/pull/60402)), strided_slice ([#60382](https://github.com/PaddlePaddle/Paddle/pull/60382)), tile/sigmoid_grad ([#60119](https://github.com/PaddlePaddle/Paddle/pull/60119)), elementwise_sub/elementwise_div ([#60386](https://github.com/PaddlePaddle/Paddle/pull/60386)), softmax_with_cross_entropy ([#63759](https://github.com/PaddlePaddle/Paddle/pull/63759)) +- Add the support for INT8 data types for some operators, including multi_encoder_xpu ([#61212](https://github.com/PaddlePaddle/Paddle/pull/61212)), qkv_attention ([#63105](https://github.com/PaddlePaddle/Paddle/pull/63105)) +- Update Kunlun SDK versions including BKCL, XHPC, XCCL, etc. [#59895](https://github.com/PaddlePaddle/Paddle/pull/59895)、[#59888](https://github.com/PaddlePaddle/Paddle/pull/59888)、[#63624](https://github.com/PaddlePaddle/Paddle/pull/63624), [#60305](https://github.com/PaddlePaddle/Paddle/pull/60305), [#62076](https://github.com/PaddlePaddle/Paddle/pull/62076), [#62646](https://github.com/PaddlePaddle/Paddle/pull/62646), [#63520](https://github.com/PaddlePaddle/Paddle/pull/63520), [#64163](https://github.com/PaddlePaddle/Paddle/pull/64163), [#64326](https://github.com/PaddlePaddle/Paddle/pull/64326), [#60617](https://github.com/PaddlePaddle/Paddle/pull/60617), [#60377](https://github.com/PaddlePaddle/Paddle/pull/60377), [#60421](https://github.com/PaddlePaddle/Paddle/pull/60421), [#60598](https://github.com/PaddlePaddle/Paddle/pull/60598), [#61199](https://github.com/PaddlePaddle/Paddle/pull/61199) +- Add the support for memory stat function. [#61116](https://github.com/PaddlePaddle/Paddle/pull/61116) +- Add multi-stream support, to assign default l3/gm buffer size to each stream. [#62729](https://github.com/PaddlePaddle/Paddle/pull/62729) +- Add nonzero operator, to support simulator XPUSIM_SKIP_RUN mode. [#60224](https://github.com/PaddlePaddle/Paddle/pull/60224)。[#60388](https://github.com/PaddlePaddle/Paddle/pull/60388) +- Add stride_slice and stride_slice_grad operators, to support strides < 0. [#62749](https://github.com/PaddlePaddle/Paddle/pull/62749) +- Add rotary_embedding, to support use_neox_rotary_style == True. [#64090](https://github.com/PaddlePaddle/Paddle/pull/64090) +- Add fusion Pass and fusion operators including cross_attention ([#63203](https://github.com/PaddlePaddle/Paddle/pull/63203)), fused_bias_act ([#62232](https://github.com/PaddlePaddle/Paddle/pull/62232)), fused_layernorm ([#62228](https://github.com/PaddlePaddle/Paddle/pull/62228)), group_norm_silu_xpu_fuse_pass ([#63342](https://github.com/PaddlePaddle/Paddle/pull/63342)) +- Add the support for distributed policy sharding stage3. [#57457](https://github.com/PaddlePaddle/Paddle/pull/57457) +- Add the support for tf32 fc quantization mode. [#62273](https://github.com/PaddlePaddle/Paddle/pull/62273) +- Add the flash attention operator. [#60065](https://github.com/PaddlePaddle/Paddle/pull/60065) +- Add the roformer relative embedding pass & kernel and support multi_encoder_xpu. [#62089](https://github.com/PaddlePaddle/Paddle/pull/62089) +- Add the support for pp + sharding strategy. [#63640](https://github.com/PaddlePaddle/Paddle/pull/63640) +- Upgrade the XPU communication library architecture to support dynamic-static unified communication library function. [#63817](https://github.com/PaddlePaddle/Paddle/pull/63817) + +#### Performance Optimization + +- Add XHPC buffer manager to improve the performance of Paddle and XHPC memory collaboration. [#63924](https://github.com/PaddlePaddle/Paddle/pull/63924) +- Enhance TensorSetConstantXPU performance and support BF16 data type. [#63920](https://github.com/PaddlePaddle/Paddle/pull/63920),[#61818](https://github.com/PaddlePaddle/Paddle/pull/61818) +- Fusion multiple group norm + silu + conv modules and compress the video memory. [#62892](https://github.com/PaddlePaddle/Paddle/pull/62892) +- Optimize XPU memory allocation in comm manager. [#64139](https://github.com/PaddlePaddle/Paddle/pull/64139) +- Optimize operator performance, including mean_all_grad ([#61148](https://github.com/PaddlePaddle/Paddle/pull/61148)), dropout_v2 ([#61029](https://github.com/PaddlePaddle/Paddle/pull/61029)), fused_rotary_position_embedding ([#62846](https://github.com/PaddlePaddle/Paddle/pull/62846)), cross_entropy ([#63159](https://github.com/PaddlePaddle/Paddle/pull/63159)), elementwise_add ([#64289](https://github.com/PaddlePaddle/Paddle/pull/64289)), fused_gemm_epilogue ([#61350](https://github.com/PaddlePaddle/Paddle/pull/61350), check_nan_or_inf ([#60853](https://github.com/PaddlePaddle/Paddle/pull/60853)) + +#### Bug Fixing + +- Fix the tile operator support for 0-dimensional Tensor. [#64279](https://github.com/PaddlePaddle/Paddle/pull/64279) +- Fix the group_norm_silu_fuse_pass. [#63449](https://github.com/PaddlePaddle/Paddle/pull/63449) +- Fix the XPU API GM memory issue. [#60260](https://github.com/PaddlePaddle/Paddle/pull/60260),[#60387](https://github.com/PaddlePaddle/Paddle/pull/60387),[#62940](https://github.com/PaddlePaddle/Paddle/pull/62940) +- Fix the distributed strategy Sharing stage1 v2 bug. [#64209](https://github.com/PaddlePaddle/Paddle/pull/64209) +- Fix the XPU constant issue. [#60763](https://github.com/PaddlePaddle/Paddle/pull/60763) +- Fix some operator issues, including AdamW ([#62251](https://github.com/PaddlePaddle/Paddle/pull/62251)), dropout_v3 ([#62726](https://github.com/PaddlePaddle/Paddle/pull/62726)), softmax([#63780](https://github.com/PaddlePaddle/Paddle/pull/63780)) , fused rope embedding ([#62143](https://github.com/PaddlePaddle/Paddle/pull/62143)), elementwise_add ([#60252](https://github.com/PaddlePaddle/Paddle/pull/60252)), resnet_basic_block ([#62914](https://github.com/PaddlePaddle/Paddle/pull/62914)) +- Fix XPU runtime and installation related issues. [#60028](https://github.com/PaddlePaddle/Paddle/pull/60028),[#61970](https://github.com/PaddlePaddle/Paddle/pull/61970) +- Fix XPU compilation bugs. [#63307](https://github.com/PaddlePaddle/Paddle/pull/63307) +- Fix end-side memory related bugs when initializing XPU communication library. [#64396](https://github.com/PaddlePaddle/Paddle/pull/64396) -- Supports the FMHA/MMHA for CacheKV division block scheduling. [#59462](https://github.com/PaddlePaddle/Paddle/pull/59462) -- RoPE encoding fusion operator supports input sin/cos values. [#55415](https://github.com/PaddlePaddle/Paddle/pull/55415) -- Added fine-grained fusion operators. Supports high-performance inference optimization of generative large models. Added operators such as quant_linear, weight_quantize, and linear_compress for support of large model quantitative inference. [#57852](https://github.com/PaddlePaddle/Paddle/pull/57852),[#55128](https://github.com/PaddlePaddle/Paddle/pull/55128),[#59090](https://github.com/PaddlePaddle/Paddle/pull/59090),[#56706](https://github.com/PaddlePaddle/Paddle/pull/56706),[#59951](https://github.com/PaddlePaddle/Paddle/pull/59951),[#55490](https://github.com/PaddlePaddle/Paddle/pull/55490),[#59291](https://github.com/PaddlePaddle/Paddle/pull/59291),[#59441](https://github.com/PaddlePaddle/Paddle/pull/59441),[#59778](https://github.com/PaddlePaddle/Paddle/pull/59778),[#59651](https://github.com/PaddlePaddle/Paddle/pull/59651)[#55301](https://github.com/PaddlePaddle/Paddle/pull/55301),[#58637](https://github.com/PaddlePaddle/Paddle/pull/58637),[#56673](https://github.com/PaddlePaddle/Paddle/pull/56673),[#56401](https://github.com/PaddlePaddle/Paddle/pull/56401) -- Supports variable length inference series API. [#57948](https://github.com/PaddlePaddle/Paddle/pull/57948) -- Supports the GQA inference. [#58472](https://github.com/PaddlePaddle/Paddle/pull/58472),[#58836](https://github.com/PaddlePaddle/Paddle/pull/58836) -- Added masked multihead attention. Supports high performance MMHA inference. [#55344](https://github.com/PaddlePaddle/Paddle/pull/55344),[#56411](https://github.com/PaddlePaddle/Paddle/pull/56411),[#58134](https://github.com/PaddlePaddle/Paddle/pull/58134),[#57936](https://github.com/PaddlePaddle/Paddle/pull/57936) -- weight_quantize/weight_only_linear supports the Volta architecture. [#58082](https://github.com/PaddlePaddle/Paddle/pull/58082) -- Added weight_only_linear_grad for support of large model weight only quantization gradient transfer-back. [#57685](https://github.com/PaddlePaddle/Paddle/pull/57685) -- Fixed large model dynamic to static bug. Optimized communication initialization logic between static graph cards. [#56390](https://github.com/PaddlePaddle/Paddle/pull/56390),[#57169](https://github.com/PaddlePaddle/Paddle/pull/57169),[#56688](https://github.com/PaddlePaddle/Paddle/pull/56688),[#56592](https://github.com/PaddlePaddle/Paddle/pull/56592),[#58868](https://github.com/PaddlePaddle/Paddle/pull/58868) -- Optimized top_p_sampling random number generation logic. [#59494](https://github.com/PaddlePaddle/Paddle/pull/59494) - -### Paddle-TensorRT Inference Optimization - -- elementwise_add fusion supports NHWC format. [#56795](https://github.com/PaddlePaddle/Paddle/pull/56795) -- conv2d supports filter as input. [#55246](https://github.com/PaddlePaddle/Paddle/pull/55246)。 -- Supports BF16 and FP64 inference. [#59765](https://github.com/PaddlePaddle/Paddle/pull/59765),[#55520](https://github.com/PaddlePaddle/Paddle/pull/55520) -- Added MarkTrtEngineOutputs API. Users can specify TensorRT Engine outputs. [#56858](https://github.com/PaddlePaddle/Paddle/pull/56858),[#56188](https://github.com/PaddlePaddle/Paddle/pull/56188),[#57407](https://github.com/PaddlePaddle/Paddle/pull/57407) -- Customized OP can generate TensorRT Plugin automatically. [#58976](https://github.com/PaddlePaddle/Paddle/pull/58976),[#56037](https://github.com/PaddlePaddle/Paddle/pull/56037) -- TensorRT inference allows users to specify input hook to optimize shape collection process. [#59466](https://github.com/PaddlePaddle/Paddle/pull/59466),[#54841](https://github.com/PaddlePaddle/Paddle/pull/54841),[#57498](https://github.com/PaddlePaddle/Paddle/pull/57498),[#54861](https://github.com/PaddlePaddle/Paddle/pull/54861),[#54432](https://github.com/PaddlePaddle/Paddle/pull/54432),[#55503](https://github.com/PaddlePaddle/Paddle/pull/55503) -- TensorRT Inference supports inference model after saving Tuning. [#55893](https://github.com/PaddlePaddle/Paddle/pull/55893),[#56952](https://github.com/PaddlePaddle/Paddle/pull/56952),[#57031](https://github.com/PaddlePaddle/Paddle/pull/57031) -- Supports variable length Transformer model PromptTuning. [#57034](https://github.com/PaddlePaddle/Paddle/pull/57034) -- Added operators such as bitwise_and, bitwise_or, bitwise_not, cumsum, einsum, lookup_table, assign, flip, size, scatter, solve, unbind, reduce, and argsort. Optimized support of existing operators. [#59214](https://github.com/PaddlePaddle/Paddle/pull/59214),[#59293](https://github.com/PaddlePaddle/Paddle/pull/59293),[#54882](https://github.com/PaddlePaddle/Paddle/pull/54882),[#54097](https://github.com/PaddlePaddle/Paddle/pull/54097),[#54860](https://github.com/PaddlePaddle/Paddle/pull/54860),[#55426](https://github.com/PaddlePaddle/Paddle/pull/55426),[#54372](https://github.com/PaddlePaddle/Paddle/pull/54372),[#55688](https://github.com/PaddlePaddle/Paddle/pull/55688),[#56069](https://github.com/PaddlePaddle/Paddle/pull/56069),[#59563](https://github.com/PaddlePaddle/Paddle/pull/59563),[#59317](https://github.com/PaddlePaddle/Paddle/pull/59317),[#59424](https://github.com/PaddlePaddle/Paddle/pull/59424),[#55476](https://github.com/PaddlePaddle/Paddle/pull/55476),[#56043](https://github.com/PaddlePaddle/Paddle/pull/56043),[#58549](https://github.com/PaddlePaddle/Paddle/pull/58549),[#57326](https://github.com/PaddlePaddle/Paddle/pull/57326),[#59409](https://github.com/PaddlePaddle/Paddle/pull/59409)) -- TensorRT enables video memory sharing by default. [#59495](https://github.com/PaddlePaddle/Paddle/pull/59495),[#58251](https://github.com/PaddlePaddle/Paddle/pull/58251) -- PrelnResidualBiasPluginDynamic supports 4D input. [#56304](https://github.com/PaddlePaddle/Paddle/pull/56304) -- Added support for FlashAttention for Paddle-TRT inference for architectures below SM80.[#56492](https://github.com/PaddlePaddle/Paddle/pull/56492) - -### Modification deprecation - -- Removed fc_elementwise_add fusion from OneDNN. [#55504](https://github.com/PaddlePaddle/Paddle/pull/55504) -- Removed redunant op. [#54442](https://github.com/PaddlePaddle/Paddle/pull/54442) - -### Bug Fix - -- Fixed “Inference so” link flags conflict issue. [#59755](https://github.com/PaddlePaddle/Paddle/pull/59755) -- Fixed constant_folding pass execution error. [#55556](https://github.com/PaddlePaddle/Paddle/pull/55556) -- Fixed softmax forward speed bug and reverse accuracy bug. [#56036](https://github.com/PaddlePaddle/Paddle/pull/56036),[#57858](https://github.com/PaddlePaddle/Paddle/pull/57858)[#57538](https://github.com/PaddlePaddle/Paddle/pull/57538) -- Fixed customized OP while error and export bug. [#58898](https://github.com/PaddlePaddle/Paddle/pull/58898),[#59318](https://github.com/PaddlePaddle/Paddle/pull/59318) -- Fixed CUDA 12.0 compilation problem on Windows platform. [#59852](https://github.com/PaddlePaddle/Paddle/pull/59852) -- Fixed bug of inference partial operator error when TensorRT version is later than 8.6. [#54379](https://github.com/PaddlePaddle/Paddle/pull/54379),[#54679](https://github.com/PaddlePaddle/Paddle/pull/54679),[#54251](https://github.com/PaddlePaddle/Paddle/pull/54251) -- Fixed and removed inference fusion Pass. [#54846](https://github.com/PaddlePaddle/Paddle/pull/54846),[#54887](https://github.com/PaddlePaddle/Paddle/pull/54887),[#55573](https://github.com/PaddlePaddle/Paddle/pull/55573),[#56434](https://github.com/PaddlePaddle/Paddle/pull/56434),[#56326](https://github.com/PaddlePaddle/Paddle/pull/56326),[#56753](https://github.com/PaddlePaddle/Paddle/pull/56753),[#57491](https://github.com/PaddlePaddle/Paddle/pull/57491),[#56909](https://github.com/PaddlePaddle/Paddle/pull/56909),[#54536](https://github.com/PaddlePaddle/Paddle/pull/54536),[#55073](https://github.com/PaddlePaddle/Paddle/pull/55073),[#55081](https://github.com/PaddlePaddle/Paddle/pull/55081),[#55240](https://github.com/PaddlePaddle/Paddle/pull/55240),[#56439](https://github.com/PaddlePaddle/Paddle/pull/56439),[#59009](https://github.com/PaddlePaddle/Paddle/pull/59009) -- Fixed error of multi-stream inference context switching. [#57629](https://github.com/PaddlePaddle/Paddle/pull/57629),[#58048](https://github.com/PaddlePaddle/Paddle/pull/58048),[#54994](https://github.com/PaddlePaddle/Paddle/pull/54994) - -## 5. Hardware Support - -### Hardware Integration Solution (Custom Device) - -In this update, added support for distributed advanced strategy, custom operator and custom fusion strategy. By upgrading distributed communication library, supports MP, GroupShared, PP, SP, MOE and other advanced distributed strategies. Meanwhile, enables vendors to flexibly access Transformer operator libraries of different granularities, and modify computation graph through Fusion Pass for performance acceleration. - -#### New features - -- Upgraded CustomDevice to support for Paddle's latest distributed communication library CommContext. Added a variety of advanced distributed strategies such as GroupShared and MOE. [#56301](https://github.com/PaddlePaddle/Paddle/pull/56301),[#54671](https://github.com/PaddlePaddle/Paddle/pull/54671),[#57957](https://github.com/PaddlePaddle/Paddle/pull/57957),[#56669](https://github.com/PaddlePaddle/Paddle/pull/56669),[#54384](https://github.com/PaddlePaddle/Paddle/pull/54384),[#54572](https://github.com/PaddlePaddle/Paddle/pull/54572),[#54573](https://github.com/PaddlePaddle/Paddle/pull/54573),[#54676](https://github.com/PaddlePaddle/Paddle/pull/54676) -- Upgraded CustomDevice to support CustomOP. Users can register undefined operators in Paddle PHI operator library. CustomDevice can support CustomOP via CAPI. [#57038](https://github.com/PaddlePaddle/Paddle/pull/57038),[#55532](https://github.com/PaddlePaddle/Paddle/pull/55532),[#56755](https://github.com/PaddlePaddle/Paddle/pull/56755),[#55532](https://github.com/PaddlePaddle/Paddle/pull/55532),[#55533](https://github.com/PaddlePaddle/Paddle/pull/55533),[#55659](https://github.com/PaddlePaddle/Paddle/pull/55659) -- Added CustomDevice's support for CustomPass function. Modified the computation graph IR through Python API. [#55511](https://github.com/PaddlePaddle/Paddle/pull/55511),[#55728](https://github.com/PaddlePaddle/Paddle/pull/55728) -- Added CustomDevice’s support for Paddle run_check. [#56318](https://github.com/PaddlePaddle/Paddle/pull/56318) -- Added CustomDevice’s support for StreamSafeAllocator. [#55393](https://github.com/PaddlePaddle/Paddle/pull/55393),[#56380](https://github.com/PaddlePaddle/Paddle/pull/56380),[#56536](https://github.com/PaddlePaddle/Paddle/pull/56536),[#58035](https://github.com/PaddlePaddle/Paddle/pull/58035) -- Added CustomDevice’s support for DataTransform. [#56627](https://github.com/PaddlePaddle/Paddle/pull/56627) - -#### Function optimization - -- Added CustomDevice’s support for more PaddlePaddle APIs such as Variable.set_value, adamw, share_external_data, mp_allreduce_sum, tensor.numpy, get_paddle_place, and GeneratorState. [#55272](https://github.com/PaddlePaddle/Paddle/pull/55272), [#56386](https://github.com/PaddlePaddle/Paddle/pull/56386), [#57253](https://github.com/PaddlePaddle/Paddle/pull/57253), [#56927](https://github.com/PaddlePaddle/Paddle/pull/56927),[#56189](https://github.com/PaddlePaddle/Paddle/pull/56189),[#55225](https://github.com/PaddlePaddle/Paddle/pull/55225),[#55247](https://github.com/PaddlePaddle/Paddle/pull/55247) -- Modified CustomDevice dynamic library loading method from RTLD_NOW to RTLD_LAZY, to facilitate subsequent checking of compatibility of CustomDevice related software stack version. [#57544](https://github.com/PaddlePaddle/Paddle/pull/57544) -- Added CustomDevice's detection function for FP16 operator under mixed precision training. [#56053](https://github.com/PaddlePaddle/Paddle/pull/56053),[#56176](https://github.com/PaddlePaddle/Paddle/pull/56176) - -#### Bug Fix - -- Fixed some problems in CustomDevice's support for distributed communication libraries. [#55293](https://github.com/PaddlePaddle/Paddle/pull/55293),[#58038](https://github.com/PaddlePaddle/Paddle/pull/58038),[#59800](https://github.com/PaddlePaddle/Paddle/pull/59800) -- Fixed some problems in CustomDevice on some operators, including c_softmax_with_cross_entropy,data loader,SplitDenseTensor,grad accumulation,atan2 grad.[#56486](https://github.com/PaddlePaddle/Paddle/pull/56486),[#55541](https://github.com/PaddlePaddle/Paddle/pull/55541),[#55615](https://github.com/PaddlePaddle/Paddle/pull/55615),[#56052](https://github.com/PaddlePaddle/Paddle/pull/56052),[#56067](https://github.com/PaddlePaddle/Paddle/pull/56067) -- Fixed some problems of device management in CustomDevice, including device exceptions ([#56556](https://github.com/PaddlePaddle/Paddle/pull/56556),[#58639](https://github.com/PaddlePaddle/Paddle/pull/58639),[#55173](https://github.com/PaddlePaddle/Paddle/pull/55173)), exception events ([#56745](https://github.com/PaddlePaddle/Paddle/pull/56745),[#58059](https://github.com/PaddlePaddle/Paddle/pull/58059)), video memory exception ([#56977](https://github.com/PaddlePaddle/Paddle/pull/56977),[#59247](https://github.com/PaddlePaddle/Paddle/pull/59247),[#54606](https://github.com/PaddlePaddle/Paddle/pull/54606)), device initialization ([#57099](https://github.com/PaddlePaddle/Paddle/pull/57099),[#57994](https://github.com/PaddlePaddle/Paddle/pull/57994)), device release ([#54932](https://github.com/PaddlePaddle/Paddle/pull/54932),[#55351](https://github.com/PaddlePaddle/Paddle/pull/55351),[#55783](https://github.com/PaddlePaddle/Paddle/pull/55783)), and device resource pooling, etc.([#55229](https://github.com/PaddlePaddle/Paddle/pull/55229),[#56580](https://github.com/PaddlePaddle/Paddle/pull/56580)) -- Fixed CustomDevice compilation-related issues. [#56760](https://github.com/PaddlePaddle/Paddle/pull/56760),[#56766](https://github.com/PaddlePaddle/Paddle/pull/56766) - -### Kunlunxin XPU - -#### New features - -- Added XPTI (XPU Profiling Tool Interface) to support collection and analysis function of runtime performance data. [#54685](https://github.com/PaddlePaddle/Paddle/pull/54685),[#54690](https://github.com/PaddlePaddle/Paddle/pull/54690),[#54800](https://github.com/PaddlePaddle/Paddle/pull/54800) -- Supports Paddle's latest distributed communication library CommContext. [#59418](https://github.com/PaddlePaddle/Paddle/pull/59418) -- Added XPU fusion operators, for example, fast_where. [#55628](https://github.com/PaddlePaddle/Paddle/pull/55628) -- Added support for XPU Pluign function, facilitating users to develop XPU customized operators through XTDK programming. [#55101](https://github.com/PaddlePaddle/Paddle/pull/55101),[#59326](https://github.com/PaddlePaddle/Paddle/pull/59326) -- Added XPU’s support for AutoGrowthAllocator. [#54121](https://github.com/PaddlePaddle/Paddle/pull/54121) -- Added operator support list of Kunlun3. [#57683](https://github.com/PaddlePaddle/Paddle/pull/57683) - -#### Function optimization - -- Upgraded XPU Inference API. [#54342](https://github.com/PaddlePaddle/Paddle/pull/54342) -- Optimized performance of some XPU operators. Added support for bf16 in some XPU operators, including unique/index_put,squeeze/unsqueeze kernels,swish/swish_grad,scatter_nd_add_grad/slice,rsqrt/bitwise_or/arange_tensor,where,collective. [#56582](https://github.com/PaddlePaddle/Paddle/pull/56582),[#58161](https://github.com/PaddlePaddle/Paddle/pull/58161),[#58440](https://github.com/PaddlePaddle/Paddle/pull/58440),[#58580](https://github.com/PaddlePaddle/Paddle/pull/58580),[#58950](https://github.com/PaddlePaddle/Paddle/pull/58950),[#58616](https://github.com/PaddlePaddle/Paddle/pull/58616),[#59273](https://github.com/PaddlePaddle/Paddle/pull/59273) -- Optimized XPU memory management to avoid memory leakage. [#59334](https://github.com/PaddlePaddle/Paddle/pull/59334),[#54847](https://github.com/PaddlePaddle/Paddle/pull/54847) -- Supports INT8 inference. [#57258](https://github.com/PaddlePaddle/Paddle/pull/57258) -- Added support for FP16 series inference operators. [#55642](https://github.com/PaddlePaddle/Paddle/pull/55642),[#54410](https://github.com/PaddlePaddle/Paddle/pull/54410) -- Supports share_external_memory interface to pass input and output. [#55170](https://github.com/PaddlePaddle/Paddle/pull/55170) -- Supports open source quantization model XPU inference. [#58568](https://github.com/PaddlePaddle/Paddle/pull/58568) -- Added context_gm_size configuration, instead of allocating global memory in Pass. [#54674](https://github.com/PaddlePaddle/Paddle/pull/54674) -- Added embedding and fast_gather_nd plugin. [#56488](https://github.com/PaddlePaddle/Paddle/pull/56488),[#56103](https://github.com/PaddlePaddle/Paddle/pull/56103) -- Supports fusion of fast_layternorm + leaky_relu. [#57113](https://github.com/PaddlePaddle/Paddle/pull/57113) -- Supports elementwise_min/max/floordiv/where inference in KL1 and KL2 precision. [#58422](https://github.com/PaddlePaddle/Paddle/pull/58422) -- Supports autotune configuration of fc and conv2d operator. [#58801](https://github.com/PaddlePaddle/Paddle/pull/58801) -- Supports conv and fc dynamic quantization. [#59307](https://github.com/PaddlePaddle/Paddle/pull/59307) -- fc + act fusion support for sigmoid, swish and relu6. [#54486](https://github.com/PaddlePaddle/Paddle/pull/54486) -- elementwise_sub/elementwise_div supports int data type. [#55920](https://github.com/PaddlePaddle/Paddle/pull/55920) - -#### Bug Fix - -- Fixed XPU communication library issues and some operator issues including rnn, layer_norm_grad, yolo_box. ([#55475](https://github.com/PaddlePaddle/Paddle/pull/55475),[#55515](https://github.com/PaddlePaddle/Paddle/pull/55515)) ([#55656](https://github.com/PaddlePaddle/Paddle/pull/55656),[#54669](https://github.com/PaddlePaddle/Paddle/pull/54669),[#55310](https://github.com/PaddlePaddle/Paddle/pull/55310) - -### Hygon DCU - -#### Bug Fix - -- Fixed some operator bugs of Hygon DCU, including rnn, concat/split, fft, and so on.[#59402](https://github.com/PaddlePaddle/Paddle/pull/59402),[#55821](https://github.com/PaddlePaddle/Paddle/pull/55821),[#56340](https://github.com/PaddlePaddle/Paddle/pull/56340)) -- Fixed issues related to communication library of Hygon DCU. [#57110](https://github.com/PaddlePaddle/Paddle/pull/57110) -- Fixed compilation-related problems of Hygon DCU. [#59775](https://github.com/PaddlePaddle/Paddle/pull/59775),[#55507](https://github.com/PaddlePaddle/Paddle/pull/55507),[#55612](https://github.com/PaddlePaddle/Paddle/pull/55612),[#54952](https://github.com/PaddlePaddle/Paddle/pull/54952),[#55076](https://github.com/PaddlePaddle/Paddle/pull/55076),[#56079](https://github.com/PaddlePaddle/Paddle/pull/56079),[#54874](https://github.com/PaddlePaddle/Paddle/pull/54874)) -- Fixed support issue of Hygon DCU for BF16 data type. [#56517](https://github.com/PaddlePaddle/Paddle/pull/56517) - -## 6. Environment Adaptation - -Adopted modular compilation to optimize CMake codes, improving efficiency of compilation of PaddlePaddle. This can increase efficiency of RD local development. Meanwhile, supports compilation in Python3.12, CUDA12, and Hopper architecture, and using Clang tool to comprehensively optimize code formats. In addition, C++ unitest is changed from linking static libraries to linking dynamic libraries to reduce compilation size. These improvements provide users with a smoother and more efficient installation and development experience. - -- CMake code optimization: stratify directories into independent static libraries, to improve incremental compilation efficiency. [#59095](https://github.com/PaddlePaddle/Paddle/pull/59095), [#58960](https://github.com/PaddlePaddle/Paddle/pull/58960),[#56591](https://github.com/PaddlePaddle/Paddle/pull/56591),[#58484](https://github.com/PaddlePaddle/Paddle/pull/58484) -- CMake compilation stratification: to realize compilation layering of PaddlePaddle architecture from bottom-up and improve compilation efficiency. [#56442](https://github.com/PaddlePaddle/Paddle/pull/56442),[#54729](https://github.com/PaddlePaddle/Paddle/pull/54729),[#55733](https://github.com/PaddlePaddle/Paddle/pull/55733),[#56352](https://github.com/PaddlePaddle/Paddle/pull/56352),[#55109](https://github.com/PaddlePaddle/Paddle/pull/55109),[#54992](https://github.com/PaddlePaddle/Paddle/pull/54992),[#57698](https://github.com/PaddlePaddle/Paddle/pull/57698),[#55147](https://github.com/PaddlePaddle/Paddle/pull/55147),[#55113](https://github.com/PaddlePaddle/Paddle/pull/55113),[#56691](https://github.com/PaddlePaddle/Paddle/pull/56691),[#58618](https://github.com/PaddlePaddle/Paddle/pull/58618),[#58899](https://github.com/PaddlePaddle/Paddle/pull/58899),[#59140](https://github.com/PaddlePaddle/Paddle/pull/59140),[#59129](https://github.com/PaddlePaddle/Paddle/pull/59129),[#59222](https://github.com/PaddlePaddle/Paddle/pull/59222),[#59105](https://github.com/PaddlePaddle/Paddle/pull/59105),[#59711](https://github.com/PaddlePaddle/Paddle/pull/59711) -- Offline compilation of third-party libraries: Third-party dependent libraries are compiled offline, so CI/CE system does not need to download third-party libraries repeatedly in every compilation, improving operation efficiency of the CI/CE system. [#54344](https://github.com/PaddlePaddle/Paddle/pull/54344),[#54370](https://github.com/PaddlePaddle/Paddle/pull/54370),[#54466](https://github.com/PaddlePaddle/Paddle/pull/54466),[#54438](https://github.com/PaddlePaddle/Paddle/pull/54438),[#54388](https://github.com/PaddlePaddle/Paddle/pull/54388),[#54436](https://github.com/PaddlePaddle/Paddle/pull/54436),[#54392](https://github.com/PaddlePaddle/Paddle/pull/54392),[#54646](https://github.com/PaddlePaddle/Paddle/pull/54646),[#54380](https://github.com/PaddlePaddle/Paddle/pull/54380),[#55501](https://github.com/PaddlePaddle/Paddle/pull/55501),[#55136](https://github.com/PaddlePaddle/Paddle/pull/55136),[#54451](https://github.com/PaddlePaddle/Paddle/pull/54451),[#55631](https://github.com/PaddlePaddle/Paddle/pull/55631),[#55549](https://github.com/PaddlePaddle/Paddle/pull/55549),[#56165](https://github.com/PaddlePaddle/Paddle/pull/56165),[#54391](https://github.com/PaddlePaddle/Paddle/pull/54391),[#54614](https://github.com/PaddlePaddle/Paddle/pull/54614),[#54522](https://github.com/PaddlePaddle/Paddle/pull/54522),[#54764](https://github.com/PaddlePaddle/Paddle/pull/54764),[#54400](https://github.com/PaddlePaddle/Paddle/pull/54400),[#54322](https://github.com/PaddlePaddle/Paddle/pull/54322) -- PaddlePaddle supports Python 3.12. [#59396](https://github.com/PaddlePaddle/Paddle/pull/59396),[#58069](https://github.com/PaddlePaddle/Paddle/pull/58069) -- Using Clang tool to optimize source codes and improve code quality. [#59626](https://github.com/PaddlePaddle/Paddle/pull/59626),[#55895](https://github.com/PaddlePaddle/Paddle/pull/55895),[#56632](https://github.com/PaddlePaddle/Paddle/pull/56632),[#54449](https://github.com/PaddlePaddle/Paddle/pull/54449),[#54523](https://github.com/PaddlePaddle/Paddle/pull/54523),[#54796](https://github.com/PaddlePaddle/Paddle/pull/54796),[#55847](https://github.com/PaddlePaddle/Paddle/pull/55847),[#55807](https://github.com/PaddlePaddle/Paddle/pull/55807),[#56261](https://github.com/PaddlePaddle/Paddle/pull/56261),[#57522](https://github.com/PaddlePaddle/Paddle/pull/57522),[#57868](https://github.com/PaddlePaddle/Paddle/pull/57868),[#57809](https://github.com/PaddlePaddle/Paddle/pull/57809),[#55658](https://github.com/PaddlePaddle/Paddle/pull/55658),[#58285](https://github.com/PaddlePaddle/Paddle/pull/58285),[#55491](https://github.com/PaddlePaddle/Paddle/pull/55491),[#55506](https://github.com/PaddlePaddle/Paddle/pull/55506),[#55279](https://github.com/PaddlePaddle/Paddle/pull/55279),[#55741](https://github.com/PaddlePaddle/Paddle/pull/55741),[#55894](https://github.com/PaddlePaddle/Paddle/pull/55894),[#55704](https://github.com/PaddlePaddle/Paddle/pull/55704),[#55800](https://github.com/PaddlePaddle/Paddle/pull/55800),[#55799](https://github.com/PaddlePaddle/Paddle/pull/55799),[#55983](https://github.com/PaddlePaddle/Paddle/pull/55983),[#55954](https://github.com/PaddlePaddle/Paddle/pull/55954),[#55764](https://github.com/PaddlePaddle/Paddle/pull/55764),[#56246](https://github.com/PaddlePaddle/Paddle/pull/56246),[#56219](https://github.com/PaddlePaddle/Paddle/pull/56219),[#56217](https://github.com/PaddlePaddle/Paddle/pull/56217),[#56216](https://github.com/PaddlePaddle/Paddle/pull/56216),[#56208](https://github.com/PaddlePaddle/Paddle/pull/56208),[#56134](https://github.com/PaddlePaddle/Paddle/pull/56134),[#56253](https://github.com/PaddlePaddle/Paddle/pull/56253),[#56255](https://github.com/PaddlePaddle/Paddle/pull/56255),[#56693](https://github.com/PaddlePaddle/Paddle/pull/56693),[#56692](https://github.com/PaddlePaddle/Paddle/pull/56692),[#56637](https://github.com/PaddlePaddle/Paddle/pull/56637),[#56636](https://github.com/PaddlePaddle/Paddle/pull/56636),[#56647](https://github.com/PaddlePaddle/Paddle/pull/56647),[#56218](https://github.com/PaddlePaddle/Paddle/pull/56218),[#56640](https://github.com/PaddlePaddle/Paddle/pull/56640),[#56635](https://github.com/PaddlePaddle/Paddle/pull/56635),[#55675](https://github.com/PaddlePaddle/Paddle/pull/55675),[#56601](https://github.com/PaddlePaddle/Paddle/pull/56601),[#56485](https://github.com/PaddlePaddle/Paddle/pull/56485),[#56648](https://github.com/PaddlePaddle/Paddle/pull/56648),[#56747](https://github.com/PaddlePaddle/Paddle/pull/56747),[#56676](https://github.com/PaddlePaddle/Paddle/pull/56676),[#56649](https://github.com/PaddlePaddle/Paddle/pull/56649),[#56895](https://github.com/PaddlePaddle/Paddle/pull/56895),[#56994](https://github.com/PaddlePaddle/Paddle/pull/56994),[#56904](https://github.com/PaddlePaddle/Paddle/pull/56904),[#56744](https://github.com/PaddlePaddle/Paddle/pull/56744),[#56954](https://github.com/PaddlePaddle/Paddle/pull/56954),[#57114](https://github.com/PaddlePaddle/Paddle/pull/57114),[#57343](https://github.com/PaddlePaddle/Paddle/pull/57343),[#57483](https://github.com/PaddlePaddle/Paddle/pull/57483),[#57871](https://github.com/PaddlePaddle/Paddle/pull/57871),[#57861](https://github.com/PaddlePaddle/Paddle/pull/57861),[#58028](https://github.com/PaddlePaddle/Paddle/pull/58028),[#57627](https://github.com/PaddlePaddle/Paddle/pull/57627),[#59072](https://github.com/PaddlePaddle/Paddle/pull/59072) -- C++ unitest has changed from linking static libraries to linking dynamic libraries, reducing compilation size and improving compilation efficiency. [#59477](https://github.com/PaddlePaddle/Paddle/pull/59477),[#56630](https://github.com/PaddlePaddle/Paddle/pull/56630),[#57789](https://github.com/PaddlePaddle/Paddle/pull/57789),[#54257](https://github.com/PaddlePaddle/Paddle/pull/54257),[#59620](https://github.com/PaddlePaddle/Paddle/pull/59620),[#59384](https://github.com/PaddlePaddle/Paddle/pull/59384),[#59619](https://github.com/PaddlePaddle/Paddle/pull/59619),[#58583](https://github.com/PaddlePaddle/Paddle/pull/58583),[#58821](https://github.com/PaddlePaddle/Paddle/pull/58821),[#58710](https://github.com/PaddlePaddle/Paddle/pull/58710),[#58619](https://github.com/PaddlePaddle/Paddle/pull/58619) -- Fixed bug related to source code compilation, improving compilation efficiency. [#56617](https://github.com/PaddlePaddle/Paddle/pull/56617),[#58195](https://github.com/PaddlePaddle/Paddle/pull/58195),[#56136](https://github.com/PaddlePaddle/Paddle/pull/56136),[#54540](https://github.com/PaddlePaddle/Paddle/pull/54540),[#57172](https://github.com/PaddlePaddle/Paddle/pull/57172),[#54429](https://github.com/PaddlePaddle/Paddle/pull/54429),[#55603](https://github.com/PaddlePaddle/Paddle/pull/55603),[#54807](https://github.com/PaddlePaddle/Paddle/pull/54807),[#56102](https://github.com/PaddlePaddle/Paddle/pull/56102),[#56829](https://github.com/PaddlePaddle/Paddle/pull/56829),[#56951](https://github.com/PaddlePaddle/Paddle/pull/56951),[#56555](https://github.com/PaddlePaddle/Paddle/pull/56555),[#57781](https://github.com/PaddlePaddle/Paddle/pull/57781),[#57836](https://github.com/PaddlePaddle/Paddle/pull/57836),[#58807](https://github.com/PaddlePaddle/Paddle/pull/58807),[#54535](https://github.com/PaddlePaddle/Paddle/pull/54535),[#54946](https://github.com/PaddlePaddle/Paddle/pull/54946),[#54437](https://github.com/PaddlePaddle/Paddle/pull/54437),[#54411](https://github.com/PaddlePaddle/Paddle/pull/54411),[#54411](https://github.com/PaddlePaddle/Paddle/pull/54411),[#54391](https://github.com/PaddlePaddle/Paddle/pull/54391),[#54466](https://github.com/PaddlePaddle/Paddle/pull/54466),[#54480](https://github.com/PaddlePaddle/Paddle/pull/54480),[#54480](https://github.com/PaddlePaddle/Paddle/pull/54480),[#54724](https://github.com/PaddlePaddle/Paddle/pull/54724),[#59193](https://github.com/PaddlePaddle/Paddle/pull/59193),[#54735](https://github.com/PaddlePaddle/Paddle/pull/54735),[#54812](https://github.com/PaddlePaddle/Paddle/pull/54812),[#56430](https://github.com/PaddlePaddle/Paddle/pull/56430),[#56655](https://github.com/PaddlePaddle/Paddle/pull/56655),[#56684](https://github.com/PaddlePaddle/Paddle/pull/56684),[#56774](https://github.com/PaddlePaddle/Paddle/pull/56774),[#56936](https://github.com/PaddlePaddle/Paddle/pull/56936),[#56949](https://github.com/PaddlePaddle/Paddle/pull/56949),[#56974](https://github.com/PaddlePaddle/Paddle/pull/56974),[#57171](https://github.com/PaddlePaddle/Paddle/pull/57171),[#57712](https://github.com/PaddlePaddle/Paddle/pull/57712),[#56617](https://github.com/PaddlePaddle/Paddle/pull/56617),[#58181](https://github.com/PaddlePaddle/Paddle/pull/58181),[#58253](https://github.com/PaddlePaddle/Paddle/pull/58253),[#58268](https://github.com/PaddlePaddle/Paddle/pull/58268),[#59051](https://github.com/PaddlePaddle/Paddle/pull/59051),[#59048](https://github.com/PaddlePaddle/Paddle/pull/59048),[#59081](https://github.com/PaddlePaddle/Paddle/pull/59081),[#59076](https://github.com/PaddlePaddle/Paddle/pull/59076),[#59155](https://github.com/PaddlePaddle/Paddle/pull/59155),[#59253](https://github.com/PaddlePaddle/Paddle/pull/59253),[#59347](https://github.com/PaddlePaddle/Paddle/pull/59347),[#58957](https://github.com/PaddlePaddle/Paddle/pull/58957),[#59443](https://github.com/PaddlePaddle/Paddle/pull/59443),[#58998](https://github.com/PaddlePaddle/Paddle/pull/58998),[#57574](https://github.com/PaddlePaddle/Paddle/pull/57574),[#55889](https://github.com/PaddlePaddle/Paddle/pull/55889),[#59078](https://github.com/PaddlePaddle/Paddle/pull/59078),[#55762](https://github.com/PaddlePaddle/Paddle/pull/55762),[#56252](https://github.com/PaddlePaddle/Paddle/pull/56252),[#56715](https://github.com/PaddlePaddle/Paddle/pull/56715),[#54905](https://github.com/PaddlePaddle/Paddle/pull/54905),[#56978](https://github.com/PaddlePaddle/Paddle/pull/56978),[#57032](https://github.com/PaddlePaddle/Paddle/pull/57032),[#57179](https://github.com/PaddlePaddle/Paddle/pull/57179),[#57179](https://github.com/PaddlePaddle/Paddle/pull/57179),[#58996](https://github.com/PaddlePaddle/Paddle/pull/58996),[#59915](https://github.com/PaddlePaddle/Paddle/pull/59915),[#54883](https://github.com/PaddlePaddle/Paddle/pull/54883),[#56746](https://github.com/PaddlePaddle/Paddle/pull/56746),[#57674](https://github.com/PaddlePaddle/Paddle/pull/57674),[#60117](https://github.com/PaddlePaddle/Paddle/pull/60117),[#55627](https://github.com/PaddlePaddle/Paddle/pull/55627),[#54568](https://github.com/PaddlePaddle/Paddle/pull/54568),[#54450](https://github.com/PaddlePaddle/Paddle/pull/54450),[#54513](https://github.com/PaddlePaddle/Paddle/pull/54513),[#54615](https://github.com/PaddlePaddle/Paddle/pull/54615),[#54913](https://github.com/PaddlePaddle/Paddle/pull/54913),[#54916](https://github.com/PaddlePaddle/Paddle/pull/54916),[#55148](https://github.com/PaddlePaddle/Paddle/pull/55148),[#55125](https://github.com/PaddlePaddle/Paddle/pull/55125),[#55479](https://github.com/PaddlePaddle/Paddle/pull/55479),[#55723](https://github.com/PaddlePaddle/Paddle/pull/55723),[#55831](https://github.com/PaddlePaddle/Paddle/pull/55831),[#55904](https://github.com/PaddlePaddle/Paddle/pull/55904),[#56085](https://github.com/PaddlePaddle/Paddle/pull/56085),[#56259](https://github.com/PaddlePaddle/Paddle/pull/56259),[#56366](https://github.com/PaddlePaddle/Paddle/pull/56366),[#56366](https://github.com/PaddlePaddle/Paddle/pull/56366),[#56546](https://github.com/PaddlePaddle/Paddle/pull/56546),[#56679](https://github.com/PaddlePaddle/Paddle/pull/56679),[#57222](https://github.com/PaddlePaddle/Paddle/pull/57222),[#57387](https://github.com/PaddlePaddle/Paddle/pull/57387),[#57993](https://github.com/PaddlePaddle/Paddle/pull/57993),[#59556](https://github.com/PaddlePaddle/Paddle/pull/59556),[#57931](https://github.com/PaddlePaddle/Paddle/pull/57931),[#58112](https://github.com/PaddlePaddle/Paddle/pull/58112),[#54228](https://github.com/PaddlePaddle/Paddle/pull/54228),[#56913](https://github.com/PaddlePaddle/Paddle/pull/56913),[#56993](https://github.com/PaddlePaddle/Paddle/pull/56993),[#55042](https://github.com/PaddlePaddle/Paddle/pull/55042),[#55305](https://github.com/PaddlePaddle/Paddle/pull/55305),[#55286](https://github.com/PaddlePaddle/Paddle/pull/55286),[#56634](https://github.com/PaddlePaddle/Paddle/pull/56634),[#57778](https://github.com/PaddlePaddle/Paddle/pull/57778),[#58374](https://github.com/PaddlePaddle/Paddle/pull/58374),[#58640](https://github.com/PaddlePaddle/Paddle/pull/58640),[#58822](https://github.com/PaddlePaddle/Paddle/pull/58822),[#59055](https://github.com/PaddlePaddle/Paddle/pull/59055),[#59303](https://github.com/PaddlePaddle/Paddle/pull/59303),[#59487](https://github.com/PaddlePaddle/Paddle/pull/59487),[#58400](https://github.com/PaddlePaddle/Paddle/pull/58400),[#59283](https://github.com/PaddlePaddle/Paddle/pull/59283),[#54791](https://github.com/PaddlePaddle/Paddle/pull/54791),[#59134](https://github.com/PaddlePaddle/Paddle/pull/59134),[#56206](https://github.com/PaddlePaddle/Paddle/pull/56206),[#56199](https://github.com/PaddlePaddle/Paddle/pull/56199),[#56670](https://github.com/PaddlePaddle/Paddle/pull/56670),[#58923](https://github.com/PaddlePaddle/Paddle/pull/58923) -- Fixed bug related to Paddle ARM compilation. [#55416](https://github.com/PaddlePaddle/Paddle/pull/55416),[#55548](https://github.com/PaddlePaddle/Paddle/pull/55548) - -## Thanks to Our Contributors - -Azure-Tang, zhaoyinglia, From00, JZ-LIANG, xysheng-baidu, SylarTiaNII, kuizhiqing, zhiqiu, FeixLiu, liuzhenhai93, GhostScreaming, pangengzheng, xiaoyewww, wanghuancoder, ForFishes, hitywt, danleifeng, tianshuo78520a, ykkk2333, houj04, lj970926, XiaociZhang, HarperCy, cqulilujia, runzhech, RuohengMa, Caozhou1995, kangguangli, heavyrain-lzy, zyfncg, SigureMo, YuanRisheng, lchdl, LiYuRio, AndSonder, Wennie396, zhangbo9674, liudongxue01, risemeup1, phlrain, winter-wang, yuanlehome, NALLEIN, Liujie0926, yuguo-Jack, gitliuyf, zh794390558, Aurelius84, 6clc, GGBond8488, xiaoguoguo626807, Wong4j, iosmers, xiaoxiaohehe001, LielinJiang, carryyu, Difers, yangxiaoyu14, xuxinyi389, cxxly, gongshaotian, jjyaoao, lijialin03, lxd-cumt, cyber-pioneer, HydrogenSulfate, MayYouBeProsperous, Charles-hit, Patrick-Star125, ScottWong98, huangjiyi, DrRyanHuang, jinyouzhi, BeingGod, Wanglongzhi2001, yangguohao, zyt1024, longranger2, 2742195759, megemini, thisjiang, kevincheng2, zhoutianzi666, Wangzheee, ming1753, tianhaodongbd, freeliuzc, zhenyun-li, MARD1NO, RichardWooSJTU, eee4017, leo0519, csy0225, wwbitejotunn, bukejiyu, jiweibo, iamsonderr, ckl117, ronny1996, zhanglirong1999, LLee233, ZHUI, wangxn12138, zhwesky2010, Courtesy-Xs, zoooo0820, llyyxx0413, Asthestarsfalll, zxcd, pkuzyc, idontkonwher, sneaxiy, hong19860320, ZibinGuo, leolishaohao, MuShangCC, zhupengyang, shentanyue, Travis-Lee, wz1qqx, frank-oops, newway, QingshuChen, zhangyk0314, HandSomeLEEw, Shixiaowei02, zhangyuqin1998, Xing-lil, zhhsplendid, jiahy0825, xinyu-intel, MarioLulab, 0x45f, Tom-Zheng, xingmingyyj, zhangbopd, gouzil, zeroRains, BiynXu, WintersMontagne10335, wuhuachaocoding, GreatV, chenwhql, deepllz, parap1uie-s, ozogxyz, FisherWY, changeyoung98, zhiboniu, YangQun1 dynamicheart, Xreki, liugddx, Lylinnnnn, YSF-A, zzjjay, YanhuiDua, lishicheng1996, USTCKAY, abenmao, cocoshe, HermitSun, ccsuzzh, sanbuphy, enkilee, RedContritio, Liyulingyue, zrr1999, chen2016013, Galaxy1458, chalsliu, mrcangye, XieYunshen, zhiheng-liu, haohongxiang, ZzSean, JamesLim-sy, yuehuayingxueluo, niuliling123, umiswing, sijunhe, littsk, SecretXV, zhurou603, zhangjun, caizejun, yangjianfengo1, vivienfanghuagood, Xinyu302, lizexu123, yghstill, Li-fAngyU, VigiZhang, co63oc, dhanush-2501, ooooo-create, PommesPeter, zeus2x7, akshatvishu, jzhang533, Sekiro-x, gumblex, BernieHuang2008, YibinLiu666, qiuwenbogdut, XavierZXY, MqLeet, zhangting2020, mingxu1067, Ainavo, SSKlearns, yuchen202, silverling, zade23, wenxiaohahaha, NKNaN, Tsaiyue, fsczz, Tomoko-hjf, rhmaaa, zbt78, Hhankyangg, wangzhen38, zhengqiwen1997, engineer1109, onepick, qili93, Rane2021, nemonameless, DesmonDay, RachelXu7, ceci3, lyuwenyu, liuruyan, LokeZhou, shiyutang, lanxianghit, feifei-111, Sahala08, sunzhongkai588, Kaedeharai, Candy2Tang, liyongchao911, whisky-12, InsaneOnion, yoyoIcy, KongAKun, linzeyang, MuhammadNizamani, eltociear, Ligoml, LUZY0726, Windfarer, FlyingQianMM, jeng1220, junelotus, zlsh80826, Vvsmile, Frida-a, TonibMw, guoshengCS, zhink, ZhangYulongg, AlbertVan, fengxin-hello, mjp9527, entired, DanGuge. - -# 2.5.0 Release Note - -## 1. Highlights -- **New dynamic-static unification architecture**: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%. -- **Operator architecture unification of PHI operator library**: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware. -- **Full go-live of new actuator for static graph**: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced. -- **Python API supporting 0-dimensional tensor**: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as `paddle.sum` etc. -- **New environment adaptation**: Adapt to CUDA 12. Compilation with gcc12 is supported. - -## **2. Incompatibility Upgrade** -- PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like `loss.numpy()[0]`.After this modification, output loss in model training is a 0-dimensional tensor. When using `loss.numpy()`, users can take out or print the loss. The codes are short, easy to understand, and in line with the industry's habit. -- `paddle.fluid ` API is fully decommissioned. According to the plan that has been previewed in the last version, 1116 `paddle.fluid ` APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand. -- Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic. -- In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including `paddle.static.ParallelExecutor ` and `paddle.static. CompiledProgram(). with_data_parallel( )` APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e., `paddle.distributed.launch ` API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on [data parallel](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/cluster_quick_start_collective_cn.html) to modify model code. [#50351](https://github.com/PaddlePaddle/Paddle/pull/50351),[#50501](https://github.com/PaddlePaddle/Paddle/pull/50501),[#51240](https://github.com/PaddlePaddle/Paddle/pull/51240),[#51701](https://github.com/PaddlePaddle/Paddle/pull/51701),[#51616](https://github.com/PaddlePaddle/Paddle/pull/51616),[#51369](https://github.com/PaddlePaddle/Paddle/pull/51369),[#52671](https://github.com/PaddlePaddle/Paddle/pull/52671) -- Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse. - -## 3. Training Framework (Including Distributed) -### Python API -#### API supporting 0-dimensional tensor -- API input supports 0-dimensional tensor, involving `paddle.reshape `, `paddle.trace `, `paddle.linalg.norm ` and other 286 APIs. [#53208](https://github.com/PaddlePaddle/Paddle/pull/53208), [#53592](https://github.com/PaddlePaddle/Paddle/pull/53592), [#47074](https://github.com/PaddlePaddle/Paddle/pull/47074), [#53186](https://github.com/PaddlePaddle/Paddle/pull/53186), [#47677](https://github.com/PaddlePaddle/Paddle/pull/47677), [#49357](https://github.com/PaddlePaddle/Paddle/pull/49357), [#50237](https://github.com/PaddlePaddle/Paddle/pull/50237), [#46555](https://github.com/PaddlePaddle/Paddle/pull/46555), [#47219](https://github.com/PaddlePaddle/Paddle/pull/47219), [#47501](https://github.com/PaddlePaddle/Paddle/pull/47501), [#47858](https://github.com/PaddlePaddle/Paddle/pull/47858), [#47961](https://github.com/PaddlePaddle/Paddle/pull/47961), [#48058](https://github.com/PaddlePaddle/Paddle/pull/48058), [#48007](https://github.com/PaddlePaddle/Paddle/pull/48007), [#49755](https://github.com/PaddlePaddle/Paddle/pull/49755), [#51024](https://github.com/PaddlePaddle/Paddle/pull/51024), [#51566](https://github.com/PaddlePaddle/Paddle/pull/51566), [#51899](https://github.com/PaddlePaddle/Paddle/pull/51899), [#49813](https://github.com/PaddlePaddle/Paddle/pull/49813), [#47812](https://github.com/PaddlePaddle/Paddle/pull/47812), [#47849](https://github.com/PaddlePaddle/Paddle/pull/47849), [#47251](https://github.com/PaddlePaddle/Paddle/pull/47251), [#53125](https://github.com/PaddlePaddle/Paddle/pull/53125), [#53828](https://github.com/PaddlePaddle/Paddle/pull/53828), [#51265](https://github.com/PaddlePaddle/Paddle/pull/51265), [#47689](https://github.com/PaddlePaddle/Paddle/pull/47689), [#48452](https://github.com/PaddlePaddle/Paddle/pull/48452), [#49072](https://github.com/PaddlePaddle/Paddle/pull/49072), [#48638](https://github.com/PaddlePaddle/Paddle/pull/48638), [#49175](https://github.com/PaddlePaddle/Paddle/pull/49175), [#49279](https://github.com/PaddlePaddle/Paddle/pull/49279), [#50857](https://github.com/PaddlePaddle/Paddle/pull/50857), [#49805](https://github.com/PaddlePaddle/Paddle/pull/49805), [#47734](https://github.com/PaddlePaddle/Paddle/pull/47734), [#45992](https://github.com/PaddlePaddle/Paddle/pull/45992), [#49616](https://github.com/PaddlePaddle/Paddle/pull/49616), [#49959](https://github.com/PaddlePaddle/Paddle/pull/49959), [#50536](https://github.com/PaddlePaddle/Paddle/pull/50536), [#49544](https://github.com/PaddlePaddle/Paddle/pull/49544), [#49842](https://github.com/PaddlePaddle/Paddle/pull/49842), [#46909](https://github.com/PaddlePaddle/Paddle/pull/46909), [#49361](https://github.com/PaddlePaddle/Paddle/pull/49361), [#50169](https://github.com/PaddlePaddle/Paddle/pull/50169), [#48314](https://github.com/PaddlePaddle/Paddle/pull/48314), [#48735](https://github.com/PaddlePaddle/Paddle/pull/48735), [#49122](https://github.com/PaddlePaddle/Paddle/pull/49122), [#49122](https://github.com/PaddlePaddle/Paddle/pull/49122), [#49177](https://github.com/PaddlePaddle/Paddle/pull/49177), [#49501](https://github.com/PaddlePaddle/Paddle/pull/49501), [#49562](https://github.com/PaddlePaddle/Paddle/pull/49562), [#49340](https://github.com/PaddlePaddle/Paddle/pull/49340), [#49550](https://github.com/PaddlePaddle/Paddle/pull/49550), [#49596](https://github.com/PaddlePaddle/Paddle/pull/49596), [#49730](https://github.com/PaddlePaddle/Paddle/pull/49730), [#49667](https://github.com/PaddlePaddle/Paddle/pull/49667), [#49692](https://github.com/PaddlePaddle/Paddle/pull/49692), [#49854](https://github.com/PaddlePaddle/Paddle/pull/49854), [#49845](https://github.com/PaddlePaddle/Paddle/pull/49845), [#49803](https://github.com/PaddlePaddle/Paddle/pull/49803), [#49889](https://github.com/PaddlePaddle/Paddle/pull/49889), [#49904](https://github.com/PaddlePaddle/Paddle/pull/49904), [#49518](https://github.com/PaddlePaddle/Paddle/pull/49518), [#49884](https://github.com/PaddlePaddle/Paddle/pull/49884), [#49880](https://github.com/PaddlePaddle/Paddle/pull/49880), [#49862](https://github.com/PaddlePaddle/Paddle/pull/49862), [#49921](https://github.com/PaddlePaddle/Paddle/pull/49921), [#49260](https://github.com/PaddlePaddle/Paddle/pull/49260), [#49929](https://github.com/PaddlePaddle/Paddle/pull/49929), [#49570](https://github.com/PaddlePaddle/Paddle/pull/49570), [#49882](https://github.com/PaddlePaddle/Paddle/pull/49882), [#50213](https://github.com/PaddlePaddle/Paddle/pull/50213), [#49780](https://github.com/PaddlePaddle/Paddle/pull/49780), [#50271](https://github.com/PaddlePaddle/Paddle/pull/50271), [#50289](https://github.com/PaddlePaddle/Paddle/pull/50289), [#50293](https://github.com/PaddlePaddle/Paddle/pull/50293), [#49735](https://github.com/PaddlePaddle/Paddle/pull/49735), [#50433](https://github.com/PaddlePaddle/Paddle/pull/50433), [#49847](https://github.com/PaddlePaddle/Paddle/pull/49847), [#50635](https://github.com/PaddlePaddle/Paddle/pull/50635), [#50950](https://github.com/PaddlePaddle/Paddle/pull/50950), [#50947](https://github.com/PaddlePaddle/Paddle/pull/50947), [#49460](https://github.com/PaddlePaddle/Paddle/pull/49460), [#53087](https://github.com/PaddlePaddle/Paddle/pull/53087), [#51687](https://github.com/PaddlePaddle/Paddle/pull/51687), [#52185](https://github.com/PaddlePaddle/Paddle/pull/52185), [#54649](https://github.com/PaddlePaddle/Paddle/pull/54649) -- API output supports 0-dimensional tensor, involving `paddle.sum `, `paddle.min/max `, `paddle.any/all ` and other 90 APIs. [#52891](https://github.com/PaddlePaddle/Paddle/pull/52891), [#52861](https://github.com/PaddlePaddle/Paddle/pull/52861), [#52775](https://github.com/PaddlePaddle/Paddle/pull/52775), [#52850](https://github.com/PaddlePaddle/Paddle/pull/52850), [#52843](https://github.com/PaddlePaddle/Paddle/pull/52843), [#52857](https://github.com/PaddlePaddle/Paddle/pull/52857), [#51721](https://github.com/PaddlePaddle/Paddle/pull/51721), [#53051](https://github.com/PaddlePaddle/Paddle/pull/53051), [#53192](https://github.com/PaddlePaddle/Paddle/pull/53192), [#52739](https://github.com/PaddlePaddle/Paddle/pull/52739), [#52741](https://github.com/PaddlePaddle/Paddle/pull/52741), [#53175](https://github.com/PaddlePaddle/Paddle/pull/53175), [#51889](https://github.com/PaddlePaddle/Paddle/pull/51889), [#53199](https://github.com/PaddlePaddle/Paddle/pull/53199), [#53242](https://github.com/PaddlePaddle/Paddle/pull/53242), [#53421](https://github.com/PaddlePaddle/Paddle/pull/53421) -- In addition to the support of 0-dimensional tensor, fix the original non-standard codes, and provide hints and compatibility for non-standard usage in the model codes. [#51562](https://github.com/PaddlePaddle/Paddle/pull/51562), [#51586](https://github.com/PaddlePaddle/Paddle/pull/51586), [#51757](https://github.com/PaddlePaddle/Paddle/pull/51757), [#52197](https://github.com/PaddlePaddle/Paddle/pull/52197), [#54117](https://github.com/PaddlePaddle/Paddle/pull/54117)。 - -#### new API -- Add `paddle.autograd.jacobian` and `paddle.autograd.hessian` APIs for scientific computing. [#53331](https://github.com/PaddlePaddle/Paddle/pull/53331) -- Add sparse computing API. For example, `paddle.sparse.reshape `, `paddle.sparse.sum ` and `paddle.sparse.slice `. [#46694](https://github.com/PaddlePaddle/Paddle/pull/46694), [#51513](https://github.com/PaddlePaddle/Paddle/pull/51513), [#53794](https://github.com/PaddlePaddle/Paddle/pull/53794), [#51406](https://github.com/PaddlePaddle/Paddle/pull/51406) -- Add APIsFor example, `paddle.optimizer.LBFGS `, `paddle.index_put ` and `paddle.logaddexp `. [#53314](https://github.com/PaddlePaddle/Paddle/pull/53314), [#51912](https://github.com/PaddlePaddle/Paddle/pull/51912), [#52886](https://github.com/PaddlePaddle/Paddle/pull/52886), [#50843](https://github.com/PaddlePaddle/Paddle/pull/50843), [#47282](https://github.com/PaddlePaddle/Paddle/pull/47282), [#52284](https://github.com/PaddlePaddle/Paddle/pull/52284) - -### Dynamic graphs -#### New features -- Add paddle.nn.utils.clip_grad_norm_ for gradient clipping support and paddle.Tensor.data_ptr for getting the address of the Tensor data's memory/GPU memory. [PR49935](https://github.com/PaddlePaddle/Paddle/pull/49935)[, PR48235](https://github.com/PaddlePaddle/Paddle/pull/48235), [PR49173](https://github.com/PaddlePaddle/Paddle/pull/49173) -- Add the saved_tensors_hooks mechanism, for temporary storage and retrieval of forward Tensor used in backward computation. [PR45763](https://github.com/PaddlePaddle/Paddle/pull/45763), [PR46215](https://github.com/PaddlePaddle/Paddle/pull/46215), [PR48124](https://github.com/PaddlePaddle/Paddle/pull/48124) -- Tensor supports pickler, for serialization of Tensor. [PR47025](https://github.com/PaddlePaddle/Paddle/pull/47025), [PR48179](https://github.com/PaddlePaddle/Paddle/pull/48179) -- Add debug logs, to print forward Python stacks when nan/inf appears in reverse. [PR53217](https://github.com/PaddlePaddle/Paddle/pull/53217) [PR52639](https://github.com/PaddlePaddle/Paddle/pull/52639) [PR52729](https://github.com/PaddlePaddle/Paddle/pull/52729) -- Add the support for expand_v2, tile, concat, assign, slice higher-order differentiation. [PR45941](https://github.com/PaddlePaddle/Paddle/pull/45941), [PR45942](https://github.com/PaddlePaddle/Paddle/pull/45942), [PR45940](https://github.com/PaddlePaddle/Paddle/pull/45940), [PR45879](https://github.com/PaddlePaddle/Paddle/pull/45879), [PR45960](https://github.com/PaddlePaddle/Paddle/pull/45960) - -#### Improvements -- Optimize log printing for dynamic graphs, including log content, VLog level, and error reporting content. [PR45783](https://github.com/PaddlePaddle/Paddle/pull/45783), [PR46349](https://github.com/PaddlePaddle/Paddle/pull/46349), [PR46934](https://github.com/PaddlePaddle/Paddle/pull/46934), [PR47724](https://github.com/PaddlePaddle/Paddle/pull/47724) -- Add FLAGS_auto_growth_chunk_size_in_mb for minimum chunk size settings of auto_growth_allocator. [PR52204](https://github.com/PaddlePaddle/Paddle/pull/52204) - -#### bug fix -- Fix bugs in some operators, including batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad. [PR47802](https://github.com/PaddlePaddle/Paddle/pull/47802), [PR47634](https://github.com/PaddlePaddle/Paddle/pull/47634), [PR47349](https://github.com/PaddlePaddle/Paddle/pull/47349), [PR46124](https://github.com/PaddlePaddle/Paddle/pull/46124), [PR46147](https://github.com/PaddlePaddle/Paddle/pull/46147), [PR50388](https://github.com/PaddlePaddle/Paddle/pull/50388), [PR48626](https://github.com/PaddlePaddle/Paddle/pull/48626), [PR48519](https://github.com/PaddlePaddle/Paddle/pull/48519), [PR50386](https://github.com/PaddlePaddle/Paddle/pull/50386), [PR48432](https://github.com/PaddlePaddle/Paddle/pull/48432), [PR51851](https://github.com/PaddlePaddle/Paddle/pull/51851) -- Fix some PyLayer bugs. [PR51740](https://github.com/PaddlePaddle/Paddle/pull/51740), [PR47154](https://github.com/PaddlePaddle/Paddle/pull/47154), [PR47323](https://github.com/PaddlePaddle/Paddle/pull/47323), [PR54041](https://github.com/PaddlePaddle/Paddle/pull/54041), [PR48533](https://github.com/PaddlePaddle/Paddle/pull/48533) -- Makes sure sync_batch_norm is sequential in reverse to avoid hang or precision errors due to misordering. [PR52268](https://github.com/PaddlePaddle/Paddle/pull/52268), [PR52860](https://github.com/PaddlePaddle/Paddle/pull/52860), [PR52779](https://github.com/PaddlePaddle/Paddle/pull/52779) -- Fix a bug of linspace under AMP. [PR46088](https://github.com/PaddlePaddle/Paddle/pull/46088) -- Fix Python C API’s incorrect call that causes Windows to crash. [PR46833](https://github.com/PaddlePaddle/Paddle/pull/46833) -- Fix the bug that DataLoader may miss deleting/dev/shm. [PR48511](https://github.com/PaddlePaddle/Paddle/pull/48511) -- Fix some bugs of paddle.grad. [PR47151](https://github.com/PaddlePaddle/Paddle/pull/47151) -- Add error message for operators that do not support higher order differentiation. [PR47231](https://github.com/PaddlePaddle/Paddle/pull/47231) -- Add numpyarray support for python operators. [PR48229](https://github.com/PaddlePaddle/Paddle/pull/48229) -- Delete either of element_size APIs. [PR49631](https://github.com/PaddlePaddle/Paddle/pull/49631) -- Fix the bug of crash when opening old dynamic graph VLOG. [PR47115](https://github.com/PaddlePaddle/Paddle/pull/47115) -- For XPU, change to d2h+h2d in case of d2d, to solve the multi-threading problem. [PR48373](https://github.com/PaddlePaddle/Paddle/pull/48373) - -#### Performance optimization -- Python operators sink to C++ implementation, to improve API performance. There is a 3x to 6x performance improvement in this class of APIs after sinking. [PR45811](https://github.com/PaddlePaddle/Paddle/pull/45811), [PR46326](https://github.com/PaddlePaddle/Paddle/pull/46326), [PR46329](https://github.com/PaddlePaddle/Paddle/pull/46329), [PR46520](https://github.com/PaddlePaddle/Paddle/pull/46520), [PR46542](https://github.com/PaddlePaddle/Paddle/pull/46542), [PR46565](https://github.com/PaddlePaddle/Paddle/pull/46565), [PR47060](https://github.com/PaddlePaddle/Paddle/pull/47060), [PR47077](https://github.com/PaddlePaddle/Paddle/pull/47077), [PR47174](https://github.com/PaddlePaddle/Paddle/pull/47174), [PR47315](https://github.com/PaddlePaddle/Paddle/pull/47315) -- Optimize the Optimizer CPU scheduling performance to reduce GPU Gap caused by Optimizer phase. [PR49787](https://github.com/PaddlePaddle/Paddle/pull/49787), [PR50188](https://github.com/PaddlePaddle/Paddle/pull/50188)[, PR51340](https://github.com/PaddlePaddle/Paddle/pull/51340), [PR49864](https://github.com/PaddlePaddle/Paddle/pull/49864), [PR50158](https://github.com/PaddlePaddle/Paddle/pull/50158), [PR50335](https://github.com/PaddlePaddle/Paddle/pull/50335) -- According to the logic that API can be sunk to C++, API is sunk to C++ to improve API performance. [PR46412](https://github.com/PaddlePaddle/Paddle/pull/46412), [PR46190](https://github.com/PaddlePaddle/Paddle/pull/46190) -- Optimize unnecessary call logic on Python side under dynamic graph, to improve API performance. [PR46221](https://github.com/PaddlePaddle/Paddle/pull/46221), [PR49473](https://github.com/PaddlePaddle/Paddle/pull/49473), [PR49574](https://github.com/PaddlePaddle/Paddle/pull/49574), [PR49589](https://github.com/PaddlePaddle/Paddle/pull/49589), [PR49612](https://github.com/PaddlePaddle/Paddle/pull/49612), [PR49717](https://github.com/PaddlePaddle/Paddle/pull/49717)[, PR49733](https://github.com/PaddlePaddle/Paddle/pull/49733), [PR49823](https://github.com/PaddlePaddle/Paddle/pull/49823)[, PR49508](https://github.com/PaddlePaddle/Paddle/pull/49508), [PR46840](https://github.com/PaddlePaddle/Paddle/pull/46840) -- Optimize use of Allocator to improve dynamic graph API scheduling performance. [PR47125](https://github.com/PaddlePaddle/Paddle/pull/47125), [PR48548](https://github.com/PaddlePaddle/Paddle/pull/48548), [PR50995](https://github.com/PaddlePaddle/Paddle/pull/50995), [PR47731](https://github.com/PaddlePaddle/Paddle/pull/47731) -- Optimize fused_attention operator performance. [PR48902](https://github.com/PaddlePaddle/Paddle/pull/48902) -- For optimizer's _add_accumulator, if device is CPU and under dynamic graphs, use full to initialize var directly. [PR48189](https://github.com/PaddlePaddle/Paddle/pull/48189) -- Prune unnecessarily executed subgraphs for inverse graphs to improve performance. [PR47827](https://github.com/PaddlePaddle/Paddle/pull/47827) -- Optimize performance of initalizers. [PR46033](https://github.com/PaddlePaddle/Paddle/pull/46033) -- Add fused dropout add operator to improve computation performance when dropout and add are used together. [#52903](https://github.com/PaddlePaddle/Paddle/pull/52903) - -### Static graphs -#### The new static graph executor is now fully go-live. -The new actuator for static graph implements a number of functions and performance optimizations, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced. [#45913](https://github.com/PaddlePaddle/Paddle/pull/45913),[#46025](https://github.com/PaddlePaddle/Paddle/pull/46025),[#48911](https://github.com/PaddlePaddle/Paddle/pull/48911),[#50239](https://github.com/PaddlePaddle/Paddle/pull/50239),[#45696](https://github.com/PaddlePaddle/Paddle/pull/45696),[#46092](https://github.com/PaddlePaddle/Paddle/pull/46092),[#48158](https://github.com/PaddlePaddle/Paddle/pull/48158),[#51389](https://github.com/PaddlePaddle/Paddle/pull/51389),[#49708](https://github.com/PaddlePaddle/Paddle/pull/49708),[#49275](https://github.com/PaddlePaddle/Paddle/pull/49275),[#48789](https://github.com/PaddlePaddle/Paddle/pull/48789),[#49939](https://github.com/PaddlePaddle/Paddle/pull/49939),[#51149](https://github.com/PaddlePaddle/Paddle/pull/51149),[#52652](https://github.com/PaddlePaddle/Paddle/pull/52652) - -### Operator library -#### Enhance functions of customized operators -New function support for custom extension mechanism to achieve the C++ extension of the arithmetic function binding to the Python side, to further enhance the framework's secondary development capabilities. The extension supports custom hardware to use a custom operator mechanism to meet the needs of hardware manufacturers to implement non-Paddle existing operations. The extension supports custom operators in the implementation of the `inplace `, `vector < Tensor> ` output, `optional < Tnesor> ` input and other high-level mechanisms in custom operators. Optimized scheduling performance of custom operators in dynamic graph mode, with a 25.4% performance improvement for operators with multiple input parameters. Add new commonly used operators and APIs for custom operator Tensor extensions. Support chaining calls and simplify code writing. Optimize the operator kernel selection mechanism. Improve the logic of some operator kernels, enhance supported data types and optimize performance. Add and improve XPU kernels 100+. Fix 170+ bugs. -[#49222](https://github.com/PaddlePaddle/Paddle/pull/49222), [#51773](https://github.com/PaddlePaddle/Paddle/pull/51773), [#51923](https://github.com/PaddlePaddle/Paddle/pull/51923), [#53080](https://github.com/PaddlePaddle/Paddle/pull/53080), [#50731](https://github.com/PaddlePaddle/Paddle/pull/50731), [#50563](https://github.com/PaddlePaddle/Paddle/pull/50563), [#50840](https://github.com/PaddlePaddle/Paddle/pull/50840), [#50983](https://github.com/PaddlePaddle/Paddle/pull/50983), [#51713](https://github.com/PaddlePaddle/Paddle/pull/51713), [#48733](https://github.com/PaddlePaddle/Paddle/pull/48733), [#50558](https://github.com/PaddlePaddle/Paddle/pull/50558), [#50764](https://github.com/PaddlePaddle/Paddle/pull/50764), [#51973](https://github.com/PaddlePaddle/Paddle/pull/51973), [#52216](https://github.com/PaddlePaddle/Paddle/pull/52216), [#51027](https://github.com/PaddlePaddle/Paddle/pull/51027), [#50745](https://github.com/PaddlePaddle/Paddle/pull/50745), [#50756](https://github.com/PaddlePaddle/Paddle/pull/50756), [#50886](https://github.com/PaddlePaddle/Paddle/pull/50886), [#50813](https://github.com/PaddlePaddle/Paddle/pull/50813), [#50869](https://github.com/PaddlePaddle/Paddle/pull/50869), [#51085](https://github.com/PaddlePaddle/Paddle/pull/51085), [#51646](https://github.com/PaddlePaddle/Paddle/pull/51646), [#51620](https://github.com/PaddlePaddle/Paddle/pull/51620), [#51844](https://github.com/PaddlePaddle/Paddle/pull/51844), [#52421](https://github.com/PaddlePaddle/Paddle/pull/52421), [#52872](https://github.com/PaddlePaddle/Paddle/pull/52872), [#52597](https://github.com/PaddlePaddle/Paddle/pull/52597), [#50582](https://github.com/PaddlePaddle/Paddle/pull/50582), [#52114](https://github.com/PaddlePaddle/Paddle/pull/52114), [#52915](https://github.com/PaddlePaddle/Paddle/pull/52915), [#50928](https://github.com/PaddlePaddle/Paddle/pull/50928), [#48272](https://github.com/PaddlePaddle/Paddle/pull/48272), [#48702](https://github.com/PaddlePaddle/Paddle/pull/48702), [#52191](https://github.com/PaddlePaddle/Paddle/pull/52191), [#52191](https://github.com/PaddlePaddle/Paddle/pull/52191), [#47374](https://github.com/PaddlePaddle/Paddle/pull/47374), [#47375](https://github.com/PaddlePaddle/Paddle/pull/47375), [#47378](https://github.com/PaddlePaddle/Paddle/pull/47378), [#54126](https://github.com/PaddlePaddle/Paddle/pull/54126), [#47638](https://github.com/PaddlePaddle/Paddle/pull/47638), [#47661](https://github.com/PaddlePaddle/Paddle/pull/47661), [#50606](https://github.com/PaddlePaddle/Paddle/pull/50606), [#53528](https://github.com/PaddlePaddle/Paddle/pull/53528), [#50599](https://github.com/PaddlePaddle/Paddle/pull/50599), [#51727](https://github.com/PaddlePaddle/Paddle/pull/51727), [#50825](https://github.com/PaddlePaddle/Paddle/pull/50825), [#50773](https://github.com/PaddlePaddle/Paddle/pull/50773), [#50979](https://github.com/PaddlePaddle/Paddle/pull/50979), [#53336](https://github.com/PaddlePaddle/Paddle/pull/53336), [#53555](https://github.com/PaddlePaddle/Paddle/pull/53555), [#53716](https://github.com/PaddlePaddle/Paddle/pull/53716), [#53753](https://github.com/PaddlePaddle/Paddle/pull/53753), [#53981](https://github.com/PaddlePaddle/Paddle/pull/53981), [#53977](https://github.com/PaddlePaddle/Paddle/pull/53977), [#53980](https://github.com/PaddlePaddle/Paddle/pull/53980), [#54043](https://github.com/PaddlePaddle/Paddle/pull/54043), [#54066](https://github.com/PaddlePaddle/Paddle/pull/54066), [#52866](https://github.com/PaddlePaddle/Paddle/pull/52866), [#53043](https://github.com/PaddlePaddle/Paddle/pull/53043), [#53325](https://github.com/PaddlePaddle/Paddle/pull/53325), [#54323](https://github.com/PaddlePaddle/Paddle/pull/54323), [#54367](https://github.com/PaddlePaddle/Paddle/pull/54367), [#51353](https://github.com/PaddlePaddle/Paddle/pull/51353), [#53749](https://github.com/PaddlePaddle/Paddle/pull/53749), [#50013](https://github.com/PaddlePaddle/Paddle/pull/50013), [#47570](https://github.com/PaddlePaddle/Paddle/pull/47570), [#50997](https://github.com/PaddlePaddle/Paddle/pull/50997), [#51241](https://github.com/PaddlePaddle/Paddle/pull/51241), [#49537](https://github.com/PaddlePaddle/Paddle/pull/49537) - -#### Unification of operator architecture -Unify all remaining 350+ operator kernels under the original operator system into PHI operator library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all Fluid header files the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce cost of accessing hardware. -[#47856](https://github.com/PaddlePaddle/Paddle/pull/47856), [#49328](https://github.com/PaddlePaddle/Paddle/pull/49328), [#49138](https://github.com/PaddlePaddle/Paddle/pull/49138), [#52014](https://github.com/PaddlePaddle/Paddle/pull/52014), [#52044](https://github.com/PaddlePaddle/Paddle/pull/52044), [#52116](https://github.com/PaddlePaddle/Paddle/pull/52116), [#52486](https://github.com/PaddlePaddle/Paddle/pull/52486), [#52101](https://github.com/PaddlePaddle/Paddle/pull/52101), [#52882](https://github.com/PaddlePaddle/Paddle/pull/52882), [#53003](https://github.com/PaddlePaddle/Paddle/pull/53003), [#53034](https://github.com/PaddlePaddle/Paddle/pull/53034), [#51914](https://github.com/PaddlePaddle/Paddle/pull/51914), [#49116](https://github.com/PaddlePaddle/Paddle/pull/49116), [#52626](https://github.com/PaddlePaddle/Paddle/pull/52626), [#52878](https://github.com/PaddlePaddle/Paddle/pull/52878), [#52879](https://github.com/PaddlePaddle/Paddle/pull/52879), [#52880](https://github.com/PaddlePaddle/Paddle/pull/52880), [#52875](https://github.com/PaddlePaddle/Paddle/pull/52875), [#51600](https://github.com/PaddlePaddle/Paddle/pull/51600), [#51601](https://github.com/PaddlePaddle/Paddle/pull/51601), [#51590](https://github.com/PaddlePaddle/Paddle/pull/51590), [#51887](https://github.com/PaddlePaddle/Paddle/pull/51887), [#51891](https://github.com/PaddlePaddle/Paddle/pull/51891), [#52036](https://github.com/PaddlePaddle/Paddle/pull/52036), [#52130](https://github.com/PaddlePaddle/Paddle/pull/52130), [#52134](https://github.com/PaddlePaddle/Paddle/pull/52134), [#51951](https://github.com/PaddlePaddle/Paddle/pull/51951), [#51886](https://github.com/PaddlePaddle/Paddle/pull/51886), [#52274](https://github.com/PaddlePaddle/Paddle/pull/52274), [#52263](https://github.com/PaddlePaddle/Paddle/pull/52263), [#51913](https://github.com/PaddlePaddle/Paddle/pull/51913), [#52145](https://github.com/PaddlePaddle/Paddle/pull/52145), [#52347](https://github.com/PaddlePaddle/Paddle/pull/52347), [#52370](https://github.com/PaddlePaddle/Paddle/pull/52370), [#52437](https://github.com/PaddlePaddle/Paddle/pull/52437), [#52424](https://github.com/PaddlePaddle/Paddle/pull/52424), [#52231](https://github.com/PaddlePaddle/Paddle/pull/52231), [#52522](https://github.com/PaddlePaddle/Paddle/pull/52522), [#52529](https://github.com/PaddlePaddle/Paddle/pull/52529), [#52802](https://github.com/PaddlePaddle/Paddle/pull/52802), [#52799](https://github.com/PaddlePaddle/Paddle/pull/52799), [#52855](https://github.com/PaddlePaddle/Paddle/pull/52855), [#52711](https://github.com/PaddlePaddle/Paddle/pull/52711), [#52940](https://github.com/PaddlePaddle/Paddle/pull/52940), [#53309](https://github.com/PaddlePaddle/Paddle/pull/53309), [#47817](https://github.com/PaddlePaddle/Paddle/pull/47817), [#48001](https://github.com/PaddlePaddle/Paddle/pull/48001), [#48063](https://github.com/PaddlePaddle/Paddle/pull/48063), [#48049](https://github.com/PaddlePaddle/Paddle/pull/48049), [#48168](https://github.com/PaddlePaddle/Paddle/pull/48168), [#48415](https://github.com/PaddlePaddle/Paddle/pull/48415), [#48696](https://github.com/PaddlePaddle/Paddle/pull/48696), [#48970](https://github.com/PaddlePaddle/Paddle/pull/48970), [#50183](https://github.com/PaddlePaddle/Paddle/pull/50183), [#50407](https://github.com/PaddlePaddle/Paddle/pull/50407), [#50498](https://github.com/PaddlePaddle/Paddle/pull/50498), [#50419](https://github.com/PaddlePaddle/Paddle/pull/50419), [#50282](https://github.com/PaddlePaddle/Paddle/pull/50282), [#50870](https://github.com/PaddlePaddle/Paddle/pull/50870), [#50911](https://github.com/PaddlePaddle/Paddle/pull/50911), [#50865](https://github.com/PaddlePaddle/Paddle/pull/50865), [#51288](https://github.com/PaddlePaddle/Paddle/pull/51288), [#53735](https://github.com/PaddlePaddle/Paddle/pull/53735), [#47248](https://github.com/PaddlePaddle/Paddle/pull/47248), [#47787](https://github.com/PaddlePaddle/Paddle/pull/47787), [#52202](https://github.com/PaddlePaddle/Paddle/pull/52202), -[#47579](https://github.com/PaddlePaddle/Paddle/pull/47579), [#49444](https://github.com/PaddlePaddle/Paddle/pull/49444), [#45772](https://github.com/PaddlePaddle/Paddle/pull/45772), [#51264](https://github.com/PaddlePaddle/Paddle/pull/51264), [#51634](https://github.com/PaddlePaddle/Paddle/pull/51634), [#51631](https://github.com/PaddlePaddle/Paddle/pull/51631), [#47385](https://github.com/PaddlePaddle/Paddle/pull/47385), [#46342](https://github.com/PaddlePaddle/Paddle/pull/46342), [#47510](https://github.com/PaddlePaddle/Paddle/pull/47510), [#47532](https://github.com/PaddlePaddle/Paddle/pull/47532), [#47702](https://github.com/PaddlePaddle/Paddle/pull/47702), [#47860](https://github.com/PaddlePaddle/Paddle/pull/47860), [#49470](https://github.com/PaddlePaddle/Paddle/pull/49470), [#50358](https://github.com/PaddlePaddle/Paddle/pull/50358), [#49121](https://github.com/PaddlePaddle/Paddle/pull/49121), [#50190](https://github.com/PaddlePaddle/Paddle/pull/50190), [#52374](https://github.com/PaddlePaddle/Paddle/pull/52374), [#52372](https://github.com/PaddlePaddle/Paddle/pull/52372), [#52375](https://github.com/PaddlePaddle/Paddle/pull/52375), [#52371](https://github.com/PaddlePaddle/Paddle/pull/52371) - -### Dynamic-to-static plus combinator -#### New features -- Add the combination rules for combinators such as dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish [#50497](https://github.com/PaddlePaddle/Paddle/pull/50497), [#50838](https://github.com/PaddlePaddle/Paddle/pull/50838), [#50861](https://github.com/PaddlePaddle/Paddle/pull/50861), [#50819](https://github.com/PaddlePaddle/Paddle/pull/50819), [#50810](https://github.com/PaddlePaddle/Paddle/pull/50810), [#51527](https://github.com/PaddlePaddle/Paddle/pull/51527), [#51070](https://github.com/PaddlePaddle/Paddle/pull/51070), [#51539](https://github.com/PaddlePaddle/Paddle/pull/51539), [#51061](https://github.com/PaddlePaddle/Paddle/pull/51061), [#49894](https://github.com/PaddlePaddle/Paddle/pull/49894), [#50422](https://github.com/PaddlePaddle/Paddle/pull/50422), [#51874](https://github.com/PaddlePaddle/Paddle/pull/51874), [#51341](https://github.com/PaddlePaddle/Paddle/pull/51341), [#50295](https://github.com/PaddlePaddle/Paddle/pull/50295), [#50298](https://github.com/PaddlePaddle/Paddle/pull/50298), [#50672](https://github.com/PaddlePaddle/Paddle/pull/50672), [#51432](https://github.com/PaddlePaddle/Paddle/pull/51432), [#51003](https://github.com/PaddlePaddle/Paddle/pull/51003) -- Add the vjp rule for combinators such as gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad [#50966](https://github.com/PaddlePaddle/Paddle/pull/50966), [#51653](https://github.com/PaddlePaddle/Paddle/pull/51653), [#52663](https://github.com/PaddlePaddle/Paddle/pull/52663), [#51742](https://github.com/PaddlePaddle/Paddle/pull/51742), [#52203](https://github.com/PaddlePaddle/Paddle/pull/52203), [#50794](https://github.com/PaddlePaddle/Paddle/pull/50794), [#50305](https://github.com/PaddlePaddle/Paddle/pull/50305), [#50786](https://github.com/PaddlePaddle/Paddle/pull/50786), [#50679](https://github.com/PaddlePaddle/Paddle/pull/50679), [#51045](https://github.com/PaddlePaddle/Paddle/pull/51045), [#51230](https://github.com/PaddlePaddle/Paddle/pull/51230), [#51474](https://github.com/PaddlePaddle/Paddle/pull/51474), [#51283](https://github.com/PaddlePaddle/Paddle/pull/51283), [#51238](https://github.com/PaddlePaddle/Paddle/pull/51238), [#49831](https://github.com/PaddlePaddle/Paddle/pull/49831), [#51838](https://github.com/PaddlePaddle/Paddle/pull/51838), [#50771](https://github.com/PaddlePaddle/Paddle/pull/50771), [#50565](https://github.com/PaddlePaddle/Paddle/pull/50565), [#51768](https://github.com/PaddlePaddle/Paddle/pull/51768), [#51750](https://github.com/PaddlePaddle/Paddle/pull/51750), [#51748](https://github.com/PaddlePaddle/Paddle/pull/51748), [#52532](https://github.com/PaddlePaddle/Paddle/pull/52532), [#52935](https://github.com/PaddlePaddle/Paddle/pull/52935), [#50963](https://github.com/PaddlePaddle/Paddle/pull/50963), [#51430](https://github.com/PaddlePaddle/Paddle/pull/51430), [#53141](https://github.com/PaddlePaddle/Paddle/pull/53141), [#52469](https://github.com/PaddlePaddle/Paddle/pull/52469), [#50436](https://github.com/PaddlePaddle/Paddle/pull/50436), [#51059](https://github.com/PaddlePaddle/Paddle/pull/51059), [#51296](https://github.com/PaddlePaddle/Paddle/pull/51296), [#52533](https://github.com/PaddlePaddle/Paddle/pull/52533), [#53374](https://github.com/PaddlePaddle/Paddle/pull/53374) -- Add the second-order differentiation rule for combinators such as matmul, tanh, and elementwise [#50452](https://github.com/PaddlePaddle/Paddle/pull/50452), [#52192](https://github.com/PaddlePaddle/Paddle/pull/52192), [#53014](https://github.com/PaddlePaddle/Paddle/pull/53014) -- Add the bf16 datatype support for combinators such as exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max [#54263](https://github.com/PaddlePaddle/Paddle/pull/54263), [#54236](https://github.com/PaddlePaddle/Paddle/pull/54236), [#53865](https://github.com/PaddlePaddle/Paddle/pull/53865), [#54175](https://github.com/PaddlePaddle/Paddle/pull/54175), [#54399](https://github.com/PaddlePaddle/Paddle/pull/54399) -- Add support for assigning semantics to containers in control flow in dynamic-to-static. [#51248](https://github.com/PaddlePaddle/Paddle/pull/51248) -- For to_static, add full graph fallback function. When dynamic-to-static conversion fails, the whole graph can fall back to the dynamic graph mode of execution. For the fallback mechanism, add the set_eval_frame API. [#50111](https://github.com/PaddlePaddle/Paddle/pull/50111), [#52006](https://github.com/PaddlePaddle/Paddle/pull/52006) -- For to_static, support the combinator mechanism. Support the scenario of using register_hook under to_static decoration; [#49836](https://github.com/PaddlePaddle/Paddle/pull/49836), [#52948](https://github.com/PaddlePaddle/Paddle/pull/52948), [#53572](https://github.com/PaddlePaddle/Paddle/pull/53572) -- Add a backend parameter to the to_static API. It can be specified as `CINN` or None. When the parameter is specified as CINN, the CINN compiler will be used to accelerate training and inference. [#52596](https://github.com/PaddlePaddle/Paddle/pull/52596) -- Add the code automatic generation function for the primitive API. Based on operator definitions in ops.yaml and legacy_ops.yaml, automatically generate code for the primitive API. Automatically generate the Tensor computation API. [#50315](https://github.com/PaddlePaddle/Paddle/pull/50315), [#49654](https://github.com/PaddlePaddle/Paddle/pull/49654), [#50642](https://github.com/PaddlePaddle/Paddle/pull/50642) -- Add the function of forward combination of operators. By registering the combination rules of forward operators, it can split forward operators into base operators. [#49605](https://github.com/PaddlePaddle/Paddle/pull/49605) -- Add the combinator switch. You can set environmental variables in shell to split operators in different ways. [#50309](https://github.com/PaddlePaddle/Paddle/pull/50309) -- Add `OpTest ` combination test function to guarantee accuracy of operators. Add elementwise class base operator unit test. Add batch_norm CINN unit test. [#50509](https://github.com/PaddlePaddle/Paddle/pull/50509), [#50807](https://github.com/PaddlePaddle/Paddle/pull/50807), [#52815](https://github.com/PaddlePaddle/Paddle/pull/52815) - -#### Improvements -- Add combinator to support FP16 operation and AMP O1 operation. Add AMP logic for softmax and layer_norm operators. [#52397](https://github.com/PaddlePaddle/Paddle/pull/52397), [#52598](https://github.com/PaddlePaddle/Paddle/pull/52598), [#51473](https://github.com/PaddlePaddle/Paddle/pull/51473) -- Simplify combination rules and vjp rules of the combinator batch_norm. [#54012](https://github.com/PaddlePaddle/Paddle/pull/54012), [#51827](https://github.com/PaddlePaddle/Paddle/pull/51827), [#51933](https://github.com/PaddlePaddle/Paddle/pull/51933), -- Optimize combination rules for combinators, and improve performance of combination rules with containing scalar. Optimize log printing for combinators. [#51960](https://github.com/PaddlePaddle/Paddle/pull/51960), [#50160](https://github.com/PaddlePaddle/Paddle/pull/50160) -- Combinator supports the jit.save API. Add custom VJP rule API. [#52344](https://github.com/PaddlePaddle/Paddle/pull/52344), [#50885](https://github.com/PaddlePaddle/Paddle/pull/50885) -- Remove the overwrite parameter from combinator gather_grad. [#52707](https://github.com/PaddlePaddle/Paddle/pull/52707) -- Clean up dynamic-to-static code style, optimize error message, and standardize logs. [#48637](https://github.com/PaddlePaddle/Paddle/pull/48637), [#46128](https://github.com/PaddlePaddle/Paddle/pull/46128), [#52527](https://github.com/PaddlePaddle/Paddle/pull/52527), [#46800](https://github.com/PaddlePaddle/Paddle/pull/46800),[#46415](https://github.com/PaddlePaddle/Paddle/pull/46415) -- For dynamic-to-static, call the append backward to get `grad var name ` to fix the error in the high order gradient computation. [#53250](https://github.com/PaddlePaddle/Paddle/pull/53250) -- Upgrade the dynamic-to-static function, and clean up the temporary directory of to_static to speed up code conversion. Enhance to_static to automatically skip internal API. Support use of to_static decorator in the program. [#47102](https://github.com/PaddlePaddle/Paddle/pull/47102), [#50596](https://github.com/PaddlePaddle/Paddle/pull/50596), [#45768](https://github.com/PaddlePaddle/Paddle/pull/45768) -- For dynamic-to-static, optimize `print ` function conversion to support printing Tensor parameters at the networking stage. Upgrade the parameter collection mechanism. [#48672](https://github.com/PaddlePaddle/Paddle/pull/48672), [#50336](https://github.com/PaddlePaddle/Paddle/pull/50336) - -#### bug fix -- For the combinator, fix cmake compilation errors. Fix cuda 12 test errors. Fix bugs of operators such as meshgird, expand_as, concat, conv, and arrange. [#49643](https://github.com/PaddlePaddle/Paddle/pull/49643), [#54622](https://github.com/PaddlePaddle/Paddle/pull/54622), [#53951](https://github.com/PaddlePaddle/Paddle/pull/53951), [#53951](https://github.com/PaddlePaddle/Paddle/pull/53951), [#53350](https://github.com/PaddlePaddle/Paddle/pull/53350), [#51486](https://github.com/PaddlePaddle/Paddle/pull/51486), [#52764](https://github.com/PaddlePaddle/Paddle/pull/52764) -- For the combinator, fix the bug in a number of scenarios such as rank=1, shape=-1, amp, and multi-process. [#51413](https://github.com/PaddlePaddle/Paddle/pull/51413), [#51435](https://github.com/PaddlePaddle/Paddle/pull/51435), [#50518](https://github.com/PaddlePaddle/Paddle/pull/50518), [#47301](https://github.com/PaddlePaddle/Paddle/pull/47301), -- For the combinator, fix bugs in automatic code generation of composite grad maker and static prim api. Fix bugs that op creation attributes are missing, and some combination rules do not take effect. [#50854](https://github.com/PaddlePaddle/Paddle/pull/50854), [#51445](https://github.com/PaddlePaddle/Paddle/pull/51445), [#50780](https://github.com/PaddlePaddle/Paddle/pull/50780), [#52120](https://github.com/PaddlePaddle/Paddle/pull/52120) -- Fix some other bugs for combinators [#50086](https://github.com/PaddlePaddle/Paddle/pull/50086), [#51208](https://github.com/PaddlePaddle/Paddle/pull/51208), [#51577](https://github.com/PaddlePaddle/Paddle/pull/51577), [#53598](https://github.com/PaddlePaddle/Paddle/pull/53598), [#47500](https://github.com/PaddlePaddle/Paddle/pull/47500), [#52119](https://github.com/PaddlePaddle/Paddle/pull/52119), [#50397](https://github.com/PaddlePaddle/Paddle/pull/50397), [#50527](https://github.com/PaddlePaddle/Paddle/pull/50527), [#50788](https://github.com/PaddlePaddle/Paddle/pull/50788), [#51014](https://github.com/PaddlePaddle/Paddle/pull/51014), [#52154](https://github.com/PaddlePaddle/Paddle/pull/52154), [#52752](https://github.com/PaddlePaddle/Paddle/pull/52752) -- For dynamic-to-static, fix the bugs of dataloader, cond input dict, transformer import, T5 model memory leak, and grad var name parsing error. [#49821](https://github.com/PaddlePaddle/Paddle/pull/49821), [#47299](https://github.com/PaddlePaddle/Paddle/pull/47299), [#50776](https://github.com/PaddlePaddle/Paddle/pull/50776), [#50883](https://github.com/PaddlePaddle/Paddle/pull/50883), [#51100](https://github.com/PaddlePaddle/Paddle/pull/51100), [#51464](https://github.com/PaddlePaddle/Paddle/pull/51464), [#51966](https://github.com/PaddlePaddle/Paddle/pull/51966), [#52110](https://github.com/PaddlePaddle/Paddle/pull/52110), [#52821](https://github.com/PaddlePaddle/Paddle/pull/52821) -- For dynamic-to-static, fix the bugs of Lazy initialization, Windows training, is_paddle_func failure, and recurrent op failure to delete pass. [#50785](https://github.com/PaddlePaddle/Paddle/pull/50785), [#52580](https://github.com/PaddlePaddle/Paddle/pull/52580), [#51585](https://github.com/PaddlePaddle/Paddle/pull/51585), [#51763](https://github.com/PaddlePaddle/Paddle/pull/51763), [#51763](https://github.com/PaddlePaddle/Paddle/pull/51763) - -#### Performance optimization -- Add scope caching and reuse mechanism during execution of run_program_op in dynamic-to-static, to avoid passing new scope for each step. [#45813](https://github.com/PaddlePaddle/Paddle/pull/45813) - -### Distributed training -#### Dynamic graph distributed training -- Remove the distributed sharding API in the old dynamic graphs. [#49334](https://github.com/PaddlePaddle/Paddle/pull/49334) -- Upgrade fleet to distributed directory. [#50834](https://github.com/PaddlePaddle/Paddle/pull/50834) -- Optimize log printing for distributed strategies. [#47761](https://github.com/PaddlePaddle/Paddle/pull/47761) -- For re-computation, support hook mode, inplace function, and stop_gradient mode. Support more flexible use. [#48471](https://github.com/PaddlePaddle/Paddle/pull/48471), [#47985](https://github.com/PaddlePaddle/Paddle/pull/47985) -- Data parallel - - For data parallel, support no_sync API for blocking parameter gradient communications. Support the parameter synchronization function. Add scale API to scale parameters. [#47536](https://github.com/PaddlePaddle/Paddle/pull/47536),[#51895](https://github.com/PaddlePaddle/Paddle/pull/51895),[#47519](https://github.com/PaddlePaddle/Paddle/pull/47519) - - Fix the problem of video memory leakage under data parallel. [#47369](https://github.com/PaddlePaddle/Paddle/pull/47369),[#47444](https://github.com/PaddlePaddle/Paddle/pull/47444),[#48668](https://github.com/PaddlePaddle/Paddle/pull/48668) - - Support sparse parameter gradient synchronization. [#52785](https://github.com/PaddlePaddle/Paddle/pull/52785) -- Pipeline parallel - - Optimize pipeline performance, and remove communication wait. Optimize scheduling and communication overlap. [#46209](https://github.com/PaddlePaddle/Paddle/pull/46209),[#54003](https://github.com/PaddlePaddle/Paddle/pull/54003),[#54312](https://github.com/PaddlePaddle/Paddle/pull/54312),[#53384](https://github.com/PaddlePaddle/Paddle/pull/53384),[#54310](https://github.com/PaddlePaddle/Paddle/pull/54310),[#46399](https://github.com/PaddlePaddle/Paddle/pull/46399),[#46483](https://github.com/PaddlePaddle/Paddle/pull/46483),[#46780](https://github.com/PaddlePaddle/Paddle/pull/46780),[#46116](https://github.com/PaddlePaddle/Paddle/pull/46116) - - Support custom sharding, log printing, random seed setting, and timer elapsed time printing. [#53344](https://github.com/PaddlePaddle/Paddle/pull/53344), [#47670](https://github.com/PaddlePaddle/Paddle/pull/47670),[#47336](https://github.com/PaddlePaddle/Paddle/pull/47336),[#52656](https://github.com/PaddlePaddle/Paddle/pull/52656),[#53831](https://github.com/PaddlePaddle/Paddle/pull/53831) - - Optimize video memory release logic in pipeline scheduling, and release intermediate variables and data in advance. [#54557](https://github.com/PaddlePaddle/Paddle/pull/54557), [#47199](https://github.com/PaddlePaddle/Paddle/pull/47199),[#47497](https://github.com/PaddlePaddle/Paddle/pull/47497),[#48045](https://github.com/PaddlePaddle/Paddle/pull/48045),[#54672](https://github.com/PaddlePaddle/Paddle/pull/54672) - - Support VPP mode and model saving for pipeline parallel. [#54196](https://github.com/PaddlePaddle/Paddle/pull/54196), [#52927](https://github.com/PaddlePaddle/Paddle/pull/52927),[#47801](https://github.com/PaddlePaddle/Paddle/pull/47801),[#45922](https://github.com/PaddlePaddle/Paddle/pull/45922),[#47242](https://github.com/PaddlePaddle/Paddle/pull/47242) -- Grouping sharding parallel - - sharding stage2 parallel supports the quantization function, hybrid parallel training, gradient accumulation, XPU hardware, BF16 low precision computation, optimizer learning rate setting, offload function, and data parallel. [#47169](https://github.com/PaddlePaddle/Paddle/pull/47169),[#47535](https://github.com/PaddlePaddle/Paddle/pull/47535), [#46795](https://github.com/PaddlePaddle/Paddle/pull/46795),[#47711](https://github.com/PaddlePaddle/Paddle/pull/47711),[#48310](https://github.com/PaddlePaddle/Paddle/pull/48310),[#46846](https://github.com/PaddlePaddle/Paddle/pull/46846),[#48857](https://github.com/PaddlePaddle/Paddle/pull/48857),[#49196](https://github.com/PaddlePaddle/Paddle/pull/49196),[#49931](https://github.com/PaddlePaddle/Paddle/pull/49931),[#47114](https://github.com/PaddlePaddle/Paddle/pull/47114),[#49767](https://github.com/PaddlePaddle/Paddle/pull/49767) - - Optimize sharing stage2 performance. Support the communication computation overlap. [#46495](https://github.com/PaddlePaddle/Paddle/pull/46495),[#46894](https://github.com/PaddlePaddle/Paddle/pull/46894) - - sharding stage3 support shared parameters, and untrainable parameters. [#48695](https://github.com/PaddlePaddle/Paddle/pull/48695),[#48577](https://github.com/PaddlePaddle/Paddle/pull/48577) -- Tensor model parallel - - Optimize tensor model parallel performance to reduce performance impact of stream sharding. [#47715](https://github.com/PaddlePaddle/Paddle/pull/47715),[#51617](https://github.com/PaddlePaddle/Paddle/pull/51617) - - Support parameter, optimizer shapes, gradient synchronization. [#51428](https://github.com/PaddlePaddle/Paddle/pull/51428),[#53254](https://github.com/PaddlePaddle/Paddle/pull/53254), [#53335](https://github.com/PaddlePaddle/Paddle/pull/53335),[#45803](https://github.com/PaddlePaddle/Paddle/pull/45803),[#46303](https://github.com/PaddlePaddle/Paddle/pull/46303),[#52293](https://github.com/PaddlePaddle/Paddle/pull/52293) - - Optimize tensor model parallel operators such as c_embedding, softmax_with_corss_entropy. [#53197](https://github.com/PaddlePaddle/Paddle/pull/53197),[#53547](https://github.com/PaddlePaddle/Paddle/pull/53547),[#53541](https://github.com/PaddlePaddle/Paddle/pull/53541),[#52789](https://github.com/PaddlePaddle/Paddle/pull/52789),[#46491](https://github.com/PaddlePaddle/Paddle/pull/46491),[#52742](https://github.com/PaddlePaddle/Paddle/pull/52742),[#53419](https://github.com/PaddlePaddle/Paddle/pull/53419) -- Launch - - Support distributed Launch function, with keeping independent logs. [#53207](https://github.com/PaddlePaddle/Paddle/pull/53207),[#50405](https://github.com/PaddlePaddle/Paddle/pull/50405) - - Add framework print environment variable function, log overwrite function, log return, and environment check. It is easy to change the debug environment variable. [#53243](https://github.com/PaddlePaddle/Paddle/pull/53243),[#53243](https://github.com/PaddlePaddle/Paddle/pull/53243), [#51803](https://github.com/PaddlePaddle/Paddle/pull/51803), [#53990](https://github.com/PaddlePaddle/Paddle/pull/53990) -- Communication library - - Add custom mixed parallel communication groups, topology information printing, and custom communication topology order. [#47021](https://github.com/PaddlePaddle/Paddle/pull/47021),[#54000](https://github.com/PaddlePaddle/Paddle/pull/54000),[#51781](https://github.com/PaddlePaddle/Paddle/pull/51781) - - Remove communication library dependency on Place information [#47857](https://github.com/PaddlePaddle/Paddle/pull/47857) - - Add communications library to support GLOO operator. Support send/recv/gather. [#52221](https://github.com/PaddlePaddle/Paddle/pull/52221), [#52334](https://github.com/PaddlePaddle/Paddle/pull/52334),[#49084](https://github.com/PaddlePaddle/Paddle/pull/49084) - - Disable reverse computation of communication operator. [#47636](https://github.com/PaddlePaddle/Paddle/pull/47636) - - Add communication library static shape check, to help determine whether communication volume is matched. [#48256](https://github.com/PaddlePaddle/Paddle/pull/48256),[#48915](https://github.com/PaddlePaddle/Paddle/pull/48915),[#48646](https://github.com/PaddlePaddle/Paddle/pull/48646) - - Support communication python object type, BF16 type, alltoall, reduce, allgather, group call, global gather, broadcast, and scatter communication methods. Support XPU device communications. [#51765](https://github.com/PaddlePaddle/Paddle/pull/51765),[#45844](https://github.com/PaddlePaddle/Paddle/pull/45844),[#48059](https://github.com/PaddlePaddle/Paddle/pull/48059),[#48115](https://github.com/PaddlePaddle/Paddle/pull/48115), [#48339](https://github.com/PaddlePaddle/Paddle/pull/48339),[#49252](https://github.com/PaddlePaddle/Paddle/pull/49252),[#49451](https://github.com/PaddlePaddle/Paddle/pull/49451),[#50085](https://github.com/PaddlePaddle/Paddle/pull/50085),[#50701](https://github.com/PaddlePaddle/Paddle/pull/50701),[#48208](https://github.com/PaddlePaddle/Paddle/pull/48208),[#48736](https://github.com/PaddlePaddle/Paddle/pull/48736),[#51762](https://github.com/PaddlePaddle/Paddle/pull/51762),[#52495](https://github.com/PaddlePaddle/Paddle/pull/52495),[#53514](https://github.com/PaddlePaddle/Paddle/pull/53514),[#48232](https://github.com/PaddlePaddle/Paddle/pull/48232),[#49896](https://github.com/PaddlePaddle/Paddle/pull/49896),[#49941](https://github.com/PaddlePaddle/Paddle/pull/49941),[#45584](https://github.com/PaddlePaddle/Paddle/pull/45584) - - Add support for communications between computational streams. [#46182](https://github.com/PaddlePaddle/Paddle/pull/46182),[#46023](https://github.com/PaddlePaddle/Paddle/pull/46023),[#46295](https://github.com/PaddlePaddle/Paddle/pull/46295),[#46761](https://github.com/PaddlePaddle/Paddle/pull/46761),[#47481](https://github.com/PaddlePaddle/Paddle/pull/47481),[#47740](https://github.com/PaddlePaddle/Paddle/pull/47740),[#47976](https://github.com/PaddlePaddle/Paddle/pull/47976),[#48163](https://github.com/PaddlePaddle/Paddle/pull/48163),[#48396](https://github.com/PaddlePaddle/Paddle/pull/48396),[#48308](https://github.com/PaddlePaddle/Paddle/pull/48308),[#47110](https://github.com/PaddlePaddle/Paddle/pull/47110),[#53089](https://github.com/PaddlePaddle/Paddle/pull/53089) - - Optimize communication library TCP linking time. [#49810](https://github.com/PaddlePaddle/Paddle/pull/49810),[#47184](https://github.com/PaddlePaddle/Paddle/pull/47184) - -#### Automatic parallel -- Improve semi-automatic parallel for static graphs: - - Add FLOPs computation function for multiple operators, and add computation Cost modelling based on FLOPs. [#48083](https://github.com/PaddlePaddle/Paddle/pull/48083),[#47978](https://github.com/PaddlePaddle/Paddle/pull/47978),[#47595](https://github.com/PaddlePaddle/Paddle/pull/47595),[#48083](https://github.com/PaddlePaddle/Paddle/pull/48083),[#48084](https://github.com/PaddlePaddle/Paddle/pull/48084),[#47816](https://github.com/PaddlePaddle/Paddle/pull/47816) - - Improve API ease-of-use. Perfect the DistAttr, Process Mesh, Engine API, information printing, input and output modules. Implement the Engine new cost API. It can be used to theoretically analyze model running time and video memory overhead. [#47503](https://github.com/PaddlePaddle/Paddle/pull/47503),[#46416](https://github.com/PaddlePaddle/Paddle/pull/46416),[#46554](https://github.com/PaddlePaddle/Paddle/pull/46554), [#46633](https://github.com/PaddlePaddle/Paddle/pull/46633),[#49214](https://github.com/PaddlePaddle/Paddle/pull/49214),[#53848](https://github.com/PaddlePaddle/Paddle/pull/53848),[#46552](https://github.com/PaddlePaddle/Paddle/pull/46552), [#47043](https://github.com/PaddlePaddle/Paddle/pull/47043), [#49665](https://github.com/PaddlePaddle/Paddle/pull/49665), [#52912](https://github.com/PaddlePaddle/Paddle/pull/52912), [#45776](https://github.com/PaddlePaddle/Paddle/pull/45776), [#47263](https://github.com/PaddlePaddle/Paddle/pull/47263) - - Optimize the generality and ease of use of Pass. Support more scenarios, and reduce time spent on Pass pre-analysis. [#46519](https://github.com/PaddlePaddle/Paddle/pull/46519),[#47358](https://github.com/PaddlePaddle/Paddle/pull/47358),[#46391](https://github.com/PaddlePaddle/Paddle/pull/46391), [#51035](https://github.com/PaddlePaddle/Paddle/pull/51035) - - Enhance debugging capabilities with distributed randomness control mechanisms and hybrid parallel precision alignment tools. [#52903](https://github.com/PaddlePaddle/Paddle/pull/52903),[#49865](https://github.com/PaddlePaddle/Paddle/pull/49865) - - Support automatic sharding of inference generation task networking. Adapt special usage of control flow and conditional block in the generation model. [#46771](https://github.com/PaddlePaddle/Paddle/pull/46771), [#54067](https://github.com/PaddlePaddle/Paddle/pull/54067) - - Improve grad_clip to support load balancing in data parallel scenarios. [#49510](https://github.com/PaddlePaddle/Paddle/pull/49510), [#49249](https://github.com/PaddlePaddle/Paddle/pull/49249) -- Semi-automatic parallel performance improvement for static graphs: - - Add the Sharding Pass automated communication Fuse and multi-streams communication functions, with throughput performance improved by 26% on two machines for GPT 6.7B model. [#48604](https://github.com/PaddlePaddle/Paddle/pull/48604), [#47180](https://github.com/PaddlePaddle/Paddle/pull/47180),[#46180](https://github.com/PaddlePaddle/Paddle/pull/46180) - - Add Recompute optimization strategy tuning function. Select optimal recompute checkpoint settings based on video memory and model size. [#48608](https://github.com/PaddlePaddle/Paddle/pull/48608),[#47846](https://github.com/PaddlePaddle/Paddle/pull/47846),[#49010](https://github.com/PaddlePaddle/Paddle/pull/49010) - - For the pipeline parallel, add 1F1B scheduling optimization Pass [#54260](https://github.com/PaddlePaddle/Paddle/pull/54260), [#45915](https://github.com/PaddlePaddle/Paddle/pull/45915) - - Optimize data parallel. Support optimizations such as converged communication and communication computation Overlap, with performance improved by 5% in GPT 1.3B model. [#48092](https://github.com/PaddlePaddle/Paddle/pull/48092),[#45643](https://github.com/PaddlePaddle/Paddle/pull/45643),[#49744](https://github.com/PaddlePaddle/Paddle/pull/49744), [#47578](https://github.com/PaddlePaddle/Paddle/pull/47578) - - Optimize Reshard module concate performance. Reduce number of concates in some scenarios. [#47809](https://github.com/PaddlePaddle/Paddle/pull/47809) - - Optimize mixing accuracy, upgrade Pass performance, support BF16 low accuracy, and adapt the auto mixing parallel of the while loop control flow. [#51285](https://github.com/PaddlePaddle/Paddle/pull/51285),[#51147](https://github.com/PaddlePaddle/Paddle/pull/51147), [#49219](https://github.com/PaddlePaddle/Paddle/pull/49219), [#49079](https://github.com/PaddlePaddle/Paddle/pull/49079) -- Improve function of fully automatic parallel for static graphs: - - Add new rule-based fully automated search strategy. [#51859](https://github.com/PaddlePaddle/Paddle/pull/51859),[#51908](https://github.com/PaddlePaddle/Paddle/pull/51908),[#52053](https://github.com/PaddlePaddle/Paddle/pull/52053),[#48316](https://github.com/PaddlePaddle/Paddle/pull/48316),[#48464](https://github.com/PaddlePaddle/Paddle/pull/48464), [#52041](https://github.com/PaddlePaddle/Paddle/pull/52041) - - Improve automatic parallel modelling capability, enriching single-node topology modelling and communication volume modelling. [#52723](https://github.com/PaddlePaddle/Paddle/pull/52723),[#46387](https://github.com/PaddlePaddle/Paddle/pull/46387),[#47043](https://github.com/PaddlePaddle/Paddle/pull/47043) - -#### Parameter server -- Clean up the all list in ps directory, in which API is not exposed [#51289](https://github.com/PaddlePaddle/Paddle/pull/51289) -- Clean up cvm operator [#48989](https://github.com/PaddlePaddle/Paddle/pull/48989) -- For GPUPS, add support for AFS. [#46611](https://github.com/PaddlePaddle/Paddle/pull/46611) -- Degrade PGLBOX2.0 log, fix stuck issue of dense parameter, fix the bug that barrier does not take effect, and add get_epoch_finish python side interface [#49946](https://github.com/PaddlePaddle/Paddle/pull/49946),[#50166](https://github.com/PaddlePaddle/Paddle/pull/50166),[#50349](https://github.com/PaddlePaddle/Paddle/pull/50349) -- GPUPs run to switch to specified mode. [#51115](https://github.com/PaddlePaddle/Paddle/pull/51115) -- GPUPS is added to benchmark. [#49587](https://github.com/PaddlePaddle/Paddle/pull/49587),[#49649](https://github.com/PaddlePaddle/Paddle/pull/49649) -- Fix the GPUPS optimizer selection bug, fix reader reading problem, and fix RPC compilation problem. [#47026](https://github.com/PaddlePaddle/Paddle/pull/47026),[#47192](https://github.com/PaddlePaddle/Paddle/pull/47192),[#49878](https://github.com/PaddlePaddle/Paddle/pull/49878), [#46356](https://github.com/PaddlePaddle/Paddle/pull/46356),[#46575](https://github.com/PaddlePaddle/Paddle/pull/46575),[#49389](https://github.com/PaddlePaddle/Paddle/pull/49389),[#46258](https://github.com/PaddlePaddle/Paddle/pull/46258),[#50136](https://github.com/PaddlePaddle/Paddle/pull/50136) -- Add rocksdb compilation method. [#46074](https://github.com/PaddlePaddle/Paddle/pull/46074) - -### CUDA -#### New features -- Add compilation support for CUDA 12.0. Fix related unit test. ([#49539](https://github.com/PaddlePaddle/Paddle/pull/49539), [#54542](https://github.com/PaddlePaddle/Paddle/pull/54542)) -- Add CUDNN Frontend API compilation support and related unit test. You can use `WITH_CUDNN_FRONTEND=ON ` compilation option for start. ([#47524](https://github.com/PaddlePaddle/Paddle/pull/47524), [#47612](https://github.com/PaddlePaddle/Paddle/pull/47612)) - -#### Improvements -- Add mixed precision strategy and optimize precision: - - Add and optimize FP16 and BF16 data type support for more than 200 operators in the framework, including logsumexp, reduce_max, cumprod, sync_batch_norm, compare class OP, etc. Carry out precision optimization and unit test for all FP16 and BF16 operators. Improve the unit test framework function for low-precision operators, to ensure there is no loss of accuracy in the process of large-model training. ([#51193](https://github.com/PaddlePaddle/Paddle/pull/51193), [#51114](https://github.com/PaddlePaddle/Paddle/pull/51114), [#45817](https://github.com/PaddlePaddle/Paddle/pull/45817), [#52862](https://github.com/PaddlePaddle/Paddle/pull/52862), [#52919](https://github.com/PaddlePaddle/Paddle/pull/52919), [#52921](https://github.com/PaddlePaddle/Paddle/pull/52921), [#46413](https://github.com/PaddlePaddle/Paddle/pull/46413), [#48205](https://github.com/PaddlePaddle/Paddle/pull/48205), [#54193](https://github.com/PaddlePaddle/Paddle/pull/54193), [#48041](https://github.com/PaddlePaddle/Paddle/pull/48041), [#48121](https://github.com/PaddlePaddle/Paddle/pull/48121), [#46364](https://github.com/PaddlePaddle/Paddle/pull/46364), [#51153](https://github.com/PaddlePaddle/Paddle/pull/51153), [#53023](https://github.com/PaddlePaddle/Paddle/pull/53023), [#53079](https://github.com/PaddlePaddle/Paddle/pull/53079), [#53137](https://github.com/PaddlePaddle/Paddle/pull/53137), [#46212](https://github.com/PaddlePaddle/Paddle/pull/46212), [#50908](https://github.com/PaddlePaddle/Paddle/pull/50908), [#52555](https://github.com/PaddlePaddle/Paddle/pull/52555), [#51582](https://github.com/PaddlePaddle/Paddle/pull/51582), [#47897](https://github.com/PaddlePaddle/Paddle/pull/47897), [#45601](https://github.com/PaddlePaddle/Paddle/pull/45601), [#53522](https://github.com/PaddlePaddle/Paddle/pull/53522), [#52666](https://github.com/PaddlePaddle/Paddle/pull/52666), [#50101](https://github.com/PaddlePaddle/Paddle/pull/50101), [#48315](https://github.com/PaddlePaddle/Paddle/pull/48315), [#50847](https://github.com/PaddlePaddle/Paddle/pull/50847), [#50905](https://github.com/PaddlePaddle/Paddle/pull/50905), [#50906](https://github.com/PaddlePaddle/Paddle/pull/50906), [#50909](https://github.com/PaddlePaddle/Paddle/pull/50909), [#50916](https://github.com/PaddlePaddle/Paddle/pull/50916), [#50917](https://github.com/PaddlePaddle/Paddle/pull/50917), [#50920](https://github.com/PaddlePaddle/Paddle/pull/50920), [#50919](https://github.com/PaddlePaddle/Paddle/pull/50919), [#50904](https://github.com/PaddlePaddle/Paddle/pull/50904), [#50918](https://github.com/PaddlePaddle/Paddle/pull/50918), [#50938](https://github.com/PaddlePaddle/Paddle/pull/50938), [#50858](https://github.com/PaddlePaddle/Paddle/pull/50858), [#50933](https://github.com/PaddlePaddle/Paddle/pull/50933), [#50945](https://github.com/PaddlePaddle/Paddle/pull/50945), [#50936](https://github.com/PaddlePaddle/Paddle/pull/50936), [#51168](https://github.com/PaddlePaddle/Paddle/pull/51168), [#51493](https://github.com/PaddlePaddle/Paddle/pull/51493), [#50924](https://github.com/PaddlePaddle/Paddle/pull/50924), [#50923](https://github.com/PaddlePaddle/Paddle/pull/50923), [#50926](https://github.com/PaddlePaddle/Paddle/pull/50926), [#50925](https://github.com/PaddlePaddle/Paddle/pull/50925), [#50930](https://github.com/PaddlePaddle/Paddle/pull/50930), [#53284](https://github.com/PaddlePaddle/Paddle/pull/53284), [#53286](https://github.com/PaddlePaddle/Paddle/pull/53286), [#53285](https://github.com/PaddlePaddle/Paddle/pull/53285), [#50976](https://github.com/PaddlePaddle/Paddle/pull/50976), [#50915](https://github.com/PaddlePaddle/Paddle/pull/50915), [#50915](https://github.com/PaddlePaddle/Paddle/pull/50915), [#48192](https://github.com/PaddlePaddle/Paddle/pull/48192), [#50993](https://github.com/PaddlePaddle/Paddle/pull/50993), [#50998](https://github.com/PaddlePaddle/Paddle/pull/50998), [#51380](https://github.com/PaddlePaddle/Paddle/pull/51380), [#51137](https://github.com/PaddlePaddle/Paddle/pull/51137), [#51106](https://github.com/PaddlePaddle/Paddle/pull/51106), [#51197](https://github.com/PaddlePaddle/Paddle/pull/51197), [#51159](https://github.com/PaddlePaddle/Paddle/pull/51159), [#51552](https://github.com/PaddlePaddle/Paddle/pull/51552), [#51151](https://github.com/PaddlePaddle/Paddle/pull/51151), [#51005](https://github.com/PaddlePaddle/Paddle/pull/51005), [#51565](https://github.com/PaddlePaddle/Paddle/pull/51565), [#51036](https://github.com/PaddlePaddle/Paddle/pull/51036), [#51185](https://github.com/PaddlePaddle/Paddle/pull/51185), [#51791](https://github.com/PaddlePaddle/Paddle/pull/51791), [#51083](https://github.com/PaddlePaddle/Paddle/pull/51083), [#51694](https://github.com/PaddlePaddle/Paddle/pull/51694), [#51689](https://github.com/PaddlePaddle/Paddle/pull/51689), [#51009](https://github.com/PaddlePaddle/Paddle/pull/51009), [#51051](https://github.com/PaddlePaddle/Paddle/pull/51051), [#51532](https://github.com/PaddlePaddle/Paddle/pull/51532), [#51978](https://github.com/PaddlePaddle/Paddle/pull/51978), [#51903](https://github.com/PaddlePaddle/Paddle/pull/51903), [#51888](https://github.com/PaddlePaddle/Paddle/pull/51888), [#52016](https://github.com/PaddlePaddle/Paddle/pull/52016), [#52035](https://github.com/PaddlePaddle/Paddle/pull/52035), [#52184](https://github.com/PaddlePaddle/Paddle/pull/52184), [#52018](https://github.com/PaddlePaddle/Paddle/pull/52018), [#51787](https://github.com/PaddlePaddle/Paddle/pull/51787), [#51640](https://github.com/PaddlePaddle/Paddle/pull/51640), [#52172](https://github.com/PaddlePaddle/Paddle/pull/52172), [#52193](https://github.com/PaddlePaddle/Paddle/pull/52193), [#51160](https://github.com/PaddlePaddle/Paddle/pull/51160), [#51809](https://github.com/PaddlePaddle/Paddle/pull/51809), [#51678](https://github.com/PaddlePaddle/Paddle/pull/51678), [#52158](https://github.com/PaddlePaddle/Paddle/pull/52158), [#51015](https://github.com/PaddlePaddle/Paddle/pull/51015), [#52240](https://github.com/PaddlePaddle/Paddle/pull/52240), [#52276](https://github.com/PaddlePaddle/Paddle/pull/52276), [#52233](https://github.com/PaddlePaddle/Paddle/pull/52233), [#52220](https://github.com/PaddlePaddle/Paddle/pull/52220), [#52107](https://github.com/PaddlePaddle/Paddle/pull/52107), [#52282](https://github.com/PaddlePaddle/Paddle/pull/52282), [#52311](https://github.com/PaddlePaddle/Paddle/pull/52311), [#52315](https://github.com/PaddlePaddle/Paddle/pull/52315), [#52357](https://github.com/PaddlePaddle/Paddle/pull/52357), [#52256](https://github.com/PaddlePaddle/Paddle/pull/52256), [#51649](https://github.com/PaddlePaddle/Paddle/pull/51649), [#52413](https://github.com/PaddlePaddle/Paddle/pull/52413), [#52369](https://github.com/PaddlePaddle/Paddle/pull/52369), [#51837](https://github.com/PaddlePaddle/Paddle/pull/51837), [#52112](https://github.com/PaddlePaddle/Paddle/pull/52112), [#51819](https://github.com/PaddlePaddle/Paddle/pull/51819), [#52388](https://github.com/PaddlePaddle/Paddle/pull/52388), [#52411](https://github.com/PaddlePaddle/Paddle/pull/52411), [#52521](https://github.com/PaddlePaddle/Paddle/pull/52521), [#51300](https://github.com/PaddlePaddle/Paddle/pull/51300), [#51117](https://github.com/PaddlePaddle/Paddle/pull/51117), [#52380](https://github.com/PaddlePaddle/Paddle/pull/52380), [#52317](https://github.com/PaddlePaddle/Paddle/pull/52317), [#51263](https://github.com/PaddlePaddle/Paddle/pull/51263), [#52668](https://github.com/PaddlePaddle/Paddle/pull/52668), [#52259](https://github.com/PaddlePaddle/Paddle/pull/52259), [#50999](https://github.com/PaddlePaddle/Paddle/pull/50999), [#52407](https://github.com/PaddlePaddle/Paddle/pull/52407), [#52288](https://github.com/PaddlePaddle/Paddle/pull/52288), [#52845](https://github.com/PaddlePaddle/Paddle/pull/52845), [#50953](https://github.com/PaddlePaddle/Paddle/pull/50953), [#52667](https://github.com/PaddlePaddle/Paddle/pull/52667), [#52582](https://github.com/PaddlePaddle/Paddle/pull/52582), [#52426](https://github.com/PaddlePaddle/Paddle/pull/52426), [#51884](https://github.com/PaddlePaddle/Paddle/pull/51884), [#52630](https://github.com/PaddlePaddle/Paddle/pull/52630), [#52136](https://github.com/PaddlePaddle/Paddle/pull/52136), [#52604](https://github.com/PaddlePaddle/Paddle/pull/52604), [#51615](https://github.com/PaddlePaddle/Paddle/pull/51615), [#51275](https://github.com/PaddlePaddle/Paddle/pull/51275), [#52898](https://github.com/PaddlePaddle/Paddle/pull/52898), [#52918](https://github.com/PaddlePaddle/Paddle/pull/52918), [#52572](https://github.com/PaddlePaddle/Paddle/pull/52572), [#52683](https://github.com/PaddlePaddle/Paddle/pull/52683), [#52956](https://github.com/PaddlePaddle/Paddle/pull/52956), [#52963](https://github.com/PaddlePaddle/Paddle/pull/52963), [#52954](https://github.com/PaddlePaddle/Paddle/pull/52954), [#52444](https://github.com/PaddlePaddle/Paddle/pull/52444), [#52314](https://github.com/PaddlePaddle/Paddle/pull/52314), [#52887](https://github.com/PaddlePaddle/Paddle/pull/52887), [#52195](https://github.com/PaddlePaddle/Paddle/pull/52195), [#53100](https://github.com/PaddlePaddle/Paddle/pull/53100), [#52961](https://github.com/PaddlePaddle/Paddle/pull/52961), [#52953](https://github.com/PaddlePaddle/Paddle/pull/52953), [#53111](https://github.com/PaddlePaddle/Paddle/pull/53111), [#53549](https://github.com/PaddlePaddle/Paddle/pull/53549), [#53736](https://github.com/PaddlePaddle/Paddle/pull/53736), [#52920](https://github.com/PaddlePaddle/Paddle/pull/52920), [#53195](https://github.com/PaddlePaddle/Paddle/pull/53195), [#53535](https://github.com/PaddlePaddle/Paddle/pull/53535), [#53876](https://github.com/PaddlePaddle/Paddle/pull/53876), [#53785](https://github.com/PaddlePaddle/Paddle/pull/53785), [#53722](https://github.com/PaddlePaddle/Paddle/pull/53722), [#54285](https://github.com/PaddlePaddle/Paddle/pull/54285), [#54232](https://github.com/PaddlePaddle/Paddle/pull/54232), [#53922](https://github.com/PaddlePaddle/Paddle/pull/53922), [#47277](https://github.com/PaddlePaddle/Paddle/pull/47277), [#50811](https://github.com/PaddlePaddle/Paddle/pull/50811), [#54571](https://github.com/PaddlePaddle/Paddle/pull/54571), [#50129](https://github.com/PaddlePaddle/Paddle/pull/50129), [#50340](https://github.com/PaddlePaddle/Paddle/pull/50340), [#50848](https://github.com/PaddlePaddle/Paddle/pull/50848), [#50849](https://github.com/PaddlePaddle/Paddle/pull/50849), [#50868](https://github.com/PaddlePaddle/Paddle/pull/50868), [#50878](https://github.com/PaddlePaddle/Paddle/pull/50878), [#50929](https://github.com/PaddlePaddle/Paddle/pull/50929), [#50939](https://github.com/PaddlePaddle/Paddle/pull/50939), [#50973](https://github.com/PaddlePaddle/Paddle/pull/50973), [#50913](https://github.com/PaddlePaddle/Paddle/pull/50913), [#51145](https://github.com/PaddlePaddle/Paddle/pull/51145), [#51090](https://github.com/PaddlePaddle/Paddle/pull/51090), [#51098](https://github.com/PaddlePaddle/Paddle/pull/51098), [#51094](https://github.com/PaddlePaddle/Paddle/pull/51094), [#51216](https://github.com/PaddlePaddle/Paddle/pull/51216), [#51736](https://github.com/PaddlePaddle/Paddle/pull/51736), [#51684](https://github.com/PaddlePaddle/Paddle/pull/51684), [#51925](https://github.com/PaddlePaddle/Paddle/pull/51925), [#54030](https://github.com/PaddlePaddle/Paddle/pull/54030), [#50700](https://github.com/PaddlePaddle/Paddle/pull/50700), [#52264](https://github.com/PaddlePaddle/Paddle/pull/52264), [#51069](https://github.com/PaddlePaddle/Paddle/pull/51069), [#51101](https://github.com/PaddlePaddle/Paddle/pull/51101), [#51286](https://github.com/PaddlePaddle/Paddle/pull/51286), [#53582](https://github.com/PaddlePaddle/Paddle/pull/53582),[#49869](https://github.com/PaddlePaddle/Paddle/pull/49869))) -- AMP optimization: Comprehensively upgrade and optimize ease of use, accuracy stability and debuggability of AMP training, to better support acceleration of large model training. In terms of ease of use, unify the API for dynamic and static graphs. Add new conversion interfaces such as model.float(), model.float16() and model.bfloat16(). In terms of accuracy stability, enhance automatic adjustment of the strategy for BF16 type. Optimize blacklist settings. Enhance support of the multi_precision function by optimizer operators Adagrad, Adamax, Adadelta, and RMSProp. In the O2 mode, improve master grad mechanism, add type promotion mechanism and a new parameter for the specific module to use float32 computation to guarantee accuracy. In terms of debuggability, add the paddle.amp.debugging module to provide operator statistics, outlier detection, and accuracy comparison. ( [#50132](https://github.com/PaddlePaddle/Paddle/pull/50132), [#50078](https://github.com/PaddlePaddle/Paddle/pull/50078), [#50131](https://github.com/PaddlePaddle/Paddle/pull/50131), [#49705](https://github.com/PaddlePaddle/Paddle/pull/49705), [#52936](https://github.com/PaddlePaddle/Paddle/pull/52936), [#52871](https://github.com/PaddlePaddle/Paddle/pull/52871), [#53289](https://github.com/PaddlePaddle/Paddle/pull/53289), [#53362](https://github.com/PaddlePaddle/Paddle/pull/53362), [#54240](https://github.com/PaddlePaddle/Paddle/pull/54240), [#53768](https://github.com/PaddlePaddle/Paddle/pull/53768), [#48041](https://github.com/PaddlePaddle/Paddle/pull/48041), [#47672](https://github.com/PaddlePaddle/Paddle/pull/47672), [#48843](https://github.com/PaddlePaddle/Paddle/pull/48843), [#49391](https://github.com/PaddlePaddle/Paddle/pull/49391), [#51635](https://github.com/PaddlePaddle/Paddle/pull/51635), [#45541](https://github.com/PaddlePaddle/Paddle/pull/45541), [#53742](https://github.com/PaddlePaddle/Paddle/pull/53742), [#51020](https://github.com/PaddlePaddle/Paddle/pull/51020), [#51063](https://github.com/PaddlePaddle/Paddle/pull/51063), [#52514](https://github.com/PaddlePaddle/Paddle/pull/52514), [#50940](https://github.com/PaddlePaddle/Paddle/pull/50940), [#52936](https://github.com/PaddlePaddle/Paddle/pull/52936), [#53439](https://github.com/PaddlePaddle/Paddle/pull/53439), [#53712](https://github.com/PaddlePaddle/Paddle/pull/53712), [#48238](https://github.com/PaddlePaddle/Paddle/pull/48238), [#52215](https://github.com/PaddlePaddle/Paddle/pull/52215), [#53012](https://github.com/PaddlePaddle/Paddle/pull/53012), [#52918](https://github.com/PaddlePaddle/Paddle/pull/52918), [#54571](https://github.com/PaddlePaddle/Paddle/pull/54571)) -- For GroupNorm operator, add support for NHWC data format. ([#47533](https://github.com/PaddlePaddle/Paddle/pull/47533)) -- For index_put operator, add support for mixed data types of bool and int. ([#54195](https://github.com/PaddlePaddle/Paddle/pull/54195)) -- Add sparse.is_nan API for determining whether a sparse tensor contains a NaN element. ([#51513](https://github.com/PaddlePaddle/Paddle/pull/51513)) - -#### bug fix -- Fix bugs of computation errors of several operators such as trace, roll, dropout_nd, and log_softmax, stack overflow, and some unit test error. ([#50243](https://github.com/PaddlePaddle/Paddle/pull/50243), [#52012](https://github.com/PaddlePaddle/Paddle/pull/52012), [#53795](https://github.com/PaddlePaddle/Paddle/pull/53795), [#53149](https://github.com/PaddlePaddle/Paddle/pull/53149), [#53654](https://github.com/PaddlePaddle/Paddle/pull/53654), [#51054](https://github.com/PaddlePaddle/Paddle/pull/51054), [#49373](https://github.com/PaddlePaddle/Paddle/pull/49373), [#53038](https://github.com/PaddlePaddle/Paddle/pull/53038)) -- Fix the problem that conv operator exhaustive search does not work in some scenarios. ([#47065](https://github.com/PaddlePaddle/Paddle/pull/47065)) -- Fix timeout problem of collective_reduce_scatter and other operators on A100. ([#54513](https://github.com/PaddlePaddle/Paddle/pull/54513)) -- Fix the problem of attribute error in FusedLinear unit test. ([#50359](https://github.com/PaddlePaddle/Paddle/pull/50359)) -- Fix the OOM problem that may occur when using Profiler. ([#46089](https://github.com/PaddlePaddle/Paddle/pull/46089)) - -#### Performance optimization -- Further optimize GPU Kernel and eigen implementations of the framework's large number of operators, including max_pool3d, dropout, adaptive_pooling, depthwise_conv2d, transpose, eigh, broadcast class computations, reduce class computations, prelu, logsumexp, and sparse, to achieve better performance in more configuration scenarios. ([#45820](https://github.com/PaddlePaddle/Paddle/pull/45820), [#45959](https://github.com/PaddlePaddle/Paddle/pull/45959), [#45934](https://github.com/PaddlePaddle/Paddle/pull/45934), [#46332](https://github.com/PaddlePaddle/Paddle/pull/46332), [#46287](https://github.com/PaddlePaddle/Paddle/pull/46287), [#47233](https://github.com/PaddlePaddle/Paddle/pull/47233), [#48855](https://github.com/PaddlePaddle/Paddle/pull/48855), [#48560](https://github.com/PaddlePaddle/Paddle/pull/48560), [#49419](https://github.com/PaddlePaddle/Paddle/pull/49419), [#49748](https://github.com/PaddlePaddle/Paddle/pull/49748), [#50348](https://github.com/PaddlePaddle/Paddle/pull/50348), [#52401](https://github.com/PaddlePaddle/Paddle/pull/52401), [#51131](https://github.com/PaddlePaddle/Paddle/pull/51131), [#51141](https://github.com/PaddlePaddle/Paddle/pull/51141), [#51479](https://github.com/PaddlePaddle/Paddle/pull/51479), [#51835](https://github.com/PaddlePaddle/Paddle/pull/51835), [#52509](https://github.com/PaddlePaddle/Paddle/pull/52509), [#52482](https://github.com/PaddlePaddle/Paddle/pull/52482), [#52700](https://github.com/PaddlePaddle/Paddle/pull/52700), [#53112](https://github.com/PaddlePaddle/Paddle/pull/53112), [#53659](https://github.com/PaddlePaddle/Paddle/pull/53659), [#53658](https://github.com/PaddlePaddle/Paddle/pull/53658), [#53154](https://github.com/PaddlePaddle/Paddle/pull/53154), [#54071](https://github.com/PaddlePaddle/Paddle/pull/54071), [#53622](https://github.com/PaddlePaddle/Paddle/pull/53622), [#52952](https://github.com/PaddlePaddle/Paddle/pull/52952), [#46046](https://github.com/PaddlePaddle/Paddle/pull/46046), [#46119](https://github.com/PaddlePaddle/Paddle/pull/46119), [#45946](https://github.com/PaddlePaddle/Paddle/pull/45946), [#47212](https://github.com/PaddlePaddle/Paddle/pull/47212), [#47791](https://github.com/PaddlePaddle/Paddle/pull/47791), [#47454](https://github.com/PaddlePaddle/Paddle/pull/47454), [#45230](https://github.com/PaddlePaddle/Paddle/pull/45230), [#48899](https://github.com/PaddlePaddle/Paddle/pull/48899), [#33051](https://github.com/PaddlePaddle/Paddle/pull/33051), [#49040](https://github.com/PaddlePaddle/Paddle/pull/49040), [#48992](https://github.com/PaddlePaddle/Paddle/pull/48992), [#49086](https://github.com/PaddlePaddle/Paddle/pull/49086), [#50808](https://github.com/PaddlePaddle/Paddle/pull/50808), [#46431](https://github.com/PaddlePaddle/Paddle/pull/46431), [#50931](https://github.com/PaddlePaddle/Paddle/pull/50931), [#48056](https://github.com/PaddlePaddle/Paddle/pull/48056), [#46071](https://github.com/PaddlePaddle/Paddle/pull/46071), [#49231](https://github.com/PaddlePaddle/Paddle/pull/49231), [#38660](https://github.com/PaddlePaddle/Paddle/pull/38660), [#50287](https://github.com/PaddlePaddle/Paddle/pull/50287), [#46111](https://github.com/PaddlePaddle/Paddle/pull/46111), [#46997](https://github.com/PaddlePaddle/Paddle/pull/46997), [#45854](https://github.com/PaddlePaddle/Paddle/pull/45854), [#47738](https://github.com/PaddlePaddle/Paddle/pull/47738), [#48635](https://github.com/PaddlePaddle/Paddle/pull/48635), [#50353](https://github.com/PaddlePaddle/Paddle/pull/50353), [#50362](https://github.com/PaddlePaddle/Paddle/pull/50362), [#51934](https://github.com/PaddlePaddle/Paddle/pull/51934), [#54045](https://github.com/PaddlePaddle/Paddle/pull/54045), [#46679](https://github.com/PaddlePaddle/Paddle/pull/46679), [#52093](https://github.com/PaddlePaddle/Paddle/pull/52093), [#52969](https://github.com/PaddlePaddle/Paddle/pull/52969)) -- Provide more fusion implementations and related fusion pass, such as fused_feed_forward, gather-gemm-scatter, matmul + bias, layernorm_shift_partition + element_add, and elementwise class fusion, to further improve performance of models that use the mode. ( [#50423](https://github.com/PaddlePaddle/Paddle/pull/50423), [#50091](https://github.com/PaddlePaddle/Paddle/pull/50091), [#50364](https://github.com/PaddlePaddle/Paddle/pull/50364), [#53017](https://github.com/PaddlePaddle/Paddle/pull/53017), [#50755](https://github.com/PaddlePaddle/Paddle/pull/50755), [#50050](https://github.com/PaddlePaddle/Paddle/pull/50050), [#47099](https://github.com/PaddlePaddle/Paddle/pull/47099), [#48848](https://github.com/PaddlePaddle/Paddle/pull/48848), [#49383](https://github.com/PaddlePaddle/Paddle/pull/49383), [#50809](https://github.com/PaddlePaddle/Paddle/pull/50809), [#52361](https://github.com/PaddlePaddle/Paddle/pull/52361), [#52028](https://github.com/PaddlePaddle/Paddle/pull/52028), [#48439](https://github.com/PaddlePaddle/Paddle/pull/48439), [#49009](https://github.com/PaddlePaddle/Paddle/pull/49009), [#51427](https://github.com/PaddlePaddle/Paddle/pull/51427), [#52731](https://github.com/PaddlePaddle/Paddle/pull/52731), [#51805](https://github.com/PaddlePaddle/Paddle/pull/51805)) - -### Intermediate Representation -In order to guarantee stability and reduce R&D cost of the IR system, we have developed a new IR system for PaddlePaddle. Complete basic data structure definition, operator definition generation, and execution system adaptation. In order to better support higher-order requirements of scientific computing scenarios, complete higher-order adaptation of operators such as silu and cast. -- Complete the definition of IR data structure, including type system and operator definition. Implement execution adaptation with phi kernel. [#51112](https://github.com/PaddlePaddle/Paddle/pull/51112), [#51992](https://github.com/PaddlePaddle/Paddle/pull/51992), [#50412](https://github.com/PaddlePaddle/Paddle/pull/50412), [#53557](https://github.com/PaddlePaddle/Paddle/pull/53557), [#53953](https://github.com/PaddlePaddle/Paddle/pull/53953), [#50959](https://github.com/PaddlePaddle/Paddle/pull/50959), [#54250](https://github.com/PaddlePaddle/Paddle/pull/54250), [#54197](https://github.com/PaddlePaddle/Paddle/pull/54197), [#54289](https://github.com/PaddlePaddle/Paddle/pull/54289), [#51636](https://github.com/PaddlePaddle/Paddle/pull/51636), [#52846](https://github.com/PaddlePaddle/Paddle/pull/52846), [#53988](https://github.com/PaddlePaddle/Paddle/pull/53988), [#54143](https://github.com/PaddlePaddle/Paddle/pull/54143), [#54035](https://github.com/PaddlePaddle/Paddle/pull/54035), [#54052](https://github.com/PaddlePaddle/Paddle/pull/54052), [#54340](https://github.com/PaddlePaddle/Paddle/pull/54340), [#54356](https://github.com/PaddlePaddle/Paddle/pull/54356), [#54068](https://github.com/PaddlePaddle/Paddle/pull/54068), [#53894](https://github.com/PaddlePaddle/Paddle/pull/53894), [#53707](https://github.com/PaddlePaddle/Paddle/pull/53707), [#54185](https://github.com/PaddlePaddle/Paddle/pull/54185), [#54031](https://github.com/PaddlePaddle/Paddle/pull/54031), [#54220](https://github.com/PaddlePaddle/Paddle/pull/54220), [#54275](https://github.com/PaddlePaddle/Paddle/pull/54275), [#54281](https://github.com/PaddlePaddle/Paddle/pull/54281), [#54186](https://github.com/PaddlePaddle/Paddle/pull/54186), [#54259](https://github.com/PaddlePaddle/Paddle/pull/54259), [#54124](https://github.com/PaddlePaddle/Paddle/pull/54124), [#54292](https://github.com/PaddlePaddle/Paddle/pull/54292), [#48068](https://github.com/PaddlePaddle/Paddle/pull/48068), [#53978](https://github.com/PaddlePaddle/Paddle/pull/53978) -- Improve the basic pass setup, including basic pass definition, pass registration management. [#54023](https://github.com/PaddlePaddle/Paddle/pull/54023),[#54170](https://github.com/PaddlePaddle/Paddle/pull/54170), [#54170](https://github.com/PaddlePaddle/Paddle/pull/54170), [#54308](https://github.com/PaddlePaddle/Paddle/pull/54308), [#54348](https://github.com/PaddlePaddle/Paddle/pull/54348), [#54385](https://github.com/PaddlePaddle/Paddle/pull/54385) -- Improve adaptation of high-level arithmetic, including modification of the basic module and adaptation of silu and cast arithmetic. [#52005](https://github.com/PaddlePaddle/Paddle/pull/52005), [#53425](https://github.com/PaddlePaddle/Paddle/pull/53425), [#53417](https://github.com/PaddlePaddle/Paddle/pull/53417), [#53417](https://github.com/PaddlePaddle/Paddle/pull/53417), [#53498](https://github.com/PaddlePaddle/Paddle/pull/53498), [#53171](https://github.com/PaddlePaddle/Paddle/pull/53171), [#53632](https://github.com/PaddlePaddle/Paddle/pull/53632), [#53605](https://github.com/PaddlePaddle/Paddle/pull/53605), [#53746](https://github.com/PaddlePaddle/Paddle/pull/53746), [#53874](https://github.com/PaddlePaddle/Paddle/pull/53874), [#54164](https://github.com/PaddlePaddle/Paddle/pull/54164), [#45888](https://github.com/PaddlePaddle/Paddle/pull/45888), [#46024](https://github.com/PaddlePaddle/Paddle/pull/46024), [#46446](https://github.com/PaddlePaddle/Paddle/pull/46446), [#46960](https://github.com/PaddlePaddle/Paddle/pull/46960) - -### CINN compiler -#### New features -- Add CINN support for 0D-Tensor. At present, in order to cooperate with the upgrade of the main framework, it is supported by adding pass temporarily. We will replace and upgrade the solution later. ([#53382](https://github.com/PaddlePaddle/Paddle/pull/53382), [#53955](https://github.com/PaddlePaddle/Paddle/pull/53955), [#54064](https://github.com/PaddlePaddle/Paddle/pull/54064), [#54118](https://github.com/PaddlePaddle/Paddle/pull/54118), [#54216](https://github.com/PaddlePaddle/Paddle/pull/54216), [#53454](https://github.com/PaddlePaddle/Paddle/pull/53454)) -- Add CINN support for int8/uint8/int16/uint16/bf16 data types. ([#50566](https://github.com/PaddlePaddle/Paddle/pull/50566), [#53637](https://github.com/PaddlePaddle/Paddle/pull/53637)) -- Add support for the CINN expand operator. ([#46776](https://github.com/PaddlePaddle/Paddle/pull/46776)) -- Add CINN support for PaddleInference. ([#45009](https://github.com/PaddlePaddle/Paddle/pull/45009)) - -#### Improvements -- For CINN compiler, pass skip_gc_vars attribute to CINN subgraph. CINN adds fetch operator for skip_gc_vars. [#49471](https://github.com/PaddlePaddle/Paddle/pull/49471), [#49553](https://github.com/PaddlePaddle/Paddle/pull/49553) -- For CINN compiler, conv2d and conv2d_grad do not use cinn operator by default. [#51645](https://github.com/PaddlePaddle/Paddle/pull/51645) -- Add build_cinn_pass to BuildStrategy for use in dynamic-to-static ([#49496](https://github.com/PaddlePaddle/Paddle/pull/49496)) -- Add reshape operator to perform unit test under combinator mechanism. ([#51276](https://github.com/PaddlePaddle/Paddle/pull/51276)) -- Change version of the main framework binding CINN from fixed commit to develop. ([#49775](https://github.com/PaddlePaddle/Paddle/pull/49775)) -- Set default Target parameter for CINN. ([#50182](https://github.com/PaddlePaddle/Paddle/pull/50182)) - -#### bug fix -- Fix the problem of inconsistent operator order after topology sorting during CINN symbolization. ([#52556](https://github.com/PaddlePaddle/Paddle/pull/52556)) -- Fix some operator computation errors, accuracy degradation, and unit test related problems. ([#53859](https://github.com/PaddlePaddle/Paddle/pull/53859), [#54261](https://github.com/PaddlePaddle/Paddle/pull/54261), [#46801](https://github.com/PaddlePaddle/Paddle/pull/46801), [#53676](https://github.com/PaddlePaddle/Paddle/pull/53676), [#53772](https://github.com/PaddlePaddle/Paddle/pull/53772)) -- Fix the problem of CINN support for float16 type. ([#48249](https://github.com/PaddlePaddle/Paddle/pull/48249)) -- Fix the problem in build_cinn_pass. ([#46843](https://github.com/PaddlePaddle/Paddle/pull/46843)) -- Fix the problem of no data area due to incorrect GC when CINN is turned on during combinator + dynamic-to-static. ([#50116](https://github.com/PaddlePaddle/Paddle/pull/50116)) -- Fix the problems of compiler dropout amp error, combinator resnet error, and inplace variable not found [#51688](https://github.com/PaddlePaddle/Paddle/pull/51688), [#52813](https://github.com/PaddlePaddle/Paddle/pull/52813), [#51769](https://github.com/PaddlePaddle/Paddle/pull/51769) - -#### Performance optimization -- Optimize reshape related fusion strategy ([#53066](https://github.com/PaddlePaddle/Paddle/pull/53066)) -- Optimize performance of BuildCINNPass. ([#49696](https://github.com/PaddlePaddle/Paddle/pull/49696)) -- Optimize performance of subgraph detection module. ([#45040](https://github.com/PaddlePaddle/Paddle/pull/45040), [#46937](https://github.com/PaddlePaddle/Paddle/pull/46937)) - -### Hardware support -#### CustomDevice -- Add support for the distributed strategy MP/Sharding/PP/MoE and recompute on the training side. Add support for the distributed strategy MP on the inference side. Support for hardware Ascend NPU and Cambricon MLU accessed through CustomDevice, without changing any codes, to automatically inherit all new distributed strategies added by CustomDevice. [#52872](https://github.com/PaddlePaddle/Paddle/pull/52872), [#54384](https://github.com/PaddlePaddle/Paddle/pull/54384), [#53220](https://github.com/PaddlePaddle/Paddle/pull/53220), [#54572](https://github.com/PaddlePaddle/Paddle/pull/54572), [#54573](https://github.com/PaddlePaddle/Paddle/pull/54573), [#54676](https://github.com/PaddlePaddle/Paddle/pull/54676), [#53044](https://github.com/PaddlePaddle/Paddle/pull/53044), [#53719](https://github.com/PaddlePaddle/Paddle/pull/53719), [#53701](https://github.com/PaddlePaddle/Paddle/pull/53701), [#53702](https://github.com/PaddlePaddle/Paddle/pull/53702), [#53703](https://github.com/PaddlePaddle/Paddle/pull/53703) -- Add API paddle.device.is_compiled_with_custom_device. It is convenient for users to judge whether the current environment supports the plug-in device backend of a certain hardware. [#49271](https://github.com/PaddlePaddle/Paddle/pull/49721) -- Add environment variable CUSTOM_DEVICE_BLACK_LIST setting, to support automatic heterogeneous operation on CPU of blacklisted operators. [#50409](https://github.com/PaddlePaddle/Paddle/pull/50409), [#50666](https://github.com/PaddlePaddle/Paddle/pull/50666) -- Optimize CustomDevice performance by reducing number of calls to get_device_count interface in runtime. [#46963](https://github.com/PaddlePaddle/Paddle/pull/46963) - -#### KUNLUNXIN XPU -- For the training side, use a new version of dynamic graph, with adding support for distributed strategy MP/Sharding/PP and recompute function, and communication library. For the inference side, add support for distributed strategy MP and support for XPU FasterTransformer operator acceleration library. [#49531](https://github.com/PaddlePaddle/Paddle/pull/49531), [#49815](https://github.com/PaddlePaddle/Paddle/pull/49815), [#48897](https://github.com/PaddlePaddle/Paddle/pull/48897), [#50717](https://github.com/PaddlePaddle/Paddle/pull/50717), [#51082](https://github.com/PaddlePaddle/Paddle/pull/51082), [#49757](https://github.com/PaddlePaddle/Paddle/pull/49757), [#51399](https://github.com/PaddlePaddle/Paddle/pull/51399), [#50329](https://github.com/PaddlePaddle/Paddle/pull/50329), [#48369](https://github.com/PaddlePaddle/Paddle/pull/48369), [#47838](https://github.com/PaddlePaddle/Paddle/pull/47838),[#48076](https://github.com/PaddlePaddle/Paddle/pull/48076),[#47882](https://github.com/PaddlePaddle/Paddle/pull/47882),[#48961](https://github.com/PaddlePaddle/Paddle/pull/48961),[#49043](https://github.com/PaddlePaddle/Paddle/pull/49043),[#49749](https://github.com/PaddlePaddle/Paddle/pull/49749),[#49806](https://github.com/PaddlePaddle/Paddle/pull/49806),[#53427](https://github.com/PaddlePaddle/Paddle/pull/53427),[#48470](https://github.com/PaddlePaddle/Paddle/pull/48470),[#49207](https://github.com/PaddlePaddle/Paddle/pull/49207),[#52296](https://github.com/PaddlePaddle/Paddle/pull/52296),[#51785](https://github.com/PaddlePaddle/Paddle/pull/51785),[#47168](https://github.com/PaddlePaddle/Paddle/pull/47168),[#47445](https://github.com/PaddlePaddle/Paddle/pull/47445),[#50200](https://github.com/PaddlePaddle/Paddle/pull/50200),[#49934](https://github.com/PaddlePaddle/Paddle/pull/49934),[#50792](https://github.com/PaddlePaddle/Paddle/pull/50792),[#52228](https://github.com/PaddlePaddle/Paddle/pull/52228),[#53337](https://github.com/PaddlePaddle/Paddle/pull/53337),[#53389](https://github.com/PaddlePaddle/Paddle/pull/53389),[#53496](https://github.com/PaddlePaddle/Paddle/pull/53496),[#53609](https://github.com/PaddlePaddle/Paddle/pull/53609),[#53697](https://github.com/PaddlePaddle/Paddle/pull/53697),[#53496](https://github.com/PaddlePaddle/Paddle/pull/53496),[#53720](https://github.com/PaddlePaddle/Paddle/pull/53720),[#53734](https://github.com/PaddlePaddle/Paddle/pull/53734),[#54172](https://github.com/PaddlePaddle/Paddle/pull/54172),[PR46227](https://github.com/PaddlePaddle/Paddle/pull/46227) - -## 4. Deployment Direction(Paddle Inference) -### New features -- Support Paddle TensorRT multiple subgraph TensorRT engine or TensorRT engine between different Predictors to share video memory in order to save video memory. [#45842](https://github.com/PaddlePaddle/Paddle/pull/45842) [#47631](https://github.com/PaddlePaddle/Paddle/pull/47631) -- For the C++ API, add Shape and data type API to obtain the input Tensor, and add Shape and data type API to obtain the output Tensor. For the C API, add SetExecStream, EnableMkldnnInt8 and other C++ existing APIs for serviced deployment. [#49758](https://github.com/PaddlePaddle/Paddle/pull/49758) -- Add paddle.inference.Predictor.register_output_hook() API. Support printing of the output of each layer under GPU inference in case of debugging. Support use in control flow models such as While. It should be noted the API does not support Paddle-TensorRT. [#54433](https://github.com/PaddlePaddle/Paddle/pull/54433) ,[#47050](https://github.com/PaddlePaddle/Paddle/pull/47050) , [#54254](https://github.com/PaddlePaddle/Paddle/pull/54254) 。 -- Paddle Inference Predictor API supports paddle::Tensor as input and output, so users can directly reuse the PaddlePaddle dynamics graph for pre-inference and post-inference processing. ([#50445](https://github.com/PaddlePaddle/Paddle/pull/50445)) -- Enhance Paddle TensorRT dynamic shape running ability, config.enable_tuned_tensorrt_dynamic_shape() API to build TensorRT Engine at runtime without passing any parameters. It is unnecessary to collect shape information before running. To avoid rebuilding at runtime, it is necessary to overwrite minimum and maximum Shape in first operations for several times. [#52162](https://github.com/PaddlePaddle/Paddle/pull/52162) 。 -- Paddle-TensorRT supports model input in NHWC format. [#49633](https://github.com/PaddlePaddle/Paddle/pull/49633) 。 -- Extend config.Exp_DisableTensorRtOPs API to disable access to TensorRT by specifying the name of the Tensor variable. [#49497](https://github.com/PaddlePaddle/Paddle/pull/49497) 。 - -### Improvements -- Enhance GPU mixed-precision inference (non-Paddle TensorRT scenarios). For the Config.enable_use_gpu enhancement, you can set precision type. [#47993](https://github.com/PaddlePaddle/Paddle/pull/47993) -- Support double type input for inference. [#51786](https://github.com/PaddlePaddle/Paddle/pull/51786) 。 -- Since the TensorRT operator does not support the INT64 type, leading to running failure of INT64 data type in the model. Paddle-TensorRT has been enhanced to automatically convert, with reducing the model to run in the INT32 type when model contains INT64 data type. [#45547](https://github.com/PaddlePaddle/Paddle/pull/45547) -- Paddle-TensorRT supports more operators into TensorRT inference, including: - - expand_v2,gather_nd,rsqrt,sign,not,onehot,arg_min,temporal_shift,expend_as_v2,setvalue,index_select,round,acosh,square,reduce_max,not_equal,reduce_min,reduce_prod,grid_sampler,elementwise_mod,pad3d ,greater_equal,bitwise,cumsum,matmul_v2,reciprocal,where,bmm,take_along_axis,less_than,greater_than, logical_or, logical_xor, logical_and, less_equal,range,reduce_all,reduce_any ,fill_any_like ,pow - - [#47002](https://github.com/PaddlePaddle/Paddle/pull/47002) , [#47589](https://github.com/PaddlePaddle/Paddle/pull/47589) ,[#48223](https://github.com/PaddlePaddle/Paddle/pull/48223) ,[#48557](https://github.com/PaddlePaddle/Paddle/pull/48557) , [#48655](https://github.com/PaddlePaddle/Paddle/pull/48655) , [#49113](https://github.com/PaddlePaddle/Paddle/pull/49113) , [#51207](https://github.com/PaddlePaddle/Paddle/pull/51207) ,[#51028](https://github.com/PaddlePaddle/Paddle/pull/51028) ,[#50341](https://github.com/PaddlePaddle/Paddle/pull/50341) ,[#51498](https://github.com/PaddlePaddle/Paddle/pull/51498) ,[#48534](https://github.com/PaddlePaddle/Paddle/pull/48534) ,[#48684](https://github.com/PaddlePaddle/Paddle/pull/48684) , [#49393](https://github.com/PaddlePaddle/Paddle/pull/49393) , [#49615](https://github.com/PaddlePaddle/Paddle/pull/49615) ,[#50934](https://github.com/PaddlePaddle/Paddle/pull/50934) ,[#50974](https://github.com/PaddlePaddle/Paddle/pull/50974),[#50986](https://github.com/PaddlePaddle/Paddle/pull/50986) , [#52000](https://github.com/PaddlePaddle/Paddle/pull/52000) ,[#51971](https://github.com/PaddlePaddle/Paddle/pull/51971) , [#52518](https://github.com/PaddlePaddle/Paddle/pull/52518) ,[#44918](https://github.com/PaddlePaddle/Paddle/pull/44918) ,[#48230](https://github.com/PaddlePaddle/Paddle/pull/48230) ,[#47820](https://github.com/PaddlePaddle/Paddle/pull/47820) , [#46877](https://github.com/PaddlePaddle/Paddle/pull/46877) , [#48358](https://github.com/PaddlePaddle/Paddle/pull/48358) , [#48592](https://github.com/PaddlePaddle/Paddle/pull/48592) ,[#48697](https://github.com/PaddlePaddle/Paddle/pull/48697) , [#53088](https://github.com/PaddlePaddle/Paddle/pull/53088) , [#47974](https://github.com/PaddlePaddle/Paddle/pull/47974) , [#53462](https://github.com/PaddlePaddle/Paddle/pull/53462) -- Enhance Paddle-TensorRT mapping operators strided_slice, instance_norm, prelu, argmax, cast, nearest_interp_v2, elementwise, bilinear. [#46819](https://github.com/PaddlePaddle/Paddle/pull/46819) ,[#47998](https://github.com/PaddlePaddle/Paddle/pull/47998) ,[#48043](https://github.com/PaddlePaddle/Paddle/pull/48043) ,[#48998](https://github.com/PaddlePaddle/Paddle/pull/48998) , [#49675](https://github.com/PaddlePaddle/Paddle/pull/49675) , [#47495](https://github.com/PaddlePaddle/Paddle/pull/47495) -- Paddle-TensorRT partial operators (scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range, unary, equal, elementwise_op). Support 0-dimensional Tensor. [#53660](https://github.com/PaddlePaddle/Paddle/pull/53660) ,[#53627](https://github.com/PaddlePaddle/Paddle/pull/53627) , [#53634](https://github.com/PaddlePaddle/Paddle/pull/53634) , [#53714](https://github.com/PaddlePaddle/Paddle/pull/53714) , [#53729](https://github.com/PaddlePaddle/Paddle/pull/53729) ,[#53769](https://github.com/PaddlePaddle/Paddle/pull/53769) ,[#53506](https://github.com/PaddlePaddle/Paddle/pull/53506) ,[#53704](https://github.com/PaddlePaddle/Paddle/pull/53704) -- Support compilation for versions earlier than GCC12 + CUDA 12.0. [#50106](https://github.com/PaddlePaddle/Paddle/pull/50106) -- Paddle-TensorRT's DeformableConv plugin supports dynamic Shape input. [#50698](https://github.com/PaddlePaddle/Paddle/pull/50698) -- For Paddle-TensorRT, add plugin support for lookup_table operator. [#46613](https://github.com/PaddlePaddle/Paddle/pull/46613) -- Add config.enable_low_precision_io() API to support low-precision type input in Paddle-TensorRT scenario. [#52485](https://github.com/PaddlePaddle/Paddle/pull/52485) -- Paddle-TensorRT's LayerNorm plugin supports FP16 computation. [#45043](https://github.com/PaddlePaddle/Paddle/pull/45043) -- Predictor's input data paddle_infer::Tensor supports bool type. [#49388](https://github.com/PaddlePaddle/Paddle/pull/49388) -- Paddle-TensorRT enhanced Convolution implementation uses ConvolutionNd. [#47653](https://github.com/PaddlePaddle/Paddle/pull/47653) -- conv2d_fusion operator supports NHWC format. [#49047](https://github.com/PaddlePaddle/Paddle/pull/49047) -- Adjust the directory structure related to Phi operators under C++ inference library. [#53091](https://github.com/PaddlePaddle/Paddle/pull/53091) -- Support rebuilding TensorRT Engine instead of reporting errors when TensorRT serialization and loading versions do not match. [#50775](https://github.com/PaddlePaddle/Paddle/pull/50775) 。 -- Optimize Paddle-TensorRT runtime to print log messages. [#50181](https://github.com/PaddlePaddle/Paddle/pull/50181) -- Support elementwise 0-dimensional Tensor inputs for oneDNN-based CPU inference. [#51656](https://github.com/PaddlePaddle/Paddle/pull/51656) -- Clean up and normalize support for Paddle-TensorRT's FC, matmul, matmul_v2 operators, and unify and upgrade to use TensorRT's IMatrixMultiplyLayer for support. [#52222](https://github.com/PaddlePaddle/Paddle/pull/52222) - -### Performance optimization -- Support multiple lookup_tables into Paddle-TensorRT's Embedding+Eltwise+LayerNorm fusion. [#46243](https://github.com/PaddlePaddle/Paddle/pull/46243) ,[#46230](https://github.com/PaddlePaddle/Paddle/pull/46230) -- Add MoE fusion Phi operator to improve inference performance of MoE model. [#48703](https://github.com/PaddlePaddle/Paddle/pull/48703) -- In the scenario of INT8 quantized inference, Paddle-TensorRT plugin can fall back to FP16 computation, instead of FP32 computation. [#50554](https://github.com/PaddlePaddle/Paddle/pull/50554) -- Optimize memory and video memory in case of inference. [#49051](https://github.com/PaddlePaddle/Paddle/pull/49051) , [#49046](https://github.com/PaddlePaddle/Paddle/pull/49046) ,[#53930](https://github.com/PaddlePaddle/Paddle/pull/53930) -- Optimize Layout and enhance Pass. [#52997](https://github.com/PaddlePaddle/Paddle/pull/52997) -- Support caching of operator Shape inferences to improve model inference performance. [#48312](https://github.com/PaddlePaddle/Paddle/pull/48312) -- Optimize bias+add+relu fusion using half2 instructions. [#49048](https://github.com/PaddlePaddle/Paddle/pull/49048) -- Optimize Concat Kernel for multiple inputs using vectorization operations. [#49540](https://github.com/PaddlePaddle/Paddle/pull/49540) -- Implement Convolution, Depthwise Convolution and related fusion operators based on CUTLASS to improve inference speed. [#47989](https://github.com/PaddlePaddle/Paddle/pull/47989) ,[#50603](https://github.com/PaddlePaddle/Paddle/pull/50603) ,[#51792](https://github.com/PaddlePaddle/Paddle/pull/51792) ,[#50603](https://github.com/PaddlePaddle/Paddle/pull/50603) -- Paddle-TensorRT supports FlashAttention’s plugin, to improve inference speed of models such as StableDiffusion. [#49438](https://github.com/PaddlePaddle/Paddle/pull/49438) 。 -- Add Transpose+LayerNorm fusion PASS, to improve inference speed of models such as StableDiffusion. [#50082](https://github.com/PaddlePaddle/Paddle/pull/50082) 。 -- Add Elementwise+Transpose fusion. [#50081](https://github.com/PaddlePaddle/Paddle/pull/50081) -- Optimize Paddle-TensorRT Group Norm plugin implementation. [#49160](https://github.com/PaddlePaddle/Paddle/pull/49160) -- For Config.EnableTensorRtEngine() API, add use_cuda_graph parameter. You can enable CUDA Graph. It should be noted you need to ensure the model input shape remains unchanged during usage, to reduce runtime consumption. [#53406](https://github.com/PaddlePaddle/Paddle/pull/53406) -- Support inplace operation of Reshape, to reduce copying time of the model at runtime. [#49146](https://github.com/PaddlePaddle/Paddle/pull/49146) -- Optimize LayerNorm kernel implementation based on oneDNN. [#47782](https://github.com/PaddlePaddle/Paddle/pull/47782) -- Support fusion of quantize+transpose and transpose+dequantize based on oneDNN. [#49509](https://github.com/PaddlePaddle/Paddle/pull/49509) -- When MKLDNN is turned on in CPU inference, FC-related fusion pass is enabled by default, to improve performance. [#45704](https://github.com/PaddlePaddle/Paddle/pull/45704) -- CPU OneDNN inference supports suqeeze2 + transpose2 fusion. [#47592](https://github.com/PaddlePaddle/Paddle/pull/47592) - -### XPU inference enhancement and performance optimization -- Add ExpRunWithRuntimeConfig API and XpuRuntimeConfig, to allow settings of parameters such as external streams, and L3 cache during inference. GetExecStream API supports obtaining Kunlun external stream objects. Input and output support Kunlun device memory, to reduce D2H and H2D overheads. [#53334](https://github.com/PaddlePaddle/Paddle/pull/53334)、 [#52466](https://github.com/PaddlePaddle/Paddle/pull/52466)、 [#53240](https://github.com/PaddlePaddle/Paddle/pull/53240) -- Add multi-encoder, fused_multi_transformer and fusion pass, to improve performance of ERNIE and Transformer class models. [#50570](https://github.com/PaddlePaddle/Paddle/pull/50570)、[#51346](https://github.com/PaddlePaddle/Paddle/pull/51346)、 [#50499](https://github.com/PaddlePaddle/Paddle/pull/50499)、[#53982](https://github.com/PaddlePaddle/Paddle/pull/53982)、[#50759](https://github.com/PaddlePaddle/Paddle/pull/50759)、[#51571](https://github.com/PaddlePaddle/Paddle/pull/51571)、 [#53144](https://github.com/PaddlePaddle/Paddle/pull/53144)、[#53306](https://github.com/PaddlePaddle/Paddle/pull/53306) -- Optimize BeamSearch performance. Transform, remove and fuse fine-grained operators such as write_read_array and gather, to improve model performance when beam_size=1. [#53130](https://github.com/PaddlePaddle/Paddle/pull/53130) -- Transform multiple stack operators with the same input into unsqueeze operators that support broadcast. Unsquee/squeeze supports inplace computation. [#52099](https://github.com/PaddlePaddle/Paddle/pull/52099) -- Add support for exporting multi-card inference models for Kunlunxin. [#50490](https://github.com/PaddlePaddle/Paddle/pull/50490) -- Add embedding_with_eltwise_add fusion pass and operator phi kernel, to reduce video memory usage and improve inference performance. [#50590](https://github.com/PaddlePaddle/Paddle/pull/50590) -- interpolate class operator phi kernel supports FP16. [#52358](https://github.com/PaddlePaddle/Paddle/pull/52358) -- argmax operator supports INT32 type output. [#51303](https://github.com/PaddlePaddle/Paddle/pull/51303) -- Fix the error of only model file when saving serialized model after turning on mixed-precision inference mode. [#52994](https://github.com/PaddlePaddle/Paddle/pull/52994) -- Fix segment error of instance_norm when scale and bias are empty. [#52627](https://github.com/PaddlePaddle/Paddle/pull/52627) -- conv_transpose operator supports FP16. [#53626](https://github.com/PaddlePaddle/Paddle/pull/53626) -- Add yolo_box_xpu fusion pass and operator phi kernel, to optimize YOLO model generic substructure. [#54163](https://github.com/PaddlePaddle/Paddle/pull/54163) -- Add conv2d_xpu fusion pass and operator phi kernel, and support FP16 inference, to optimize convolution operation inference consumption time. [#52247](https://github.com/PaddlePaddle/Paddle/pull/52247) ,[#53626](https://github.com/PaddlePaddle/Paddle/pull/53626) -- Add sigmoid_elementmul generic fusion pass, to fuse to swish operator to match conv2d_fusion pass to improve YOLO model inference performance. [#53580](https://github.com/PaddlePaddle/Paddle/pull/53580) -- Add act_add fusion pass and operator phi kernel to improve inference performance. [#53965](https://github.com/PaddlePaddle/Paddle/pull/53965) -- Add fold_interp_outsize fusion pass, to improve inference performance. [#54245](https://github.com/PaddlePaddle/Paddle/pull/54245) -- Solve the problem of incorrect results due to duplicate fusion when there is shared weight in FC. [#51108](https://github.com/PaddlePaddle/Paddle/pull/51108)、[#51039](https://github.com/PaddlePaddle/Paddle/pull/51039) -- Remove op_device attribute where operator is only used for training, to prevent wrong choice of place for training during inference. [#51029](https://github.com/PaddlePaddle/Paddle/pull/51029) -- Support saving of optimized models, allowing PASS optimization to be skipped in case of re-inference, to reduce first time inference time. [#53696](https://github.com/PaddlePaddle/Paddle/pull/53696) -- Solve the problem of computation error caused by the CPUPlace input of operator Kernel being forced to copy to XPU. [#51306](https://github.com/PaddlePaddle/Paddle/pull/51306) -- subblock supports early copying of H2D parameters to improve inference performance. [#51876](https://github.com/PaddlePaddle/Paddle/pull/51876) -- Fix scale memory size of the output activation of Kunlunxin 2nd generation chip. [#53505](https://github.com/PaddlePaddle/Paddle/pull/53505) -- In new executor Kunlunxin D2D copy, support asynchronous execution. [#51876](https://github.com/PaddlePaddle/Paddle/pull/51876) -- Remove concat operator with only one input. [#52304](https://github.com/PaddlePaddle/Paddle/pull/52304) -- lookup_table_v2 supports FP16 to remove redundant cast operator. [#52888](https://github.com/PaddlePaddle/Paddle/pull/52888) -- Control flow While operator supports caching scope, to reduce overhead of creating new scope every time. [#52628](https://github.com/PaddlePaddle/Paddle/pull/52628) -- Scatter newly supports FP16, to remove redundant cast operators and elementwise_mul operators with an input of 1. [#52831](https://github.com/PaddlePaddle/Paddle/pull/52831) - -### Model quantization -- Upgrade of dynamic graph quantization function. - - Add a new API for quantization training of dynamic graph models: ```paddle.quantization.QAT ```. Support passing quantization-related parameters through configuration, simplifying quantization training process and difficulty of secondary development. ([#49398](https://github.com/PaddlePaddle/Paddle/pull/49398)) - - Add a new offline quantization API: ```paddle.quantization.PTQ ```. Support exporting quantization model to model format supported by inference. ([#50107](https://github.com/PaddlePaddle/Paddle/pull/50107)) - - Add STUB operator to simulate actual quantization operation during training process. ([#50510](https://github.com/PaddlePaddle/Paddle/pull/50510)) -- Support quantization training model to load parameters of offline quantization model. Support more operators for quantization, including matmul, scale, and conv1d. [#47892](https://github.com/PaddlePaddle/Paddle/pull/47892), [#45911](https://github.com/PaddlePaddle/Paddle/pull/45911),[#48912](https://github.com/PaddlePaddle/Paddle/pull/48912) -- Support hybrid parallel training of static graph quantization training. [#52219](https://github.com/PaddlePaddle/Paddle/pull/52219) -- Fix the problem in the process of dynamic graph quantization: - - Repeat insertion of quantization nodes when exporting quantization training models. [#48751](https://github.com/PaddlePaddle/Paddle/pull/48751) - - Fix the problem of inserting quantization nodes into model input. [#49926](https://github.com/PaddlePaddle/Paddle/pull/49926) - -## 5. Environment Adaptation -Improve efficiency of source code compilation, and promote setuptools + ninja compilation method to increase development efficiency: In CPU scenarios, full amount of compilation time is reduced by 20 min, and compilation speed is increased by 24.52%. In GPU scenario, full amount of compilation time is reduced by 22 min, and compilation speed is increased by 29.31%. In order to adapt to mainstream development environments, PaddlePaddle supports gcc12 compilation and C++17 in the source code, and adapts to the latest CUDA12. In terms of code quality, complete cleanup of compilation warnings, to improve compilation experience. At the third-party dependency level, we have upgraded the version of underlying protobuf to reduce dependency, cleaned up deprecated attributes of some earlier versions of dependency libraries and old code formats, and removed support for Python 2.x. -- ninja compilation adaptation to improve compilation speed. [#52433](https://github.com/PaddlePaddle/Paddle/pull/52433),[#48932](https://github.com/PaddlePaddle/Paddle/pull/48932),[#49420](https://github.com/PaddlePaddle/Paddle/pull/49420),[#48435](https://github.com/PaddlePaddle/Paddle/pull/48435),[#49303](https://github.com/PaddlePaddle/Paddle/pull/49303),[#49448](https://github.com/PaddlePaddle/Paddle/pull/49448),[#49838](https://github.com/PaddlePaddle/Paddle/pull/49838),[#50067](https://github.com/PaddlePaddle/Paddle/pull/50067),[#52796](https://github.com/PaddlePaddle/Paddle/pull/52796),[#50431](https://github.com/PaddlePaddle/Paddle/pull/50431),[#49181](https://github.com/PaddlePaddle/Paddle/pull/49181),[#48867](https://github.com/PaddlePaddle/Paddle/pull/48867),[#48490](https://github.com/PaddlePaddle/Paddle/pull/48490),[#48211](https://github.com/PaddlePaddle/Paddle/pull/48211),[#49499](https://github.com/PaddlePaddle/Paddle/pull/49499),[#53076](https://github.com/PaddlePaddle/Paddle/pull/53076) -- setuptools compilation and package all-in-one adaptation. [#48770](https://github.com/PaddlePaddle/Paddle/pull/48770),[#46957](https://github.com/PaddlePaddle/Paddle/pull/46957),[#49583](https://github.com/PaddlePaddle/Paddle/pull/49583),[#47602](https://github.com/PaddlePaddle/Paddle/pull/47602),[#48301](https://github.com/PaddlePaddle/Paddle/pull/48301),[#50800](https://github.com/PaddlePaddle/Paddle/pull/50800),[#42575](https://github.com/PaddlePaddle/Paddle/pull/42575)),[#49826](https://github.com/PaddlePaddle/Paddle/pull/49826),[#49002](https://github.com/PaddlePaddle/Paddle/pull/49002),[#51443](https://github.com/PaddlePaddle/Paddle/pull/51443),[#51528](https://github.com/PaddlePaddle/Paddle/pull/51528),[#52621](https://github.com/PaddlePaddle/Paddle/pull/52621),[#52465](https://github.com/PaddlePaddle/Paddle/pull/52465) -- gcc12 support. [#52960](https://github.com/PaddlePaddle/Paddle/pull/52960),[#52265](https://github.com/PaddlePaddle/Paddle/pull/52265),[#46546](https://github.com/PaddlePaddle/Paddle/pull/46546),[#52318](https://github.com/PaddlePaddle/Paddle/pull/52318),[#46808](https://github.com/PaddlePaddle/Paddle/pull/46808),[#47466](https://github.com/PaddlePaddle/Paddle/pull/47466),[#52083](https://github.com/PaddlePaddle/Paddle/pull/52083),[#48176](https://github.com/PaddlePaddle/Paddle/pull/48176),[#49423](https://github.com/PaddlePaddle/Paddle/pull/49423),[#49452](https://github.com/PaddlePaddle/Paddle/pull/49452),[#51037](https://github.com/PaddlePaddle/Paddle/pull/51037),[#52007](https://github.com/PaddlePaddle/Paddle/pull/52007),[#52441](https://github.com/PaddlePaddle/Paddle/pull/52441),[#52085](https://github.com/PaddlePaddle/Paddle/pull/52085),[#50817](https://github.com/PaddlePaddle/Paddle/pull/50817),[#52646](https://github.com/PaddlePaddle/Paddle/pull/52646),[#50777](https://github.com/PaddlePaddle/Paddle/pull/50777),[#53288](https://github.com/PaddlePaddle/Paddle/pull/53288),[#54009](https://github.com/PaddlePaddle/Paddle/pull/54009) -- c++17 standard support. [#53345](https://github.com/PaddlePaddle/Paddle/pull/53345),[#53892](https://github.com/PaddlePaddle/Paddle/pull/53892),[#54282](https://github.com/PaddlePaddle/Paddle/pull/54282),[#49017](https://github.com/PaddlePaddle/Paddle/pull/49017),[#47635](https://github.com/PaddlePaddle/Paddle/pull/47635),[#54258](https://github.com/PaddlePaddle/Paddle/pull/54258) -- cuda12 support. [#52285](https://github.com/PaddlePaddle/Paddle/pull/52285),[#49592](https://github.com/PaddlePaddle/Paddle/pull/49592),[#52232](https://github.com/PaddlePaddle/Paddle/pull/52232),[#52654](https://github.com/PaddlePaddle/Paddle/pull/52654),[#54641](https://github.com/PaddlePaddle/Paddle/pull/54641) -- CodeStyle。[#45909](https://github.com/PaddlePaddle/Paddle/pull/45909),[#47772](https://github.com/PaddlePaddle/Paddle/pull/47772),[#48538](https://github.com/PaddlePaddle/Paddle/pull/48538),[#49522](https://github.com/PaddlePaddle/Paddle/pull/49522),[#47264](https://github.com/PaddlePaddle/Paddle/pull/47264),[#49558](https://github.com/PaddlePaddle/Paddle/pull/49558) -- Compilation Warning is removed. [#47163](https://github.com/PaddlePaddle/Paddle/pull/47163),[#47216](https://github.com/PaddlePaddle/Paddle/pull/47216),[#47309](https://github.com/PaddlePaddle/Paddle/pull/47309),[#47252](https://github.com/PaddlePaddle/Paddle/pull/47252),[#47341](https://github.com/PaddlePaddle/Paddle/pull/47341),[#47399](https://github.com/PaddlePaddle/Paddle/pull/47399),[#47513](https://github.com/PaddlePaddle/Paddle/pull/47513),[#47558](https://github.com/PaddlePaddle/Paddle/pull/47558),[#47706](https://github.com/PaddlePaddle/Paddle/pull/47706),[#52717](https://github.com/PaddlePaddle/Paddle/pull/52717),[#51203](https://github.com/PaddlePaddle/Paddle/pull/51203),[#51336](https://github.com/PaddlePaddle/Paddle/pull/51336),[#51608](https://github.com/PaddlePaddle/Paddle/pull/51608),[#51633](https://github.com/PaddlePaddle/Paddle/pull/51633),[#46644](https://github.com/PaddlePaddle/Paddle/pull/46644),[#53092](https://github.com/PaddlePaddle/Paddle/pull/53092),[#53185](https://github.com/PaddlePaddle/Paddle/pull/53185),[#53246](https://github.com/PaddlePaddle/Paddle/pull/53246),[#53650](https://github.com/PaddlePaddle/Paddle/pull/53650),[#53683](https://github.com/PaddlePaddle/Paddle/pull/53683),[#53687](https://github.com/PaddlePaddle/Paddle/pull/53687),[#53886](https://github.com/PaddlePaddle/Paddle/pull/53886),[#53689](https://github.com/PaddlePaddle/Paddle/pull/53689),[#53679](https://github.com/PaddlePaddle/Paddle/pull/53679),[#53681](https://github.com/PaddlePaddle/Paddle/pull/53681),[#53532](https://github.com/PaddlePaddle/Paddle/pull/53532),[#47137](https://github.com/PaddlePaddle/Paddle/pull/47137),[#47045](https://github.com/PaddlePaddle/Paddle/pull/47045),[#52186](https://github.com/PaddlePaddle/Paddle/pull/52186),[#52490](https://github.com/PaddlePaddle/Paddle/pull/52490),[#53924](https://github.com/PaddlePaddle/Paddle/pull/53924),[#53938](https://github.com/PaddlePaddle/Paddle/pull/53938),[#53945](https://github.com/PaddlePaddle/Paddle/pull/53945),[#53851](https://github.com/PaddlePaddle/Paddle/pull/53851),[#53847](https://github.com/PaddlePaddle/Paddle/pull/53847),[#53818](https://github.com/PaddlePaddle/Paddle/pull/53818),[#53931](https://github.com/PaddlePaddle/Paddle/pull/53931) -- Support protobuf upgrade. [#49875](https://github.com/PaddlePaddle/Paddle/pull/49875),[#48495](https://github.com/PaddlePaddle/Paddle/pull/48495),[#49673](https://github.com/PaddlePaddle/Paddle/pull/49673),[#52499](https://github.com/PaddlePaddle/Paddle/pull/52499),[#51161](https://github.com/PaddlePaddle/Paddle/pull/51161),[#49168](https://github.com/PaddlePaddle/Paddle/pull/49168) -- Support offline compilation of third-party libraries. [#54326](https://github.com/PaddlePaddle/Paddle/pull/54326),[#54370](https://github.com/PaddlePaddle/Paddle/pull/54370),[#54335](https://github.com/PaddlePaddle/Paddle/pull/54335),[#54346](https://github.com/PaddlePaddle/Paddle/pull/54346),[#53744](https://github.com/PaddlePaddle/Paddle/pull/53744),[#54319](https://github.com/PaddlePaddle/Paddle/pull/54319),[#53915](https://github.com/PaddlePaddle/Paddle/pull/53915) -- Phi independent compilation header file dependency decoupling. [#50456](https://github.com/PaddlePaddle/Paddle/pull/50456),[#47088](https://github.com/PaddlePaddle/Paddle/pull/47088),[#52573](https://github.com/PaddlePaddle/Paddle/pull/52573),[#52651](https://github.com/PaddlePaddle/Paddle/pull/52651) -- Python2.x decommissioning. [#48685](https://github.com/PaddlePaddle/Paddle/pull/48685) - -## 6. Security -- Fix bugs such as null pointer usage, illegal address access, memory out of bounds, divide by 0, and Python IndexError [PR49976](https://github.com/PaddlePaddle/Paddle/pull/49976), [ PR49993](https://github.com/PaddlePaddle/Paddle/pull/49993)[, PR49942](https://github.com/PaddlePaddle/Paddle/pull/49942), [PR49965](https://github.com/PaddlePaddle/Paddle/pull/49965)[, PR50000](https://github.com/PaddlePaddle/Paddle/pull/50000)[, PR50005](https://github.com/PaddlePaddle/Paddle/pull/50005)[, PR49953](https://github.com/PaddlePaddle/Paddle/pull/49953)[, PR49995](https://github.com/PaddlePaddle/Paddle/pull/49995)[, PR49974](https://github.com/PaddlePaddle/Paddle/pull/49974)[, PR50015](https://github.com/PaddlePaddle/Paddle/pull/50015)[, PR50010](https://github.com/PaddlePaddle/Paddle/pull/50010), [PR49979](https://github.com/PaddlePaddle/Paddle/pull/49979), [PR49994](https://github.com/PaddlePaddle/Paddle/pull/49994), [PR49977](https://github.com/PaddlePaddle/Paddle/pull/49977)[, PR49968](https://github.com/PaddlePaddle/Paddle/pull/49968), [PR49984](https://github.com/PaddlePaddle/Paddle/pull/49984)[, PR49958](https://github.com/PaddlePaddle/Paddle/pull/49958)[, PR50008](https://github.com/PaddlePaddle/Paddle/pull/50008)[, PR51714](https://github.com/PaddlePaddle/Paddle/pull/51714), [PR51847](https://github.com/PaddlePaddle/Paddle/pull/51847), [PR51034](https://github.com/PaddlePaddle/Paddle/pull/51034)[, PR51088](https://github.com/PaddlePaddle/Paddle/pull/51088)[, PR51091](https://github.com/PaddlePaddle/Paddle/pull/51091)[, PR51092](https://github.com/PaddlePaddle/Paddle/pull/51092), [PR49966](https://github.com/PaddlePaddle/Paddle/pull/49966), [PR49656](https://github.com/PaddlePaddle/Paddle/pull/49656), [PR52161](https://github.com/PaddlePaddle/Paddle/pull/52161), [PR49548](https://github.com/PaddlePaddle/Paddle/pull/49548), [PR49546](https://github.com/PaddlePaddle/Paddle/pull/49546), [PR49547](https://github.com/PaddlePaddle/Paddle/pull/49547), [PR49549](https://github.com/PaddlePaddle/Paddle/pull/49549), [PR51850](https://github.com/PaddlePaddle/Paddle/pull/51850) - -## Thanks to our Contributors -This release contains contributions from: -1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin Wu Jiawen , Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, Ding Yi, Fu Jianhan, Liu Ge Gu Tou, Lu Lin, Zhou Zhouzhou, Jiang Yongyong, Xue Zhawu, Zhang Chunqiao, Zhang Zhenghai, Ning Meng Wei, Wang Mingdong, Shi Xiaowei, Chao Ji Ma Niu, Chen Cangye, Qi Ma Xiao Mao - -# 2.4.2 Release Note - - V2.4.2 fixed known bugs, and added a tiny set of features. - -## Training Framework (distributed included) - - - Fix the problem while using paddle.utils.dlpack.to_dlpack API to create dlpack objects multiple times in the for loop, and fix the bug that the reference counting error causes the memory actually pointed by dlpack to be destructed unexpectedly. [#50138](https://github.com/PaddlePaddle/Paddle/pull/50138) - - Fixed the issue of out-of-bounds memory access when the input tensor is multi-dimensional in paddle.multiplex API. [#49368](https://github.com/PaddlePaddle/Paddle/pull/49368) - - Fix the occasional compilation error caused by incorrect referencing of the Eigen header file. [#48157](https://github.com/PaddlePaddle/Paddle/pull/48157) - - Fixed the bug that the output value of the backward operator may be None when the output gradient parameter order of the custom operator is not continuous.[#48656](https://github.com/PaddlePaddle/Paddle/pull/48656) - - Add cutlass and implement the fusion kernel of gather+gemm+scatter; Optimize training and inference performance of sparse convolution; Optimize inference performance of batch_norm under 1D input data.[#50118](https://github.com/PaddlePaddle/Paddle/pull/50118) - - Fix compilation failure in gcc54 environment caused by using constexpr. [#50421](https://github.com/PaddlePaddle/Paddle/pull/50421) - - Move sum op kernel to PHI and fix bug that can't get correct SelectedRows' dims when run infermeta.[#49342](https://github.com/PaddlePaddle/Paddle/pull/49342) - - Fixed the issue that the fold operator accesses memory out of bounds under large bs input.[#49491](https://github.com/PaddlePaddle/Paddle/pull/49491) - - Fix the problem that no parameter Layer cannot call backward under dynamic to static mode.[#49812](https://github.com/PaddlePaddle/Paddle/pull/49812) - - Fix the compile problem of CUDA11.8 on windows platform.[#50205](https://github.com/PaddlePaddle/Paddle/pull/50205) - - Fix the unsupported error for `FusedDropoutActBiasGrad` on H100.[#47285](https://github.com/PaddlePaddle/Paddle/pull/47285) - - Add `debug_graphviz_path` option into `build_strategy`.[#46531](https://github.com/PaddlePaddle/Paddle/pull/46531) - - Fix the not closed `popen` object.[#47053](https://github.com/PaddlePaddle/Paddle/pull/47053) - -## Deployment Direction (Paddle Inference) - - - Improve the functionality and stability of mixed-precision inference. Reconstruct the implementation of interface convert_to_mixed_precision and add parameter precision to interface enable_use_gpu.[#49077](https://github.com/PaddlePaddle/Paddle/pull/49077)、[#49239](https://github.com/PaddlePaddle/Paddle/pull/49239)、[#49477](https://github.com/PaddlePaddle/Paddle/pull/49477) - - Support compilation under jetson ampere architecture.[#49364](https://github.com/PaddlePaddle/Paddle/pull/49364) - - Fixed fc kernel diff.[#49781](https://github.com/PaddlePaddle/Paddle/pull/49781) - - Fixed the error of trt workspace parameter type under CAPI. [#48350](https://github.com/PaddlePaddle/Paddle/pull/48350) - - Fixed the error caused by arg_max/arg_min without flatten dtype parameter in Paddle 1.x version. [#49771](https://github.com/PaddlePaddle/Paddle/pull/49771) - - Fixed the bug of missing information about lod logic after split infermeta's refactoring. [#49745](https://github.com/PaddlePaddle/Paddle/pull/49745) - - Fixed the bug of the constant-folding pass, which causes the conv2d weight to be non-persistent after folding and not enter the TensorRT engine. [#50105](https://github.com/PaddlePaddle/Paddle/pull/50105) - -# 2.4.1 Release Note - - -Remove the dependence of the Paddle on python.so, and fix the bug that fails to execute due to the inability to find python.so in specific environments, including conda. - - -# 2.4.0 Release Note - -## 1. Important Updates - -- **New dynamic graph architecture is officially effective**: The new dynamic graph framework has significantly improved the scheduling performance. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% kits is improved by over 5%. The functional architecture is clearer, and the secondary development capability and experience are significantly enhanced. - -- **Comprehensive improvement of the dynamic-static unification ability of the PaddlePaddle**: The dynamic-to-static function is provided with richer Python syntax support. The Python syntax coverage of the PaddlePaddle reaches 90%. The syntax transcription logic is mainly optimized to completely support the control flow syntax, with providing smooth dynamic-to-static graph experiences by pressing one key. With the newly upgraded static graph executor, the dynamic-to-static training has better acceleration capability, and the key model test shows that it is close to the best level of the static graph. The dynamic-to-static scalability is improved, with newly supporting multi-function merge export and inference. Users can use the PHI operator library for secondary development and flexible deployment. This can effectively support the custom decoding of U2++ featured models in the speech domain. - -- **Add sparse computing APIs**: Add 55 sparse APIs `paddle.sparse.*` and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API. - -- **Large-scale graph neural network GPU training engine**: Through the heterogeneous hierarchical storage technology of SSD, memory, and video memory, it breaks through the video memory bottleneck and supports all-GPU storage and training of super-large-scale graphs. It realizes the all-GPU integrated solution of walk, sampling and training. This can increase the training speed by more than 10x under the same costs, compared to the traditional distributed CPU solution. - -- **Environment adaptation**: Add pre-compiled installer adapted to CUDA version 11.7. It newly supports the running in Ubuntu 22.04 or later. - -### Forward-looking forecast - -- PaddlePaddle Framework will deprecate support for python 3.6 in version 2.5. -- The PaddlePaddle framework will gradually deprecate the API under the `paddle.fluild` namespace on the python side, and some of the APIs under this namespace will be directly removed in version 2.5. - -## 2. Incompatibility upgrade - -- The pre-compiled installer for CUDA version 10.1 is cancelled. -- The -Tensor.clear_gradient(bool set_to_zero) interface will not take the value passed by kwargs, and will have to pass the bool variable of set_to_zero through args. -- In order to improve the utilization efficiency of video memory, only the gradients of forward leaf node variables, such as the gradients of network parameters in training, are retained in the dynamic graph by default, instead of the gradients of non-leaf nodes. If you need to preserve a specific Tensor gradient, you can call the Tensor.retain_grads() interface before reverse execution. -- paddle.autograd. PyLayer will no longer support the case where the input is tuple, pass in a list of Tensor if you want a group of them. - -## 3. Training framework (including the distributed feature) - -### (1)New APIs and enhanced API functions -- **Add the sparse computing class API**:paddle.sparse - - Add 55 sparse APIs and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API.[#45849](https://github.com/PaddlePaddle/Paddle/pull/45849), [#46694](https://github.com/PaddlePaddle/Paddle/pull/46694), [#45086](https://github.com/PaddlePaddle/Paddle/pull/45086), [#41857](https://github.com/PaddlePaddle/Paddle/pull/41857), [#42935](https://github.com/PaddlePaddle/Paddle/pull/42935), [#43475](https://github.com/PaddlePaddle/Paddle/pull/43475), [#43668](https://github.com/PaddlePaddle/Paddle/pull/43668), [#43966](https://github.com/PaddlePaddle/Paddle/pull/43966), [#44022](https://github.com/PaddlePaddle/Paddle/pull/44022), [#44346](https://github.com/PaddlePaddle/Paddle/pull/44346), [#44432](https://github.com/PaddlePaddle/Paddle/pull/44432), [#44451](https://github.com/PaddlePaddle/Paddle/pull/44451), [#44743](https://github.com/PaddlePaddle/Paddle/pull/44743), [#42013](https://github.com/PaddlePaddle/Paddle/pull/42013), [#43520](https://github.com/PaddlePaddle/Paddle/pull/43520), [#41434](https://github.com/PaddlePaddle/Paddle/pull/41434), [#42130](https://github.com/PaddlePaddle/Paddle/pull/42130), [#41276](https://github.com/PaddlePaddle/Paddle/pull/41276), [#41857](https://github.com/PaddlePaddle/Paddle/pull/41857), [#41356](https://github.com/PaddlePaddle/Paddle/pull/41356) -- **Add the audio field API:** paddle.audio - - Add the feature extraction APIs such as MFCC, Spectrogram, and LogMelSpectrogram. Support the GPU computing. The performance increases by more than 15x compared to the CPU. This can significantly improve the GPU utilization in speech model training.[#45424](https://github.com/PaddlePaddle/Paddle/pull/45424) - - Add the feature extraction basic APIs such as Window Function and Discrete Cosine Transform. This can facilitate users to customize the speech feature extraction.[#45424](https://github.com/PaddlePaddle/Paddle/pull/45424) - - Add the speech I/O module. It provides 2 types of audio I/O backend and supports 6 types of codecs for convenient loading of speech data. [#45939](https://github.com/PaddlePaddle/Paddle/pull/45939) - - Add TESS and ESC50 speech classification datasets. It is convenient for users to complete the classical speech classification model.[#45939](https://github.com/PaddlePaddle/Paddle/pull/45939) -- **Add the graph learning domain API:** paddle.geometric - - Graph learning is gradually becoming a key technology in the field of machine learning. The new paddle.geometric module of PaddlePaddle provides a better modeling and training development experience of graph learning. - - Message passing: The message passing mechanism of the graph learning is the basis of graph modeling. We add 7 graph learning message passing APIs to make it more convenient to complete the modeling of the graph learning. Among them, 3 newly added message passing fusion operators can significantly reduce the GPU memory consumption in the GNN model training. In the dense graph scenarios, more than 50% of GPU memory can be saved in the models of GCN series, and the training speed can increase by more than 20%.[#44848](https://github.com/PaddlePaddle/Paddle/pull/44848), [#44580](https://github.com/PaddlePaddle/Paddle/pull/44580), [#43174](https://github.com/PaddlePaddle/Paddle/pull/43174), [#44970](https://github.com/PaddlePaddle/Paddle/pull/44970) - - Graph sampling: Graph sampling is the performance bottleneck of GNN model training. This newly added high-performance graph sampling operator supports high concurrent graph sampling. It can increase the sampling speed of GraphSage by more than 32 times and the model training speed by more than 12 times.[#44970](https://github.com/PaddlePaddle/Paddle/pull/44970) -- **Add the vision domain API** - - The paddle.vision is added with target detection domain operators.([#43736](https://github.com/PaddlePaddle/Paddle/pull/43736)), paddle.vision.generate_proposals([#43611](https://github.com/PaddlePaddle/Paddle/pull/43611)), paddle.vision.matrix_nms([#44357](https://github.com/PaddlePaddle/Paddle/pull/44357)), paddle.vision.prior_box 和 paddle.vision.box_coder( [#47282](https://github.com/PaddlePaddle/Paddle/pull/47282) ). - -- - **Add other API** - - Add the iinfo([#45321](https://github.com/PaddlePaddle/Paddle/pull/45321)), count_nonzero([#44169](https://github.com/PaddlePaddle/Paddle/pull/44169)), nanmedian([#42385](https://github.com/PaddlePaddle/Paddle/pull/42385)), remainder\_ ([#45266](https://github.com/PaddlePaddle/Paddle/pull/45266)), take([#44741](https://github.com/PaddlePaddle/Paddle/pull/44741)), triu_indices([#45168](https://github.com/PaddlePaddle/Paddle/pull/45168)), sgn([#44568](https://github.com/PaddlePaddle/Paddle/pull/44568)), bucketize([#44195](https://github.com/PaddlePaddle/Paddle/pull/44195)), nanquantile([#41343](https://github.com/PaddlePaddle/Paddle/pull/41343)), frac([#41226](https://github.com/PaddlePaddle/Paddle/pull/41226)), logcumsumexp([#42267](https://github.com/PaddlePaddle/Paddle/pull/42267)), pairwise_distance([#44161](https://github.com/PaddlePaddle/Paddle/pull/44161)), heaviside([#41872](https://github.com/PaddlePaddle/Paddle/pull/41872)), logspace([#41261](https://github.com/PaddlePaddle/Paddle/pull/41261)), corrcoef([#40690](https://github.com/PaddlePaddle/Paddle/pull/40690)) - - Add the RReLU([#41823](https://github.com/PaddlePaddle/Paddle/pull/41823)), CyclicLR([#40698](https://github.com/PaddlePaddle/Paddle/pull/40698)), OneCycleLR([#41825](https://github.com/PaddlePaddle/Paddle/pull/41825)), Softmax2D([#40910](https://github.com/PaddlePaddle/Paddle/pull/40910)), SoftMarginLoss([#42364](https://github.com/PaddlePaddle/Paddle/pull/42364)), MultiLabelSoftMarginLoss([#41183](https://github.com/PaddlePaddle/Paddle/pull/41183)), TripletMarginLoss([#40487](https://github.com/PaddlePaddle/Paddle/pull/40487)), TripletMarginWithDistanceLoss([#40545](https://github.com/PaddlePaddle/Paddle/pull/40545)), CosineEmbeddingLoss 和 cosine_embedding_loss([#41680](https://github.com/PaddlePaddle/Paddle/pull/41680)), PixelUnshuffle([#40728](https://github.com/PaddlePaddle/Paddle/pull/40728)), ChannelShuffle([#40743](https://github.com/PaddlePaddle/Paddle/pull/40743)) -- **Enhanced API functions** - - Add the large batch_size calculation function of BatchNorm1D [#43072](https://github.com/PaddlePaddle/Paddle/pull/43072) -- **Optimize the collective communications distributed training API** - - Optimize the `fleet.init` function, and add the `log_level` parameter to facilitate users to view logs during operation [#45909](https://github.com/PaddlePaddle/Paddle/pull/45909) - - Add the `paddle.distributed.fleet.recompute_sequential paddle.distributed.fleet.recompute_hybrid` interface. It is convenient for users to use the recompute function [#45348](https://github.com/PaddlePaddle/Paddle/pull/45348) - - Add the `paddle.distributed.fleet.layers.mpu` package. It is convenient for users to use tensor parallel function [#45803](https://github.com/PaddlePaddle/Paddle/pull/45803) - - Add the communication API `paddle.distributed.destroy_process_group paddle.distributed.isend paddle.distributed.irecv paddle.distributed.all_to_all_single`. It improves the completeness and ease of use of communication [#43918](https://github.com/PaddlePaddle/Paddle/pull/43918) - - Add the `paddle.distributed.stream` package. The performance is increased by 5% to 10% compared to the base version[#46023](https://github.com/PaddlePaddle/Paddle/pull/46023) [#45282](https://github.com/PaddlePaddle/Paddle/pull/45282) - - The communication API is added with the support of multiple data types such as `Char/Byte/Bool`. It improves the completeness and ease of use of communication [#45574](https://github.com/PaddlePaddle/Paddle/pull/45574) [#45440](https://github.com/PaddlePaddle/Paddle/pull/45440) - - The communication API asynchronous parameter is changed from`use_calc_stream` to `sync_op`, It enhances the semantic readability of the interface [#46493](https://github.com/PaddlePaddle/Paddle/pull/46493) -- **Enhanced high-level API** - - The visual model ResNeXt in the high-level API implements the reuse of the ResNet code for refactoring. [#40588](https://github.com/PaddlePaddle/Paddle/pull/40588) - - The visual models Inceptionv3, MobileNetv1, MobileNetv2, and ShuffleNetv2 in the high level API are improved.[#40431](https://github.com/PaddlePaddle/Paddle/pull/40431) - -### (2)New functions and important upgrades - -- **The new dynamic graph architecture is officially launched**:The scheduling performance of the new dynamic graph framework is greatly improved. Compared with the original architecture, the scheduling performance is significantly enhanced. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% of kits is improved by over 5%. The new dynamic graph architecture is clear, and the coupling is low. The learning and development costs of extension modules such as Hook and PyLayer are significantly reduced based on the new architecture. [#37550](https://github.com/PaddlePaddle/Paddle/pull/37550) , [#37574](https://github.com/PaddlePaddle/Paddle/pull/37574) , [#37813](https://github.com/PaddlePaddle/Paddle/pull/37813) , [#37926](https://github.com/PaddlePaddle/Paddle/pull/37926) , [#39192](https://github.com/PaddlePaddle/Paddle/pull/39192) , [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599) , [#37406](https://github.com/PaddlePaddle/Paddle/pull/37406) , [#37466](https://github.com/PaddlePaddle/Paddle/pull/37466) , [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599) , [#40945](https://github.com/PaddlePaddle/Paddle/pull/40945) , [#39989](https://github.com/PaddlePaddle/Paddle/pull/39989) - -- **High-order auto-differentiation mechanism**:In order to better support scientific computing and other scenarios, the PaddlePaddle framework has been further improved and optimized for higher-order auto-differentiation capabilities. At present, the `paddle.incubate.autograd` directory has provided relevant trial functions and APIs for forward/reverse higher-order auto-differentiation (Currently they are in incubation, and related functions and API signatures may change).If you intend to implement related models and explore the auto-differentiation mechanism by yourself, please read the [usage and limitations of higher-order auto-differentiation](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/incubate/autograd/Overview_cn.html) carefully. Specific upgrades include: - 1. Static graph higher-order differentiation mechanism upgrade. Through the base operator system and program transformation, it supports higher-order forward and reverse differentiation, with the availability of the compiler and distributed functions.[#41919](https://github.com/PaddlePaddle/Paddle/pull/41919), [#41201](https://github.com/PaddlePaddle/Paddle/pull/41201) - 2. Add the forward and reverse higher-order auto-differentiation API, `paddle.incubate.autograd.forward_grad`, `paddle.incubate.autograd.grad`. [#43354](https://github.com/PaddlePaddle/Paddle/pull/43354) - 3. Add 18 higher-order auto-differentiation operators:`sin`, `cos`, `exp`, `erf`, `abs`, `log`, `cast`, `where`, `equal`, `not_equal`, `greater_than`, `greater_equal`, `elementwise_pow` `square`, `elementwise_max`, `gelu`, `reduce_mean`, `size`. [#46184](https://github.com/PaddlePaddle/Paddle/pull/46184), [#46024](https://github.com/PaddlePaddle/Paddle/pull/46024), [#45888](https://github.com/PaddlePaddle/Paddle/pull/45888), [#45338](https://github.com/PaddlePaddle/Paddle/pull/45338), [#44345](https://github.com/PaddlePaddle/Paddle/pull/44345) - 4. Fix the existing bugs of the operators such as`elementwise_div`, `reduce_sum`, `p_norm`. [#46514](https://github.com/PaddlePaddle/Paddle/pull/46514), [#46184](https://github.com/PaddlePaddle/Paddle/pull/46184) -- **Generic heterogeneous parameter server architecture**: - - Parameter server GPUGraph infrastructure upgraded to meet the implementation needs of large-scale applications: The storage and training of large-scale graph neural networks based on the traditional CPU feature high cost, low stability, and less performance. To overcome these problems, we have built a pure GPU graph training engine (PGLBox). Through the heterogeneous hierarchical storage technology of SSD, memory and video memory, it supports the training of ultra-large scale graph models. The training performance is improved by more than 10x compared with CPU graph training engine on the premise of equal cost. The task failure rate is extremely low.[#44594](https://github.com/PaddlePaddle/Paddle/pull/44594) - - Large-scale federation parameter server architecture: For large-scale personalized recommendation scenarios, the large-scale federation parameter server training is developed based on the heterogeneous PS infrastructure, to support horizontal and vertical federation under hundreds of billions of parameters. It includes two features: User private parameters updated locally and public parameters updated remotely. Users can flexibly configure the slicing policy for private and public parameters. A new central scheduling node Coordinator is added. Users can perform secondary development from the base class to customize the Client selection policy. [#42682](https://github.com/PaddlePaddle/Paddle/pull/42682) , [#44864](https://github.com/PaddlePaddle/Paddle/pull/44864) , [#44327](https://github.com/PaddlePaddle/Paddle/pull/44327) -- **Adaptive parallel** - - Design and launch a complete automatic parallelism interface system: Support automatic dynamic-to-static distributed training, automatic distributed data loading, automatic distributed saving and loading, automatic parameter conversion, custom slice marker and custom execution process. Users can easily obtain the automatic distributed training capability based on a single machine networking. It supports data parallel, model parallel, pipeline parallel, and hybrid parallel. [#45776](https://github.com/PaddlePaddle/Paddle/pull/45776) ,[#46552](https://github.com/PaddlePaddle/Paddle/pull/46552) , [#44202](https://github.com/PaddlePaddle/Paddle/pull/44202) , [#45840](https://github.com/PaddlePaddle/Paddle/pull/45840) , [#45518](https://github.com/PaddlePaddle/Paddle/pull/45518) , [#40528](https://github.com/PaddlePaddle/Paddle/pull/40528), [#42838](https://github.com/PaddlePaddle/Paddle/pull/42838), [#43093](https://github.com/PaddlePaddle/Paddle/pull/43093), [#43312](https://github.com/PaddlePaddle/Paddle/pull/43312), [#45053](https://github.com/PaddlePaddle/Paddle/pull/45053). - - Improve the underlying adaptive parallel mechanism, including the upgrade of the distributed costmodel design and implementation, to provide better evaluation of the slice policy. Add the native distributed properties to ProgramIR and enrich the Cluster functions. [#40457](https://github.com/PaddlePaddle/Paddle/pull/40457) , [#42601](https://github.com/PaddlePaddle/Paddle/pull/42601) , [#42727](https://github.com/PaddlePaddle/Paddle/pull/42727) , [#42874](https://github.com/PaddlePaddle/Paddle/pull/42784) , [#43114](https://github.com/PaddlePaddle/Paddle/pull/43114) , [#44095](https://github.com/PaddlePaddle/Paddle/pull/44095) , [#44146](https://github.com/PaddlePaddle/Paddle/pull/44146) , [#44701](https://github.com/PaddlePaddle/Paddle/pull/44701) , [#44973](https://github.com/PaddlePaddle/Paddle/pull/44973) , [#45002](https://github.com/PaddlePaddle/Paddle/pull/45002) , [#45118](https://github.com/PaddlePaddle/Paddle/pull/45118) , [#45237](https://github.com/PaddlePaddle/Paddle/pull/45237) , [#42576](https://github.com/PaddlePaddle/Paddle/pull/42576) , [#41722](https://github.com/PaddlePaddle/Paddle/pull/41722) , [#44150](https://github.com/PaddlePaddle/Paddle/pull/44150) , [#44989](https://github.com/PaddlePaddle/Paddle/pull/44989), [#44951](https://github.com/PaddlePaddle/Paddle/pull/44951), [#44963](https://github.com/PaddlePaddle/Paddle/pull/44963) . - - Add the Shardingstage1/2/3 AutoTuning feature under data parallel. This allows to automatically select the highest throughput Shardingstage policy while ensuring that the video memory constraints are met. [#43782](https://github.com/PaddlePaddle/Paddle/pull/43782) . - -- **Training hardware access - Plug-in solutions**:Add custom Runtime/Kernel/CCL/Graph/Pass solutions. The hardware vendors can choose which modules to implement on-demand based on hardware characteristics. - -- **ONNX format export** - - Support the quantized model export. The exported ONNX model uses TensorRT or ONNXRuntime to load inference. About 1.5~4 times inference acceleration can be obtained [#856](https://github.com/PaddlePaddle/Paddle2ONNX/pull/856), [#782](https://github.com/PaddlePaddle/Paddle2ONNX/pull/782) - - Add the export of a large model greater than 2GB [#942](https://github.com/PaddlePaddle/Paddle2ONNX/pull/942) - -### (3)Function optimization -- **Comprehensive increase of dynamic-to-static analysis conversion & extension capabilities** - - In order to improve the success rate and experience of model dynamic-to-static conversion, the transcription logic of control flow syntax is reconstructed. The core syntax has been upgraded to JIT (just-in-time) paradigm to achieve equivalent transcription with Python codes. The syntax functions such as break, return and continue are improved.[#43666](https://github.com/PaddlePaddle/Paddle/pull/43666) , [#43846](https://github.com/PaddlePaddle/Paddle/pull/43846) , [#43848](https://github.com/PaddlePaddle/Paddle/pull/43848) , [#43880](https://github.com/PaddlePaddle/Paddle/pull/43880) , [#43957](https://github.com/PaddlePaddle/Paddle/pull/43957) , [#43328](https://github.com/PaddlePaddle/Paddle/pull/43328) , [#43348](https://github.com/PaddlePaddle/Paddle/pull/43348) , [#43998](https://github.com/PaddlePaddle/Paddle/pull/43998) , [#44465](https://github.com/PaddlePaddle/Paddle/pull/44465) , [#44504](https://github.com/PaddlePaddle/Paddle/pull/44504) , [#43713](https://github.com/PaddlePaddle/Paddle/pull/43713) , [#43864](https://github.com/PaddlePaddle/Paddle/pull/43864) , [#43967](https://github.com/PaddlePaddle/Paddle/pull/43967) , [#44155](https://github.com/PaddlePaddle/Paddle/pull/44155) , [#44487](https://github.com/PaddlePaddle/Paddle/pull/44487) , [#44527](https://github.com/PaddlePaddle/Paddle/pull/44527) , [#45105](https://github.com/PaddlePaddle/Paddle/pull/45105) , [#45900](https://github.com/PaddlePaddle/Paddle/pull/45900) - - In order to support the voice custom decoding flexible deployment scenarios, the jit.save/load interface function is extended to support user multi-function merge and export. A new JITLayer component is added to support the invocation of class functions. Meanwhile, the custom inference deployment function is implemented with the PHI operator library C++ API. [#44283](https://github.com/PaddlePaddle/Paddle/pull/44283), [#41783](https://github.com/PaddlePaddle/Paddle/pull/41783), [#43607](https://github.com/PaddlePaddle/Paddle/pull/43607), [#43754](https://github.com/PaddlePaddle/Paddle/pull/43754), [#43758](https://github.com/PaddlePaddle/Paddle/pull/43758), [#43798](https://github.com/PaddlePaddle/Paddle/pull/43798), [#44010](https://github.com/PaddlePaddle/Paddle/pull/44010), [#44351](https://github.com/PaddlePaddle/Paddle/pull/44351), [#44465](https://github.com/PaddlePaddle/Paddle/pull/44465), [#44504](https://github.com/PaddlePaddle/Paddle/pull/44504), [#44597](https://github.com/PaddlePaddle/Paddle/pull/44597), [#44738](https://github.com/PaddlePaddle/Paddle/pull/44738), [#44984](https://github.com/PaddlePaddle/Paddle/pull/44984), [#46249](https://github.com/PaddlePaddle/Paddle/pull/46249) - - In order to unify API dynamic and static behaviors, 20 operators are upgraded to support variable attribute information of Op in static graphs, to ensure consistent dynamic and static behaviors and improve the success rate of dynamic-to-static conversion of models. Include `pad2d`,`depthwise_conv2d_transpose`,`conv2d_transpose`,`adaptive_avg_pool2d`,`reverse`,`bincount`,`multinomial`,`reduce_sum`,`reduce_mean`,`reduce_prod`,`reduce_min`,`reduce_max`,`uniform`,`squeeze`,`max_unpool2d`,`dropout`,`cumsum`,`eye`,`argmin`,`argmax`. [#44737](https://github.com/PaddlePaddle/Paddle/pull/44737), [#45084](https://github.com/PaddlePaddle/Paddle/pull/45084), [#45189](https://github.com/PaddlePaddle/Paddle/pull/45189), [#45391](https://github.com/PaddlePaddle/Paddle/pull/45391), [#45417](https://github.com/PaddlePaddle/Paddle/pull/45417), [#45427](https://github.com/PaddlePaddle/Paddle/pull/45427), [#45514](https://github.com/PaddlePaddle/Paddle/pull/45514), [#45525](https://github.com/PaddlePaddle/Paddle/pull/45525), [#45543](https://github.com/PaddlePaddle/Paddle/pull/45543), [#45660](https://github.com/PaddlePaddle/Paddle/pull/45660), [#46352](https://github.com/PaddlePaddle/Paddle/pull/46352/), [#46433](https://github.com/PaddlePaddle/Paddle/pull/46433), [#45078](https://github.com/PaddlePaddle/Paddle/pull/45078), [#45342](https://github.com/PaddlePaddle/Paddle/pull/45342), [#45372](https://github.com/PaddlePaddle/Paddle/pull/45372), [#45453](https://github.com/PaddlePaddle/Paddle/pull/45453), [#45522](https://github.com/PaddlePaddle/Paddle/pull/45522), [#45620](https://github.com/PaddlePaddle/Paddle/pull/45620) - - In order to solve the problem of occasional loss of error reporting stack for user dynamic-to-static, the logic of the error reporting module is optimized to improve the readability of the error reporting stack and the user debugging experience. [#44054](https://github.com/PaddlePaddle/Paddle/pull/44054), [#44083](https://github.com/PaddlePaddle/Paddle/pull/44083), [#44781](https://github.com/PaddlePaddle/Paddle/pull/44781), [#44996](https://github.com/PaddlePaddle/Paddle/pull/44996) - - Add the TypeHint syntax recognition and transcription module to fully support Python Type Hint syntax. [#47121](https://github.com/PaddlePaddle/Paddle/pull/47121) - -- **PHI operator library covers the full amount of arithmetic class operators**:Continuously build the highly reusable operator library PHI. The remaining PaddlePaddle 2.x arithmetic class PythonAPI-associated operators and related kernels are migrated to the PHI operators library and rewritten as functional expression. Add about 180 forward/reverse operator CPU&GPU kernels, and 170 Kunlun-specific arithmetic kernels. This further enhances the kernel function sets that can be reused when new operators are added. In addition, add more than 100 C++ arithmetic class APIs. These APIs can be used in the custom operators, further enhancing the ease of use for external extension development based on the PaddlePaddle. [#44577](https://github.com/PaddlePaddle/Paddle/pull/44577), [#44631](https://github.com/PaddlePaddle/Paddle/pull/44631), [#44434](https://github.com/PaddlePaddle/Paddle/pull/44434), [#44605](https://github.com/PaddlePaddle/Paddle/pull/44605), [#44676](https://github.com/PaddlePaddle/Paddle/pull/44676), [#44742](https://github.com/PaddlePaddle/Paddle/pull/44742), [#44436](https://github.com/PaddlePaddle/Paddle/pull/44436) , [#45887](https://github.com/PaddlePaddle/Paddle/pull/45887), [#45851](https://github.com/PaddlePaddle/Paddle/pull/45851), [#45623](https://github.com/PaddlePaddle/Paddle/pull/45623), [#45397](https://github.com/PaddlePaddle/Paddle/pull/45397), [#45863](https://github.com/PaddlePaddle/Paddle/pull/45863) - -- **Normalized operator definitions with significantly improving the model simplicity**:For the problems of many redundant parameters in the historical operator definitions of PaddlePaddle 1.x and the high cost of understanding the adaptation, the redundant parameters of about 150 high-frequency operators are cleaned up centrally. Basically, the mathematically irrelevant parameters are removed. After these redundant parameters are cleaned up, the amount of information in the inference model stored in the PaddlePaddle is significantly reduced. Generally, about 40% of the attribute variables are removed, significantly improving the clarity of the PaddlePaddle operator definition, and improving the experience of model analysis and debugging. Meanwhile, the size of the inference model stored in the PaddlePaddle is also significantly reduced by more than 70%. As a result, this can significantly improve the lightweight of the PaddlePaddle model. [#44310](https://github.com/PaddlePaddle/Paddle/pull/44310) , [#45613](https://github.com/PaddlePaddle/Paddle/pull/45613) , [#45684](https://github.com/PaddlePaddle/Paddle/pull/45684) , [#45708](https://github.com/PaddlePaddle/Paddle/pull/45708) , [#45758](https://github.com/PaddlePaddle/Paddle/pull/45758) , [#45786](https://github.com/PaddlePaddle/Paddle/pull/45786) , [#45772](https://github.com/PaddlePaddle/Paddle/pull/45772) , [#45845](https://github.com/PaddlePaddle/Paddle/pull/45845) , [#45984](https://github.com/PaddlePaddle/Paddle/pull/45984) , [#46218](https://github.com/PaddlePaddle/Paddle/pull/46218) , [#46553](https://github.com/PaddlePaddle/Paddle/pull/46553) - -### (4)Performance optimization - -- AMP performance and accuracy optimization - - More operators are added with the support of FP16 data types, including elementwise series operators, compare series operators, strided_slice, set_value, uniform_ramdom, etc.([#45504](https://github.com/PaddlePaddle/Paddle/pull/45504) [#44405](https://github.com/PaddlePaddle/Paddle/pull/44405) [#45496](https://github.com/PaddlePaddle/Paddle/pull/45496) [#46641](https://github.com/PaddlePaddle/Paddle/pull/46641), [#46906](https://github.com/PaddlePaddle/Paddle/pull/46906) ) - - Optimize the implementation scheme of the hard_swish operator FP16 Kernel to guarantee the accuracy without loss. ( [35386](https://github.com/PaddlePaddle/Paddle/pull/35386) ) - - More operators are added with the support of BF16 data types, including fused_linear, empty, selu, pow, adam, clip, embedding, gelu, pad3d, pixel_shuffle, tile, where, etc. [#46364](https://github.com/PaddlePaddle/Paddle/pull/46364), [#47177](https://github.com/PaddlePaddle/Paddle/pull/47177) -- AutoTuning of single machine training performance - - Transpose OP supports automatic Kernel selection mechanism. This allows the automatic search for the best Kernel implementation for different model configurations, improving the model performance. [#43310](https://github.com/PaddlePaddle/Paddle/pull/43310) (Transpose Op access AutoTuning function) - - AMP Layout auto-switching supports the new dynamic graph mode. For the ResNet50, TSM, and DeepLabV3 models, the performance increases by 9%-21% by Layout AutoTuning in the new dynamic graph. ([#45409](https://github.com/PaddlePaddle/Paddle/pull/45409), [#45751](https://github.com/PaddlePaddle/Paddle/pull/45751), [#45826](https://github.com/PaddlePaddle/Paddle/pull/45826), [#46880](https://github.com/PaddlePaddle/Paddle/pull/46880)) -- Generic performance optimization of GPU single machine training - - Optimize the Cache scheme of the Conv operator cuDNN algorithm and Cache the results in all algorithm acquisition methods. This can significantly reduce the CPU overhead of the operator.([#41891](https://github.com/PaddlePaddle/Paddle/pull/41891) [#47197](https://github.com/PaddlePaddle/Paddle/pull/47197) ) - - Further optimize the GPU Kernel and Python side performance of multiple operators, including dist, poisson, depthwise_conv2d, transpose, eigh, broadcast computation, reduce computation, layer_norm, cross_entropy, etc. This can achieve better performance in more configuration scenarios. ([#44946](https://github.com/PaddlePaddle/Paddle/pull/44946), [#45057](https://github.com/PaddlePaddle/Paddle/pull/45057), [#45160](https://github.com/PaddlePaddle/Paddle/pull/45160), [#42491](https://github.com/PaddlePaddle/Paddle/pull/42491), [#42704](https://github.com/PaddlePaddle/Paddle/pull/42704), [#42853](https://github.com/PaddlePaddle/Paddle/pull/42853), [#46287](https://github.com/PaddlePaddle/Paddle/pull/46287), [#46362](https://github.com/PaddlePaddle/Paddle/pull/46362), [#46490](https://github.com/PaddlePaddle/Paddle/pull/46490), [#46412](https://github.com/PaddlePaddle/Paddle/pull/46412), [#46623](https://github.com/PaddlePaddle/Paddle/pull/46623), [#40051](https://github.com/PaddlePaddle/Paddle/pull/40051) ) -- Performance optimization of distributed training for collective communications - - To improve pipeline parallel scheduling efficiency, support the dynamic graph Interleaving1F1B scheduling policy. In the GPT-3 model, the performance is improved by 3%-4%. [#45797](https://github.com/PaddlePaddle/Paddle/pull/45797) , [#45869](https://github.com/PaddlePaddle/Paddle/pull/45869) , [#45922](https://github.com/PaddlePaddle/Paddle/pull/45922) , [#46209](https://github.com/PaddlePaddle/Paddle/pull/46209) , [#45402](https://github.com/PaddlePaddle/Paddle/pull/45402) , [#45444](https://github.com/PaddlePaddle/Paddle/pull/45444) , [#45497](https://github.com/PaddlePaddle/Paddle/pull/45497) , [#45797](https://github.com/PaddlePaddle/Paddle/pull/45797) , [#45869](https://github.com/PaddlePaddle/Paddle/pull/45869) , [#45922](https://github.com/PaddlePaddle/Paddle/pull/45922), [#46209](https://github.com/PaddlePaddle/Paddle/pull/46209), [#46399](https://github.com/PaddlePaddle/Paddle/pull/46399) , [#46483](https://github.com/PaddlePaddle/Paddle/pull/46483) , [#46876](https://github.com/PaddlePaddle/Paddle/pull/46876) , [#47242](https://github.com/PaddlePaddle/Paddle/pull/47242) , [#47249](https://github.com/PaddlePaddle/Paddle/pull/47249) , [#47497](https://github.com/PaddlePaddle/Paddle/pull/47497) , [#47517](https://github.com/PaddlePaddle/Paddle/pull/47517) - - To improve the distributed training performance of the MLPerfBERT model, the DistributedFusedLamb distributed optimizer supports hierarchical AllReduce. It improves MLPerfBERT performance by 17% on the DCU1024 card. [#44821](https://github.com/PaddlePaddle/Paddle/pull/44821) , [#44843](https://github.com/PaddlePaddle/Paddle/pull/44843) - - To optimize the video memory footprint when using DataParallel, the Buffer Lazy initialization policy for Tensor Fusion is supported, thus reducing the video memory footprint by an amount equal to the number of model parameters. [#45631](https://github.com/PaddlePaddle/Paddle/pull/45631). - - Distributed parallel policies DataParallel and Sharding support BF16 training. [#46846](https://github.com/PaddlePaddle/Paddle/pull/46846) , [#47246](https://github.com/PaddlePaddle/Paddle/pull/47246) - - To support the Sequence Parallel policy, the Distributed Pipeline Parallel supports enable_partial_send_recv policy, and supports the tensor after slice of the transmission sequence parallel. [#46992](https://github.com/PaddlePaddle/Paddle/pull/46992) , [#47083](https://github.com/PaddlePaddle/Paddle/pull/47083) - - To improve the performance of sharding stage 2 policy, implement the overlap of sharding stage 2 optimizer broadcast parameters with next step forward and use multi-CUDA Stream for communication. In the GPT 6.7B model, the 16-card training performance is improved by 11%. [#46495](https://github.com/PaddlePaddle/Paddle/pull/46495) , [#46656](https://github.com/PaddlePaddle/Paddle/pull/46656) , [#47061](https://github.com/PaddlePaddle/Paddle/pull/47061) - -### (5)Bug fix - -- Dynamic-to-static - - Fix the bug of reporting an error in dynamic-to-static of the model in a Parameter no-gradient scenario during multi-card training. [#44485](https://github.com/PaddlePaddle/Paddle/pull/44485) - - Fix the bug of where redundant frame logs are mistakenly output by the terminal in the dynamic-to-static. [#45754](https://github.com/PaddlePaddle/Paddle/pull/45754), [#46800](https://github.com/PaddlePaddle/Paddle/pull/46800) - - Fix the bug of reporting an error in the dynamic-to-static training when the control flow in the model contains a Tensor that does not require a gradient. [#43034](https://github.com/PaddlePaddle/Paddle/pull/43034) - - Fix the bug of incorrect computation value during gradient aggregation in the dynamic-to-static training. [#44893](https://github.com/PaddlePaddle/Paddle/pull/44893) - - Fix the bug of reporting an error in the dynamic-to-static when the function is decorated with @staticmethod. [#44983](https://github.com/PaddlePaddle/Paddle/pull/44983), [#45268](https://github.com/PaddlePaddle/Paddle/pull/45268), [#45277](https://github.com/PaddlePaddle/Paddle/pull/45277) - - Fix the bug of too much video memory footprint in some scenarios where the model contains the dynamic-to-static training. [#45380](https://github.com/PaddlePaddle/Paddle/pull/45380) - - Fix the bug of reporting an error of dynamic-to-static shape derivation in the networking phase when the model contains a complex control flow. [#45916](https://github.com/PaddlePaddle/Paddle/pull/45916), [#46020](https://github.com/PaddlePaddle/Paddle/pull/46020) -- Fix the error report mechanism - - Replace self.assertTrue(np.allclose(...)) with np.testing.assert_allclose to get fuller error reporting information ( [#44947](https://github.com/PaddlePaddle/Paddle/pull/44947), [#44988](https://github.com/PaddlePaddle/Paddle/pull/44988), [#45213](https://github.com/PaddlePaddle/Paddle/pull/45213)) -- Distributed training in collective communications - - Fix several bugs in communication library initialization and communication process, and enhance the system operation stability. [#44964](https://github.com/PaddlePaddle/Paddle/pull/44964) [#45100](https://github.com/PaddlePaddle/Paddle/pull/45100) [#44758](https://github.com/PaddlePaddle/Paddle/pull/44758) - - Fix the bug of frequent occurrences of hang in pipeline parallel, and enhance the ease of use of the policy [#47201](https://github.com/PaddlePaddle/Paddle/pull/47201); enhance the pipeline function to support unbalanced input. [#47199](https://github.com/PaddlePaddle/Paddle/pull/47199) - - Fix the bug that the performance of the new dynamic graph MP/PP policy is lower than the old dynamic graph. [#47071](https://github.com/PaddlePaddle/Paddle/pull/47071) - - Fix the bug that the shardingstage2 policy incorrectly maintains the parameter trainable property. [#47240](https://github.com/PaddlePaddle/Paddle/pull/47240) - - Fix the bug that tensornumel is greater than INT32_MAX in series of OPs. [#45711](https://github.com/PaddlePaddle/Paddle/pull/45711), [#45741](https://github.com/PaddlePaddle/Paddle/pull/45741), [#45897](https://github.com/PaddlePaddle/Paddle/pull/45897), [#46158](https://github.com/PaddlePaddle/Paddle/pull/46158), [#46767](https://github.com/PaddlePaddle/Paddle/pull/46767), [#47191](https://github.com/PaddlePaddle/Paddle/pull/47191), [#46045](https://github.com/PaddlePaddle/Paddle/pull/46045), [#46160](https://github.com/PaddlePaddle/Paddle/pull/46160) - - Fix the bug of too much video memory footprint in FusedAttention and Fused FeedForward OP.[#47236](https://github.com/PaddlePaddle/Paddle/pull/47236), [#47235](https://github.com/PaddlePaddle/Paddle/pull/47235) - - Fix the bug of incorrect parameter update in multi_tensor_adam and multi_tensor_momentumOP when the parameters passed in are listofdict. [#47352](https://github.com/PaddlePaddle/Paddle/pull/47352), [#47372](https://github.com/PaddlePaddle/Paddle/pull/47372) - -## 4. Deployment direction (Paddle Inference) - -### (1)New features - -- Optimize the back-end graph engine integration scheme - - In order to reduce Paddle-TensorRT plugin code development and reduce the number of Paddle-TensorRT subgraphs and thus reducing resource usage, a generic plugin mechanism has been developed, to automatically provide a unified TensorRT plugin interface for rich Phi operators in the framework. As a result, the video memory footprint can be effectively reduced in most scenarios. [#46970](https://github.com/PaddlePaddle/Paddle/pull/46070), [#46179](https://github.com/PaddlePaddle/Paddle/pull/46179), [#46580](https://github.com/PaddlePaddle/Paddle/pull/46580) - - In order to facilitate users to customize operators in the framework and make Paddle-TensorRT perform efficient inference, the function is upgraded to support the framework custom Paddle-TensorRT plugin. [#46970](https://github.com/PaddlePaddle/Paddle/pull/46070) -- Optimize the Inference library build system. The size can be pruned on demand - - Pre-compiled installer supports TensorRT by default: The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default, reducing the switching cost for users using PaddleTensorRT. [#46008](https://github.com/PaddlePaddle/Paddle/pull/46008), [#45824](https://github.com/PaddlePaddle/Paddle/pull/45824), [#46058](https://github.com/PaddlePaddle/Paddle/pull/46058) - - The size can be pruned on demand: Pruned according to the model operator. [#47033](https://github.com/PaddlePaddle/Paddle/pull/47033) , [#47049](https://github.com/PaddlePaddle/Paddle/pull/47049) , [#47047](https://github.com/PaddlePaddle/Paddle/pull/47047) -- Inference supports native AMP - - In order to make full use of GPUTensorCore computation capability and improve the model inference performance, a model accuracy conversion tool has been developed. The InferenceGPU natively supports the inference of the mixed precision model. For the usages, refer to the documentation. [documentation](https://github.com/PaddlePaddle/Paddle-Inference-Demo/blob/release/v2.4/docs-official/guides/nv_gpu_infer/gpu_mixed_precision.md), [#43814](https://github.com/PaddlePaddle/Paddle/pull/43814), [#43881](https://github.com/PaddlePaddle/Paddle/pull/43881), [#44057](https://github.com/PaddlePaddle/Paddle/pull/44057), [#44307](https://github.com/PaddlePaddle/Paddle/pull/44307), [#44457](https://github.com/PaddlePaddle/Paddle/pull/44457), [#44866](https://github.com/PaddlePaddle/Paddle/pull/44866), [#45050](https://github.com/PaddlePaddle/Paddle/pull/45050), [#45346](https://github.com/PaddlePaddle/Paddle/pull/45346), [#45379](https://github.com/PaddlePaddle/Paddle/pull/45379), [#45406](https://github.com/PaddlePaddle/Paddle/pull/45406), [#45882](https://github.com/PaddlePaddle/Paddle/pull/45882) - - In order to improve the inference performance of the mixed precision model, the FP16kernel of high-frequency operators that do not support FP16 computation is supplemented, thus reducing the possibility of inserting the cast operator due to input precision mismatch. The inference performance is improved. [#44642](https://github.com/PaddlePaddle/Paddle/pull/44642), [#45061](https://github.com/PaddlePaddle/Paddle/pull/45061), [#44653](https://github.com/PaddlePaddle/Paddle/pull/44653), [#45504](https://github.com/PaddlePaddle/Paddle/pull/45504), [#45061](https://github.com/PaddlePaddle/Paddle/pull/45061), [#44969](https://github.com/PaddlePaddle/Paddle/pull/44969), [#44558](https://github.com/PaddlePaddle/Paddle/pull/44558), [#44710](https://github.com/PaddlePaddle/Paddle/pull/44710), [#43871](https://github.com/PaddlePaddle/Paddle/pull/43871), [#44792](https://github.com/PaddlePaddle/Paddle/pull/44792) -- Upgrade the compression and inference engine - - Upgrade the quantization model storage format. The new format supports PaddleInference, PaddleLite and Paddle2ONNX 3 deployment methods. The supported chips include X86 CPU, NVIDIA GPU, and Arm CPU. ([#46305](https://github.com/PaddlePaddle/Paddle/pull/46305), [#462832](https://github.com/PaddlePaddle/Paddle/pull/46283), [#46022](https://github.com/PaddlePaddle/Paddle/pull/46022) ) - - Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips. -- Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips. - - Upgrade the interface module between the PaddlePaddle framework and compiler, to support inference models to access the compiler for optimization via Paddle Inference. ([#44499](https://github.com/PaddlePaddle/Paddle/pull/44499) [#44708](https://github.com/PaddlePaddle/Paddle/pull/44708) ) - -### (2)Underlying optimization - -- **GPU performance optimization** - - Add the TensorRT mapping for operators such as matmul_v2, LSTM, reshape, fill_constant, swish, mulitclass_nms3, bilinear_interp_v2, split, silu, shuffle_channel operators. Optimize the support for the dynamic shape. Performance improved by 7% to 90% for multi-class focused models. ([#46177](https://github.com/PaddlePaddle/Paddle/pull/46177), [#44678](https://github.com/PaddlePaddle/Paddle/pull/44678), [#44314](https://github.com/PaddlePaddle/Paddle/pull/44314), [#44561](https://github.com/PaddlePaddle/Paddle/pull/44561), [#45166](https://github.com/PaddlePaddle/Paddle/pull/45166), [#44411](https://github.com/PaddlePaddle/Paddle/pull/44411), [#43424](https://github.com/PaddlePaddle/Paddle/pull/43424), [#44516](https://github.com/PaddlePaddle/Paddle/pull/44516)) - - Add constant folding PASS for inference performance optimization, to improve the performance of SwinTransformer, HifiGAN, FastSpeech2, and other models.([#45494](https://github.com/PaddlePaddle/Paddle/pull/45494)) - - Add cache of conv_fusionworkspacesize, to improve the computation performance of conv_fusion. ([#45902](https://github.com/PaddlePaddle/Paddle/pull/45902)) -- **Vision ViT model optimization** - - Add the ViT model Attention structure fusion PASS, and support OSSPlugin and auto padding. The ViT inference speed increases by 30%-40%. [#45019](https://github.com/PaddlePaddle/Paddle/pull/45019) [#45506](https://github.com/PaddlePaddle/Paddle/pull/45506) -- **Inference performance optimization of large model** - - To improve the inference speed of very large generative models and save the video memory, add INT8 implementation (fused_multi_transformer_int8_op) to the multi-layer Transformer fusion operator (fused_multi_transformer_op), and support quantized inference of generative models. Use the matrix multiplication algorithm to select, quantize/de-quantize the kernel fusion for performance optimization. [#46169](https://github.com/PaddlePaddle/Paddle/pull/46169) - - Add Pass for automatic matching fusion in order to improve the ease of use of fused_multi_transformer fusion for large model inference. -- **CPU performance optimization** - - Optimize the speech U2++ model. The FP32 model inference speed is improved by 35%. The INT8 model inference speed is improved by 69%. ([#47592](https://github.com/PaddlePaddle/Paddle/pull/47592), [#47127](https://github.com/PaddlePaddle/Paddle/pull/47127), [#47391](https://github.com/PaddlePaddle/Paddle/pull/47391), [#47234](https://github.com/PaddlePaddle/Paddle/pull/47234), [#47009](https://github.com/PaddlePaddle/Paddle/pull/47009), [#47080](https://github.com/PaddlePaddle/Paddle/pull/47080)) - - -### (3)Bug fix - -- TensorRT workspace size supports int64. ([#44469](https://github.com/PaddlePaddle/Paddle/pull/44469) ) -- In Paddle-TRT, fully support Op's input as weight.([#45545](https://github.com/PaddlePaddle/Paddle/pull/45545) ) -- In Paddle-TRT, support conv2d_transpose/conv3d_transpose to have the output_padding attribute.([#45004](https://github.com/PaddlePaddle/Paddle/pull/45004) ) -- In Paddle-TRT, enhance the strided_slice support for dynamic shape. ([#46819](https://github.com/PaddlePaddle/Paddle/pull/46819) ) -- In Paddle-TRT, optimize the video memory footprint of context when running in multi-thread scenarios.([#45468](https://github.com/PaddlePaddle/Paddle/pull/45468) ) -- In Paddle-TRT, fix the bug of repeatedly generating serialization files in case of change of initialization sequences when multiple models run in the same process.([#43942](https://github.com/PaddlePaddle/Paddle/pull/43942) ) -- Fix the bug of occasional crash when Predictor is initialized to run for multiple times in the same process.([#45203](https://github.com/PaddlePaddle/Paddle/pull/45203) ) -- Fix the bug of abnormal inference accuracy of quantization models such as MobileNetV3_large, ERNIE 3.0-Medium and bert ([#45416](https://github.com/PaddlePaddle/Paddle/pull/45416), [#46283](https://github.com/PaddlePaddle/Paddle/pull/46283), [#45920](https://github.com/PaddlePaddle/Paddle/pull/45920) [#47573](https://github.com/PaddlePaddle/Paddle/pull/47574)) - -## 5. Environment adaptation - -- The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default. -- The pre-compiled installer for CUDA version 10.1 is cancelled. -- Add the pre-compiled installer for CUDA 11.7. -- Decrease of source code compilation time: Reduce inter-module dependencies, improve the parallel, and optimize the compilation speed of some modules. The full compilation time is reduced by about 20 minutes in total. -- Support the running of PaddlePaddle on windows 11, Centos 8, Ubuntu 22.04, Jetson 5.02 system environment. Support to run PaddlePaddle linux installer in windows system by using the WSL 2 tool. -- Fix the running error bug of the PaddlePaddle in glibc2.34+ environment. -- Optimize the code style of C++, Python, CMake in the whole code repository. Introduce or upgrade the following code style checking tools. - - pre-commit is upgraded from 1.10.4 to 2.17.0: [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103) - - pylint is changed from default version to specify as: [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103) - - remove-crlf is upgraded from 1.0.1 to 1.1.14 : [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103) - - cpplint is changed from default version to specify as 1.6.0 : [#43175](https://github.com/PaddlePaddle/Paddle/pull/43175), [#43978](https://github.com/PaddlePaddle/Paddle/pull/43978), [#43673](https://github.com/PaddlePaddle/Paddle/pull/43673), [#43679](https://github.com/PaddlePaddle/Paddle/pull/43679), [#43695](https://github.com/PaddlePaddle/Paddle/pull/43695), [#43733](https://github.com/PaddlePaddle/Paddle/pull/43733), [#43740](https://github.com/PaddlePaddle/Paddle/pull/43740) - - clang-format is upgrade from 3.8 to 13.0 : [#42840](https://github.com/PaddlePaddle/Paddle/pull/42840), [#43248](https://github.com/PaddlePaddle/Paddle/pull/43248), [#43329](https://github.com/PaddlePaddle/Paddle/pull/43329), [#43333](https://github.com/PaddlePaddle/Paddle/pull/43333), [#43633](https://github.com/PaddlePaddle/Paddle/pull/43633), [#43678](https://github.com/PaddlePaddle/Paddle/pull/43678) - - Introduce the black tool for python code style checking :[#46014](https://github.com/PaddlePaddle/Paddle/pull/46014) - - Introduce the cmakelint tool for cmake file code checking. Version is 1.4.2 : [#43222](https://github.com/PaddlePaddle/Paddle/pull/43222), [#43406](https://github.com/PaddlePaddle/Paddle/pull/43406), [#43414](https://github.com/PaddlePaddle/Paddle/pull/43414), [#43428](https://github.com/PaddlePaddle/Paddle/pull/43428) - - Introduce cmake-format for automatic formatting of cmake files. Version is 0.6.13 : [#43057](https://github.com/PaddlePaddle/Paddle/pull/43057) - -## 6. Hardware adaptation ### Hygon DCU -- Add the Profiler function on DCU, to collect, count and display performance data of model running process on DCU, and support DCU occupancy display at kernel level. -### Kunlunxin Chip -- Add Profiler function on Kunlunxin 2 generation chip, which can collect, count and display the performance data of model running process on Kunlunxin 2 generation chip, and support occupancy display of Kunlunxin 2 generation chip at kernel level. -- Training/reasoning support for Kunlunxin 2 generation chips (Kunlunxin AI accelerator cards R200, R300, R200-8F, R200-8FS, RG800), a total of 51 models such as PPYOLOE, PP-OCR, ERNIE3.0, PP-TSM, PP-TTS, DLRM, PPO, etc. have been verified, supporting static graph + dynamic graph training, supporting mixed precision training, support single machine single card and single machine multi-card training, covering 5 fields of intelligent vision, natural language processing, intelligent speech, intelligent recommendation, reinforcement learning. -### Cambricon -- Support the training/inference of Cambricon MLU chip (MLU370 series of boards): The ResNet50, BERT, YoloV3, OCR-DB, Deeplabv3 and many other models are verified. Support the static graph + dynamic graph training. Support mixed precision training. Support the single machine single card and single machine multi-card training. -### Graphcore -- Support the training/inference of Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU). Support ResNet50, BERT and other models. Support the static graph and dynamic-to-static graph mode training. Support the single chip, single machine, and multi-machine distributed training. -- Add the support of more operators -- Upgrade to Poplar SDK v3.0.0 [#46892](https://github.com/PaddlePaddle/Paddle/pull/46892) -* Support the training models by using the dynamic-to-static graph mode. Add a new paddle.incubate.identity_loss op to assist with composition [#43770](https://github.com/PaddlePaddle/Paddle/pull/43770) -* Support the Paddle native distributed training API: paddle.distributed.launch [#43311](https://github.com/PaddlePaddle/Paddle/pull/43311) -* Support the training models with the mixed precision [#41733](https://github.com/PaddlePaddle/Paddle/pull/41733) -* Paddle Inference supports custom operators by using PopART [#45235](https://github.com/PaddlePaddle/Paddle/pull/45235) - -### Intel -- Migrate oneDNN operators : transpose2_grad([#46139](https://github.com/PaddlePaddle/Paddle/pull/46139)), relu6_grad([#46501](https://github.com/PaddlePaddle/Paddle/pull/46501)), gaussian_random([#46747](https://github.com/PaddlePaddle/Paddle/pull/46747), [#45481](https://github.com/PaddlePaddle/Paddle/pull/45481)), sgd and stack([#46374](https://github.com/PaddlePaddle/Paddle/pull/46374)), concat+grad, expand+grad,fill_constant([#45863](https://github.com/PaddlePaddle/Paddle/pull/45863)), slice, slice_grad, split,pad and pad3d([#46101](https://github.com/PaddlePaddle/Paddle/pull/46101)), softmax_grad([#46257](https://github.com/PaddlePaddle/Paddle/pull/46257)), Shape([#46051](https://github.com/PaddlePaddle/Paddle/pull/46051)), Sum([#46239](https://github.com/PaddlePaddle/Paddle/pull/46239)), Transpose2_grad([#46139](https://github.com/PaddlePaddle/Paddle/pull/46139)), Cast, clip+grad andpool+grad([#45775](https://github.com/PaddlePaddle/Paddle/pull/45775)), Reduce sum+grad,mean+grad, min and max([#45536](https://github.com/PaddlePaddle/Paddle/pull/45536)), Relu and abs([#45397](https://github.com/PaddlePaddle/Paddle/pull/45397)), Gelu([#45596](https://github.com/PaddlePaddle/Paddle/pull/45596)), Scale([#45537](https://github.com/PaddlePaddle/Paddle/pull/45537)) -- Optimize kernels of fill_constant, fc, conv, and a number of operators -- Add several Pass fusion optimizations -- Optimize the Adam-W CPU FP32 optimizer ([#42522](https://github.com/PaddlePaddle/Paddle/pull/42522)) -- Optimize pad3d fp32 onednn operator kernel implementation ([#43990](https://github.com/PaddlePaddle/Paddle/pull/43990)) -- Optimize the concurrent execution of matmul, FC andlookup_v2 kernels ([#44023](https://github.com/PaddlePaddle/Paddle/pull/44023), [#44078](https://github.com/PaddlePaddle/Paddle/pull/444078), [#44640](https://github.com/PaddlePaddle/Paddle/pull/44640), [#44744](https://github.com/PaddlePaddle/Paddle/pull/44744), [#45249](https://github.com/PaddlePaddle/Paddle/pull/45249)) -- FC onednn operator kernel supports bf16 ( [#42758](https://github.com/PaddlePaddle/Paddle/pull/42758), [#43154](https://github.com/PaddlePaddle/Paddle/pull/43154), [#43109](https://github.com/PaddlePaddle/Paddle/pull/43109)) -- Add the fusion of matrix multiplication and activation functions ([#43519](https://github.com/PaddlePaddle/Paddle/pull/43519), [#43198](https://github.com/PaddlePaddle/Paddle/pull/43198)) -- Support convolution operator int8 parameter production IR passes ( [#44680](https://github.com/PaddlePaddle/Paddle/pull/44680), [#42625](https://github.com/PaddlePaddle/Paddle/pull/42625)) -- Add pool/avg quantization and scales correction ([#44186](https://github.com/PaddlePaddle/Paddle/pull/44186)) -- Add the matmul and elementwise onednn operator kernel fusion ([#45077](https://github.com/PaddlePaddle/Paddle/pull/45077)) -- Fix the QAT precision bug ([#43693](https://github.com/PaddlePaddle/Paddle/pull/43693), [#45936](https://github.com/PaddlePaddle/Paddle/pull/45936), [#46378](https://github.com/PaddlePaddle/Paddle/pull/46378)) -- Migrate 42 oneDNN operator kernels to PHI operator library ([#46374](https://github.com/PaddlePaddle/Paddle/pull/46374), [#46101](https://github.com/PaddlePaddle/Paddle/pull/46101), [#45989](https://github.com/PaddlePaddle/Paddle/pull/45989), [#45863](https://github.com/PaddlePaddle/Paddle/pull/45863), [#45775](https://github.com/PaddlePaddle/Paddle/pull/45775), [#45626](https://github.com/PaddlePaddle/Paddle/pull/45626), [#45536](https://github.com/PaddlePaddle/Paddle/pull/45536), [#46501](https://github.com/PaddlePaddle/Paddle/pull/46501), [#46257](https://github.com/PaddlePaddle/Paddle/pull/46257), [#45596](https://github.com/PaddlePaddle/Paddle/pull/45596), [#45537](https://github.com/PaddlePaddle/Paddle/pull/45537), [#45481](https://github.com/PaddlePaddle/Paddle/pull/45481), [#45397](https://github.com/PaddlePaddle/Paddle/pull/45397), [#46239](https://github.com/PaddlePaddle/Paddle/pull/46239), [#46139](https://github.com/PaddlePaddle/Paddle/pull/46139), [#46051](https://github.com/PaddlePaddle/Paddle/pull/46051)) -- Quantize the elementwise_sub and shape operator kernels ([#42854](https://github.com/PaddlePaddle/Paddle/pull/42854), [#44124](https://github.com/PaddlePaddle/Paddle/pull/44124)) - -## Thanks to our Contributors - -This release contains contributions from: - -0x45f, Aganlengzi, Ainavo, Allen Guo, Asthestarsfalll, Aurelius84, Baibaifan, baoachun, BiynXu, Bo Zhang, BrilliantYuKaimin, cambriconhsq, caozhou, carryyu, ccrrong, ceci3, chalsliu, Chang Xu, Charles-hit, Chen Long, Chen Weihang, chenjian, chentianyu03, Chenxiao Niu, cifar10, crystal, csy0225, danleifeng, David Nicolas, dc-cheny, denglin-github, dongfangshenzhu, duanboqiang, duanyanhui, engineer, enzodechine, Fan Zhang, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, FlyingQianMM, freeliuzc, furnace, fuyou765, fwenguang, Ghost Screaming, gongweibao, Guanghua Yu, guguguzi, Guoxia Wang, Haipeng Wang, handiz, Haohongxiang, haosicheng, helen88, heliqi, hong, HongyuJia, houj04, huangxu96, Hui Zhang, Huihuang Zheng, huzhiqiang, Jacek Czaja, Jack Zhou, jack603047588, Jackwaterveg, jakpiase, james, Jiabin Yang, jiangcheng, Jiaqi Liu, JingZhuangzhuang, joanna.wozna.intel, JYChen, JZ-LIANG, Kaipeng Deng, kangguangli, kuizhiqing, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, lidanqing, LielinJiang, Ligoml, Lijunhui, lilong12, limingshu, Lin Manhui, Linjie Chen, liqitong-a, littletomatodonkey, liu zhengxi, Liu-xiandong, liutiexing, Liyulingyue, LiYuRio, Lux et Veritas, lyq, Matsumoto Ruko, MayYouBeProsperous, mengqingchun02, Ming-Xu Huang, ming1753, minghaoBD, moyan, mrcangye, Netpunk, niuliling123, Nyakku Shigure, OccupyMars2025, onecatcn, pangyoki, parap1uie-s, peachlcy, piotrekobi, Qi Li, QingshuChen, qipengh, Rayman, Regan Yue, RichardWooSJTU, risemeup1, Roc, ronnywang, Rui Li, Ruibiao Chen, seemingwang, Shang Zhizhou, shangliang Xu, ShenLiang, shentanyue, Shijie, ShiningZhang, shixingbo, shiyutang, Shuangchi He, Siming Dai, Sing_chan, Skr Bang, SmirnovKol, sneaxiy, sprouteer, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao CHANG, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, tiancaishaonvjituizi, tianshuo78520a, Tomasz Socha, TTerror, USTCKAY, Vigi Zhang, Walter, Wang Bojun, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, WangXi, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wawltor, wbn, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, whs, Wilber, WJJ1995, wuhuachaocoding, wuhuanzhou, wuyefeilin, XiaoguangHu, xiaoguoguo626807, xiaohemaikoo, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiayanming, Xingyuan Zhang, xiongkun, yang131313, yangguohao, YangZhou, Yanxing Shi, Yao Zihang, yaoxuefeng, yaozhixin, yeliang2258, Yilingyelu, Yiqun Liu, ykkk2333, Yuang Liu, Yuanle Liu, YuanRisheng, yuguo, Yulong Ao, Yulv-git, YUNSHEN XIE, Zhang Jun, Zhang Ting, Zhang Zheng, zhangbo9674, zhangbopd, zhangchunle, Zhangjingyu06, zhangkaihuo, zhangxiaoci, zhangyikun02, zhangzhenguo, Zhanlue Yang, zhaocaibei123, zhaoying9105, zhaoyingli, Zhen Wang, Zhengyang Song, zhiboniu, Zhong Hui, Zhou Wei, zhoutianzi666, zhupengyang, ziyoujiyi, zlsh80826, zmxdream, zn, Zuza Gawrysiak, zyfncg, 傅剑寒, 六个骨头, 津, 熊峻峰, 王明冬, 石晓伟 - -# 2.3.1 Release Note - -## **1. Important Updates** - -- V2.3.1 is built on V2.3 by fixing known issues and releasing precompiled binary that supports CUDA 11.6. - -## **2. Training Framework (distributed included)** - -### **(1) Function Optimization** - -#### API - -- Modify two initialization modes of `paddle.nn.initializer.KaimingUniform` and `paddle.nn.initializer.KaimingNormal`, to support multiple types of activation functions. ([#43721](https://github.com/PaddlePaddle/Paddle/pull/43721), [#43827](https://github.com/PaddlePaddle/Paddle/pull/43827)) -- Optimize the data pre-fetching function of `paddle.io.DataLoader`, so that it can support the setting of the `prefetch_factor` to set the cache size of pre-fetched data. This can avoid IO blocking when reading large blocks of data. ([#43674](https://github.com/PaddlePaddle/Paddle/pull/43674)) - -#### **New dynamic graph execution mechanism** - -- Modify the initialization method of optional type Tensor in the new dynamic graph API logic to prevent data exceptions caused by early destruction. ([#42561](https://github.com/PaddlePaddle/Paddle/pull/42561)) - -#### **New static graph executor** - -- Defer initialization of the thread pools in the executor, to avoid creating thread pools for `programs` that execute only once (e.g.,`save, load, startup_program`, etc.). ([#43768](https://github.com/PaddlePaddle/Paddle/pull/43768)) - -#### **Mixed precision training** - -- Disabling `state_dict` hook in `set_state_dict` in `paddle.nn.Layer`. ([#43407](https://github.com/PaddlePaddle/Paddle/pull/43407)) - -#### **Distributed training** - -- Enabling tensor parallelism in `paddle.incubate.nn.functional.fused_attention` and `paddle.incubate.nn.functional.fused_feedforward`. ([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) - -#### **Others** - -- Adjust print format of the framework operator kernels to facilitate automated splitting and parsing. ([#42931](https://github.com/PaddlePaddle/Paddle/pull/42931)) -- Update the model quantization API to support the round-off in `rounding to nearest ties to even`, and support quantization in the range [-128, 127]. ([#43829](https://github.com/PaddlePaddle/Paddle/pull/43829)) -- Support AMP mixed precision training in quantization-aware training. ([#43689](https://github.com/PaddlePaddle/Paddle/pull/43689)) -- Add the `progress bar` at the beginning of quantization-aware training, so that it is easy to check the progress of quantization initialization. Skip the scale op when counting out_threshold to speed up the initialization process. ([#43454](https://github.com/PaddlePaddle/Paddle/pull/43454)) -- Support `conv` and `bn` fusion in the dynamic graph quantization training. Support the settings of skip_tensor_list in the static graph offline quantization, to skip some layers without quantization. ([#43301](https://github.com/PaddlePaddle/Paddle/pull/43301)) - -### **(2) Performance Optimization** - -- Optimize`paddle.incubate.nn.functional.fused_attention` and `paddle.incubate.nn.functional.fused_feedforward`operators. Add `add_residual` property to control whether to perform add-`residual` operation in the last step. The performance of CAE model is improved by 7.7%. ([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) -- Optimize `linspace` operator. Initialize three input Tensor of `start`,`stop` and `num` on CPU, to avoid GPU->CPU copy in the operator. This can speed up SOLOv2 model performance by 6%. ([#43746](https://github.com/PaddlePaddle/Paddle/pull/43746)) - -### **(3) Bug Fix** - -#### API - -- Fix the error reported by `paddle.io.DataLoader` when `return_list=True` due to multi-thread conflict. ([#43691](https://github.com/PaddlePaddle/Paddle/pull/43691)) -- Fix the error that the `to` method reports NoneType does not have the device attribute when the `paddle.nn.Layer` parameter has the `None` type parameter. ([#43597](https://github.com/PaddlePaddle/Paddle/pull/43597)) -- Fix the bug that the calculation result of cumsum op is wrong in some `shape` settings. ([#42500](https://github.com/PaddlePaddle/Paddle/pull/42500), [#43777](https://github.com/PaddlePaddle/Paddle/pull/43777)) -- Fix the bug that the output result dimension of `Tensor.__getitem__` is 0 in the networking stage when using `bool` index in the static graph. ([#43246](https://github.com/PaddlePaddle/Paddle/pull/43246)) -- Fix the bug occurred when `paddle.slice` and `paddle.strided_slice` handle negative parameters. ([#43432](https://github.com/PaddlePaddle/Paddle/pull/43432)) -- Fix the bug that the assignment result of set_value op is abnormal when the processing slice `step` is negative. ([#43694](https://github.com/PaddlePaddle/Paddle/pull/43694)) -- Fix the bug that the `copy` interface in C++ cannot copy between multiple cards. ([#43728](https://github.com/PaddlePaddle/Paddle/pull/43728)) -- Fix the bug in inference stage caused by attribute naming in `paddle.incubate.nn.functional.fused_attention`and `paddle.incubate.nn.functional.fused_feedforward`. ([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505)) -- Fix an exception in ConditionalBlockGrad op when processing Tensor that does not require `grad`. ([#43034](https://github.com/PaddlePaddle/Paddle/pull/43034)) -- Fix the bug of device memory increase caused by einsum op in the speed optimization of backward computation. By default, this optimization is enabled. ([#43397](https://github.com/PaddlePaddle/Paddle/pull/43397)) -- Fix the bug that data fails to be fixed when `paddle.io.DataLoader` multi-process data reads the fixing random seeds under a single card. ([#43702](https://github.com/PaddlePaddle/Paddle/pull/43702)) -- Fix the bug that softmax op triggers CUDNN_STATUS_NOT_SUPPORT when the Tensor exceeds 2G. ([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719)) -- Fix the bug that the trace op `Event` string is indistinguishable among different operators that cause the inconvenient performance analysis. ([#42789](https://github.com/PaddlePaddle/Paddle/pull/42789)) - -#### **Others** - -- Fix the bug of overflowing device memory caused by multiple deepcopy and saving in case of dynamic-to-static. ([#43141](https://github.com/PaddlePaddle/Paddle/pull/43141)) -- Fix the bug that the device id introduced by the upgrade of PlaceType used in the custom operator is wrong in the multi-card scenario. ([#43830](https://github.com/PaddlePaddle/Paddle/pull/43830)) -- Optimize the `paddle.profiler.Profiler` timeline visualization logic, move events customized in python scripts from C++ folding display to python folding display. ([#42790](https://github.com/PaddlePaddle/Paddle/pull/42790)) - -## **3.** Deployment Direction (Paddle Inference) - -### **(1) New Features** - -#### **New functions** - -- Add the support of the PaddleSlim quantization model for ONNX Runtime backends on CPUs. ([#43774](https://github.com/PaddlePaddle/Paddle/pull/43774), [#43796](https://github.com/PaddlePaddle/Paddle/pull/43796)) - -### **(2) Underlying Optimization** - -#### **CPU performance optimization** - -- Remove `gpu_cpu_reshape2_matmul_fuse_pass` from EnableMkldnn configuration to fix the bug of ResNet50 performance degradation. ([#43750](https://github.com/PaddlePaddle/Paddle/pull/43750)) - -#### **GPU performance optimization** - -- Add the support of `bilinear_interp_v2` TensorRT convert. ([#43618](https://github.com/PaddlePaddle/Paddle/pull/43618)) -- Add `matmul_scale_fuse_pass` and `multihead_matmul_fuse_pass_v3` to GPU pass. ([#43765](https://github.com/PaddlePaddle/Paddle/pull/43765)) -- Add the support of the GPU handle deferred initialization. ([#43661](https://github.com/PaddlePaddle/Paddle/pull/43661)) - -### **(3) Bug Fixing** - -#### **Framework and API fixing** - -- Fix the compile error problem when binding Paddle-Lite XPU. ([#43178](https://github.com/PaddlePaddle/Paddle/pull/43178)) -- Fix the bug of false trigger of ERNIE 3.0 pass. ([#43948](https://github.com/PaddlePaddle/Paddle/pull/43948)) -- Fix the bug that int8 quantization attribute in multihead op cannot be read. ([#43020](https://github.com/PaddlePaddle/Paddle/pull/43020)) - -#### **Backend capability fixing** - -- Fix the bug that two ops of elementwise_mul and matmul in MKLDNN are crashed during quantitative inference. ([#43725](https://github.com/PaddlePaddle/Paddle/pull/43725)) -- Fix a bug where TensorRT subgraph serialization files are repeatedly generated for the same model during inference. ([#42945](https://github.com/PaddlePaddle/Paddle/pull/43945), [#42633](https://github.com/PaddlePaddle/Paddle/pull/42633)) -- Fix a conflict between the ONNX Runtime backend and the externally use of protobuf. ([#43159](https://github.com/PaddlePaddle/Paddle/pull/43159), [#43742](https://github.com/PaddlePaddle/Paddle/pull/43742)) -- Fix an error reported by python prediction library when using ONNX Runtime backend in case of multiple inputs. ([#43621](https://github.com/PaddlePaddle/Paddle/pull/43621)) - -## **4. Environment Adaptation** - -### **Compile and install** - -- Complete verification and adaptation of CUDA 11.6, and release CUDA 11.6 precompiled binary. ([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) -- Fix a cub error when compiling with CUDA 11.6 on Windows. ([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005)) -- Fix the bug of long compilation time for elementwise and reduce op. ([#43202](https://github.com/PaddlePaddle/Paddle/pull/43202), [#42779](https://github.com/PaddlePaddle/Paddle/pull/42779), [#43205](https://github.com/PaddlePaddle/Paddle/pull/43205)) - -### **New hardware adaptation** - -- Cambricon MLU supports PaddlePaddle Profiler. ([#42115](https://github.com/PaddlePaddle/Paddle/pull/42115)) -- GraphCore IPU supports visualization of compilation progress. ([#42078](https://github.com/PaddlePaddle/Paddle/pull/42078)) - -# 2.3.0 Release Note - -## 1. **Important Updates** - -We are excited to release the PaddlePaddle Framework V2.3.0. This version contains the following highlights. - -### API - -- Added more than 100 new APIs, covering automatic differentiation, linear algebra, probability distribution, sparse tensor, framework performance analysis, hardware device management, vision domain, etc. - -- Added 4 new automatic differentiation APIs, 11 new linear algebra APIs, and 21 new probability distribution APIs to better support use cases in scientific computing, reinforcement learning, xand other application areas. - -- Added 11 new Sparse Tensor APIs including basic functions of sparse tensor construction and conversion. The COO and CSR formats are supported. - -- Added 9 new framework performance analysis APIs. The new performance profiling APIs, centered around Paddle.Profiler.Profiler, help users collect and analyze performance statistics during training and inference. - -- Added 7 APIs for device management, facilitating hardware information acquistion. - -- Added several visual and text domain APIs to facilitate ~~the~~ reusability of MobileNetV3, ResNeXt and other backbone networks, to achieve the fast networking. - - -### **Paddle** HIgh reusability operator l**ibrary** - -- We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework's performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these built-in functions can greatly reduce the user's development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composability, the reuse of kernels through Primitives, and accelerators through plugins. - -### **Distributed Training** - -- Fully upgrade the adaptive distributed training architecture, including multiple modules such as elastic resource management, asynchronous pipelined executor, heterogeneous communication, and automatic parallelism, and support the hard-aware distributed training and inference under a variety of heterogeneous hardware. - -- Add MoE parallel strategy, GroupSharded parallel strategy, and Pure FP16 under dynamic graph hybrid Parallelism, which further supports the efficient distributed training of large models under the dynamic graph. - -- Comprehensively upgrade and optimize the architecture of general heterogeneous parameter server, and simplify each module, such as communication and storage, to improve the secondary development experience of parameter server. The performance of GPU parameter server is improved by 2.38 times under 100 billion parameters and 10 billion data. - - -### **Compile and Install** - -- From version 2.3.0, PaddlePaddle upgrades GPU architectures supported. - - -### **Inference Deployment** - -- Add the Java API and ONNX Runtime CPU backend. - -- Support the TensorRT 8.0 / 8.2 and structured sparsity, with deep performance optimization for ERNIE-like structural models. - - -### **Hardware Backend Extention** - -- Add custom device support: provide a plug-in way to extend PaddlePaddle hardware backend. - -- Add training/inference support for multiple heterogeneous chips such as HUAWEI Ascend 910 / GraphCore IPU / Cambricon MLU / KUNLUNXIN 2. - - -### **Framework Architecture** - -- In this version, we did a lot of work on the framework executor. For details, please see [New Dynamic Graph Execution Mechanism](#new-dynamic-graph-execution-mechanism) and [New Static Graph Executor](#new-static-graph-executor). - -## **2. Incompatibility Upgrade** - -- Due to limitation of the binary size, sm35 CUDA ARCH is dropped in pre-compiled binaries. ([#41754](https://github.com/PaddlePaddle/Paddle/pull/41754)) - -- When `paddle.to_tensor` converts a python int scalar to a Tensor, the default data type on Windows changes from int32 to int64, thus alignment with Linux/Mac. ([#39662](https://github.com/PaddlePaddle/Paddle/pull/39662)) - -- To keep consistency with division behavior under python3, the division symbol `/` has been changed from “rounding divide” to “true divide”, and the data type of the computed output has been switched from int to float. ([#40890](https://github.com/PaddlePaddle/Paddle/pull/40890)) - - - - - - - - - - - - -
-2.2 - -2.3.0 -
-
-
-```python
->>> import paddle
->>> a = paddle.to_tensor([327])
->>> b = paddle.to_tensor([80])
->>> a / b
-Tensor(shape=[1], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
-      [4])
-```
-
-
-
-
-```python
->>> import paddle
->>> a = paddle.to_tensor([327])
->>> b = paddle.to_tensor([80])
->>> a / b
-Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=True,
-      [4.08750010])
-```
-
-
- -- Revise the ELU's formula. The computing method in case of alpha <0 aligns with the original paper, thus fixing a small number of cases where the results are incorrectly calculated. Meanwhile, elu_ will report an error in case of alpha <0, because it is not mathematically possible to compute the inverse gradient from the output only at alpha <0. ([#37316](https://github.com/PaddlePaddle/Paddle/pull/37316)) - - - - - - - - - - - -
-2.2 - -2.3.0 -
-
-
-```python
-# elu(x) = max(0, x) + min(0, α ∗ (e^x − 1))
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-       [ 0.        , -74.48576355])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-       [ 0.        , -74.48576355])
-```
-
-
-
-
-```python
-# elu(x) = x, if x > 0
-# elu(x) = α ∗ (e^x − 1), if x <= 0
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-       [0.12642412,  6.        ])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
-Traceback (most recent call last):
-  File "", line 1, in 
-  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
-    return caller(func, *(extras + args), **kw)
-  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
-    return wrapped_func(*args, **kwargs)
-  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/inplace_utils.py", line 34, in __impl__
-    return func(*args, **kwargs)
-  File "/usr/local/lib/python3.7/dist-packages/paddle/nn/functional/activation.py", line 89, in elu_
-    assert alpha >= 0., "elu_ only support alpha >= 0, please use elu instead."
-AssertionError: elu_ only support alpha >= 0, please use elu instead.
-```
-
-
- -## **3. Training Framework (with the distributed function)** - -### **(1) New functions** - -#### API - -- Add 4 new automatic differentiation APIs to support scientific computing, as listed below: ([#40692](https://github.com/PaddlePaddle/Paddle/pull/40692)) - - - `paddle.incubate.autograd.vjp`, compute vector-Jacobi matrix product. - - - `paddle.incubate.autograd.jvp`, compute Jacobi matrix-vector product. - - - `paddle.incubate.autograd.Jacobian`, compute Jacobi matrix. - - - `paddle.incubate.autograd.Hessian`, compute Hessian matrix. - -- Add linear algebra class API - - - Add `paddle.linalg.triangular_solve`, to compute a system of linear equations with unique solutions through a triangular coefficient. ([#36714](https://github.com/PaddlePaddle/Paddle/pull/36714)) - - - Add `paddle.linalg.eig`, to compute the characteristic decomposition of the general square matrix. ([#35764](https://github.com/PaddlePaddle/Paddle/pull/35764)) - - - Add `paddle.linalg.sovle`, to compute solutions to systems of linear equations. ([#35715](https://github.com/PaddlePaddle/Paddle/pull/35715)) - - - Add `paddle.linalg.lstsq`, to compute least-squares solutions to systems of linear equations. ([#38585](https://github.com/PaddlePaddle/Paddle/pull/38585), [#38621](https://github.com/PaddlePaddle/Paddle/pull/38621)) - - - Add `paddle.linalg.qr`, compute QR decomposition of matrix. ([#35742](https://github.com/PaddlePaddle/Paddle/pull/35742), [#38824](https://github.com/PaddlePaddle/Paddle/pull/38824)) - - - Add `paddle.inner`, to compute inner product of a matrix. ([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706)) - - - Add `paddle.outer`, to compute outer product of a matrix. ([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706)) - - - Add `paddle.linalg.cov`, to compute covariance between vectors. ([#38392](https://github.com/PaddlePaddle/Paddle/pull/38392)) - - - Add `paddle.linalg.cholesky_sovle`, to compute the cholesky solution of the equation. ([#38167](https://github.com/PaddlePaddle/Paddle/pull/38167)) - - - Add `paddle.linalg.lu` and `paddle.linalg.lu_unpack`, to compute matrix lu decomposition, and decompress lu matrix. ([#38617](https://github.com/PaddlePaddle/Paddle/pull/38617), [#38559](https://github.com/PaddlePaddle/Paddle/pull/38559), [#38616](https://github.com/PaddlePaddle/Paddle/pull/38616)) - -- Add 21 new probability distribution class APIs for reinforcement learning, variation inference, scientific computing, and other scenarios. Including 6 random variable distributions, 13 random variable transformations, and 2 KL divergence computing. as listed below: ([#40536](https://github.com/PaddlePaddle/Paddle/pull/40536), [#38820](https://github.com/PaddlePaddle/Paddle/pull/38820), [#38558](https://github.com/PaddlePaddle/Paddle/pull/38558/files), [#38445](https://github.com/PaddlePaddle/Paddle/pull/38445), [#38244](https://github.com/PaddlePaddle/Paddle/pull/38244), [#38047](https://github.com/PaddlePaddle/Paddle/pull/38047)) - - - `paddle.distribution.ExponentialFamily`, exponential distribution family base class. - - - `paddle.distribution.Beta`, `Beta` distribution. - - - `paddle.distribution.Dirichlet`, `Dirichlet` distribution. - - - `paddle.distribution.Independent`, Independent distribution, used to create higher order distributions. - - - `paddle.distribution.TransformedDistribution`, Transform distribution, used to generate higher-order distributions through the base distribution and a series of transformations. - - - `paddle.distribution.Multionmial`, a multinomial distribution. - - - `paddle.distribution.Transform`, base class for transforming random variables. - - - `paddle.distribution.AbsTransform`, take absolute value transform. - - - `paddle.distribution.AffineTransform`, affine transform. - - - `paddle.distribution.ChainTransform`, chain combination of the transform. - - - `paddle.distribution.ExpTransform`, exponential transform. - - - `paddle.distribution.IndependentTransform`, independent transform, used to extend the `event_dim` of the transform definition field. - - - `paddle.distribution.PowerTransform`, power transform. - - - `paddle.distribution.ReshapeTransform`, `reshape` transform. - - - `paddle.distribution.SigmoidTransform`, `sigmoid` transform. - - - `paddle.distribution.SoftmaxTransform`, `softmax` transform. - - - `paddle.distribution.StackTransform`, `stack` transform, used to combine multiple transforms in a `stack` method. - - - `paddle.distribution.StickBreakingTransform`, `stickbreaking` transform. - - - `paddle.distribution.TanhTransform`, `tanh` transform. - - - `paddle.distribution.kl_divergence`, compute KL divergence. - - - `paddle.distribution.register_kl`, register user-defined KL divergence calculation function. - -- Add high-level API - - - Add `paddle.vision.models.AlexNet` and `paddle.vision.models.alexnet`, to use AlexNet models directly. ([#36058](https://github.com/PaddlePaddle/Paddle/pull/36058)) - - - Add `paddle.vision.models.DenseNet`, `paddle.vision.models.densenet121`, `paddle.vision.models.densenet161`, `paddle.vision.models. densenet169`, `paddle.vision.models.densenet201`, and `paddle.vision.models.densenet264`, to use DenseNet models directly. ([#36069](https://github.com/PaddlePaddle/Paddle/pull/36069)) - - - Add `paddle.vision.models.GoogLeNet` and `paddle.vision.models.googlenet`, to use GoogLeNet models directly. ([#36034](https://github.com/PaddlePaddle/Paddle/pull/36034)) - - - Add `paddle.vision.models.InceptionV3`, `paddle.vision.models.inception_v3`, to use InceptionV3 models directly. ([#36064](https://github.com/PaddlePaddle/Paddle/pull/36064)) - - - Add `paddle.vision.models.MobileNetV3Small`, `paddle.vision.models.MobileNetV3Large`, `paddle.vision.models.mobilenet_v3_small`, and `paddle.vision.models.mobilenet_v3_large`, to use MobileNetV3 models directly. ([#38653](https://github.com/PaddlePaddle/Paddle/pull/38653)) - - - Add `paddle.vision.models.resnext50_32x4d`, `paddle.vision.models.resnext50_64x4d`, `paddle.vision.models. paddle.vision.models.resnext101_32x4d`, `paddle.vision.models.resnext101_64x4d`, `paddle.vision.models.resnext152_32x4d`, and `paddle.vision.models.resnext152_64x4d`, to use ResNeXt models directly. ([#36070](https://github.com/PaddlePaddle/Paddle/pull/36070)) - - - Add `paddle.vision.models.ShuffleNetV2`, `paddle.vision.models.shufflenet_v2_x0_25`, `paddle.vision.models.shufflenet_v2_x0_33`, `paddle.vision.models.shufflenet_v2_x0_5`, `paddle.vision.models.shufflenet_v2_x1_0`, `paddle.vision.models.shufflenet_v2_x1_5`, `paddle.vision.models.shufflenet_v2_x2_0`, and `paddle.vision.models.shufflenet_v2_swish`, to use ShuffleNetV2 models directly ([#36067](https://github.com/PaddlePaddle/Paddle/pull/36067)) - - - Add `paddle.vision.models.SqueezeNet`, `paddle.vision.models.squeezenet1_0`, and `paddle.vision.models.squeezenet1_1`, to use SqueezeNet models directly. ([#36066](https://github.com/PaddlePaddle/Paddle/pull/36066)) - - - Add `paddle.vision.models.wide_resnet50_2`, and `paddle.vision.models.wide_resnet101_2`, to use WideResNet models directly. ([#36952](https://github.com/PaddlePaddle/Paddle/pull/36952)) - - - Add `paddle.vision.ops.nms` API, to support single-category and multi-category non-maximum suppression (NMS) algorithms for target detection and prediction task acceleration ([#40962](https://github.com/PaddlePaddle/Paddle/pull/40962)) - - - Add `paddle.vision.ops.roi_pool` and `paddle.vision.ops.RoIPool`, to support RoI region pooling operations in detection tasks. ([#36154](https://github.com/PaddlePaddle/Paddle/pull/36154)) - - - Add `paddle.vision.ops.roi_align` and `paddle.vision.ops.RoIAlign`, to support RoI Align operations in detection tasks. ([#35102](https://github.com/PaddlePaddle/Paddle/pull/36154)) - - - Add `paddle.text.ViterbiDecoder`, and `paddle.text.viterbi_decode` Viterbi decoding API, mainly for sequence tagging model prediction. ([#35778](https://github.com/PaddlePaddle/Paddle/pull/35778)) - -- Add 11 Sparse class APIs, to support basic functions, such as creating Sparse Tensor in COO and CSR formats, and add C++ inter-converting with Tensor. - - - `paddle.sparse.sparse_coo_tensor`,create Sparse Tensor in COO format. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `paddle.sparse.sparse_csr_tensor`,create Sparse Tensor in CSR format. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `paddle.sparse.ReLU`,support ReLU activation layer for SparseCooTensor. ([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959)) - - - `paddle.sparse.functional.relu`,support ReLU function of SparseCooTensor. ([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959)) - - - `Tensor.values()`,c++ method to get non-zero elements of a SparseCooTensor or SparseCsrTensor. ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.indices()`,c++ method to get the coordinate information of a SparseCooTensor. ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.crows()`,c++ method to get information about the compressed row information of the SparseCsrTensor. ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.cols()`,c++ method to get the column information of the SparseCsrTensor ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608)) - - - `Tensor.to_sparse_coo()`,c++ method to convert a DenseTensor or SparseCsrTensor to a SparseCooTensor. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `Tensor.to_sparse_csr()`,c++ convert a DenseTensor or SparseCooTensor to a SparseCsrTensor. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - - - `Tensor.to_dense()`,c++ convert a SparseCooTensor or SparseCsrTensor to a DenseTensor. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780)) - -- Add hardware related APIs - - - Add four GPU memory monitoring related APIs: `paddle.device.cuda.max_memory_allocated`, `paddle.device.cuda.max_memory_reserved`, `paddle.device.cuda.memory_allocated`, and `paddle.device.cuda.memory_reserved`, to view and analyze the GPU memory usage in real-time. ([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657)) - - - Add `paddle.device.cuda.get_device_properties`, to return the properties of the GPU device. ([#35661](https://github.com/PaddlePaddle/Paddle/pull/35661)) - - - Add `paddle.device.cuda.get_device_name` and `paddle.device.cuda.get_device_capability`, to return the name and compute capability of the GPU device. ([#35672](https://github.com/PaddlePaddle/Paddle/pull/35672)) - -- Add Tensor operation API - - - Add `paddle.nansum`, to sum input Tensor along `axis` with ignoring the `NaNs` values. ([#38137](https://github.com/PaddlePaddle/Paddle/pull/38137)) - - - Add `paddle.nanmean`,to average input Tensor along `axis` with ignoring the `NaNs` values. ([#40472](https://github.com/PaddlePaddle/Paddle/pull/40472)) - - - Add `paddle.clone`, to return a copy of the input Tensor and provide gradient calculation. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - - - Add `paddle.Tensor.element_size`, to return the number of bytes allocated for a single element in a Tensor. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - - - Add `paddle.Tensor.to_uva_tensor`, to convert the numpy objects to be accessed by CUDA objects with virtual addresses, which are stored in CPU memory physically. ([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#38950](https://github.com/PaddlePaddle/Paddle/pull/38950)) - - - Add `paddle.rot90`, to rotate the n-dimensional Tensor by 90 degrees along the plane specified by `axes`. ([#37634](https://github.com/PaddlePaddle/Paddle/pull/37634)) - - - Add `paddle.logit` and `paddle.Tensor.logit`, to compute the logit function values for input Tensor. ([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844)) - - - Add `paddle.repeat_interleave`, to copy the input along the specified axis, and return a new Tensor. ([#37981](https://github.com/PaddlePaddle/Paddle/pull/37981)) - - - Add `paddle.renorm`, to split the Tensor into multiple pieces at the specified `axis` and then perform p norm operations separately. ([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459)) - - - Add `paddle.mode` and `paddle.Tensor.mode`, to search the values and indices of the input Tensor along the specified axis. ([#38446](https://github.com/PaddlePaddle/Paddle/pull/38446)) - - - Add `paddle.quantile` and `paddle.Tensor.quantile`, to compute the q-quantile of a Tensor along the specified axis. ([#38567](https://github.com/PaddlePaddle/Paddle/pull/38567)) - - - Add `paddle.kthvalue` and `paddle.Tensor.kthvalue`, to find the values and indices of the kth smallest at the specified axis. ([#38386](https://github.com/PaddlePaddle/Paddle/pull/38386)) - - - Add `paddle.is_floating_point` and `paddle.Tensor.is_floating_point`, to determine if the input Tensor is the floating point type. ([#37885](https://github.com/PaddlePaddle/Paddle/pull/37885)) - - - Add `paddle.erfinv` and `paddle.Tensor.erfinv`, to compute the inverse error function of the input Tensor. ([#38295](https://github.com/PaddlePaddle/Paddle/pull/38295)) - - - Add `paddle.lerp` and `paddle.Tensor.lerp`, to compute linear interpolation among the input Tensors based on the given weights. ([#37253](https://github.com/PaddlePaddle/Paddle/pull/37253)) - - - Add `paddle.angle`, to compute the phase angle of a complex Tensor. ([#37689](https://github.com/PaddlePaddle/Paddle/pull/37689)) - - - Add `paddle.rad2deg` and `paddle.Tensor.rad2deg`, to convert each of the elements of input from the angles in radians to the degrees. ([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598)) - - - Add `paddle.deg2rad` and `paddle.Tensor.deg2rad`, to convert each of the elements of input from the degrees in radians to the angles. ([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598)) - - - Add `paddle.gcd` and `paddle.Tensor.gcd`, to compute the greatest common divisors of the absolute values of two inputs by element. ([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819)) - - - Add `paddle.lcm` and `paddle.Tensor.lcm`, to compute the least common multiple of the absolute value of two inputs by element. ([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819)) - - - Add `paddle.amax` and `paddle.Tensor.amax`, to get the maximum value of Tensor elements along the specified dimension. ([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417)) - - - Add `paddle.amin` and `paddle.Tensor.amin`, to get the minimum value of Tensor elements along the specified dimension. ([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417)) - - - Add `paddle.isclose`, to determine if each element of two Tensors is close to each other. ([#37135](https://github.com/PaddlePaddle/Paddle/pull/37135)) - - - Add `paddle.put_along_axis` and `paddle.take_along_axis`, for extracting or placing elements with specified index subscripts. ([#38608](https://github.com/PaddlePaddle/Paddle/pull/38608)) - - - Add `paddle.bincount` and `paddle.Tensor.bincount`, for counting the number of occurrences of each element in a Tensor. ([#36317](https://github.com/PaddlePaddle/Paddle/pull/36317)) - - - Add `paddle.fmax` and `paddle.fmin`, to extend the max/min function to support the case of NaN values in the two Tensors. If there is one NaN value in the corresponding position, return that non-NaN value; if there are two NaN values in the corresponding position, return the NaN value. ([#37826](https://github.com/PaddlePaddle/Paddle/pull/37826)) - - - Add `paddle.diff`, for computing the nth forward difference along a given dimension. It currently supports n=1. ([#37441](https://github.com/PaddlePaddle/Paddle/pull/37441)) - - - Add inverse hyperbolic functions: `paddle.asinh`, `paddle.acosh`, and `paddle.atanh`. ([#37076](https://github.com/PaddlePaddle/Paddle/pull/37076)) - - - Add `paddle.as_real` and `paddle.as_complex` for conversion between real Tensor and complex Tensor. ([#37784](https://github.com/PaddlePaddle/Paddle/pull/37784)) - - - Add `paddle.complex`, for constructing a complex Tensor with the given real and imaginary parts. ([#37918](https://github.com/PaddlePaddle/Paddle/pull/37918), [#38272](https://github.com/PaddlePaddle/Paddle/pull/38272)) - - - Add `paddle.det` and `paddle.slogdet`, to compute the determinant of a matrix and the natural logarithm of the determinant. ([#34992](https://github.com/PaddlePaddle/Paddle/pull/34992)) - - - Add `paddle.nn.utils.parameters_to_vector`, to flatten parameters to a 1-D Tensor. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - - - Add `paddle.nn.utils.vector_to_parameters`, to transform a Tensor with 1-D shape to the parameters. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020)) - -- Add networking class APIs - - - Add `paddle.nn.Fold` and `paddle.nn.functional.fold`, to extract sliding local area blocks for the Tensors of a batch. ([#38613](https://github.com/PaddlePaddle/Paddle/pull/38613)) - - - Add `paddle.nn.CELU` and `paddle.nn.functional.celu`, to support the CELU activation layer. ([#36088](https://github.com/PaddlePaddle/Paddle/pull/36088)) - - - Add `paddle.nn.HingeEmbeddingLoss`. Add a way to compute hinge embedding loss. It is usually used for nonlinear embedding or semi-supervised learning. ([#37540](https://github.com/PaddlePaddle/Paddle/pull/37540)) - - - Add `paddle.nn.ZeroPad2D` API, for zero-padding according to the padding property. ([#37151](https://github.com/PaddlePaddle/Paddle/pull/37151)) - - - Add `paddle.nn.MaxUnPool3D` and `paddle.nn.MaxUnPool1D`, for computing 3D maximum inverse pooling and 1D maximum inverse pooling. ([#38716](https://github.com/PaddlePaddle/Paddle/pull/38716)) - - - Add `paddle.incubate.graph_khop_sampler`, `paddle.incubate.graph_sample_neighbors`, and `paddle.incubate.graph_reindex` APIs, to support graph multi-order neighbor sampling and graph reindexing operations. They are mainly used for graph neural network model training. ([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#40809](https://github.com/PaddlePaddle/Paddle/pull/40809)) - -- Add random number class APIs - - - Add `paddle.poisson`, to generate a Tensor that obeys Poisson distributed with the lambda parameter. ([#38117](https://github.com/PaddlePaddle/Paddle/pull/38117)) - - - Add `paddle.randint_like` API, to generate a new Tensor that obeys uniform distribution in the range [low, high), with the shape of the output matching the shape of the input. ([#36169](https://github.com/PaddlePaddle/Paddle/pull/36169)) - - - Add `paddle.Tensor.exponential_`. It is an inplace style API that populates the input Tensor with exponentially distributed random numbers. ([#38256](https://github.com/PaddlePaddle/Paddle/pull/38256)) - -- Add parameter initialization class APIs - - - Add `paddle.nn.initializer.Dirac`, to initialize 3D/4D/5D parameters with Dirac delta functions. It is commonly used for initialization of Conv1D/Conv2D/Conv3D parameters in the convolution layer. ([#37389](https://github.com/PaddlePaddle/Paddle/pull/37389)) - - - Add `paddle.nn.initializer.Orthogonal` for orthogonal matrix initialization. The initialized parameter is the (semi-) orthogonal vector. ([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163)) - - - Add `paddle.nn.initializer.calculate_gain`, to get the recommended gain value for the activation function. The gain value can be used to set certain initialization APIs to adjust the initialization range. ([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163)) - -- Add learning rate class API - - - Add `paddle.optimizer.lr.MultiplicativeDecay`, to provide the `lambda` function to set the learning rate. ([#38250](https://github.com/PaddlePaddle/Paddle/pull/38250)) -- Add distributed-related APIs - - - Add `paddle.incubate.optimizer.DistributedFusedLamb`, to allow the Lamb optimizer to update parameters distributedly. ([#40011](https://github.com/PaddlePaddle/Paddle/pull/40011), [#39972](https://github.com/PaddlePaddle/Paddle/pull/39972), [#39900](https://github.com/PaddlePaddle/Paddle/pull/39900), [#39747](https://github.com/PaddlePaddle/Paddle/pull/39747), [#39148](https://github.com/PaddlePaddle/Paddle/pull/39148), [#39416](https://github.com/PaddlePaddle/Paddle/pull/39416)) -- Add new optimizer-related APIs([#40710](https://github.com/PaddlePaddle/Paddle/pull/40710)) - - - `paddle.incubate.optimizer.functional.minimize_bfgs`,add second-order optimizer BFGS. - - - `paddle.incubate.optimizer.functional.minimize_lbfgs`,add second-order optimizer L-BFGS. - -- Add `paddle.incubate.multiprocessing` module, to provide Tensor (CPU/GPU) data transfer between python processes. ([#37302](https://github.com/PaddlePaddle/Paddle/pull/37302), [#41339](https://github.com/PaddlePaddle/Paddle/pull/41339)) - -- Add `paddle.incubate.autotune.set_config` API, to support multi-version Kernel auto-selection, mixed precision data layout auto-conversion, and num_workers auto-selection for DataLoader to automatically improve model performance. ([#42301](https://github.com/PaddlePaddle/Paddle/pull/42301)) - -- Add `paddle.incubate.nn.FusedMultiTransformer` and `paddle.incubate.nn.functional.fused_multi_transformer` API, to fuse multiple layers of transformers into a single op to improve model inference performance. It should be noted that only forward is supported. ([#42311](https://github.com/PaddlePaddle/Paddle/pull/42311)) - -- Add einsum_v2 operators for consistent interface between dynamic graph mode and static graph mode. It is compatible with the `paddle.einsum` implementation at the original python side, while supporting dynamic to static export and more complete Infershape inference. ([#42495](https://github.com/PaddlePaddle/Paddle/pull/42495), [#42327](https://github.com/PaddlePaddle/Paddle/pull/42327), [#42397](https://github.com/PaddlePaddle/Paddle/pull/42397), [#42105](https://github.com/PaddlePaddle/Paddle/pull/42105)) - - -#### IR(Intermediate Representation) - -- Dynamic graph to static graph - - - For the variable type StaticAnalysis module, add support for type tag similar to `a, b = paddle.shape(x)`. ([#39245](https://github.com/PaddlePaddle/Paddle/pull/39245)) - - - Add a computed field, supporting `InputSpec.name` as the Program cache hash key. ([#38273](https://github.com/PaddlePaddle/Paddle/pull/38273)) - - - Add syntax for supporting `dict['key'] = x.shape`. ([#40611](https://github.com/PaddlePaddle/Paddle/pull/40611)) - - - Add the support for Pure FP16 training. ([#36944](https://github.com/PaddlePaddle/Paddle/pull/36944)) - - - Add the support `for i in [x,y,z]` syntax. ([#37259](https://github.com/PaddlePaddle/Paddle/pull/37259)) - - - Add the support for type hint syntax of python3. ([#36544](https://github.com/PaddlePaddle/Paddle/pull/36544)) - -- Pass development - - - Add forward and backward fusion for FC + [relu|gelu] based on NVIDIA cuBlasLt Epilogue. ([#39437](https://github.com/PaddlePaddle/Paddle/pull/39437)) -- Kernel Primitive API - - - Add KP operators on GPU platform, including cast, scale, clip, bce_loss, abs_grad, reduce_sum_grad, reduce_mean_grad, clip, bce_loss, full, full_like, distribution, random, masked_select_kernel, where_index, masked_select_grad, dropout, sigmoid, where, and abs_grad. ([#36203](https://github.com/PaddlePaddle/Paddle/pull/36203), [#36423](https://github.com/PaddlePaddle/Paddle/pull/36423), [#39390](https://github.com/PaddlePaddle/Paddle/pull/39390), [#39734](https://github.com/PaddlePaddle/Paddle/pull/39734), [#38500](https://github.com/PaddlePaddle/Paddle/pull/38500), [#38959](https://github.com/PaddlePaddle/Paddle/pull/38959), [#39197](https://github.com/PaddlePaddle/Paddle/pull/39197/), [#39563](https://github.com/PaddlePaddle/Paddle/pull/39563), [#39666](https://github.com/PaddlePaddle/Paddle/pull/39666), [#40517](https://github.com/PaddlePaddle/Paddle/pull/40517), [#40617](https://github.com/PaddlePaddle/Paddle/pull/40617), [#40766](https://github.com/PaddlePaddle/Paddle/pull/40766), [#39898](https://github.com/PaddlePaddle/Paddle/pull/39898), [#39609](https://github.com/PaddlePaddle/Paddle/pull/39609)) - - - Add the support for XPU2 source code compilation mode. ([#37254](https://github.com/PaddlePaddle/Paddle/pull/37254), [#40397](https://github.com/PaddlePaddle/Paddle/pull/40397), [#38455](https://github.com/PaddlePaddle/Paddle/pull/38455)) - - - Add the support for KP operator reuse on XPU2 and GPU, including reduce, broadcast, elementwise_add, `exp、log、relu、sigmoid、leaky_relu、softplus、hard_swish、reciprocal`。([#36904](https://github.com/PaddlePaddle/Paddle/pull/36904), [#37226](https://github.com/PaddlePaddle/Paddle/pull/37226), [#38918](https://github.com/PaddlePaddle/Paddle/pull/38918), [#40560](https://github.com/PaddlePaddle/Paddle/pull/40560/), [#39787](https://github.com/PaddlePaddle/Paddle/pull/39787), [#39917](https://github.com/PaddlePaddle/Paddle/pull/39917), [#40002](https://github.com/PaddlePaddle/Paddle/pull/40002), [#40364](https://github.com/PaddlePaddle/Paddle/pull/40364)) - - - Add unit tests of KP operators on the XPU2 platform, including `brelu、ceil、celu、elu、floor、hard_shrink、hard_sigmoid、log1p、logsigmoid、relu6、silu、soft_relu、softsign、sqrt、square、swish、thresholded_relu、softshrink`。([#40448](https://github.com/PaddlePaddle/Paddle/pull/40448), [#40524](https://github.com/PaddlePaddle/Paddle/pull/40524)) - - - Add the support for XPU2 KP models, including resnet50, deepfm, wide_deep, yolov3-darknet53, det_mv3_db, bert, transformer, mobilenet_v3, and GPT2. - - -#### **Mixed Precision Training** - -- Split the `paddle.amp.GradScaler.unscale_` method from the `minimize` of the mixed precision training `paddle.amp.GradScaler`, to provide a separate interface for recovering the loss. ([#35825](https://github.com/PaddlePaddle/Paddle/pull/35825)) - -- Add the FP16 support for `paddle.nn.ClipByGlobalNorm` dynamic graph mode. Add FP16 Kernel for clip op to enable clip-related operations to support FP16 compute. ([#36198](https://github.com/PaddlePaddle/Paddle/pull/36198), [#36577](https://github.com/PaddlePaddle/Paddle/pull/36577)) - -- Support the case that the `optimizer` parameter transferred from `paddle.amp.decorate` is Nan. ([#37541](https://github.com/PaddlePaddle/Paddle/pull/37541)) - -- For the merged_momentum op,add the support of input multiple learning rates, the computing for use_nesterov policy and the regularization computing. ([#37527](https://github.com/PaddlePaddle/Paddle/pull/37527)) - -- Add multi_tensor policy to `paddle.optimizer.Momentum` optimizer. Add `set_to_zero` branch to `clear_grad` of `Optimzizer` class. ([#37564](https://github.com/PaddlePaddle/Paddle/pull/37564)) - -- Add multi_tensor policy to `paddle.optimizer.Adam`. ([#38010](https://github.com/PaddlePaddle/Paddle/pull/38010)) - -- Add multi_precision policy to `paddle.optimizer.SGD` optimizer. ([#38231](https://github.com/PaddlePaddle/Paddle/pull/38231)) - -- Add the storage `master weight` parameter to the optimizer `state_dict` method. ([#39121](https://github.com/PaddlePaddle/Paddle/pull/39121)) - -- Add support for op CUDA bfloat16 mixed precision training. Support for O1 and O2 modes. Enable the above training modes via `paddle.amp.auto_cast`. ([#39029](https://github.com/PaddlePaddle/Paddle/pull/39029), [#39815](https://github.com/PaddlePaddle/Paddle/pull/39815)) - -- Add bfloat16 CUDA Kernel for the following ops: matmul, concat, split, dropout, reshape, slice, squeeze, stack, transpose, unbind, elementwize_max, elementwize_add, elementwize_mul, elementwize_sub, scale, sum, layer_norm, p_norm, reduce_sum, softmax, log_softmax, sigmoid, sqrt, softplus, square, gaussian_random, fill_constant, and fill_any_like. ([#39485](https://github.com/PaddlePaddle/Paddle/pull/39485), [#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683), [#39843](https://github.com/PaddlePaddle/Paddle/pull/39843), [#39999](https://github.com/PaddlePaddle/Paddle/pull/39999), [#40004](https://github.com/PaddlePaddle/Paddle/pull/40004), [#40027](https://github.com/PaddlePaddle/Paddle/pull/40027)) - -- Add bfloat16 CPU Kernel for the following ops: dropout, reshape, slice, squeeze, unsqueeze, stack, transpose, unbind, elementwize_max, elementwise_mul, elementwise_sub, and gather. ([#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683)) - -- Support printing of Tensor with data of bfloat16. ([#39375](https://github.com/PaddlePaddle/Paddle/pull/39375), [#39370](https://github.com/PaddlePaddle/Paddle/pull/39370)) - -- Add support for FP16 computation for `p_norm`, `elementwise_max`, and `fill_constant_batch_size_like ``scatter`. ([#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [#39907](https://github.com/PaddlePaddle/Paddle/pull/39907), [#38136](https://github.com/PaddlePaddle/Paddle/pull/38136), [#38499](https://github.com/PaddlePaddle/Paddle/pull/38499)) - -- Add support for int16_t for the following ops: cumsum, less_than, less_equal, greater_than, greater_equal, equal, not_equal, fill_any_like, grather_nd reduce_sum, where_index, reshape, and unsqueeze. ([#39636](https://github.com/PaddlePaddle/Paddle/pull/39636)) - -- Add support for int16_t label type for cross_entropy op. ([#39409](https://github.com/PaddlePaddle/Paddle/pull/39409)) - -- Add support for int16_t id type for embedding op. ([#39381](https://github.com/PaddlePaddle/Paddle/pull/39381)) - -- Add support for FP16 type for reduce_mean op. ([#38289](https://github.com/PaddlePaddle/Paddle/pull/38289)) - -- Add support for FP16 type for elementwise_min op. ([#38123](https://github.com/PaddlePaddle/Paddle/pull/38123)) - -- Update bfloat16 AMP oneDNN default support list. ([#39304](https://github.com/PaddlePaddle/Paddle/pull/39304)) - - -#### **Paddle HIgh reusability operator library** - -We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework's performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these built-in functions can greatly reduce the user's development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composabilty, the reuse of kernels through Primitives, and accelerators through plugins.The main contents include six parts as below: - -- **The implementation of the operator library infrastructure, core components and mechanisms**: The directory structure of the new operator library is reasonably planned, design and implement the common base data structure of the new operator library, the new functional InferMeta and Kernel development paradigm and the corresponding registration and management components. Support the automated compilation object generation and compilation dependency generation of Kernel files, allowing developers to focus only on the functional Kernel implementation, and making the development paradigm clear and concise. ([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#37107](https://github.com/PaddlePaddle/Paddle/pull/37107), [#36946](https://github.com/PaddlePaddle/Paddle/pull/36946), [#36948](https://github.com/PaddlePaddle/Paddle/pull/36948), [#37876](https://github.com/PaddlePaddle/Paddle/pull/37876), [#37916](https://github.com/PaddlePaddle/Paddle/pull/37916), [#37977](https://github.com/PaddlePaddle/Paddle/pull/37977), [38078](https://github.com/PaddlePaddle/Paddle/pull/38078), [#38861](https://github.com/PaddlePaddle/Paddle/pull/38861), [#39123](https://github.com/PaddlePaddle/Paddle/pull/39123), [#39131](https://github.com/PaddlePaddle/Paddle/pull/39131), [#39748](https://github.com/PaddlePaddle/Paddle/pull/39748), [#39790](https://github.com/PaddlePaddle/Paddle/pull/39790), [#39941](https://github.com/PaddlePaddle/Paddle/pull/39941), [#40239](https://github.com/PaddlePaddle/Paddle/pull/40239), [#40635](https://github.com/PaddlePaddle/Paddle/pull/40635), [#41091](https://github.com/PaddlePaddle/Paddle/pull/41091), [#37409](https://github.com/PaddlePaddle/Paddle/pull/37409), [#37942](https://github.com/PaddlePaddle/Paddle/pull/37942), [#39002](https://github.com/PaddlePaddle/Paddle/pull/39002), [#38109](https://github.com/PaddlePaddle/Paddle/pull/38109), [#37881](https://github.com/PaddlePaddle/Paddle/pull/37881), [#37517](https://github.com/PaddlePaddle/Paddle/pull/37517), [#39870](https://github.com/PaddlePaddle/Paddle/pull/39870), [#40975](https://github.com/PaddlePaddle/Paddle/pull/40975), [#39475](https://github.com/PaddlePaddle/Paddle/pull/39475), [#37304](https://github.com/PaddlePaddle/Paddle/pull/37304), #36910, #37120, #37146, #37215, #37255, #37369, #38258, #38257, #38355, #38853, #38937, #38977, #38946, #39085, #39153, #39228, #38301, #38275, #38506, #38607, #38473, #38632, #38811, #38880, #38996, #38914, #39101) - -- **Operator library C++ API system construction**: design and implement yaml configuration file-based operator definition paradigm, to automatically generate more than 200 C++ operator class APIs for internal and external developers to reuse. This reduces the cost of repeated development of basic operators. ([#37668](https://github.com/PaddlePaddle/Paddle/pull/37668), [#36938](https://github.com/PaddlePaddle/Paddle/pull/36938), [#38172](https://github.com/PaddlePaddle/Paddle/pull/38172), [#38182](https://github.com/PaddlePaddle/Paddle/pull/38182), [#38311](https://github.com/PaddlePaddle/Paddle/pull/38311), [#38438](https://github.com/PaddlePaddle/Paddle/pull/38438), [#39057](https://github.com/PaddlePaddle/Paddle/pull/39057), [#39229](https://github.com/PaddlePaddle/Paddle/pull/39229), [#39281](https://github.com/PaddlePaddle/Paddle/pull/39281), [#39263](https://github.com/PaddlePaddle/Paddle/pull/39263), [#39408](https://github.com/PaddlePaddle/Paddle/pull/39408), [#39436](https://github.com/PaddlePaddle/Paddle/pull/39436), [#39482](https://github.com/PaddlePaddle/Paddle/pull/39482), [#39497](https://github.com/PaddlePaddle/Paddle/pull/39497), [#39651](https://github.com/PaddlePaddle/Paddle/pull/39651), [#39521](https://github.com/PaddlePaddle/Paddle/pull/39521), [#39760](https://github.com/PaddlePaddle/Paddle/pull/39760), [#40060](https://github.com/PaddlePaddle/Paddle/pull/40060), [#40196](https://github.com/PaddlePaddle/Paddle/pull/40196), [#40218](https://github.com/PaddlePaddle/Paddle/pull/40218), [#40640](https://github.com/PaddlePaddle/Paddle/pull/40640), [#40732](https://github.com/PaddlePaddle/Paddle/pull/40732), [#40729](https://github.com/PaddlePaddle/Paddle/pull/40729), [#40840](https://github.com/PaddlePaddle/Paddle/pull/40840), [#40867](https://github.com/PaddlePaddle/Paddle/pull/40867), [#41025](https://github.com/PaddlePaddle/Paddle/pull/41025), [#41368](https://github.com/PaddlePaddle/Paddle/pull/41368)) - -- **Operator library compatible with various execution systems**: Implement new InferMeta and Kernel to access the original dynamic and static graph execution system. Support the safe removal of the original OpKernel registration and migration to the new Kernel form. ([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#38825](https://github.com/PaddlePaddle/Paddle/pull/38825), [#38837](https://github.com/PaddlePaddle/Paddle/pull/38837), [#38842](https://github.com/PaddlePaddle/Paddle/pull/38842), [#38976](https://github.com/PaddlePaddle/Paddle/pull/38976), [#39134](https://github.com/PaddlePaddle/Paddle/pull/39134), [#39140](https://github.com/PaddlePaddle/Paddle/pull/39140), [#39135](https://github.com/PaddlePaddle/Paddle/pull/39135), [#39252](https://github.com/PaddlePaddle/Paddle/pull/39252), [#39222](https://github.com/PaddlePaddle/Paddle/pull/39222), [#39351](https://github.com/PaddlePaddle/Paddle/pull/39351)) - -- **Decouple the underlying data structures and tool functions of the operator library from the framework**: Relieve PHI's dependence on the framework for core data structures, lay the foundation for subsequent independent compilation of PHI, and support infrt, custom Kernel, and a series of Phi-based construction work ([#38583](https://github.com/PaddlePaddle/Paddle/pull/38583), [#39188](https://github.com/PaddlePaddle/Paddle/pull/39188), [#39560](https://github.com/PaddlePaddle/Paddle/pull/39560), [#39931](https://github.com/PaddlePaddle/Paddle/pull/39931), [#39169](https://github.com/PaddlePaddle/Paddle/pull/39169), [#38951](https://github.com/PaddlePaddle/Paddle/pull/38951), [#38898](https://github.com/PaddlePaddle/Paddle/pull/38898), [#38873](https://github.com/PaddlePaddle/Paddle/pull/38873), [#38696](https://github.com/PaddlePaddle/Paddle/pull/38696), [#38651](https://github.com/PaddlePaddle/Paddle/pull/38651), [#39359](https://github.com/PaddlePaddle/Paddle/pull/39359), [#39305](https://github.com/PaddlePaddle/Paddle/pull/39305), [#39234](https://github.com/PaddlePaddle/Paddle/pull/39234), [#39098](https://github.com/PaddlePaddle/Paddle/pull/39098), [#39120](https://github.com/PaddlePaddle/Paddle/pull/39120), [#38979](https://github.com/PaddlePaddle/Paddle/pull/38979), [#38899](https://github.com/PaddlePaddle/Paddle/pull/38899), [#38844](https://github.com/PaddlePaddle/Paddle/pull/38844), [#39714](https://github.com/PaddlePaddle/Paddle/pull/39714), [#39729](https://github.com/PaddlePaddle/Paddle/pull/39729), [#39889](https://github.com/PaddlePaddle/Paddle/pull/39889), [#39587](https://github.com/PaddlePaddle/Paddle/pull/39587), [#39558](https://github.com/PaddlePaddle/Paddle/pull/39558), [#39514](https://github.com/PaddlePaddle/Paddle/pull/39514), [#39502](https://github.com/PaddlePaddle/Paddle/pull/39502), [#39300](https://github.com/PaddlePaddle/Paddle/pull/39300), [#39246](https://github.com/PaddlePaddle/Paddle/pull/39246), [#39124](https://github.com/PaddlePaddle/Paddle/pull/39124)) - -- **Integration between custom operator mechanism and Phi with improvement**: support for calling over 200 C++ operator class APIs automatically generated by PHI when writing custom operators. This reduces custom operator development costs. A series of bugs are fixed. ([#37122](https://github.com/PaddlePaddle/Paddle/pull/37122), [#37276](https://github.com/PaddlePaddle/Paddle/pull/37276), [#37281](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37262](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37415](https://github.com/PaddlePaddle/Paddle/pull/37415), [#37423](https://github.com/PaddlePaddle/Paddle/pull/37423), [#37583](https://github.com/PaddlePaddle/Paddle/pull/37683), [#38776](https://github.com/PaddlePaddle/Paddle/pull/38776), [#39353](https://github.com/PaddlePaddle/Paddle/pull/39353), [#41072](https://github.com/PaddlePaddle/Paddle/pull/41072)) - -- **Operator scale migration and refactoring**: migrate about 250 high-frequency forward and backward operator Kernel to the new operator library and refactor them as a single function. Achieve the high-performance operator by encapsulating multiple base Kernel functions on the C++ side for the fast combination. Meanwhile, add the corresponding yaml operator definition, and access to the new dynamic graph execution system to improve the python API scheduling performance. The migrated and refactored operators include: - - - sqrt ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - square([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - sin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - sinh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - elementwise_fmax([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140)) - - - elementwise_fmin([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140)) - - - pool2d([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - max_pool2d_with_index([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - pool3d([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - max_pool3d_with_index([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - fill_constant ([#36930](https://github.com/PaddlePaddle/Paddle/pull/36930), [#39465](https://github.com/PaddlePaddle/Paddle/pull/39465)) - - - p_norm ([#40819](https://github.com/PaddlePaddle/Paddle/pull/40819)) - - - fill_constant_batch_size_like ([#40784](https://github.com/PaddlePaddle/Paddle/pull/40784)) - - - conv2d([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354)) - - - conv2d_transpose([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - conv3d([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354)) - - - conv3d_transpose([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - mish([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - gather_nd ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043)) - - - gather ([#40500](https://github.com/PaddlePaddle/Paddle/pull/40500)) - - - scatter ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043)) - - - scatter_nd_add ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043)) - - - sgd([40045](https://github.com/PaddlePaddle/Paddle/pull/40045)) - - - momentum ([#41319](https://github.com/PaddlePaddle/Paddle/pull/41319)) - - - rmsprop([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994)) - - - index_sample([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459),[#39905](https://github.com/PaddlePaddle/Paddle/pull/39905)) - - - adam ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351)) - - - layer_norm([#40193](https://github.com/PaddlePaddle/Paddle/pull/40193)) - - - adagrad([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994/)) - - - adamax ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173)) - - - adadelta ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173)) - - - clip([#40602](https://github.com/PaddlePaddle/Paddle/pull/40602), [#41661](https://github.com/PaddlePaddle/Paddle/pull/41661), [#41675](https://github.com/PaddlePaddle/Paddle/pull/41675)) - - - ceil ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - cos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - atan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - cosh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - erf([#40388](https://github.com/PaddlePaddle/Paddle/pull/40388)) - - - asin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - acos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - scale ([#39278](https://github.com/PaddlePaddle/Paddle/pull/39278)) - - - elementwise_pow ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993)) - - - elementwise_sub ([#39225](https://github.com/PaddlePaddle/Paddle/pull/39225), [#37260](https://github.com/PaddlePaddle/Paddle/pull/37260)) - - - round ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - floor ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - pow ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - elementwise_floordiv ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993)) - - - reciprocal([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - log1p ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - allclose ([#40469](https://github.com/PaddlePaddle/Paddle/pull/40469)) - - - mul ([#40833](https://github.com/PaddlePaddle/Paddle/pull/40833)) - - - elementwise_max ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590)) - - - elementwise_min ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590)) - - - elementwise_mod ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590)) - - - elementwise_add ([#39048](https://github.com/PaddlePaddle/Paddle/pull/39048), [#37043](https://github.com/PaddlePaddle/Paddle/pull/37043)) - - - matmul_v2 ([#36844](https://github.com/PaddlePaddle/Paddle/pull/36844), [#38713](https://github.com/PaddlePaddle/Paddle/pull/38713)) - - - elementwise_mul ([#41042](https://github.com/PaddlePaddle/Paddle/pull/41042), [#40252](https://github.com/PaddlePaddle/Paddle/pull/40252), [#37471](https://github.com/PaddlePaddle/Paddle/pull/37471)) - - - elementwise_div ([#40172](https://github.com/PaddlePaddle/Paddle/pull/40172), [#40039](https://github.com/PaddlePaddle/Paddle/pull/40039), [#37418](https://github.com/PaddlePaddle/Paddle/pull/37418)) - - - SelectedRows ([#39037](https://github.com/PaddlePaddle/Paddle/pull/39037), [#39087](https://github.com/PaddlePaddle/Paddle/pull/39087), [#39128](https://github.com/PaddlePaddle/Paddle/pull/39128), [#39162](https://github.com/PaddlePaddle/Paddle/pull/39162), [#39236](https://github.com/PaddlePaddle/Paddle/pull/39236)) - - - fill_any_like ([#39807](https://github.com/PaddlePaddle/Paddle/pull/39807)) - - - dot([#38359](https://github.com/PaddlePaddle/Paddle/pull/38359)) - - - sum ([#40873](https://github.com/PaddlePaddle/Paddle/pull/40873)) - - - cumsum ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - diag_v2 ([#39914](https://github.com/PaddlePaddle/Paddle/pull/39914)) - - - auc ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - log_loss ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - one_hot_v2([39876](https://github.com/PaddlePaddle/Paddle/pull/39876)) - - - sigmoid_cross_entropy_with_logits ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200)) - - - bce_loss ([#39868](https://github.com/PaddlePaddle/Paddle/pull/39868)) - - - argsort ([#40151](https://github.com/PaddlePaddle/Paddle/pull/40151)) - - - arg_max ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222)) - - - arg_min ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222)) - - - segment_pool ([#40099](https://github.com/PaddlePaddle/Paddle/pull/40099)) - - - frobenius_norm([#40707](https://github.com/PaddlePaddle/Paddle/pull/40707), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - dist ([#40178](https://github.com/PaddlePaddle/Paddle/pull/40178)) - - - isnan_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076)) - - - logical_and ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - logical_not ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - isfinite_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076)) - - - logical_or ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - isinf_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076)) - - - is_empty ([#39919](https://github.com/PaddlePaddle/Paddle/pull/39919)) - - - logical_xor ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942)) - - - less_than([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - not_equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - less_equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - equal_all([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - uniform_random ([#39937](https://github.com/PaddlePaddle/Paddle/pull/39937)) - - - randint ([#39876](https://github.com/PaddlePaddle/Paddle/pull/39876), [#41375](https://github.com/PaddlePaddle/Paddle/pull/41375)) - - - randperm ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265)) - - - unbind ([#39789](https://github.com/PaddlePaddle/Paddle/pull/39789)) - - - bernoulli ([#39590](https://github.com/PaddlePaddle/Paddle/pull/39590)) - - - increment ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - multinomial ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - addmm ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - cholesky ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913)) - - - where ([#39811](https://github.com/PaddlePaddle/Paddle/pull/39811)) - - - log10 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - log2 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - expm1([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - atan2 ([#39806](https://github.com/PaddlePaddle/Paddle/pull/39806)) - - - gaussian_random ([#39932](https://github.com/PaddlePaddle/Paddle/pull/39932), [#40122](https://github.com/PaddlePaddle/Paddle/pull/40122), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191)) - - - empty ([#38334](https://github.com/PaddlePaddle/Paddle/pull/38334)) - - - truncated_gaussian_random ([#39971](https://github.com/PaddlePaddle/Paddle/pull/39971), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191)) - - - mv ([#39861](https://github.com/PaddlePaddle/Paddle/pull/39861), [#39954](https://github.com/PaddlePaddle/Paddle/pull/39954)) - - - tan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - set_value ([#40195](https://github.com/PaddlePaddle/Paddle/pull/40195), [#40478](https://github.com/PaddlePaddle/Paddle/pull/40478), [#40636](https://github.com/PaddlePaddle/Paddle/pull/40636)) - - - bitwise_and ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - bitwise_not([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - bitwise_or([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - poisson([#39814](https://github.com/PaddlePaddle/Paddle/pull/39814)) - - - cholesky_solve([#40387](https://github.com/PaddlePaddle/Paddle/pull/40387)) - - - bitwise_xor([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031)) - - - triangular_solve([#40417](https://github.com/PaddlePaddle/Paddle/pull/40417)) - - - sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626)) - - - atanh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - softsign([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - thresholded_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - tanh_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - stanh([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - reduce_mean ([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559)) - - - reduce_max([#40225](https://github.com/PaddlePaddle/Paddle/pull/40225)) - - - reduce_min ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374)) - - - mean ([#40872](https://github.com/PaddlePaddle/Paddle/pull/40872), [#41319](https://github.com/PaddlePaddle/Paddle/pull/41319)) - - - reduce_all ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374)) - - - reduce_any ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374)) - - - logsumexp ([#40790](https://github.com/PaddlePaddle/Paddle/pull/40790)) - - - softshrink([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - range ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265), [#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - stack([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - tile ([#40371](https://github.com/PaddlePaddle/Paddle/pull/40371)) - - - unique([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - unstack([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851)) - - - slice([#40736](https://github.com/PaddlePaddle/Paddle/pull/40736)) - - - transpose2([#39327](https://github.com/PaddlePaddle/Paddle/pull/39327)) - - - unsqueeze2( [#40596](https://github.com/PaddlePaddle/Paddle/pull/40596)) - - - squeeze2( [#40596](https://github.com/PaddlePaddle/Paddle/pull/40596)) - - - strided_slice ([#40708](https://github.com/PaddlePaddle/Paddle/pull/40708)) - - - softmax ([#39547](https://github.com/PaddlePaddle/Paddle/pull/39547)) - - - leaky_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - gelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393)) - - - prelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393)) - - - log_softmax ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393)) - - - elu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - logsigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626)) - - - psroi_pool ([#40353](https://github.com/PaddlePaddle/Paddle/pull/40353), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - kthvalue([#40575](https://github.com/PaddlePaddle/Paddle/pull/40575)) - - - mode ([#40571](https://github.com/PaddlePaddle/Paddle/pull/40571)) - - - yolo_box([#40112](https://github.com/PaddlePaddle/Paddle/pull/40112)) - - - yolov3_loss ([#40944](https://github.com/PaddlePaddle/Paddle/pull/40944)) - - - temporal_shift([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - depthwise_conv2d([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354)) - - - pad3d ([#40701](https://github.com/PaddlePaddle/Paddle/pull/40701)) - - - pad( [#40012](https://github.com/PaddlePaddle/Paddle/pull/40012)) - - - greater_equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - kldiv_loss ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - isclose ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - silu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - unfold ([#39778](https://github.com/PaddlePaddle/Paddle/pull/39778)) - - - batch_norm([39347](https://github.com/PaddlePaddle/Paddle/pull/39347)) - - - norm([#39324](https://github.com/PaddlePaddle/Paddle/pull/39324)) - - - roi_pool ([#40574](https://github.com/PaddlePaddle/Paddle/pull/40574), [#40682](https://github.com/PaddlePaddle/Paddle/pull/40682), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - roi_align ([#40382](https://github.com/PaddlePaddle/Paddle/pull/40382), [#40556](https://github.com/PaddlePaddle/Paddle/pull/40556), [#41402](https://github.com/PaddlePaddle/Paddle/pull/41402)) - - - deformable_conv ([#40700](https://github.com/PaddlePaddle/Paddle/pull/40700), [#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644)) - - - deformable_conv_v1 ([#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644)) - - - label_smooth ([#39796](https://github.com/PaddlePaddle/Paddle/pull/39796)) - - - grid_sampler ([#40585](https://github.com/PaddlePaddle/Paddle/pull/40585)) - - - greater_than([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970)) - - - pixel_shuffle ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712)) - - - nearest_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855)) - - - bilinear_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855)) - - - softmax_with_cross_entropy ([#40832](https://github.com/PaddlePaddle/Paddle/pull/40832)) - - - rnn ([#41007](https://github.com/PaddlePaddle/Paddle/pull/41007)) - - - reverse ([#40791](https://github.com/PaddlePaddle/Paddle/pull/40791)) - - - trace ([#39510](https://github.com/PaddlePaddle/Paddle/pull/39510)) - - - kron([#40427](https://github.com/PaddlePaddle/Paddle/pull/40427)) - - - accuracy([#39982](https://github.com/PaddlePaddle/Paddle/pull/39982)) - - - gather_tree ([#40082](https://github.com/PaddlePaddle/Paddle/pull/40082), [#39844](https://github.com/PaddlePaddle/Paddle/pull/39844)) - - - dropout([#40148](https://github.com/PaddlePaddle/Paddle/pull/40148)) - - - bincount ([#39947](https://github.com/PaddlePaddle/Paddle/pull/39947)) - - - warpctc ([#41389](https://github.com/PaddlePaddle/Paddle/pull/41389), [#40023](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/40023)) - - - multiplex([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40102](https://github.com/PaddlePaddle/Paddle/pull/40102)) - - - qr([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40007](https://github.com/PaddlePaddle/Paddle/pull/40007)) - - - assign_value ([#40967](https://github.com/PaddlePaddle/Paddle/pull/40967)) - - - assign ([#40022](https://github.com/PaddlePaddle/Paddle/pull/40022)) - - - cast ([#37610](https://github.com/PaddlePaddle/Paddle/pull/37610)) - - - tril_triu([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - where_index ([#40255](https://github.com/PaddlePaddle/Paddle/pull/40255)) - - - index_select ([#40260](https://github.com/PaddlePaddle/Paddle/pull/40260), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - roll ([#40257](https://github.com/PaddlePaddle/Paddle/pull/40257), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - cumprod (Xiong Kun [#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - shard_index ([#40254](https://github.com/PaddlePaddle/Paddle/pull/40254)) - - - reshape2 ([#40914](https://github.com/PaddlePaddle/Paddle/pull/40914), [#39631](https://github.com/PaddlePaddle/Paddle/pull/39631), [#38833](https://github.com/PaddlePaddle/Paddle/pull/38833), [#37164](https://github.com/PaddlePaddle/Paddle/pull/37164)) - - - flip ([#39822](https://github.com/PaddlePaddle/Paddle/pull/39822), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - eye ([#39712](https://github.com/PaddlePaddle/Paddle/pull/39712), [#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#41476](https://github.com/PaddlePaddle/Paddle/pull/41476)) - - - lookup_table_v2([#39901](https://github.com/PaddlePaddle/Paddle/pull/39901)) - - - searchsorted([#40520](https://github.com/PaddlePaddle/Paddle/pull/40520), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053)) - - - adamw ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351)) - - - tanh ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - cross ([#39829](https://github.com/PaddlePaddle/Paddle/pull/39829)) - - - concat ([#38955](https://github.com/PaddlePaddle/Paddle/pull/38955), [#41112](https://github.com/PaddlePaddle/Paddle/pull/41112)) - - - split ([#39060](https://github.com/PaddlePaddle/Paddle/pull/39060)) - - - linspace ([#40124](https://github.com/PaddlePaddle/Paddle/pull/40124)) - - - huber_loss ([#39761](https://github.com/PaddlePaddle/Paddle/pull/39761)) - - - hierarchical_sigmoid([#40553](https://github.com/PaddlePaddle/Paddle/pull/40553)) - - - nll_loss ([#39936](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39936)) - - - graph_send_recv ([#40092](https://github.com/PaddlePaddle/Paddle/pull/40092), [#40320](https://github.com/PaddlePaddle/Paddle/pull/40320)) - - - abs([#39492](https://github.com/PaddlePaddle/Paddle/pull/39492), [#39762](https://github.com/PaddlePaddle/Paddle/pull/39762)) - - - exp([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - rsqrt([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727)) - - - viterbi_decode ([#40186](https://github.com/PaddlePaddle/Paddle/pull/40186)) - - - conj ([#38247](https://github.com/PaddlePaddle/Paddle/pull/38247)) - - - real ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - imag ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173)) - - - take_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40270](https://github.com/PaddlePaddle/Paddle/pull/40270), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - put_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - lgamma ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770)) - - - relu ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175)) - - - maxout ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974)) - - - log ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785)) - - - bilinear_tensor_product([#39903](https://github.com/PaddlePaddle/Paddle/pull/39903)) - - - flatten_contiguous_range ([#38712](https://github.com/PaddlePaddle/Paddle/pull/38712), [#36957](https://github.com/PaddlePaddle/Paddle/pull/36957), [#41345](https://github.com/PaddlePaddle/Paddle/pull/41345)) - - - matrix_rank ([#40074](https://github.com/PaddlePaddle/Paddle/pull/40074), [#40519](https://github.com/PaddlePaddle/Paddle/pull/40519), [#41466](https://github.com/PaddlePaddle/Paddle/pull/41466)) - - - logit ([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844)) - - - lerp ([#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#39524](https://github.com/PaddlePaddle/Paddle/pull/39524)) - - - erfinv ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712)) - - - broadcast_tensors([#40047](https://github.com/PaddlePaddle/Paddle/pull/40047)) - - - gumbel_softmax([#39873](https://github.com/PaddlePaddle/Paddle/pull/39873)) - - - diagonal ([#39575](https://github.com/PaddlePaddle/Paddle/pull/39575)) - - - trunc ([#39543](https://github.com/PaddlePaddle/Paddle/pull/39543), [#39772](https://github.com/PaddlePaddle/Paddle/pull/39772)) - - - multi_dot ([#40038](https://github.com/PaddlePaddle/Paddle/pull/40038)) - - - matrix_power ([#40231](https://github.com/PaddlePaddle/Paddle/pull/40231)) - - - digamma([#39240](https://github.com/PaddlePaddle/Paddle/pull/39240)) - - - masked_select([#39193](https://github.com/PaddlePaddle/Paddle/pull/39193)) - - - determinant ([#40539](https://github.com/PaddlePaddle/Paddle/pull/40539)) - - - eigh ([#40213](https://github.com/PaddlePaddle/Paddle/pull/40213)) - - - size ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712)) - - - shape ([#40248](https://github.com/PaddlePaddle/Paddle/pull/40248)) - - - reduce_sum([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559), [#41295](https://github.com/PaddlePaddle/Paddle/pull/41295)) - - - reduce_prod ([#39844](https://github.com/PaddlePaddle/Paddle/pull/39844)) - - - histogram([#39496](https://github.com/PaddlePaddle/Paddle/pull/39496)) - - - meshgrid ([#41411](https://github.com/PaddlePaddle/Paddle/pull/41411)) - - - brelu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385)) - - - hard_swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - hard_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565)) - - - selu ([#39819](https://github.com/PaddlePaddle/Paddle/pull/39819)) - - - expand_v2 ([#39471](https://github.com/PaddlePaddle/Paddle/pull/39471)) - - - top_k_v2([#40064](https://github.com/PaddlePaddle/Paddle/pull/40064)) - - - expand_as_v2([#40373](https://github.com/PaddlePaddle/Paddle/pull/40373)) - - - swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913)) - - - hard_sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626)) - - - exp, det, assign, gaussian_random, matrix_rank, eye, and deformable_conv. ([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755), [#41737](https://github.com/PaddlePaddle/Paddle/pull/41737)) - -#### **New Dynamic Graph Execution Mechanism** - -To improve scheduling performance and custom development capability of the dynamic graph execution mechanism of the PaddlePaddle, we have reconstructed the underlying execution mechanism of the dynamic graph. With the new execution method, the PHI operator library can be used for efficient runtime execution. For the operators supported by the PHI operator library, switching to the new dynamic graph mode will get a significant improvement in scheduling performance. However, due to the huge workload required in the upgrade of the overall framework execution mechanism and this part of the work is coupled with a lot on the PHI operator library, we still do not use this execution method by default in this version. If you want to try it, you can switch to it by setting the environment variable `FLAGS_enable_eager_mode=1`.The details are as follows: - -- **Implementation of dynamic graph execution infrastructure, core components and mechanism**: By staticizing dynamic graph-related execution codes, the original homogeneous operators constructing converted to specific calling for different PHI APIs, thus greatly optimizing the scheduling overhead. ([#36059](https://github.com/PaddlePaddle/Paddle/pull/36059), [#37323](https://github.com/PaddlePaddle/Paddle/pull/37323), [#37556](https://github.com/PaddlePaddle/Paddle/pull/37556), [#37555](https://github.com/PaddlePaddle/Paddle/pull/37555), [#37478](https://github.com/PaddlePaddle/Paddle/pull/37478), [#37458](https://github.com/PaddlePaddle/Paddle/pull/37458), [#37479](https://github.com/PaddlePaddle/Paddle/pull/37479), [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599), [#37659](https://github.com/PaddlePaddle/Paddle/pull/37659), [#37654](https://github.com/PaddlePaddle/Paddle/pull/37654), [#39200](https://github.com/PaddlePaddle/Paddle/pull/39200), [#39309](https://github.com/PaddlePaddle/Paddle/pull/39309), [#39319](https://github.com/PaddlePaddle/Paddle/pull/39319), [#39414](https://github.com/PaddlePaddle/Paddle/pull/39414), [#39504](https://github.com/PaddlePaddle/Paddle/pull/39504), [#39526](https://github.com/PaddlePaddle/Paddle/pull/39526), [#39878](https://github.com/PaddlePaddle/Paddle/pull/39878), [#39963](https://github.com/PaddlePaddle/Paddle/pull/39963)) - -- **New dynamic graph execution mechanism sub-function development and adaptation**: support more flexible and complete dynamic graph sub-functions such as hook, pylayer, double_grad, inplace, amp, etc. ([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396), [#40400](https://github.com/PaddlePaddle/Paddle/pull/40400), [#40695](https://github.com/PaddlePaddle/Paddle/pull/40695), [#41043](https://github.com/PaddlePaddle/Paddle/pull/41043), [#40915](https://github.com/PaddlePaddle/Paddle/pull/40915), [#41104](https://github.com/PaddlePaddle/Paddle/pull/41104), [#41350](https://github.com/PaddlePaddle/Paddle/pull/41350), [#41209](https://github.com/PaddlePaddle/Paddle/pull/41209), [#40830](https://github.com/PaddlePaddle/Paddle/pull/40830), [#40891](https://github.com/PaddlePaddle/Paddle/pull/40891), [#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377), [#37193](https://github.com/PaddlePaddle/Paddle/pull/37193), [#36965](https://github.com/PaddlePaddle/Paddle/pull/36965), [#37810](https://github.com/PaddlePaddle/Paddle/pull/37810), [#36837](https://github.com/PaddlePaddle/Paddle/pull/36837), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#39531](https://github.com/PaddlePaddle/Paddle/pull/39531), [#39638](https://github.com/PaddlePaddle/Paddle/pull/39638), [#39674](https://github.com/PaddlePaddle/Paddle/pull/39674), [#39893](https://github.com/PaddlePaddle/Paddle/pull/39893), [#40170](https://github.com/PaddlePaddle/Paddle/pull/40170), [#40693](https://github.com/PaddlePaddle/Paddle/pull/40693), [#40937](https://github.com/PaddlePaddle/Paddle/pull/40937), [#41016](https://github.com/PaddlePaddle/Paddle/pull/41016), [#41051](https://github.com/PaddlePaddle/Paddle/pull/41051), [#41121](https://github.com/PaddlePaddle/Paddle/pull/41121), [#41198](https://github.com/PaddlePaddle/Paddle/pull/41198), [#41287](https://github.com/PaddlePaddle/Paddle/pull/41287), [#41380](https://github.com/PaddlePaddle/Paddle/pull/41380), [#41306](https://github.com/PaddlePaddle/Paddle/pull/41306), [#41387](https://github.com/PaddlePaddle/Paddle/pull/41387), [#40623](https://github.com/PaddlePaddle/Paddle/pull/40623), [#40945](https://github.com/PaddlePaddle/Paddle/pull/40945), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488)) - -- **Automatic code generation mechanism for new dynamic graph execution**: When we are trying to split the computation and scheduling logic of a large number of homogeneous operators into different specific scheduling logics, we find that it is a huge workload. So we introduce a new automatic code generation logic to generate code and thus simplify the runtime logic of dynamic graphs. Meanwhile, in order to adapt to the various types of runtime logic in the previous framework, we also use some complicated compilation techniques to obtain information at runtime to generate more accurate scheduling code. ([#37574](https://github.com/PaddlePaddle/Paddle/pull/37574), [#37575](https://github.com/PaddlePaddle/Paddle/pull/37575), [#37639](https://github.com/PaddlePaddle/Paddle/pull/37639), [#37723](https://github.com/PaddlePaddle/Paddle/pull/37723), [#37753](https://github.com/PaddlePaddle/Paddle/pull/37753), [#37812](https://github.com/PaddlePaddle/Paddle/pull/37812), [#37837](https://github.com/PaddlePaddle/Paddle/pull/37837), [#37910](https://github.com/PaddlePaddle/Paddle/pull/37910), [#37943](https://github.com/PaddlePaddle/Paddle/pull/37943), [#37992](https://github.com/PaddlePaddle/Paddle/pull/37992), [#37959](https://github.com/PaddlePaddle/Paddle/pull/37959), [#38017](https://github.com/PaddlePaddle/Paddle/pull/38017), [#37969](https://github.com/PaddlePaddle/Paddle/pull/37969), [#38160](https://github.com/PaddlePaddle/Paddle/pull/38160), [#38085](https://github.com/PaddlePaddle/Paddle/pull/38085), [#38562](https://github.com/PaddlePaddle/Paddle/pull/38562), [#38573](https://github.com/PaddlePaddle/Paddle/pull/38573), [#39192](https://github.com/PaddlePaddle/Paddle/pull/39192), [#39215](https://github.com/PaddlePaddle/Paddle/pull/39215), [#39355](https://github.com/PaddlePaddle/Paddle/pull/39355), [#39358](https://github.com/PaddlePaddle/Paddle/pull/39358), [#39328](https://github.com/PaddlePaddle/Paddle/pull/39328), [#39233](https://github.com/PaddlePaddle/Paddle/pull/39233), [#39628](https://github.com/PaddlePaddle/Paddle/pull/39628), [#39767](https://github.com/PaddlePaddle/Paddle/pull/39767), [#39743](https://github.com/PaddlePaddle/Paddle/pull/39743), [#39897](https://github.com/PaddlePaddle/Paddle/pull/39897), [#39797](https://github.com/PaddlePaddle/Paddle/pull/39797), [#39997](https://github.com/PaddlePaddle/Paddle/pull/39997), [#40058](https://github.com/PaddlePaddle/Paddle/pull/40058), [#40080](https://github.com/PaddlePaddle/Paddle/pull/40080), [#40107](https://github.com/PaddlePaddle/Paddle/pull/40107), [#39962](https://github.com/PaddlePaddle/Paddle/pull/39962), [#40132](https://github.com/PaddlePaddle/Paddle/pull/40132), [#40276](https://github.com/PaddlePaddle/Paddle/pull/40276), [#40266](https://github.com/PaddlePaddle/Paddle/pull/40266), [#40480](https://github.com/PaddlePaddle/Paddle/pull/40480), [#40482](https://github.com/PaddlePaddle/Paddle/pull/40482), [#40368](https://github.com/PaddlePaddle/Paddle/pull/40368), [#40650](https://github.com/PaddlePaddle/Paddle/pull/40650), [#40815](https://github.com/PaddlePaddle/Paddle/pull/40815), [#40907](https://github.com/PaddlePaddle/Paddle/pull/40907), [#40935](https://github.com/PaddlePaddle/Paddle/pull/40935), [#41089](https://github.com/PaddlePaddle/Paddle/pull/41089)) - -- **New dynamic graph execution mechanism accessed into the main framework and Integration test**: we currently use some environment variables to distinguish between static graph mode and dynamic graph mode (including new dynamic graph and old dynamic graph mode). We have adapted most logics of dynamic graphs in these modes. However, there are still a lot of problems being fixed. ([#37638](https://github.com/PaddlePaddle/Paddle/pull/37638), [#37643](https://github.com/PaddlePaddle/Paddle/pull/37643), [#37653](https://github.com/PaddlePaddle/Paddle/pull/37653), [#38314](https://github.com/PaddlePaddle/Paddle/pull/38314), [#38337](https://github.com/PaddlePaddle/Paddle/pull/38337), [#38338](https://github.com/PaddlePaddle/Paddle/pull/38338), [#39164](https://github.com/PaddlePaddle/Paddle/pull/39164), [#39326](https://github.com/PaddlePaddle/Paddle/pull/39326), [#40391](https://github.com/PaddlePaddle/Paddle/pull/40391), [#40201](https://github.com/PaddlePaddle/Paddle/pull/40201), [#40854](https://github.com/PaddlePaddle/Paddle/pull/40854), [#40887](https://github.com/PaddlePaddle/Paddle/pull/40887)) - -- **Update some judgment logics under dynamic graphs, to support fast execution paths for dynamic graphs in compatible forms**:([#40786](https://github.com/PaddlePaddle/Paddle/pull/40786)) - - - Non-static graph mode (current transition scheme): `_non_static_mode()`。 - - - Determined as new dynamic graph in dynamic graph mode (recommended judgment logic): `_in_dygrah_mode()`。 - - - Determined as old dynamic graph in dynamic graph mode (Not recommended. It will be deprecated in future versions): `_in_legacy_dygraph()`。 - - - Enable old dynamic graph and disable new dynamic graph in dynamic graph mode: `_enable_legacy_dygraph()` or exit `_test_eager_guard()`。 - - - Enable new dynamic graph and disable old dynamic graph in dynamic graph mode: `_disable_legacy_dygraph()` or with `with _test_eager_guard()`。 - - - Determine in new dynamic graph in static or dynamic graph mode: `_in_eager_without_dygraph_check()`。 - -- **Support inplace after dynamic graph reconstruction**: input and output are the same Tensor. - - - Adapt the inplace strategy for dynamic graph reconstruction intermediate states. ([#40400](https://github.com/PaddlePaddle/Paddle/pull/40400)) - - - Adapt the inplace strategy to the final state of the dynamic graph reconstruction. ([#40695](https://github.com/PaddlePaddle/Paddle/pull/40695)) - - - Add inplace strategy to PyLayer function after dynamical graph reconstruction. ([#41043](https://github.com/PaddlePaddle/Paddle/pull/41043)) - - - Add inplace strategy for Tensor's setitem function after dynamical graph reconstruction. ([#40915](https://github.com/PaddlePaddle/Paddle/pull/40915)) - - - Add `_reset_grad_inplace_version` interface after dynamic graph reconstruction, to set the inplace version of the Tensor's gradient to 0. ([#41101](https://github.com/PaddlePaddle/Paddle/pull/41101)) - - - If the value of the forward Tensor is not needed during the inverse computation (no need buffer property), the inplace version detection operation is not needed for that Tensor. For Tensor with no_need_buffer, skip the inplace version check. ([#41350](https://github.com/PaddlePaddle/Paddle/pull/41350)) - - - Unify error messages for inplace version checks after and before reconstruction of dynamic graphs. ([#41209](https://github.com/PaddlePaddle/Paddle/pull/41209)) - -- **Support view strategy after dynamical graph reconstruction**: input and output Tensor share underlying data. - - - Adapt the view strategy for dynamic graph reconstruction intermediate states. Include `reshape`, `squeeze`, `unsqueeze`, and `flatten` APIs. ([#40830](https://github.com/PaddlePaddle/Paddle/pull/40830)) - - - Adapt the view strategy for dynamic graph reconstruction final state. Include `reshape` API. ([#40891](https://github.com/PaddlePaddle/Paddle/pull/40891)) - -- **Add support for weakref on the python side of the new dynamic graph eager Tensor.** ([#41797](https://github.com/PaddlePaddle/Paddle/pull/41797)) - -- **Enhance the new dynamic graph DoubleGrad function** to support the basic DoubleGrad feature. ([#41893](https://github.com/PaddlePaddle/Paddle/pull/41893), [#41894](https://github.com/PaddlePaddle/Paddle/pull/41894), [#41895](https://github.com/PaddlePaddle/Paddle/pull/41895)) - -- **Add `core.eager.StringTensor` interface**, to support the construction of StringTensor on python side and the use of the StringTensor related APIs. ([#41039](https://github.com/PaddlePaddle/Paddle/pull/41039)) - -- **Add `_grad_name` and `_grad_value`*to `core.eager.Tensor` to return the name and value of a gradient. ([#41990](https://github.com/PaddlePaddle/Paddle/pull/41990)) - -- **Add the processing of the no_need_buffer attribute for dynamic graph intermediate state.** The Tensor with the no_need_buffer attribute is skipped in the inplace backward check operation. ([#41720](https://github.com/PaddlePaddle/Paddle/pull/41720)) - - -#### **New Static Graph Executor** - -In order to solve the problem that the original static graph executor of the PaddlePaddle is not good enough for scheduling in some scenarios and it is not easy to use multiple streams, we have implemented a new static graph executor with superior performance. It is easy to take advantage of the asynchronous scheduling capabilities of multi-streams and multi-threads. The new executor is a compatible upgrade of the original executor. At present, it is used by default in single-card scenarios. Users do not need to make any changes in the training codes. It can be used automatically. Of course, we also provide an interface to switch back to the original executor. Users can switch back to the original executor by setting the environment variable: `FLAGS_USE_STANDALONE_EXECUTOR=false`. ([#41179](https://github.com/PaddlePaddle/Paddle/pull/41179)) The main contents are as follows. - -- Basic components: High-performance thread pool for multi-threaded scheduling in the executor ([#35470](https://github.com/PaddlePaddle/Paddle/pull/35470), [#35930](https://github.com/PaddlePaddle/Paddle/pull/35930), [#36030](https://github.com/PaddlePaddle/Paddle/pull/36030), [#36480](https://github.com/PaddlePaddle/Paddle/pull/36480), [#36688](https://github.com/PaddlePaddle/Paddle/pull/36688), [#36740](https://github.com/PaddlePaddle/Paddle/pull/36740), [#38335](https://github.com/PaddlePaddle/Paddle/pull/38335), [#40770](https://github.com/PaddlePaddle/Paddle/pull/40770)) and thread co-op component ([#38779](https://github.com/PaddlePaddle/Paddle/pull/38779), [#40876](https://github.com/PaddlePaddle/Paddle/pull/40876), [#40912](https://github.com/PaddlePaddle/Paddle/pull/40912)). There is the timely memory recovery after operator execution ([#37642](https://github.com/PaddlePaddle/Paddle/pull/37642), [#39617](https://github.com/PaddlePaddle/Paddle/pull/39617), [#40859](https://github.com/PaddlePaddle/Paddle/pull/40859)). There is the new dependency analysis algorithm for parallel executor ([#37231](https://github.com/PaddlePaddle/Paddle/pull/37231)) etc. - -- Scheduling logic: Optimize the scheduling method of operator in the executor. Support multi-stream multi-threaded asynchronous scheduling mechanism. Change transforms such as data type, device, and layout to the operator scheduling to improve performance. Support caching the selection of operator Kernel. Support the selection of new PHI operator. ([#35024](https://github.com/PaddlePaddle/Paddle/pull/35024), [#34922](https://github.com/PaddlePaddle/Paddle/pull/34922), [#35711](https://github.com/PaddlePaddle/Paddle/pull/35711), [#35928](https://github.com/PaddlePaddle/Paddle/pull/35928), [#39458](https://github.com/PaddlePaddle/Paddle/pull/39458),[#36899](https://github.com/PaddlePaddle/Paddle/pull/36899))。 - -- Interface compatibility: Compatible with the user interface and functionality of the original executor, such as alignment with python interface Executor.run(), support for managing Tensor in Scope, etc. This ensures that users can switch to the new executor without perception. ([#37278](https://github.com/PaddlePaddle/Paddle/pull/37278), [#37379](https://github.com/PaddlePaddle/Paddle/pull/37379), [#37445](https://github.com/PaddlePaddle/Paddle/pull/37445), [#37510](https://github.com/PaddlePaddle/Paddle/pull/37510), [#40955](https://github.com/PaddlePaddle/Paddle/pull/40955), [#41778](https://github.com/PaddlePaddle/Paddle/pull/41178), [#41058](https://github.com/PaddlePaddle/Paddle/pull/41058), [#38584](https://github.com/PaddlePaddle/Paddle/pull/38584), [#37957](https://github.com/PaddlePaddle/Paddle/pull/37957), [#37672](https://github.com/PaddlePaddle/Paddle/pull/37672), [#37474](https://github.com/PaddlePaddle/Paddle/pull/37474), [#37085](https://github.com/PaddlePaddle/Paddle/pull/37085), [#37061](https://github.com/PaddlePaddle/Paddle/pull/37061), [#36945](https://github.com/PaddlePaddle/Paddle/pull/36945)) - -- Enhance debugging and error reporting in multi-threaded scenarios by capturing error reports from sub-threads and throwing them uniformly in the main thread. This can improve user experience. ([#36692](https://github.com/PaddlePaddle/Paddle/pull/36692),[#36802](https://github.com/PaddlePaddle/Paddle/pull/36802)) - -- Fix the bug with the new executor communication flow resetting stream cache information in the allocator, to reduce RecordStream overhead in cross-stream scenarios. This improves performance of DeepFM models by about 8% after optimization. ([#42046](https://github.com/PaddlePaddle/Paddle/pull/42046)) - -- Optimize the dependency analysis method between new executor operators to improve runtime performance. Establish correct dependencies for send/recv communication operators to support pipeline parallel. ([#42009](https://github.com/PaddlePaddle/Paddle/pull/42009)) - - - -#### **Distributed Training** - -- Basic functions of multi-machine multi-card parallel training based on collective communication - - - Add support for elastic training, enables scaling up and down the number of workers, enables training process resuming when node failure,to improve the fault tolerance of distributed training. ([#36684](https://github.com/PaddlePaddle/Paddle/pull/36684), [#37177](https://github.com/PaddlePaddle/Paddle/pull/37177), [#37781](https://github.com/PaddlePaddle/Paddle/pull/37781)) - - - Refactor launch startup module, add `master` collaboration and node number `nnodes` definition, to improve the ease of using the distributed startup. ([#40086](https://github.com/PaddlePaddle/Paddle/pull/40086), [#40568](https://github.com/PaddlePaddle/Paddle/pull/40568), [#40782](https://github.com/PaddlePaddle/Paddle/pull/40782), [#40844](https://github.com/PaddlePaddle/Paddle/pull/40844), [#40936](https://github.com/PaddlePaddle/Paddle/pull/40936), [#41190](https://github.com/PaddlePaddle/Paddle/pull/41190), [#41314](https://github.com/PaddlePaddle/Paddle/pull/41314)) - - - Add support for GPU/NPU/XPU multi-hardware heterogeneous training. ([#37613](https://github.com/PaddlePaddle/Paddle/pull/37613), [#37998](https://github.com/PaddlePaddle/Paddle/pull/37998)) - - - Add fleet_executor asynchronous pipeline executor. ([#36966](https://github.com/PaddlePaddle/Paddle/pull/36966), [#37049](https://github.com/PaddlePaddle/Paddle/pull/37049), [#37087](https://github.com/PaddlePaddle/Paddle/pull/37087), [#37126](https://github.com/PaddlePaddle/Paddle/pull/37126), [#37150](https://github.com/PaddlePaddle/Paddle/pull/37150), [#37203](https://github.com/PaddlePaddle/Paddle/pull/37203), [#37167](https://github.com/PaddlePaddle/Paddle/pull/37167), [#37282](https://github.com/PaddlePaddle/Paddle/pull/37282), [#37319](https://github.com/PaddlePaddle/Paddle/pull/37319), [#37462](https://github.com/PaddlePaddle/Paddle/pull/37462), [#37507](https://github.com/PaddlePaddle/Paddle/pull/37507), [#37533](https://github.com/PaddlePaddle/Paddle/pull/37533), [#37576](https://github.com/PaddlePaddle/Paddle/pull/37576), [#37605](https://github.com/PaddlePaddle/Paddle/pull/37605), [#37691](https://github.com/PaddlePaddle/Paddle/pull/37691), [#37742](https://github.com/PaddlePaddle/Paddle/pull/37742), [#37783](https://github.com/PaddlePaddle/Paddle/pull/37783), [#37809](https://github.com/PaddlePaddle/Paddle/pull/37809), [#37862](https://github.com/PaddlePaddle/Paddle/pull/37862), [#37882](https://github.com/PaddlePaddle/Paddle/pull/37882), [#37934](https://github.com/PaddlePaddle/Paddle/pull/37934), [#38024](https://github.com/PaddlePaddle/Paddle/pull/38024), [#38083](https://github.com/PaddlePaddle/Paddle/pull/38083), [#38164](https://github.com/PaddlePaddle/Paddle/pull/38164), [#38261](https://github.com/PaddlePaddle/Paddle/pull/38261), [#38290](https://github.com/PaddlePaddle/Paddle/pull/38290), [#40607](https://github.com/PaddlePaddle/Paddle/pull/40607), [#37093](https://github.com/PaddlePaddle/Paddle/pull/37093), [#37106](https://github.com/PaddlePaddle/Paddle/pull/37106), [#37143](https://github.com/PaddlePaddle/Paddle/pull/37143), [#37338](https://github.com/PaddlePaddle/Paddle/pull/37338), [#37376](https://github.com/PaddlePaddle/Paddle/pull/37376), [#37485](https://github.com/PaddlePaddle/Paddle/pull/37485), [#37531](https://github.com/PaddlePaddle/Paddle/pull/37531), [#37623](https://github.com/PaddlePaddle/Paddle/pull/37623), [#37693](https://github.com/PaddlePaddle/Paddle/pull/37693), [#37755](https://github.com/PaddlePaddle/Paddle/pull/37755), [#37807](https://github.com/PaddlePaddle/Paddle/pull/37807), [#37889](https://github.com/PaddlePaddle/Paddle/pull/37889), [#38420](https://github.com/PaddlePaddle/Paddle/pull/38420), [#38539](https://github.com/PaddlePaddle/Paddle/pull/38539), [#36892](https://github.com/PaddlePaddle/Paddle/pull/36892), [#37084](https://github.com/PaddlePaddle/Paddle/pull/37084), [#37158](https://github.com/PaddlePaddle/Paddle/pull/37158), [#37361](https://github.com/PaddlePaddle/Paddle/pull/37361), [#37509](https://github.com/PaddlePaddle/Paddle/pull/37509), [#37603](https://github.com/PaddlePaddle/Paddle/pull/37603), [#37703](https://github.com/PaddlePaddle/Paddle/pull/37703), [#37824](https://github.com/PaddlePaddle/Paddle/pull/37824), [#38114](https://github.com/PaddlePaddle/Paddle/pull/38114), [#38322](https://github.com/PaddlePaddle/Paddle/pull/38322), [#38535](https://github.com/PaddlePaddle/Paddle/pull/38535), [#38650](https://github.com/PaddlePaddle/Paddle/pull/38650), [#38709](https://github.com/PaddlePaddle/Paddle/pull/38709), [#38799](https://github.com/PaddlePaddle/Paddle/pull/38799), [#38839](https://github.com/PaddlePaddle/Paddle/pull/38839), [#38904](https://github.com/PaddlePaddle/Paddle/pull/38904)) - - - Add distributed inference function for large-scale model. ([#38795](https://github.com/PaddlePaddle/Paddle/pull/38795), [#39012](https://github.com/PaddlePaddle/Paddle/pull/39012), [#39032](https://github.com/PaddlePaddle/Paddle/pull/39032), [#39076](https://github.com/PaddlePaddle/Paddle/pull/39076), [#39194](https://github.com/PaddlePaddle/Paddle/pull/39194), [#39207](https://github.com/PaddlePaddle/Paddle/pull/39207), [#39241](https://github.com/PaddlePaddle/Paddle/pull/39241), [#39603](https://github.com/PaddlePaddle/Paddle/pull/39603), [#39758](https://github.com/PaddlePaddle/Paddle/pull/39758), [#39992](https://github.com/PaddlePaddle/Paddle/pull/39992)). - -- Dynamic graph hybrid parallelism - - - Reconstruct `paddle.distributed.fleet.utils.recompute`, to support new dynamic computational graph. ([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396)) - - - Add pure FP16 training to support data parallelism. ([#36420](https://github.com/PaddlePaddle/Paddle/pull/36420)) - - - Add MoE (Mixture of Experts) parallel strategy, to support large-scale MoE model training. ([#41092](https://github.com/PaddlePaddle/Paddle/pull/41092), [#40895](https://github.com/PaddlePaddle/Paddle/pull/40895), [#40850](https://github.com/PaddlePaddle/Paddle/pull/40580), [#39224](https://github.com/PaddlePaddle/Paddle/pull/39224)) - - - Add GroupSharded parallel strategy. Support stage1, stage2, stage3, and it supports synchronous and asynchronous communication. It can be used together with the basic function combinations such as Recompute, AMP O1\O2, Offload, GroupShardedClipGrad, and GroupShardedScaler. ([#37489](https://github.com/PaddlePaddle/Paddle/pull/37489), [#37568](https://github.com/PaddlePaddle/Paddle/pull/37568), [#37707](https://github.com/PaddlePaddle/Paddle/pull/37707), [#37836](https://github.com/PaddlePaddle/Paddle/pull/37836), [#37947](https://github.com/PaddlePaddle/Paddle/pull/37947), [#38151](https://github.com/PaddlePaddle/Paddle/pull/38151), [#38407](https://github.com/PaddlePaddle/Paddle/pull/38407), [#38052](https://github.com/PaddlePaddle/Paddle/pull/38052), [#39112](https://github.com/PaddlePaddle/Paddle/pull/39112), [#38989](https://github.com/PaddlePaddle/Paddle/pull/38989), [#39171](https://github.com/PaddlePaddle/Paddle/pull/39171), [#39285](https://github.com/PaddlePaddle/Paddle/pull/39285), [#39334](https://github.com/PaddlePaddle/Paddle/pull/39334), [#39397](https://github.com/PaddlePaddle/Paddle/pull/39397), [#39581](https://github.com/PaddlePaddle/Paddle/pull/39581), [#39668](https://github.com/PaddlePaddle/Paddle/pull/39668), [#40129](https://github.com/PaddlePaddle/Paddle/pull/40129), [#40396](https://github.com/PaddlePaddle/Paddle/pull/40396), [#40488](https://github.com/PaddlePaddle/Paddle/pull/40488), [#40601](https://github.com/PaddlePaddle/Paddle/pull/40601),[#37725](https://github.com/PaddlePaddle/Paddle/pull/37725),[#37904](https://github.com/PaddlePaddle/Paddle/pull/37904), [#38064](https://github.com/PaddlePaddle/Paddle/pull/38064)) - -- Static graph hybrid parallelism - - - Add `scale_gradient` flag bit to `gradient_scale_configs` to control the position where the gradient aggregation operation averages the gradients under pipeline parallelism. ([#36384](https://github.com/PaddlePaddle/Paddle/pull/36384)) - - - Under tensor parallelism, the dropout op supports the settings of deterministic random seed generators, to ensure random consistency for non-distributed variables and randomness of distributed variables. ([#36228](https://github.com/PaddlePaddle/Paddle/pull/36228)) - - - NPU hybrid parallelism supports Offload, with saving 40% of NPU memory. ([#37224](https://github.com/PaddlePaddle/Paddle/pull/37224)) - - - Add `force_cpu` optional parameter to the seed op, to allow dropout to read seed values directly from CPU. ([#35820](https://github.com/PaddlePaddle/Paddle/pull/35820)) - - - Improve the Automatic Sparsity (ASP) sharding strategy and support the selection of sharding strategy according to the program. ([#40028](https://github.com/PaddlePaddle/Paddle/pull/40028)) - -- Automatic parallel - - - Add the process restart (relaunch) after automatic mapping between logical processes and physical devices. ([#37523](https://github.com/PaddlePaddle/Paddle/pull/37523), [#37326](https://github.com/PaddlePaddle/Paddle/pull/37326)) - - - Improve the underlying mechanism and interface for automatic parallel to facilitate the unification of modules and add the optimized pass. ([#36617](https://github.com/PaddlePaddle/Paddle/pull/36617), [#38132](https://github.com/PaddlePaddle/Paddle/pull/38132)) - - - Add unified resource representation, to support for automatic mapping between logical processes and physical devices. ([#37091](https://github.com/PaddlePaddle/Paddle/pull/37091), [#37482](https://github.com/PaddlePaddle/Paddle/pull/37482), [#37094](https://github.com/PaddlePaddle/Paddle/pull/37094)) - - - Improve the distributed attribute complementation for the backward and update parts of the computation graph. ([#36744](https://github.com/PaddlePaddle/Paddle/pull/36744)) - - - Add data slicing function. ([#36055](https://github.com/PaddlePaddle/Paddle/pull/36055)) - - - Add tensor resharding function to reshard the tensor according to the distributed properties of the tensor and operator. ([#40865](https://github.com/PaddlePaddle/Paddle/pull/40865), [#41106](https://github.com/PaddlePaddle/Paddle/pull/41106)) - - - Add the automatic conversion pass of distributed parameters when the number of resources or parallel policy changes. ([#40434](https://github.com/PaddlePaddle/Paddle/pull/40434)) - - - Add GradientMerge pass to reduce the number of communications and improve training efficiency. ([#38259](https://github.com/PaddlePaddle/Paddle/pull/38259), [#40737](https://github.com/PaddlePaddle/Paddle/pull/40737)) - - - Add Recompute pass to reduce the activation memory storage. ([#38920](https://github.com/PaddlePaddle/Paddle/pull/38920)) - - - Add Sharding optimization pass, to support p-g-os 3 stage optimization. ([#38502](https://github.com/PaddlePaddle/Paddle/pull/38502)) - - - Add AMP + FP16 optimization pass. ([#38764](https://github.com/PaddlePaddle/Paddle/pull/38764), [#40615](https://github.com/PaddlePaddle/Paddle/pull/40615)) - - - Add fused QKV parallelization for Transformer class model. ([#39080](https://github.com/PaddlePaddle/Paddle/pull/39080)) - - - Improve the sharding propagation for while op to ensure convergence of the fix-point algorithm. ([#39939](https://github.com/PaddlePaddle/Paddle/pull/39939), [#39086](https://github.com/PaddlePaddle/Paddle/pull/39086), [#39014](https://github.com/PaddlePaddle/Paddle/pull/39014)) - - - Support training and inference for sub-block and while op control flow. ([#39612](https://github.com/PaddlePaddle/Paddle/pull/39612), [#39895](https://github.com/PaddlePaddle/Paddle/pull/39895), [#40077](https://github.com/PaddlePaddle/Paddle/pull/40077)) - -- Parameter Server - - - Add NaN/Inf value checking tool under GPUPS. ([#38131](https://github.com/PaddlePaddle/Paddle/pull/38131)) - - - Under GPUPS, add set_date interface to adapt incremental training. ([#36194](https://github.com/PaddlePaddle/Paddle/pull/36194)) - - - Under GPUPS, add asynchronous release dataset function. ([#37790](https://github.com/PaddlePaddle/Paddle/pull/37790)) - - - Under GPUPS, support the Dump parameters and intermediate layers([#36157](https://github.com/PaddlePaddle/Paddle/pull/36157)); - - - Under GPUPS, support the optimizer parameter configuration. ([#39783](https://github.com/PaddlePaddle/Paddle/pull/39783), [#39849](https://github.com/PaddlePaddle/Paddle/pull/39849)) - - - Under the Unified Parameter Server, refactor the base classes of each module such as communication and storage, to improve the ease of secondary development of each module. ([#41207](https://github.com/PaddlePaddle/Paddle/pull/41207), [#41022](https://github.com/PaddlePaddle/Paddle/pull/41022), [#40702](https://github.com/PaddlePaddle/Paddle/pull/40702), [#39341](https://github.com/PaddlePaddle/Paddle/pull/39341) [#39377](https://github.com/PaddlePaddle/Paddle/pull/39377), [#39191](https://github.com/PaddlePaddle/Paddle/pull/39191), [#39064](https://github.com/PaddlePaddle/Paddle/pull/39064)) - - - Add evaluation metrics module under the Unified Parameter Server, to support AUC/WuAUC/MaskAUC and other evaluation metrics calculation and customizable extensions. ([#38789](https://github.com/PaddlePaddle/Paddle/pull/38789)) - - - Supports XPU parameter server training on KUNLUNXIN 2. ([#41917](https://github.com/PaddlePaddle/Paddle/pull/41917), [#42266](https://github.com/PaddlePaddle/Paddle/pull/42266), [#41916](https://github.com/PaddlePaddle/Paddle/pull/41916)) - -#### Profiler - -- Add the performance analysis module `paddle.profiler` in the Python layer: Provide the ability to collect, export, and count performance data during the training push. ([#40065](https://github.com/PaddlePaddle/Paddle/pull/40065), [#40357](https://github.com/PaddlePaddle/Paddle/pull/40357), [#40888](https://github.com/PaddlePaddle/Paddle/pull/40888)) - - - `paddle.profiler.Profiler`: performance analyzer, interface for user interaction. ([#41029](https://github.com/PaddlePaddle/Paddle/pull/41029), [#41524](https://github.com/PaddlePaddle/Paddle/pull/41524), [#41157](https://github.com/PaddlePaddle/Paddle/pull/41157), [#40249](https://github.com/PaddlePaddle/Paddle/pull/40249), [#40111](https://github.com/PaddlePaddle/Paddle/pull/40111), [#39964](https://github.com/PaddlePaddle/Paddle/pull/39964), [#40133](https://github.com/PaddlePaddle/Paddle/pull/40133)) - - - `paddle.profiler.RecordEvent`: provide custom punches to record time. ([#39693](https://github.com/PaddlePaddle/Paddle/pull/39693), [#39694](https://github.com/PaddlePaddle/Paddle/pull/39694), [#39695](https://github.com/PaddlePaddle/Paddle/pull/39695), [#39675](https://github.com/PaddlePaddle/Paddle/pull/39675),[#41445](https://github.com/PaddlePaddle/Paddle/pull/41445), [#41132](https://github.com/PaddlePaddle/Paddle/pull/41132)) - - - `paddle.profiler.ProfilerTarget`: specify the target device for performance analysis. - - - `paddle.profiler.ProfilerState`: indicate the state of the performance analyzer. - - - `paddle.profiler.SortedKeys`: specify the sorting method of the data within the statistics form. - - - `paddle.profiler.make_scheduler`: the scheduler generating the performance analyzer state and implement the periodic control of the collection scope. - - - `paddle.profiler.export_chrome_tracing`: save performance data to a google chrome tracing file viewable by the chrome://tracing plugin. ([#39316](https://github.com/PaddlePaddle/Paddle/pull/39316), [#39984](https://github.com/PaddlePaddle/Paddle/pull/39984), [#41029](https://github.com/PaddlePaddle/Paddle/pull/41029)) - - - `paddle.profiler.export_protobuf`: save performance data to a protobuf file represented by internal structure. ([#39519](https://github.com/PaddlePaddle/Paddle/pull/39519), [#39109](https://github.com/PaddlePaddle/Paddle/pull/39109), [#39474](https://github.com/PaddlePaddle/Paddle/pull/39474)) - - - `paddle.profiler.load_profiler_result`: load the performance data saved to a protobuf file. - - - `paddle.profiler.Profiler` generate statistics for data reading, step overhead and throughput for the model training by specifying the `timer_only` parameter. ([#40386](https://github.com/PaddlePaddle/Paddle/pull/40386)) - -- Refactor Profiler underlying infrastructure in C++ layer - - - Refactor the Profiler's controller architecture. ([#38826](https://github.com/PaddlePaddle/Paddle/pull/38826), [#39230](https://github.com/PaddlePaddle/Paddle/pull/39230), [#39779](https://github.com/PaddlePaddle/Paddle/pull/39779) ) - - - Add Host Tracer to collect host-side performance metrics. ([#37629](https://github.com/PaddlePaddle/Paddle/pull/39629), [#37766](https://github.com/PaddlePaddle/Paddle/pull/37766), [#37944](https://github.com/PaddlePaddle/Paddle/pull/37944), [#38280](https://github.com/PaddlePaddle/Paddle/pull/38280), [#39975](https://github.com/PaddlePaddle/Paddle/pull/39975), [#40460](https://github.com/PaddlePaddle/Paddle/pull/40460)) - - - Add CUDA Tracer to collect device-side performance metrics. ([#39488](https://github.com/PaddlePaddle/Paddle/pull/39488)) - - - Profiler support for grading. ([#39926](https://github.com/PaddlePaddle/Paddle/pull/39926)) - -- Modify the name and type of logging for op under new dynamic graph. ([#41771](https://github.com/PaddlePaddle/Paddle/pull/41771/) - -- Add Kernel running statistics into profilers' summarization and optimize the summarization. ([#41989](https://github.com/PaddlePaddle/Paddle/pull/41989) - -- Remove side-effect to performance in forward computing forward when Profiler is off. ([#42142](https://github.com/PaddlePaddle/Paddle/pull/42142)) - -#### **CINN compiler adoption** - -With the recent development of PaddlePaddle's compiler, a.k.a, CINN([GitHub - PaddlePaddle/CINN: Compiler Infrastructure for Neural Networks](https://github.com/PaddlePaddle/CINN)), paddle framework has also been changed to adapt the compiler CINN features. These include the subgraph management related functions for the Paddle-CINN runtime, optimization of memory and speed performance, and bug fixing during development. - -- Functions developed: - - - Subgraph op related functions: - - - Add the function to find and generate CINN subgraphs from computational graphs. ([#36345](https://github.com/PaddlePaddle/Paddle/pull/36345)) - - - Add cinn_launch op as a runtime entry point to CINN. It is responsible for scheduling CINN to compile the subgraph, to initialize the data, and to execute the generated kernels. ([#36600](https://github.com/PaddlePaddle/Paddle/pull/36600)) - - - Add a helper class `CinnLaunchContext` to the kernel implementation of cinn_launch op to manage the intermediate data for compiling and running subgraphs, to improve scalability and code readability. ([#37938](https://github.com/PaddlePaddle/Paddle/pull/37938)) - - - Add additional fetch nodes to CINN subgraphs, thus ensuring that CINN external nodes can fetch the values of variables. ([#37172](https://github.com/PaddlePaddle/Paddle/pull/37172), [#37190](https://github.com/PaddlePaddle/Paddle/pull/37190)) - - - Add the function to symbolize a CINN subgraph, which is used to topologically sort the subgraphs and return the CINN execution sequence. ([#36417](https://github.com/PaddlePaddle/Paddle/pull/36417) - - - Add `CinnCompiler` class for involking subgraphs in the CINN compiled graph that can be replaced by using CINN operators. ([#36562](https://github.com/PaddlePaddle/Paddle/pull/36562), [#36975](https://github.com/PaddlePaddle/Paddle/pull/36975)) - - - Add the interface to CINN symbolization class to get the names of subgraph fetched variables to prevent fetched variables from being eliminated in compilation optimizations. ([#37218](https://github.com/PaddlePaddle/Paddle/pull/37218)) - - - Checking, debugging, and PI changes related: - - - Synchronize the update of NetBuilder API name changes in CINN. ([#40392](https://github.com/PaddlePaddle/Paddle/pull/40392)) - - - Add necessary log information to Paddle-CINN for better debugging. ([#36867](https://github.com/PaddlePaddle/Paddle/pull/36867)) - - - Add the bidirectional conversion function between Paddle desc and CINN desc. ([#36100](https://github.com/PaddlePaddle/Paddle/pull/36100)) - - - The operator implemented in CINN may not use some input variables compared to Paddle. Therefore, remove the check that the input variables must be used in the cinn_launch op. ([#37119](https://github.com/PaddlePaddle/Paddle/pull/37119)) - - - Added cinn_instruction_run op for invoking CINN to execute a single generation instruction, facilitating the construction of scheduling run subgraphs on the Paddle side. ([#39435](https://github.com/PaddlePaddle/Paddle/pull/39435), [#39576](https://github.com/PaddlePaddle/Paddle/pull/39576)) - - - Add control macros to Paddle for CUDA/CUBLAS/MKL/CINN pass application required to compile CINN. ([#37066](https://github.com/PaddlePaddle/Paddle/pull/37066), [#36660](https://github.com/PaddlePaddle/Paddle/pull/36660)) - - - Add two control flags FLAGS_allow_cinn_ops and FLAGS_deny_cinn_ops to control the categories of CINN operators used to replace native operators during Paddle training. ([#36842](https://github.com/PaddlePaddle/Paddle/pull/36842)) - -- Performance optimization: - - - Speed optimization - - - Optimize the computational time consumed by CinnCacheKey. ([#37786](https://github.com/PaddlePaddle/Paddle/pull/37786), [#37317](https://github.com/PaddlePaddle/Paddle/pull/37317)) - - - Cache variable scope for CINN compiled subgraphs to reduce runtime parameter construction overhead. ([#37983](https://github.com/PaddlePaddle/Paddle/pull/37983)) - - - Utilize CINN's auto-tuning in case of subgraph compilation, could be enabled by flag, for further tuning of training performance. ([#41795](https://github.com/PaddlePaddle/Paddle/pull/41795)) - - - Refactor the correctness check of compilation results in case of subgraph compilation to avoid repeated checks at runtime and reduce the scheduling overhead. ([#41777](https://github.com/PaddlePaddle/Paddle/pull/41777)) - - - Enable TransposeFolding and GemmRewriter optimization passes by default in Paddle-CINN training. ([#41084](https://github.com/PaddlePaddle/Paddle/pull/41084)) - - - Pass the cuda stream created in Paddle into CINN so that Paddle and CINN can use the same CUDA stream in cuda computing. ([#37337](https://github.com/PaddlePaddle/Paddle/pull/37337)) - - - Move CINN optimization pass application logic from Paddle to CINN. ([#42047](https://github.com/PaddlePaddle/Paddle/pull/42047), [#42070](https://github.com/PaddlePaddle/Paddle/pull/42070)) - - - Device memory optimization - - - Add NoNeedBufferVars to cinn_launch op to declare a list of input variables that do not require a buffer, so that the memory can be freed in advance. ([#38367](https://github.com/PaddlePaddle/Paddle/pull/38367)) - - - Pass in reference count information for external variables to the subgraph, so that subgraphs within cinn_launch can reuse memory optimization passes and reduce the memory overhead in using CINN. ([#39209](https://github.com/PaddlePaddle/Paddle/pull/39209), [#39622](https://github.com/PaddlePaddle/Paddle/pull/39622)) - - - Add the function to convert a collection of executable instructions generated by CINN compilation to a Paddle Graph, supporting reuse of the Paddle scheduler and memory optimization pass, further reducing the memory overhead in using CINN. ([#39724](https://github.com/PaddlePaddle/Paddle/pull/39724), [#39911](https://github.com/PaddlePaddle/Paddle/pull/39911)) - - - Add Kernel of cinn_instruction_run op, to support dynamic device memory requests based on data types inferred from compilation results. ([#40920](https://github.com/PaddlePaddle/Paddle/pull/40920)) - -- Bug fixing: - - - Fix and optimize the generation logic of CINN subgraphs. ([#36503](https://github.com/PaddlePaddle/Paddle/pull/36503)) - - - Fix the bug that Paddle-CINN does not support no-input subgraphs. ([#40814](https://github.com/PaddlePaddle/Paddle/pull/40814)) - - - Fix an error reported due to CINN not being able to handle useless outputs in operators such as batch_norm. ([#36996](https://github.com/PaddlePaddle/Paddle/pull/36996)) - - - Fix several bugs in CINN subgraph partitioning and symbolization, and solve problems with Paddle training accessing the CINN. ([#36739](https://github.com/PaddlePaddle/Paddle/pull/36739), [#36698](https://github.com/PaddlePaddle/Paddle/pull/36698) ) - - - CINN does not yet support the control flow yet. Add logic to skip control flow when encountered. ([#40812](https://github.com/PaddlePaddle/Paddle/pull/40812)) - -#### **Other** - -- Model quantization - - - Upgrade quantization storage format to unify quantization formats for dynamic and static graphs. ([#41041](https://github.com/PaddlePaddle/Paddle/pull/41041)) - - - Add new post training quantization (PTQ): EMD and Adaround. ([#40421](https://github.com/PaddlePaddle/Paddle/pull/40421), [#38460](https://github.com/PaddlePaddle/Paddle/pull/38460)) - - - Support to quantize more operations in PTQ and QAT, such as crop, split, ab, unsqueeze etc. ([#40083](https://github.com/PaddlePaddle/Paddle/pull/40083)) - - - Support to quantize operators in control flow. ([#37498](https://github.com/PaddlePaddle/Paddle/pull/37498)) - - - Support quantization of matmul_v2 operator. ([#36469](https://github.com/PaddlePaddle/Paddle/pull/36469)) - - - Add support for quantized matmul_v2 inference on TensorRT. ([#36594](https://github.com/PaddlePaddle/Paddle/pull/36594)) - -- CUDA memory optimization - - - Implement multi-stream safe Allocator to support safe and efficient use of CUDA memory in asynchronous computing scenarios. ([#37290](https://github.com/PaddlePaddle/Paddle/pull/37290)) - - - Add new APIs (paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved) for GPU memory monitoring in runtime. ([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657)) - - - Support allocate CUDA Managed Memory to train super large models in memory-constrained scenarios. ([#39075](https://github.com/PaddlePaddle/Paddle/pull/39075)) - - - Add GetBasePtr interface in C++ to get device address created with *cudaMalloc*. ([#37978](https://github.com/PaddlePaddle/Paddle/pull/37978)) - - - Reduce the number of free blocks in AutoGrowth Allocator to improve memory allocation performance. ([#35732](https://github.com/PaddlePaddle/Paddle/pull/35732)) - - - Remove redundant Float32 temporary tensor and cast operation for tensor with data type FP16 in`initializer.Normal` and `initializer.Constant`to save 2x memory. ([#38818](https://github.com/PaddlePaddle/Paddle/pull/38818)) - -- High-order derivative testing for models in dynamic graphs. - - - Add third-order derivative testing for network in dynamic graphs. ([#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377)) -- Custom op: Support to custom op in ROCm(HIP) platform. ([#36771](https://github.com/PaddlePaddle/Paddle/pull/36771)) - -- Cost Model: Add basic Cost Model based on profiling infomation. ([#35774](https://github.com/PaddlePaddle/Paddle/pull/35774)) - -- Added a function to allow user to add their own layer and correspond pruning way to ASP support. ([#40253](https://github.com/PaddlePaddle/Paddle/pull/40253)) - -- Add string tensor data structure, allowing the framework to have the ability to represent and process string. ([#39830](https://github.com/PaddlePaddle/Paddle/pull/39830), [#40992](https://github.com/PaddlePaddle/Paddle/pull/40992)) - -- Add or upgrade oneDNN FP32/int8/bfloat16 Kernel, including: - - - ELU ([#37149](https://github.com/PaddlePaddle/Paddle/pull/37149)) - - - exp ([#38624](https://github.com/PaddlePaddle/Paddle/pull/38624)) - - - stack ([#37002](https://github.com/PaddlePaddle/Paddle/pull/37002)) - - - softplus ([#36382](https://github.com/PaddlePaddle/Paddle/pull/36382)) - - - round ([#39653](https://github.com/PaddlePaddle/Paddle/pull/39653)) - - - shape ([#36033](https://github.com/PaddlePaddle/Paddle/pull/36033)) - - - flatten and flatten2 ([#35892](https://github.com/PaddlePaddle/Paddle/pull/35892)) - - - slice ([#37630](https://github.com/PaddlePaddle/Paddle/pull/37630)) - - - elementwise_mul ([#40546](https://github.com/PaddlePaddle/Paddle/pull/40546)) - - - elementwise_add ([#38176](https://github.com/PaddlePaddle/Paddle/pull/38176)) - - - ementwise_div ([#36158](https://github.com/PaddlePaddle/Paddle/pull/36158)) - - - elementwise_sub ([#35662](https://github.com/PaddlePaddle/Paddle/pull/35662)) - - - roi_align ([#37848](https://github.com/PaddlePaddle/Paddle/pull/37848)) - - - nearest_interp and nearest_interp_v2 ([#37985](https://github.com/PaddlePaddle/Paddle/pull/37985),[#38622](https://github.com/PaddlePaddle/Paddle/pull/38622),[#39490](https://github.com/PaddlePaddle/Paddle/pull/39490)) - - - assembly optimized Adam ([#39158](https://github.com/PaddlePaddle/Paddle/pull/39158)) - - - logsoftmax ([#39793](https://github.com/PaddlePaddle/Paddle/pull/39793)) - - - activation ([#40721](https://github.com/PaddlePaddle/Paddle/pull/40721)) - - - mul ([#38552](https://github.com/PaddlePaddle/Paddle/pull/38552)) - - - mean ([#37104](https://github.com/PaddlePaddle/Paddle/pull/37104)) - - - relu ([#36265](https://github.com/PaddlePaddle/Paddle/pull/36265)) - - - pool2d ([#37081](https://github.com/PaddlePaddle/Paddle/pull/37081)) - - - concat ([#35889](https://github.com/PaddlePaddle/Paddle/pull/35889)) - - - conv2d ([#38507](https://github.com/PaddlePaddle/Paddle/pull/38507),[#38938](https://github.com/PaddlePaddle/Paddle/pull/38938),[#36284](https://github.com/PaddlePaddle/Paddle/pull/36284)) - - - LayerNorm ([#40418](https://github.com/PaddlePaddle/Paddle/pull/40418)) - -- Add the 3-stage storage graph retrieval engine based on SSD - host memory - GPU device memory, to support large-scale graph neural network training. ([#42472](https://github.com/PaddlePaddle/Paddle/pull/42472), [#42321](https://github.com/PaddlePaddle/Paddle/pull/42321), [#42027](https://github.com/PaddlePaddle/Paddle/pull/42027)) - -- Add heterogeneous multi-cloud training communication module switch, implement the Send/Recv interface function, and support multiple heterogeneous cloud communication. ([#40965](https://github.com/PaddlePaddle/Paddle/pull/40965) [40911](https://github.com/PaddlePaddle/Paddle/pull/40911)) - -### **(2) Function optimization** - -#### API - -- Add backward implementation of `paddle.linalg.det `. ([#36013](https://github.com/PaddlePaddle/Paddle/pull/36013)) - -- Add support for mixed precision training O2 mode for `paddle.Model`, i.e., support for Pure FP16 training mode of the original dynamic/static graphs. ([#36441](https://github.com/PaddlePaddle/Paddle/pull/40962441)) - -- Support for self chain calls for `paddle.nn.Layer`. ([#36609](https://github.com/PaddlePaddle/Paddle/pull/36609)) - -- Add settings of `is_distributed` property for the `to` method of `paddle.nn.Layer` to ensure that the distributed properties remain consistent before and after network parameter transform. ([#36221](https://github.com/PaddlePaddle/Paddle/pull/36221)) - -- Improve the parameter conversion logic of the `to` method of `paddle.nn.Layer`, to reduce the peak memory consumption of the conversion process and improve the conversion success rate. ([#36862](https://github.com/PaddlePaddle/Paddle/pull/36862)) - -- Support settings of the shape of the output Tensor for `paddle.incubate.graph_send_recv` to reduce the memory usage during the actual computation. ([#40509](https://github.com/PaddlePaddle/Paddle/pull/40509)) - -- Add the support of int32 and int64 data types for `paddle.incubate.segment_sum`, `segment_mean`, `segment_max`, and `segment_min`. ([#40577](https://github.com/PaddlePaddle/Paddle/pull/40577)) - -- Add the support of the bool type for transpose op. ([#35886](https://github.com/PaddlePaddle/Paddle/pull/35886)) - -- Switch the `paddle.mm` underlying operator from matmul to matmul_v2. ([#35770](https://github.com/PaddlePaddle/Paddle/pull/35770)) - -- Support static graph mode and support the unknown shape for `paddle.einsum`. ([#40360](https://github.com/PaddlePaddle/Paddle/pull/40360)) - -- Support data`parallelism for paddle.nn.functional.margin_cross_entropy` and `paddle.nn.functional.class_center_sample`. ([#39852](https://github.com/PaddlePaddle/Paddle/pull/39852)) - -- Support input of shape [1] for `paddle.nn.functional.grid_sample`. ([#36183](https://github.com/PaddlePaddle/Paddle/pull/36183)) - -- Support NHWC data format for `paddle.nn.PRelu`. ([#37019](https://github.com/PaddlePaddle/Paddle/pull/37019)) - -- Support the fixed random state using `paddle.seed` for `paddle.nn.functional.class_center_sample`. ([#38248](https://github.com/PaddlePaddle/Paddle/pull/38248)) - -- Add ROCM backend support for all APIs under `paddle.fft`, and optimize CUFFT backend error messages. ([#36415](https://github.com/PaddlePaddle/Paddle/pull/36415), [#36114](https://github.com/PaddlePaddle/Paddle/pull/36114/files)) - -- Support the function that the slicing dimension i 0, that is, allow slicing index results to be empty. ([#37313](https://github.com/PaddlePaddle/Paddle/pull/37313)) - -- Support int and bool type Tensor with using bool index for `Tensor.setitem`. ([#37761](https://github.com/PaddlePaddle/Paddle/pull/37761)) - -- Support nearest mode for `paddle.nn.functional.interpolate` when the input shape is 5D. ([#38868](https://github.com/PaddlePaddle/Paddle/pull/38868)) - -- Add the support of int16 for `paddle.nn.Embedding`and`paddle.gather`. ([#40964](https://github.com/PaddlePaddle/Paddle/pull/40964), [#40052](https://github.com/PaddlePaddle/Paddle/pull/40052)) - -- Support data`parallelism on single machine on``CPU platform``in paddle.distributed.spawn`. ([#35745](https://github.com/PaddlePaddle/Paddle/pull/35745), [#36758](https://github.com/PaddlePaddle/Paddle/pull/36758), [#36637](https://github.com/PaddlePaddle/Paddle/pull/36637)) - -- Add `depthwise_conv2d` MKLDNN operator. ([#38484](https://github.com/PaddlePaddle/Paddle/pull/38484)) - -- Add complex types check in the static graph model for API`paddle.abs`, `paddle.transpose`, `paddle.squeeze`, `paddle.unsqueeze`, `paddle.matmul`, and `paddle.full`. ([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113)) - -- Support tuple and list type arguments for `paddle.autograd.PyLayer`. ([#38146](https://github.com/PaddlePaddle/Paddle/pull/38146)) - -- Add check whether tensor is inplace and leaf when calculate gradient. ([#37931](https://github.com/PaddlePaddle/Paddle/pull/37931)) - -- Support HIP library for `paddle.autograd.PyLayer`. ([#38184](https://github.com/PaddlePaddle/Paddle/pull/38184)) - -- Support more size inputs for `paddle.take_along_axis` and `paddle.put_along_axis`, and allow index matrix shape size to be larger than array matrix shape size. ([#39072](https://github.com/PaddlePaddle/Paddle/pull/39072)) - -- Optimize the error report message of API `paddle.nn.Pad2D` when replicate is 0. ([#36510](https://github.com/PaddlePaddle/Paddle/pull/36510/files)) - -- Support pad input in tuple format for API `paddle.nn.Pad2D`. ([#35985](https://github.com/PaddlePaddle/Paddle/pull/35985/files)) - -- Add tdm_sample API in `paddle.distributed.InMemoryDataset` to support sampling operations in TDM algorithms. ([#37044](https://github.com/PaddlePaddle/Paddle/pull/37044)) - -- Add Pre-saving Hooks mechanism for `paddle.jit.save`. ([#38186](https://github.com/PaddlePaddle/Paddle/pull/38186)) - -- Add new higher-order differentiation-related APIs. - - - `elementwise_add`: add third-order Kernel, to support computation of third-order differentiation. ([#36508](https://github.com/PaddlePaddle/Paddle/pull/36508), [#36618](https://github.com/PaddlePaddle/Paddle/pull/36618)) - - - `matmul_v2`: add third-order Kernel, to support computation of third-order differentiation. ([#36459](https://github.com/PaddlePaddle/Paddle/pull/36459)) - - - `elementwise_mul`: Add third-order Kernel, to support computation of third-order differentiation. ([#37152](https://github.com/PaddlePaddle/Paddle/pull/37547)) - -- Improve the logic of the `paddle.amp.GradScaler` to call check_finite_and_unscale op, to eliminate the cudaMemcpy introduced by the creation of the bool variable. ([#37770](https://github.com/PaddlePaddle/Paddle/pull/37770)) - -- Add check for unstack and unique op in case of input Tensor with 0 elements. ([#36021](https://github.com/PaddlePaddle/Paddle/pull/36021)) - -- Add new multi-layer, bi-directional LSTM function that supports KUNLUNXIN 2, to improve RNN forward/backward ops, and support the use of temporal model training. ([#](https://github.com/PaddlePaddle/Paddle/pull/41781)[42076](https://github.com/PaddlePaddle/Paddle/pull/42076)) - -- Add bce_loss forward/backward ops for KUNLUNXIN 2. ([#41610](https://github.com/PaddlePaddle/Paddle/pull/41610)) - -- Add backward implementation of `paddle.linalg.det `. ([#36013](https://github.com/PaddlePaddle/Paddle/pull/36013)) - -#### IR(Intermediate Representation) - -- Dynamic Graphs to Static Graphs - - - Optimize the behavior of the `ProgramCache.last` interface for dynamic graph to static graph so that it returns the most recently used Program instead of the final generated Program. ([#39541](https://github.com/PaddlePaddle/Paddle/pull/39541)) - - - Optimize the error report message for the `paddle.reshape` API for dynamic graph to static graph, and add a new recommended usage hint. ([#40599](https://github.com/PaddlePaddle/Paddle/pull/40599)) - - - Optimize the type of exception catch in the `is_api_in_module` function when transcribing dynamic code to static code. ([#40243](https://github.com/PaddlePaddle/Paddle/pull/40243)) - - - Optimize the hint of error message for dynamic graph to static graph,hide warning information by default. ([#39730](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39730)) - - - Add the support of type hint syntax for dynamic graph to static graph to improve the accuracy of variable type analysis. ([#39572](https://github.com/PaddlePaddle/Paddle/pull/39572)) - - - Optimize the `paddle.cond` function to allow values are equal for basic types such as bool and int. ([#37888](https://github.com/PaddlePaddle/Paddle/pull/37888)) - - - Optimize the decorate function `@to_static` to allow the switch of the train/eval mode. ([#37383](https://github.com/PaddlePaddle/Paddle/pull/37383)) - - - Optimize the stack of error report for dynamic graph to static graph, to highlight user-related codes and reduce the framework redundant error stack. ([#36741](https://github.com/PaddlePaddle/Paddle/pull/36741)) - - - Remove `no_value` placeholder from the return value of `paddle.cond`. ([#36513](https://github.com/PaddlePaddle/Paddle/pull/36513)、[#36826](https://github.com/PaddlePaddle/Paddle/pull/36826)) - - - Adapt the run_program op to the new dynamic graph mode. ([#40198](https://github.com/PaddlePaddle/Paddle/pull/40198), [#40355](https://github.com/PaddlePaddle/Paddle/pull/40355)) - - - Add check for zip syntax. ([#37846](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/37846)) - - - Fix the dynamic graph to static graph failure due to the error of dimension and type judgment in the `paddle.signal.frame`, `paddle.signal.stft` and `paddle.signal.istft`. ([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113)) - - - Add registration of plural type Kernel for mean, pad3d ops. ([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113)) - - -#### **Mixed Precision Training** - -- Add GPU Compute Capability environment check for amp. Add the usage warning for GPU environments that the fail acceleration for training. ([#38086](https://github.com/PaddlePaddle/Paddle/pull/38086)) - -- Add check of calling order when using `paddle.amp.decorate` and `paddle.DataParallel` at the same time. ([#38785](https://github.com/PaddlePaddle/Paddle/pull/38785)) - - -#### **Distributed Training** - -- Basic functions of the distributed training - - - Optimize Fleet API and DistributedStrategy configuration to use dynamic graph parallel function conveniently. ([#40408](https://github.com/PaddlePaddle/Paddle/pull/40408)) - - - Optimize Dynamic Graph mixed parallel HybridParallelClipGrad strategy, support 4D hybrid parallel and Pure FP16 training. ([#36237](https://github.com/PaddlePaddle/Paddle/pull/36237), [#36555](https://github.com/PaddlePaddle/Paddle/pull/36555)) - - - Restructure dynamic graph data parallel strategy, to support new dynamic graph and communication. ([#40389](https://github.com/PaddlePaddle/Paddle/pull/40389), [#40593](https://github.com/PaddlePaddle/Paddle/pull/40593), [#40836](https://github.com/PaddlePaddle/Paddle/pull/40836), [#41119](https://github.com/PaddlePaddle/Paddle/pull/41119), [#41413](https://github.com/PaddlePaddle/Paddle/pull/41413), [#39987](https://github.com/PaddlePaddle/Paddle/pull/39987)) - - - Support distributed tensor model parallel for fused_attention op. ([#40101](https://github.com/PaddlePaddle/Paddle/pull/40101)) - - - Support the distributed tensor model parallel for fused_feedforward op. ([#40160](https://github.com/PaddlePaddle/Paddle/pull/40160)) - -- Graph retrieval engine - - - Optimize the data format returned by the graph sampling interface of the graph engine, with a 3x improvement of the sampling speed. ([#37315](https://github.com/PaddlePaddle/Paddle/pull/37315)) - - - Reduce the amount of graph engine threads to improve performance. ([#37098](https://github.com/PaddlePaddle/Paddle/pull/37098)) - - - Optimize graph engine data transfer to improve performance. ([#37341](https://github.com/PaddlePaddle/Paddle/pull/37341)) - - - Optimize the merge logic of embedding op to improve performance by exploiting the topological relationship of embedding op in the model. [(#35942)](https://github.com/PaddlePaddle/Paddle/pull/35942) - -- Communication library: restructure the communication library to improve the scalability and development of the communication library, and support heterogeneous communication. ([#41398](https://github.com/PaddlePaddle/Paddle/pull/41398), [#39720](https://github.com/PaddlePaddle/Paddle/pull/39720), [#40911](https://github.com/PaddlePaddle/Paddle/pull/40911), [#40579](https://github.com/PaddlePaddle/Paddle/pull/40579), [#40629](https://github.com/PaddlePaddle/Paddle/pull/40629), [#40437](https://github.com/PaddlePaddle/Paddle/pull/40437), [#40430](https://github.com/PaddlePaddle/Paddle/pull/40430), [#40228](https://github.com/PaddlePaddle/Paddle/pull/40228), [#40181](https://github.com/PaddlePaddle/Paddle/pull/40181), [#40100](https://github.com/PaddlePaddle/Paddle/pull/40100), [#40097](https://github.com/PaddlePaddle/Paddle/pull/40097), [#39892](https://github.com/PaddlePaddle/Paddle/pull/39892), [#39384](https://github.com/PaddlePaddle/Paddle/pull/39384), [#39737](https://github.com/PaddlePaddle/Paddle/pull/39737), [#40040](https://github.com/PaddlePaddle/Paddle/pull/40040)) - -- Support the publication of MoE-related interfaces in `paddle.incubate.distributed.models.moe ` (`moe.GShardGate `, `moe.BaseGate `, `moe.SwitchGate `, `moe.MoELayer `, and `moe. ClipGradForMOEByGlobalNorm `). ([#42300](https://github.com/PaddlePaddle/Paddle/pull/42300)) - -- Fix the error report in the use of recomputing in `paddle.incubate.distributed.models.moe.MoELayer `. ([#42128](https://github.com/PaddlePaddle/Paddle/pull/42128)) - -- Fix the error report in the new dynamic graph pipeline parallel caused by different data types ([#41937](https://github.com/PaddlePaddle/Paddle/pull/41937) [#42053](https://github.com/PaddlePaddle/Paddle/pull/42053)) - -- Fix the error report in the new dynamic graph tensor model parallel due to different data types([#41960](https://github.com/PaddlePaddle/Paddle/pull/41960)) - -#### **Custom operator** - -- Enhance the C++ custom operator mechanism for writing second-order gradient operators, to support adding suffixes to the gradient input variables of second-order gradient operators for use as outputs. ([#41781](https://github.com/PaddlePaddle/Paddle/pull/41781)) - -- Remove the use of the deprecated enumeration type `PlaceType` from the Tensor API member methods, make it compatible, and add a deprecation warning. ([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882)) - -- Add deprecated warning for a number of deprecated interfaces of the original Tensor API, including the incomplete constructor, reshape, mutable_data, and copy_to methods. ([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882)) - -#### **Other** - -- Error report and debugging optimization - - - Optimize `the error message of the label` boundary check for the cross_entropy op. ([#40001](https://github.com/PaddlePaddle/Paddle/pull/40001)) - - - Add profile record for `infer_shape` and `compute` methods of op execution of dynamic graphs, show their cost in timeline. ([#39023](https://github.com/PaddlePaddle/Paddle/pull/39023)) - - - Replace `pybind::index_error` error hint on Windows for unknown exceptions. ([#40538](https://github.com/PaddlePaddle/Paddle/pull/40538)) - - - Add the error message in the out-of-bounds checks for user scatter op. ([#37429](https://github.com/PaddlePaddle/Paddle/pull/37429)) - -- Download tool: For the problem of slow decompression of directories with multiple files in `paddle.utils.download.get_path_from_url`, replace the original way (traverse directory in loop) of decompressing files in directories one by one by calling extractall on the directory, which greatly improves the decompression speed. ([#37311](https://github.com/PaddlePaddle/Paddle/pull/37311)) - -- Speed up the quantization training for`fake_quantize_range_abs_max`、`fake_quantize_abs_max`、`fake_quantize_dequantize_abs_max`、 `fake_quantize_moving_average_abs_max`, etc. ([#40491](https://github.com/PaddlePaddle/Paddle/pull/40491)) - - -### **(3) Performance optimization** - -#### **Distributed Training** - -- Hybrid parallel optimizer `sharding_optimizer` supports `optimize_cast` optimization, which move the parameter cast during forward and backwark stage to the optimizer stage. This improves performance by 7%. ([#35878](https://github.com/PaddlePaddle/Paddle/pull/35878)) - -- GPUPS optimization: support for gradient fuse allreduce training. This improves training performance by 20%. ([#35131](https://github.com/PaddlePaddle/Paddle/pull/35131)) - -- GPUPS optimization: dump CPU optimization speed improves by 3.21x. ([#40068](https://github.com/PaddlePaddle/Paddle/pull/40068)) - -- CPU parameter server streaming training optimization: support for automatic statistics of sparse parameter statistics, incremental saving of sparse parameters, etc. The training performance improves by 20%. ([#36465](https://github.com/PaddlePaddle/Paddle/pull/36465), [#36601](https://github.com/PaddlePaddle/Paddle/pull/36601), [#36734](https://github.com/PaddlePaddle/Paddle/pull/36734), [#36909](https://github.com/PaddlePaddle/Paddle/pull/36909), [#36943](https://github.com/PaddlePaddle/Paddle/pull/36943), [#37181](https://github.com/PaddlePaddle/Paddle/pull/37181), [#37194](https://github.com/PaddlePaddle/Paddle/pull/37194), [#37515](https://github.com/PaddlePaddle/Paddle/pull/37515), [#37626](https://github.com/PaddlePaddle/Paddle/pull/37626), [#37995](https://github.com/PaddlePaddle/Paddle/pull/37995), [#38582](https://github.com/PaddlePaddle/Paddle/pull/38582), [#39250](https://github.com/PaddlePaddle/Paddle/pull/39250), [#40762](https://github.com/PaddlePaddle/Paddle/pull/40762), [#41234](https://github.com/PaddlePaddle/Paddle/pull/41234), [#41320](https://github.com/PaddlePaddle/Paddle/pull/41320), [#41400](https://github.com/PaddlePaddle/Paddle/pull/41400)) - -#### **Auto-tuning** - -Add hardware-aware automatic performance tuning for the full training process, with performance improvements of about 3% to 50% or more on image classification, segmentation, detection, and image generation tasks compared to the model's default configuration. The auto-tuning status is set via the `paddle.incubate.autotune.set_config ` API. By default, it is currently disabled. Auto-tuning has three specific levels: - -- Add the auto-tuning function to `paddle.io.DataLoader `, to select the best num_workers based on training data and device resources. ([#42004](https://github.com/PaddlePaddle/Paddle/pull/42004)) - -- Add mixed-precision training data layout auto-tuning feature, to select the best data layout based on device type and data type, and automatically convert it at runtime. ([#41964](https://github.com/PaddlePaddle/Paddle/pull/41964)) - -- Add the automatic tuning of the required workspace size threshold for Conv, which is automatically set based on the GPU's currently available requested device memory resources. Add the automatic selection of Conv cuDNN algorithms based on the generic AlgorithmCache design and Kernel timing component, which supports data variation length models. ([#41833](https://github.com/PaddlePaddle/Paddle/pull/41833)) - -#### **Operator Optimization** - -- Optimize `FasterTokenizer` performance, with a 10% performance improvement compared to pre-optimization. ([#36701](https://github.com/PaddlePaddle/Paddle/pull/36701)) - -- Optimize `index_select` inverse computation, with 3.7~25.2x performance improvement over pre-optimization. ([#37055](https://github.com/PaddlePaddle/Paddle/pull/37055)) - -- Optimize the performance of `paddle.nn.ClipByGlobalNorm`. Take 10*10 `paddle.nn.Linear` as an example. In contrast to pre-optimization, the performance improves by about 30%. ([#38209](https://github.com/PaddlePaddle/Paddle/pull/38209)) - -- Optimize the performance of `pnorm` with very large or very small `axis` dimensions, with 31-96x improvement in forward speed and 1.1-19x improvement in backward speed. ([#37685](https://github.com/PaddlePaddle/Paddle/pull/37685), [#38215](https://github.com/PaddlePaddle/Paddle/pull/38215), [#39011](https://github.com/PaddlePaddle/Paddle/pull/39011)) - -- Optimize `softmax` forward and backward performance, with a speedup ratio of about 2x for the `axis!=-1` configuration. ([#38602](https://github.com/PaddlePaddle/Paddle/pull/38602), [#38609](https://github.com/PaddlePaddle/Paddle/pull/38609), [#32387](https://github.com/PaddlePaddle/Paddle/pull/32387), [#37927](https://github.com/PaddlePaddle/Paddle/pull/37927/files)) - -- Optimize `log_softmax` forward and backward performance, with a speedup ratio of about 6x to 20x for `axis!=-1` configurations. ([#38992](https://github.com/PaddlePaddle/Paddle/pull/38992), [#40612](https://github.com/PaddlePaddle/Paddle/pull/40612)) - -- Optimize `softmax_with_cross_entropy` forward and backward performance, with a speedup ratio of about 1.3x for the `hard_label` configuration. ([#39553](https://github.com/PaddlePaddle/Paddle/pull/39553), [#40424](https://github.com/PaddlePaddle/Paddle/pull/40424), [#40643](https://github.com/PaddlePaddle/Paddle/pull/40643)) - -- Optimize `top_k` performance, with a speedup ratio of more than 22x for one-dimension and larger `k` (k=5000) configuration. ([#40941](https://github.com/PaddlePaddle/Paddle/pull/40941)) - -- Optimize `elementwise_mul` backward computation, with 1.85~12.16x performance improvement over pre-optimization. ([#37728](https://github.com/PaddlePaddle/Paddle/pull/37728)) - -- Optimize `elementwise_min` and `elementwise_max` backward computation, to equalize or improve performance by 1.05x to 18.75x over pre-optimization. ([#38236](https://github.com/PaddlePaddle/Paddle/pull/38236), [#37906](https://github.com/PaddlePaddle/Paddle/pull/37906)) - -- Optimize `nearest_interp` forward and backward computation, with forward performance improvement by 1.5x to 2.3x over pre-optimization, and backward performance improvement by 60% to 1.8x over pre-optimization. ([#38528](https://github.com/PaddlePaddle/Paddle/pull/38528), [#39067](https://github.com/PaddlePaddle/Paddle/pull/39067)) - -- Optimize `bilinear_interp` forward and backward computation, with forward performance improvement by 0.4x to 2.3x over pre-optimization, and backward performance improvement by 10%-30% over pre-optimization. ([#39243](https://github.com/PaddlePaddle/Paddle/pull/39243), [#39423](https://github.com/PaddlePaddle/Paddle/pull/39423)) - -- Optimize `dropout` forward and backward computation, with performance improvement by about 20%. ([#39795](https://github.com/PaddlePaddle/Paddle/pull/39795), [#38859](https://github.com/PaddlePaddle/Paddle/pull/38859), [#38279](https://github.com/PaddlePaddle/Paddle/pull/38279), [#40053](https://github.com/PaddlePaddle/Paddle/pull/40053)) - -- Optimize `grid_sampler` forward and backward computation, with forward performance improvement by 10% to 30% over pre-optimization, and backward performance improvement by 10% to 60% over pre-optimization. ([#39751](https://github.com/PaddlePaddle/Paddle/pull/39751)) - -- Optimize `group_norm` forward and backward computation, with the forward performance improvement by 1.04x to 2.35x, and backward performance improvement by 1.12x to 1.18x. ([#39944](https://github.com/PaddlePaddle/Paddle/pull/39944), [#40657](https://github.com/PaddlePaddle/Paddle/pull/40657), [#39596](https://github.com/PaddlePaddle/Paddle/pull/39596)) - -- Optimize `conv1d` forward and backward computation, with the forward performance improvement by 1.00x to 2.01x, and backward performance improvement by 1.01x to 474.56x. ([#38425](https://github.com/PaddlePaddle/Paddle/pull/38425)) - -- Optimize `elementwise_div` backward computation, with the backward performance improvement by 1.02x to 29.25x. ([#38044](https://github.com/PaddlePaddle/Paddle/pull/38044)) - -- Optimize `gelu` forward and backward computation, with the backward performance improvement by 1.13x to 1.43x, and reverse performance improvement by 1.10x to 1.55x. ([#38188](https://github.com/PaddlePaddle/Paddle/pull/38188), [#38263](https://github.com/PaddlePaddle/Paddle/pull/38263)) - -- Optimize `elementwise_sub` backward computation, with the backward performance improvement by 1.04x to 15.64x. ([#37754](https://github.com/PaddlePaddle/Paddle/pull/37754)) - -- Optimize `flip's` forward performance on one-dimensional data input, with the performance improvement by 100%. ([#37825](https://github.com/PaddlePaddle/Paddle/pull/37825)) - -- Optimize `layer_norm` forward and backward computation, with the forward performance improvement by 2x to 5x over pre-optimization, and backward performance improvement by 20% to 50% over pre-optimization. ([#39167](https://github.com/PaddlePaddle/Paddle/pull/39167), [#39247](https://github.com/PaddlePaddle/Paddle/pull/39247)) - -- Optimize `embedding` forward and backward computation, with a maximum improvement of 1.51x in forward performance and 1.03x to 7.79x in backward performance. ([#39856](https://github.com/PaddlePaddle/Paddle/pull/39856), [#39886](https://github.com/PaddlePaddle/Paddle/pull/398866)) - -- Optimize `gelu` FP16 forward and backward calculations, with forward performance improvement by 9% to 12% over pre-optimization, and backward performance improvement by 2% to 9% over pre-optimization. ([#38980](https://github.com/PaddlePaddle/Paddle/pull/38980)) - -- Remove CPU -> GPU explicit data transfer operation in `gather_nd` forward and backward operators, and remove the explicit synchronous operation in `index_select` forward and backward operators. Change GPU -> GPU data transfer in `scatter_nd` from synchronous operation to asynchronous operation. ([#40933](https://github.com/PaddlePaddle/Paddle/pull/40933)) - -- Optimize `Lars optimzier` computation, with the training performance improvement of Resnet50 PF16 model by 5.1% over pre-optimization. ([#35652](https://github.com/PaddlePaddle/Paddle/pull/35652), [#35476](https://github.com/PaddlePaddle/Paddle/pull/35476)) - -- Optimize `AvgPool2dGrad` computation, with the performance improvement by 2.6x over pre-optimization. ([#35389](https://github.com/PaddlePaddle/Paddle/pull/35389)) - -- Optimize `Elementwise` computation for multivariate output, improving performance by up to 15% over pre-optimization. ([#38329](https://github.com/PaddlePaddle/Paddle/pull/38329), [#38410](https://github.com/PaddlePaddle/Paddle/pull/38410)) - -- Optimize `Categorical`the probs computation, simplify the computation logic, and improve the performance by 4x to 5x. ([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178)) - -- Optimize the `paddle.sum ` performance, with performance improvement by about 20%. ([#42309](https://github.com/PaddlePaddle/Paddle/pull/42309)) - -- Remove CudaStreamSync operation from `paddle.nn.ClipGradByGlobalNorm ` to reduce scheduling overhead during execution, with 5% performance improvement on ptb models. ([#42170](https://github.com/PaddlePaddle/Paddle/pull/42170)) - -- Optimize a series of underlying data structures and detailed implementations in the original dynamic graph execution system to improve the scheduling performance of the original dynamic graph. ([#42010](https://github.com/PaddlePaddle/Paddle/pull/42010), [#42171](https://github.com/PaddlePaddle/Paddle/pull/42171), [#42224](https://github.com/PaddlePaddle/Paddle/pull/42224), [#42256](https://github.com/PaddlePaddle/Paddle/pull/42256), [#42306](https://github.com/PaddlePaddle/Paddle/pull/42306), [#42329](https://github.com/PaddlePaddle/Paddle/pull/42329)[, #42340](https://github.com/PaddlePaddle/Paddle/pull/42340), [#42368](https://github.com/PaddlePaddle/Paddle/pull/42368), [#42425](https://github.com/PaddlePaddle/Paddle/pull/42425)) - -- Simplify the probs calculation logics of `paddle.distribution.Categorical `, to improve performance by 4x to 5x. ([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178)) - -### **(4) Bug fixing** - -#### API - -- Fix the output type error with `paddle.sum` when the input parameter type and output parameter type do not match and the number of reduce elements on the `axis` is 1. ([#36123](https://github.com/PaddlePaddle/Paddle/pull/36123)) - -- Fix an `AttributeError` in `paddle.flops` when the layer output type is tuple. ([#38850](https://github.com/PaddlePaddle/Paddle/pull/38850)) - -- Fix the `paddle.diag` failing to propagate gradients because there is no backward kernel. ([#40447](https://github.com/PaddlePaddle/Paddle/pull/40447)) - -- Fix an error in sorting `paddle.sort` input with NaN values. ([#41070](https://github.com/PaddlePaddle/Paddle/pull/41070)) - -- Fix the error when`paddle.full_like`'s input contains INF value. ([#40232](https://github.com/PaddlePaddle/Paddle/pull/40232)) - -- Fix the bug in `paddle.strided_slice`: strided_slice result does not consistent with slice when the data in the input of starts is less than -rank. ([#39066](https://github.com/PaddlePaddle/Paddle/pull/39066)) - -- Fix the bug in the `max_pool` family of operators where infer_shape is calculated incorrectly when index is returned. This affects the APIs: `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`. ([#40139](https://github.com/PaddlePaddle/Paddle/pull/40139)) - -- Fix an issue where the dtype of pooling_mask returned by the `max_pool` family of operators is incorrect. Now the dtype of pooling_mask is int32. The affected APIs are `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`. ([#39314](https://github.com/PaddlePaddle/Paddle/pull/39314) ) - -- Fix the bug with `paddle.shape` where the backward gradient by default causes a computation error. ([#37340](https://github.com/PaddlePaddle/Paddle/pull/37340)) - -- Fix the bug in `paddle.nn.Layer's` `to` method when converting both dtype and place at the same time. ([#37007](https://github.com/PaddlePaddle/Paddle/pull/38007)) - -- Fix the bug that `paddle.amp.decorate` fails to rewrite the parameters of non-leaf network layers to FP16. ([#38402](https://github.com/PaddlePaddle/Paddle/pull/38402)) - -- Fix the bug that the `paddle.amp.decorate` rewrites the non-input parameter in `paddle.nn.BatchNorm1D`, `paddle.nn.BatchNorm2D`, and `paddle.nn.BatchNorm3D` to FP16. ([#38541](https://github.com/PaddlePaddle/Paddle/pull/38541)) - -- Fix the bug that the `paddle.amp.decorate` rewrites the non-input parameter in `paddle.nn.SyncBatchNorm` to FP16. ([#40943](https://github.com/PaddlePaddle/Paddle/pull/40943)) - -- Fix redundant warnings in `paddle.nn.Layer.to`. ([#36700](https://github.com/PaddlePaddle/Paddle/pull/36700)) - -- Fix the bug in `paddle.nn.RNN` when being used inside control flow. ([#41162](https://github.com/PaddlePaddle/Paddle/pull/41162)) - -- Fix the bug that the `paddle.to_tensor` fails to specify the CUDAPlace of the Tensor. ([#39662](https://github.com/PaddlePaddle/Paddle/pull/39662)) - -- Fix the issue that`paddle.nn.Identity` is not exposed. ([#39615](https://github.com/PaddlePaddle/Paddle/pull/39615)) - -- Fix the bug where the output values of the `fill_` and `zero_` inplace APIs are incorrect when the input is on a CUDAPinned Place after dynamic graph reconstruction. ([#41229](https://github.com/PaddlePaddle/Paddle/pull/41229)) - -- After refactoring the dynamic graph, fix the bug of incorrect inplace version value of the output Tensor when calling assign op using the append op. Change it to call assign op using the `_C_ops`. ([#41118](https://github.com/PaddlePaddle/Paddle/pull/41118)) - -- Remove unreasonable codes in the `elementwise_add` 's third-order kernel, and fix an uninitialized issue in the network creation process. ([#36618](https://github.com/PaddlePaddle/Paddle/pull/36618)) - -- Fix the missing attribute bug in `conv2d` execution of cuDNN Kernel. ([#38827](https://github.com/PaddlePaddle/Paddle/pull/38827)) - -- Fix an issue where `multiclass_nms3` output shape is incorrect. ([#40059](https://github.com/PaddlePaddle/Paddle/pull/40059)) - -- Fix an issue with `yolo_box` outputting incorrect shape. ([#40056](https://github.com/PaddlePaddle/Paddle/pull/40056)) - -- Fix an issue where the higher-order differentiation `gradients` interface does not take effect as expected when target_grad is specified. ([#40940](https://github.com/PaddlePaddle/Paddle/pull/40940/)) - -- Fix an issue that the network parameter type is incorrect when the default_dtype is modified in the op`_BatchNormBase` base class in the dynamic graph mode. The affected APIs are `paddle.nn.BatchNorm1D`,`paddle.nn.BatchNorm2D`,`paddle.nn.BatchNorm3D`, and `paddle.nn.SyncBatchNorm`. Specific reason: when `get_default_dtype() == 'float16'`, the default parameter data type is modified by `set_default_dtype('float32')`. The parameter type in dynamic graph mode is created by default_dtype; therefore, the change of the default parameter type causes the subsequent networking Parameter type error. ([#36376](https://github.com/PaddlePaddle/Paddle/pull/36376)) - -- Fix the bug of the undefined intermediate variable in the backward op in batchnorm op in case that the data type is FP32 and the data dimension is `dims = 2 and data_layout = NHWC`. ([#37020](https://github.com/PaddlePaddle/Paddle/pull/37020)) - -- Fix the bug that shape of weights is incorrect, when using`paddle.static.nn.prelu` in static graph mode, and input format is`NHWC`, `mode==channel`. ([#38310](https://github.com/PaddlePaddle/Paddle/pull/38310)) - -- Fix the bug of `paddle.nn.functional.class_center_sample`: CUDA seed setting issue in multi-machine case. ([#38815](https://github.com/PaddlePaddle/Paddle/pull/38815)) - -- Fix the bug of failing to report error when the input of`paddle.nn.functional.one_hot`is incorrect. ([#41335](https://github.com/PaddlePaddle/Paddle/pull/41335)) - -- Fix an issue where a callback to reclaim device memory on a DCU device is not triggered in time, resulting in an OOM of the device memory. ([#40445](https://github.com/PaddlePaddle/Paddle/pull/40445)) - -- Fix the bugs of `setitem` backward gradient abnormal and inplace logic handling abnormal in some dynamic graph scenarios. ([#37023](https://github.com/PaddlePaddle/Paddle/pull/37023), [#38298](https://github.com/PaddlePaddle/Paddle/pull/38298)) - -- Fix the bug of index abnormal when Tensor array uses the Slice to index in the dynamic to static scenarios. ([#39251](https://github.com/PaddlePaddle/Paddle/pull/39251)) - -- Fix the bug of memory or device memory leaks caused by some temporary variables not being correctly destructed when `paddle.Tensor.register_hook` interface is used. ([#40716](https://github.com/PaddlePaddle/Paddle/pull/40716)) - -- Fix the bug that `Tensor.getitem` cannot get the value when the index is a bool Tensor with all False. ([#41297](https://github.com/PaddlePaddle/Paddle/pull/41297)) - -- Fix the bug that `Tensor.getitem` cannot get the value when the index is a bool scalar Tensor. ([#40829](https://github.com/PaddlePaddle/Paddle/pull/40829)) - -- Fix the bug in `paddle.index_select` when index is a 0-shape Tensor. ([#41383](https://github.com/PaddlePaddle/Paddle/pull/41383)) - -- Fix the bug when the number of GPU threads requested by `paddle.index_select` and `paddle.index_sample` exceeds the limited machine resources. ([#41127](https://github.com/PaddlePaddle/Paddle/pull/41127), [#37816](https://github.com/PaddlePaddle/Paddle/pull/37816), [#39736](https://github.com/PaddlePaddle/Paddle/pull/39736), [#41563](https://github.com/PaddlePaddle/Paddle/pull/41563)) - -- Fix the bug when ReduceConfig, elemwise_grad, gather, gather_nd, and scatter ops request more GPU threads than the limited machine resources. ([#40813](https://github.com/PaddlePaddle/Paddle/pull/40813), [#41127](https://github.com/PaddlePaddle/Paddle/pull/41127)) - -- Fix the bug that the memory access is out of boundary when NX ! = 1 in ReadData, ReadDataBc, and ReadDataReduce in Kernel Primitive API. ([#36373](https://github.com/PaddlePaddle/Paddle/pull/36373)) - -- Fix the bug of the computation result abnormal due to data overflow caused by the IndexRandom data type error. ([#39867](https://github.com/PaddlePaddle/Paddle/pull/39867), [#39891](https://github.com/PaddlePaddle/Paddle/pull/39891)) - -- Fix the bug of the returned computing result error of reduce op when reduce_num = 1. ([#38771](https://github.com/PaddlePaddle/Paddle/pull/38771)) - -- Fix the bug of the memory access out-of-bound of reduce op in the middle dimension of reduce in HIP environments. ([#41273](https://github.com/PaddlePaddle/Paddle/pull/41273)) - -- Fix the bug of Kernel failed to properly release in the computation of two FP16 one-dimensional vectors of matmul op. - -- Fix the bug caused by CUDA integer computation overflow for some operators, including: bernoulli, gaussian_random, gumbel_softmax, multinomial, truncated_gaussian_random, uniform_ random_inplace, and uniform_random ops. ([#37670](https://github.com/PaddlePaddle/Paddle/pull/37670)) - -- Fix the bug where `paddle.nn.Sequential` reports a KeyError error when traversing sublayers in a for loop. ([#39372](https://github.com/PaddlePaddle/Paddle/pull/39372)) - -- Fix the bug of the check shape error in `paddle.nn.functional.unfold` when compiling in static graphs. ([#38907](https://github.com/PaddlePaddle/Paddle/pull/38907), [#38819](https://github.com/PaddlePaddle/Paddle/pull/38819)) - -- Fix the bug of reporting an error if `axis` is specified when using dropout for static graphs. ([#37223](https://github.com/PaddlePaddle/Paddle/pull/37223)) - -- Migrate the matmul operator in the `paddle.nn.MultiHeadAttention` to the matmul_v2 operator. ([#36222](https://github.com/PaddlePaddle/Paddle/pull/36222)) - -- Fix the bug occurred in throwing FPE when the empty Tensor is used in `paddle.nn.functional.label_smooth`. ([#35861](https://github.com/PaddlePaddle/Paddle/pull/35861)) - -- Fix the deformation bug of reshape op when input is an empty Tensor. Support the empty Tensor rehape to [-1]. ([#36087](https://github.com/PaddlePaddle/Paddle/pull/36087)) - -- Fix the bug of the modified values will incorrectly override other rows when the `fill_diagonal` 's input parameter offset is non-zero. ([#36212](https://github.com/PaddlePaddle/Paddle/pull/36212)) - -- Modify stop_gradient returned by the range op bing set to True in dynamic graph mode. ([#37486](https://github.com/PaddlePaddle/Paddle/pull/37486)) - -- Fix the bug where Lamb optimizer is updated incorrectly when Beta1Pow and Beta2Pow are on the GPU. ([#38518](https://github.com/PaddlePaddle/Paddle/pull/38518)) - -- Fix the bug where the conv2d operator doesn't respect to FLAGS_cudnn_deterministic. ([#37173](https://github.com/PaddlePaddle/Paddle/pull/37173)) - -- Fix the bug caused by an earlier version of cufft that does not define CUFFT_VERSION. ([#37312](https://github.com/PaddlePaddle/Paddle/pull/37312)) - -- Fix the computing error of `paddle.ifftshit` and `paddle.fftshift`. ([#36834](https://github.com/PaddlePaddle/Paddle/pull/36834), [#36748](https://github.com/PaddlePaddle/Paddle/pull/36748)) - -- Fix the `axis` computation error in `paddle.fft` series of APIs. ([#36321](https://github.com/PaddlePaddle/Paddle/pull/36321)) - -- Fix an output data type registration bug of batch_norm_grad op in case of FP16 data type. This bug causes the compilation failure in some scenarios. There is also the impact on FP16 computational precision. ([#42461](https://github.com/PaddlePaddle/Paddle/pull/42461)) - -- Fix the incorrect Infershape information bug in the `paddle.nn.functional.pad ` API when the padding is Tensor in dynamic to static conversion. ([#42414](https://github.com/PaddlePaddle/Paddle/pull/42414)) - -- Fix an exception in `paddle.distribution.StickBreakingTransform ` when the input dimension exceeds 2. ([#41762](https://github.com/PaddlePaddle/Paddle/pull/41672)) - -- Fix a nan/inf bug calculated with QK^T in fused_attention op. ([#42032](https://github.com/PaddlePaddle/Paddle/pull/42032)) - -- Fix a nan/inf bug calculated in fused_attention op with FusedResidualDropoutBias on V100. ([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398)) - -- Fix a redundant data transform bug introduced by the full_like op during execution. ([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973)) - -- Fix a problem with p_norm op calculating nan on GPU environments. ([#41804](https://github.com/PaddlePaddle/Paddle/pull/41804)) - -- Fix a section error of split op when the sections parameter has a size of 0. ([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755)) - -- Fix the bug of reporting not supporting Place (gpu:0) in multi-card training when broadcast is required in 6 elementwise ops (pow, complex, divide_double, multiply_double, fmax, and fmin). ([#42332](https://github.com/PaddlePaddle/Paddle/pull/42332)) - -- Fix the bug that the deprecated interface reports a warning in case of `import paddle` due to a PIL version update. ([#42307](https://github.com/PaddlePaddle/Paddle/pull/42307)) - -- Fix the bug that `paddle.linalg.matrix_rank ` does not support tol as FP64 Tensor under static graph. ([#42085](https://github.com/PaddlePaddle/Paddle/pull/42085)) - -#### IR(Intermediate Representation) - -- Dynamic to static graphs - - - Fix a type derivation error in reverse gradient accumulation when the `tensor_array` is used with the control flow. ([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585), [#39689](https://github.com/PaddlePaddle/Paddle/pull/39689)) - - - Fix an issue where the parameter gradient type is not set correctly during dynamic to static AMP training. ([#40938](https://github.com/PaddlePaddle/Paddle/pull/40938)) - - - Fix an issue of reporting an error in the dynamic to static transcription when there are misplaced annotations in the codes. ([#39035](https://github.com/PaddlePaddle/Paddle/pull/39035), [#38003](https://github.com/PaddlePaddle/Paddle/pull/38003)) - - - Fix an issue where Tensor is not properly converted to Variable when calling a non-forward function in dynamic to static codes. ([#37296](https://github.com/PaddlePaddle/Paddle/pull/37296), [#38540](https://github.com/PaddlePaddle/Paddle/pull/38540)) - - - Fix an issue where `paddle` is incorrectly passed as a variable when dynamic to static transcription. ([#37999](https://github.com/PaddlePaddle/Paddle/pull/37999)) - - - Fix an issue where model parameters are incorrectly counted when calling `paddle.flops` after model dynamic to static conversion. ([#36852](https://github.com/PaddlePaddle/Paddle/pull/36852)) - - - Fix an issue where GPU memory will keep growing in train mode and no_grad contexts after loading models using the `paddle.jit.save/load` interface. ([#36434](https://github.com/PaddlePaddle/Paddle/pull/36434)) - - - Add warning in function of convert_call when converting the generator function. ([#35369](https://github.com/PaddlePaddle/Paddle/pull/35369)) - - - Fix the run_program op dependency analysis bug. ([#38470](https://github.com/PaddlePaddle/Paddle/pull/38470)) - - - Fix the code conversion bug when returning a single value in control flow For. ([#40683](https://github.com/PaddlePaddle/Paddle/pull/40683)) - - - Fix the bug when generating a reverse op when the input to conditional_block op contains LoDTensorArray. ([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585)) - - - Fix the bug that `padddle.jit.save ` loses the forward_pre_hook and forward_post_hook of the top Layer in case of the export of a dynamic-to-static graph mode. ([#42273](https://github.com/PaddlePaddle/Paddle/pull/42273)) - - - Fix the dynamic to static conversion error report where the shape parameter in `paddle.expand ` contains a Tensor. ([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973)) - - -#### **Distributed Training** - -- Distributed training basic functions - - - Fix the bug of a port reporting error in the distributed multi-machine training. ([#37274](https://github.com/PaddlePaddle/Paddle/pull/37274)) - - - Fix the brpc compilation dependency bug. ([#37064](https://github.com/PaddlePaddle/Paddle/pull/37064)) - - - Fix an occupied port issue due to tcp self-connections when Fleet starts. ([#38174](https://github.com/PaddlePaddle/Paddle/pull/38174)) - - - Fix the precision degradation bug under data parallel due to inconsistent initialization of FP16 parameters under multiple cards. ([#38838](https://github.com/PaddlePaddle/Paddle/pull/38838), [#38563](https://github.com/PaddlePaddle/Paddle/pull/38563), [#38405](https://github.com/PaddlePaddle/Paddle/pull/38405)) - - - Fix the precision degradation under data parallel due to FP16 gradient synchronization without dividing by the number of cards. ([#38378](https://github.com/PaddlePaddle/Paddle/pull/38378)) - -- Dynamic graph mixing parallel - - - Fix the bug where parameters are not updated in FP16 mode under mixed parallel by using the new update interface. ([#36017](https://github.com/PaddlePaddle/Paddle/pull/36017)) -- Static graph mixing parallel - - - Fix an issue where grad merge is not compatible with ClipGradientByGlobalNorm in distributed dp mode. ([#36334](https://github.com/PaddlePaddle/Paddle/pull/36334)) - - - Fix an issue under hybrid parallelism where the non-distributed parameters of tensor model parallelism are not broadcast during the initialization phase, resulting in inconsistent non-distributed parameters across cards. ([#36186](https://github.com/PaddlePaddle/Paddle/pull/36186)) - - - Fix the issue that sharding's save_persistables interface does not save FP16 parameters and offload persistent variables when sharding is enabled with offload. ([#40477](https://github.com/PaddlePaddle/Paddle/pull/40477)) - - - Fix the bug where ema parameters are not saved on non-0 cards when sharding is enabled for training. ([#39860](https://github.com/PaddlePaddle/Paddle/pull/39860)) - - - Fix an issue where FC incorrectly calculates gradients according to column cuts. ([#38724](https://github.com/PaddlePaddle/Paddle/pull/38724)) - - - Fix the bug reported when DistributedStrategy is set to without_graph_optimizer when used with rnn. ([#36176](https://github.com/PaddlePaddle/Paddle/pull/36176)) - -- GPUPS Parameter Server Training - - - Fix the CPU branch compilation bug triggered by the GPUPS macro definition. ([#37248](https://github.com/PaddlePaddle/Paddle/pull/37248)) - - - Fix an occasional error raised when saving delta and pullsparse concurrency during GPUPS streamline training. ([#37233](https://github.com/PaddlePaddle/Paddle/pull/37233)) - - - Fix a download error issue caused by HDFSClient querying a directory without returning the full path. ([#36590](https://github.com/PaddlePaddle/Paddle/pull/36590)) - - - Fix the bug with pulling old parameters in GPUPS streamline training. ([#36512](https://github.com/PaddlePaddle/Paddle/pull/36512)) - - - Fix a GPUPS multi-stream allocation issue. ([#37476](https://github.com/PaddlePaddle/Paddle/pull/37476)) - - - Fix the bug of the GPUPS pybind out of core. ([#37287](https://github.com/PaddlePaddle/Paddle/pull/37287)) - - -#### **Other** - -- Fix the clip_extra issue when saving models for dynamic graph quantization training. ([#38323](https://github.com/PaddlePaddle/Paddle/pull/38323)) - -- Fix an issue with abs_max scale initialization for dynamic graph quantization training. ([#39307](https://github.com/PaddlePaddle/Paddle/pull/39307)) - -- Fix an issue of exceptions in saving model in dynamic graph quantization training. ([#38102](https://github.com/PaddlePaddle/Paddle/pull/38102), [#38012](https://github.com/PaddlePaddle/Paddle/pull/38012)) - -- Fix the offline quantization flatten op output error. ([#37722](https://github.com/PaddlePaddle/Paddle/pull/37722)) - -- Fix the non-matching dimension bug in case of inverse quantization matmul op. ([#36982](https://github.com/PaddlePaddle/Paddle/pull/36982)) - -- Fix the bug of adding quantization op when quantizing matmul_v2 without weights. ([#36593](https://github.com/PaddlePaddle/Paddle/pull/36593)) - -- Fix the error of saving the quant_axis attribute in the conv op channel-wise quantization when saving the models. ([#39054](https://github.com/PaddlePaddle/Paddle/pull/39054)) - -- Fix the slow training of channel-wise quantization. ([#40772](https://github.com/PaddlePaddle/Paddle/pull/40772)) - -- Fix the bug of quantization training when dividing by tensor(initialized as 0) leads to nan. ([#36762](https://github.com/PaddlePaddle/Paddle/pull/36762)) - -- Fix incorrect settings of amp_level for mixed precision in multi-threaded scenarios. ([#39198](https://github.com/PaddlePaddle/Paddle/pull/39198)) - -- Fix an issue where PyLayer and Recompute is not set mixed precision correctly when mixed precision training is used with PyLayer and Recompute. ([#39950](https://github.com/PaddlePaddle/Paddle/pull/39950), [#40042](https://github.com/PaddlePaddle/Paddle/pull/40042)) - -- Fix an issue where `D_GLIBCXX_USE_CXX11_ABI` does not take effect when compiling custom operators under Mac. ([#37878](https://github.com/PaddlePaddle/Paddle/pull/37878)) - -- Fix the bug of inconsistent dynamic and static behaviors in case of block=None the initializer-related API. ([#37827](https://github.com/PaddlePaddle/Paddle/pull/37827)) - -- Fix the bug in python 3.6 where there is no fluid module. ([#35862](https://github.com/PaddlePaddle/Paddle/pull/35862)) - -- Fix the bug where optimizer `paddle.optimizer.Adamw` incorrectly calls adam op. ([#36028](https://github.com/PaddlePaddle/Paddle/pull/36028)) - -- Fix a logic error when the `paddle.optimizer.Momentum` optimizer parameter `regularizer` property is None under the multi tensor policy. ([#38344](https://github.com/PaddlePaddle/Paddle/pull/38344)) - -- Fix the bug that the `paddle.optimizer.Momentum` and `paddle.optimizer.Adam` optimizers modify the `multi_precision` property under the multi tensor policy. ([#38991](https://github.com/PaddlePaddle/Paddle/pull/38991)) - -- Fix the code compilation error when using final-state API amp in combination with optional Tensor. ([#40980](https://github.com/PaddlePaddle/Paddle/pull/40980)) - -- Fix the bug where paddle+lite+xpu prediction library would report an error when calling lite CPU prediction, and fix the bug where paddle+lite(without NNAdapter) would report an error when compiling. ([#37449](https://github.com/PaddlePaddle/Paddle/pull/37449)) - -- Fix the bug in Debug compile mode where LoDTensorArray crashes due to inconsistent Pybind11 bindings. ([#37954](https://github.com/PaddlePaddle/Paddle/pull/37954)) - -- Fix the bug that prevents correct construction of Tensor in the extreme case where the shape parameter is a list of Tensor mix with int. ([#38284](https://github.com/PaddlePaddle/Paddle/pull/38284)) - -- Fix a compatibility issue with the `paddle.optimizer.AdamW` API. ([#37905](https://github.com/PaddlePaddle/Paddle/pull/37905)) - -- Fix the bug in _InstanceNormBase where the returne value of extra_repr is incorrect. ([#38537](https://github.com/PaddlePaddle/Paddle/pull/38537)) - -- Fix the bug that the Paddle Inference lacks of the symbol `paddle::distributed::TensorTable` when the -DWITH_DISTRIBUTED is uesd. ([#41128](https://github.com/PaddlePaddle/Paddle/pull/41128)) - -- matmul_v2 op reports error when there is a 0 value in the shape. ([#35791](https://github.com/PaddlePaddle/Paddle/pull/35791)) - -- Fix the problem of the repeated printing for no gradient input hint message of the recomputed in dynamic graphs. Change it to the printing only once with using warning. ([#38293](https://github.com/PaddlePaddle/Paddle/pull/38293)) - -- Fix the low accuracy bug on the validation set in later epoch training in visual models in the gelu op. ([#38450](https://github.com/PaddlePaddle/Paddle/pull/38450)) - -- Fix adamw op error in numerical computation. ([#37746](https://github.com/PaddlePaddle/Paddle/pull/37746)) - -- Add the parameters in the sparse_momentum `_C_ops` interface. ([#39969](https://github.com/PaddlePaddle/Paddle/pull/39969)) - -- Fix the bug where there is no `distributed` module in python 3.6. ([#35848](https://github.com/PaddlePaddle/Paddle/pull/35848)) - -- Fix the eigh unit test data initialization problem. ([#39568](https://github.com/PaddlePaddle/Paddle/pull/39568)) - -- Fix the eigvalsh unit test data initialization problem. ([#39841](https://github.com/PaddlePaddle/Paddle/pull/39841)) - -- Fix the bug of not working properly due to excessive register usage on V100 by segment op. ([#38113](https://github.com/PaddlePaddle/Paddle/pull/38113)) - -- Fix the bug with conv-related op sparsification incorrectly set dimension. ([#36054](https://github.com/PaddlePaddle/Paddle/pull/36054)) - -- Provide Automatic SParsity training for static graph-related function Alias to `Paddle.static.sparsity`. ([#36525](https://github.com/PaddlePaddle/Paddle/pull/36525)) - -- Fix the bug where divide op’s integer division is still an integer. ([#40890](https://github.com/PaddlePaddle/Paddle/pull/40890)) - -- Fix the crash bug of`paddle.multiplex` when input Tensor value is 0. ([#34972](https://github.com/PaddlePaddle/Paddle/pull/34972)) - -- Fix a speed exception for set `reduction` parameter in `paddlpaddle.nn.functional.kl_div`. ([#37283](https://github.com/PaddlePaddle/Paddle/pull/37283)) - -- Fix the data source unsorted bug in loading the Cifar dataset. ([#37272](https://github.com/PaddlePaddle/Paddle/pull/37272)) - -- Fix the conversion of loss from uint16 to float in the ProgressBar class. ([#39231](https://github.com/PaddlePaddle/Paddle/pull/39231)) - -- Fix the ShareBufferWith shared data type problem. ([#37464](https://github.com/PaddlePaddle/Paddle/pull/37464), [#37247](https://github.com/PaddlePaddle/Paddle/pull/37247)) - -- Fix the performance issue when `paddle.io.DataLoader` uses IterableDataset and num_workers>0. ([#40541](https://github.com/PaddlePaddle/Paddle/pull/40541)) - -- Fix the bug with `paddle.vision.ops.yolo_loss` returns incomplete values in dynamic graph. ([#40185](https://github.com/PaddlePaddle/Paddle/pull/40185)) - -- Remove the restriction that the input parameter dataset of `paddle.io.BatchSampler` needs to be the `paddle.io.Dataset` type, to expand the support for user-defined datasets. ([#40184](https://github.com/PaddlePaddle/Paddle/pull/40184)) - -- Fix the bug of `paddle.summary` reporting that op_flops does not exist. ([#36489](https://github.com/PaddlePaddle/Paddle/pull/36489)) - -- Fix the formula error of lars_momentum op when lars_weight_decay=0. ([#40892](https://github.com/PaddlePaddle/Paddle/pull/40892)) - -- Fix the bug that the optimize-offload cannot save presistable var. ([#36433](https://github.com/PaddlePaddle/Paddle/pull/36433)) - -- Fix an issue where optimizer-offload does not support adamw op type. ([#36432](https://github.com/PaddlePaddle/Paddle/pull/36432)) - -- Fix an issue where enable_program_desc_tracing_data in Tracer is not safe in multi-threaded scenarios. ([#39776](https://github.com/PaddlePaddle/Paddle/pull/39776)) - -- Fix an issue where the model file size is not initialized when the model is read. ([#40518](https://github.com/PaddlePaddle/Paddle/pull/40518)) - -- Fix the logic bug of the Expand op. When the dimension of the input Tensor X is smaller than the shape to be expanded, it may result in the incorrect Out.Shape. ([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677)) - -- Fix the dynamic to static transcription error when the Expand_As op takes only y.shape without Y variable entered. ([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677)) - -- Fix the logic error when Expand_As op computes the output shape. ([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677)) - -- Fix the bug that the variables of the `core.VarDesc.VarType.STRINGS` type report error when getting the `lod_level` property and setting its `lod_level` to None. ([#39077](https://github.com/PaddlePaddle/Paddle/pull/39077)) - -- Fix an issue where the framework function `Pylayer` does not support different dtypes. ([#37974](https://github.com/PaddlePaddle/Paddle/pull/37974)) - -- Fix the bug of division by zero of the learning rate decay API `paddle.optimizer.lr.PolynomialDecay`. ([#38782](https://github.com/PaddlePaddle/Paddle/pull/38782)) - -- Fix the issue where some logs remained after calling the DisableGlogInfo() interface. ([#36356](https://github.com/PaddlePaddle/Paddle/pull/36356)) - -- Fix an error in backward of multi-layer RNN (when dropout is set to 0) in the training of SimpleRNN, GRU and LSTM API CPU. ([#37080](https://github.com/PaddlePaddle/Paddle/pull/37080)) - -- Add cache for fft on the backend of cufft and hipfft. ([#36646](https://github.com/PaddlePaddle/Paddle/pull/36646)) - -- Enable the shifts parameter of `paddle.roll` to support transfer in Tensor. ([#36727](https://github.com/PaddlePaddle/Paddle/pull/36727)) - -- Add onemkl to fft as an optional computation backend. ([#36414](https://github.com/PaddlePaddle/Paddle/pull/36414)) - -- Fix the precision bug in the bfloat16 type under two mamtul_v2 and elementwise_div ops. ([#42479](https://github.com/PaddlePaddle/Paddle/pull/42479)) - -- Fix a possible error in the next step caused by LoDTensorArray clearing only the internal Tensor and not clearing the Array during device memory recycling. ([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398)) - - -## **4. Deployment Direction (Paddle Inference)** - -### **(1) New features** - -#### **New APIs** - -- Add the Java API so that Java developers can implement high performance inference on the server and in the cloud through a simple and flexible interface. ([#37162](https://github.com/PaddlePaddle/Paddle/pull/37162)) - -- Add `GetTrtCompileVersion` and `GetTrtRuntimeVersion` interfaces for getting TensorRT version information. ([#36429](https://github.com/PaddlePaddle/Paddle/pull/36429)) - -- Add the `ShareExternalData` interface to avoid memory copy of input data during inference. ([#39809](https://github.com/PaddlePaddle/Paddle/pull/39809)) - - -#### **New functions** - -- Add ONNX Runtime backend support. Currently it supports only CPU in the integrated version. ([#39988](https://github.com/PaddlePaddle/Paddle/pull/39988), [#40561](https://github.com/PaddlePaddle/Paddle/pull/40561)) - -- Add support for Ascend 310 inference based on the Paddle Lite subgraph approach. ([#35226](https://github.com/PaddlePaddle/Paddle/pull/35226)) - -- Add the native GPU FP16 inference. ([#40531](https://github.com/PaddlePaddle/Paddle/pull/40531)) - -- For the switch_ir_debug interface, add the dump model function. ([#36581](https://github.com/PaddlePaddle/Paddle/pull/36581)) - -- Add the configuration interface for TensorRT config: `void UpdateConfigInterleaved(paddle_infer::Config* c, bool with_interleaved)` for special data layout in int8 quantization inference. ([#38884](https://github.com/PaddlePaddle/Paddle/pull/38884)) - -- Add TensorRT inspector output information to the log. It is valid only for TensorRT 8.2 or later. ([#38362](https://github.com/PaddlePaddle/Paddle/pull/38362),[#38200](https://github.com/PaddlePaddle/Paddle/pull/38200))) - -- Add the support of the TensorRT ASP sparse inference. ([#36413](https://github.com/PaddlePaddle/Paddle/pull/36413)) - - -### **(2) Underlying optimization** - -#### **CPU performance optimization** - -- Optimize the caching mechanism of MKLDNN. ([#38336](https://github.com/PaddlePaddle/Paddle/pull/38336), [#36980](https://github.com/PaddlePaddle/Paddle/pull/36980), [#36695](https://github.com/PaddlePaddle/Paddle/pull/36695)) - -- Add matmul_scale_fuse pass. ([#37962](https://github.com/PaddlePaddle/Paddle/pull/37962)) - -- Add MKLDNN reshape_transpose_matmul_v2_mkldnn_fuse_pass. ([#37847](https://github.com/PaddlePaddle/Paddle/pull/37847), [#40948](https://github.com/PaddlePaddle/Paddle/pull/40948)) - -- Add MKLDNN conv_hard_sigmoid_mkldnn_fuse_pass. ([#36869](https://github.com/PaddlePaddle/Paddle/pull/36869)) - -- Add MKLDNN matmul_v2_transpose_reshape_fuse_pass. ([#36481](https://github.com/PaddlePaddle/Paddle/pull/36481)) - -- Add MKLDNN softplus_activation_mkldnn_fuse_pass. ([#36657](https://github.com/PaddlePaddle/Paddle/pull/36657)) - -- Add MKLDNN elt_act_mkldnn_fuse_pass. ([#36541](https://github.com/PaddlePaddle/Paddle/pull/36541)) - -- Add MKLDNN mish operator and conv_mish_mkldnn_fuse_pass. ([#38623](https://github.com/PaddlePaddle/Paddle/pull/38623)) - - -#### **GPU performance optimization** - -- Change the inference default video memory allocation policy from `naive_best_fit` to `auto_growth`, to solve the problem of some models filled up with the GPU video memory. ([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491)) - -- Support gelu and FC+gelu ops using TensorRT inference. ([#38399](https://github.com/PaddlePaddle/Paddle/pull/38399)) - -- Support `deformable_conv` inference using TensorRT under static shape. ([#36612](https://github.com/PaddlePaddle/Paddle/pull/36612) [#36850](https://github.com/PaddlePaddle/Paddle/pull/36850) [#37345](https://github.com/PaddlePaddle/Paddle/pull/37345)) - -- Support nearest_interp_v2 op using TensorRT inference. ([#34126](https://github.com/PaddlePaddle/Paddle/pull/34126)) - -- Add `yolo_box` TensorRT plugin to support input parameters `iou_aware` and `iou_aware_factor` so that the IoU computed by inference is used as a factor for confidence. ([#34128](https://github.com/PaddlePaddle/Paddle/pull/34128)) - -- Support `elementwise_sub` and `elementwise_div` calling for TensorRT inference. ([#40806](https://github.com/PaddlePaddle/Paddle/pull/40806) [#41253](https://github.com/PaddlePaddle/Paddle/pull/41253)) - -- Support `multiclass_nms3` using TensorRT inference. ([#41181](https://github.com/PaddlePaddle/Paddle/pull/41181) [#41344](https://github.com/PaddlePaddle/Paddle/pull/41344)) - -- Support flatten_contiguous_rang op using TensorRT inference. ([#38922](https://github.com/PaddlePaddle/Paddle/pull/38922)) - -- Support for `pool2d` attribute `padding` using TensorRT inference when dimension is 4, and `global_pooling` and `ceil_mode` are True. ([#39545](https://github.com/PaddlePaddle/Paddle/pull/39545)) - -- Support batch_norm and elementwise_add using TensorRT inference when dimension is 5. ([#36446](https://github.com/PaddlePaddle/Paddle/pull/36446)) - -- Add pool3d to use TensorRT inference. ([#36545](https://github.com/PaddlePaddle/Paddle/pull/36545), [#36783](https://github.com/PaddlePaddle/Paddle/pull/36783)) - -- Add the `reduce` int32 and float types to use TensorRT inference. Add `reduce_mean` GPU operator int32 and int64 registration. ([#39088](https://github.com/PaddlePaddle/Paddle/pull/39088)) - -- Modify MatmulV2ToMul pass. Modify the qualifier (not support of broadcast) and op_teller mapping condition. ([#36652](https://github.com/PaddlePaddle/Paddle/pull/36652)) - -- Add the support for TenorRT plugin interface AddPluginV2IOExt. ([#36493](https://github.com/PaddlePaddle/Paddle/pull/36493)) - -- Add the aligned attribute in roi_align op and support for TensorRT inference. ([#38905](https://github.com/PaddlePaddle/Paddle/pull/38905)) - -- Add the support for TensorRT inference with concat attribute `axis = -1`. ([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096)) - -- Add TensorRT plugin: preln_emb_eltwise_layernorm, preln_skip_la, and rnorm ops, for ERNIE-like model performance optimization. ([#39570](https://github.com/PaddlePaddle/Paddle/pull/39570)) - -- Add TensorRT fuse pass: preln_embedding_eltwise_layernorm_fuse_pass, preln_skip_layernorm_fuse_pass, for ERNIE-like model performance optimization. ([#39508](https://github.com/PaddlePaddle/Paddle/pull/39508)) - -- Split matmul fusion-related passes based on different backends (GPU, CPU, TensorRT), to support transpose function for FC weights. ([#39369](https://github.com/PaddlePaddle/Paddle/pull/39369)) - -- Add the support to TensorRT by roll, strided_slice, and slice op in case of dynamic shapes. ([#41913](https://github.com/PaddlePaddle/Paddle/pull/41913), [#41573](https://github.com/PaddlePaddle/Paddle/pull/41573), [#41467](https://github.com/PaddlePaddle/Paddle/pull/41467)) - -- Add div op support for TensorRT. ([#41243](https://github.com/PaddlePaddle/Paddle/pull/41243)) - -- Quantization support - - - For the `PostTrainingQuantization` API, add the support for `paddle.io.DataLoader` object or `Python Generator` input. ([#38686](https://github.com/PaddlePaddle/Paddle/pull/38686)) - - - ERNIE full quantization model inference supports for interleaved data layout. ([#39424](https://github.com/PaddlePaddle/Paddle/pull/39424)) - - - Support for PaddleSlim new quantile model format inference. ([#41049](https://github.com/PaddlePaddle/Paddle/pull/41049)) - - - Add matmul int8 quantization inference op converter and plugin. ([#37285](https://github.com/PaddlePaddle/Paddle/pull/37285)) - - - Add pass to determine if all ops in the model can support int8 quantization. ([#36042](https://github.com/PaddlePaddle/Paddle/pull/36042)) - - - Support quantization inference for the FC part of the multihead attention of the non-variable-length branch. ([#39660](https://github.com/PaddlePaddle/Paddle/pull/39660)) - - -#### **Ascend NPU Related Features** - -- - Refactor shape operator forward computation logic to support execution on NPU. ([#39613](https://github.com/PaddlePaddle/Paddle/pull/39613)) - - - Refactor reshape operator forward computation logic to support ShapeTensor input. ([#38748](https://github.com/PaddlePaddle/Paddle/pull/38748)) - - - Uniform accuracy type when loading model weights. ([#39160](https://github.com/PaddlePaddle/Paddle/pull/39160)) - - -### **(3) Bug fixing** - -#### **Framework and API fixing** - -- Fix the bug of model clipping when saving static graphs. ([#37579](https://github.com/PaddlePaddle/Paddle/pull/37579)) - -- For the C API, add wrapper PD_Cstr for strings, and provide construction and destructing methods to avoid users to use C runtime library to destruct strings directly. ([#38667](https://github.com/PaddlePaddle/Paddle/pull/38667)) - -- Fix the logic bug with memory reuse at prediction time. ([#37324](https://github.com/PaddlePaddle/Paddle/pull/37324)) - -- Fix memory reuse error reporting in multi-threading. ([#37894](https://github.com/PaddlePaddle/Paddle/pull/37894)) - -- Allow passing empty strings for inference when no weight file is available. ([#38579](https://github.com/PaddlePaddle/Paddle/pull/38579)) - -- Fix an issue of clone not being supported when TensorRT dynamic shape is enabled. ([#38520](https://github.com/PaddlePaddle/Paddle/pull/38520)) - -- Fix multi-threaded clone error after TensorRT dynamic shape is enabled. ([#40067](https://github.com/PaddlePaddle/Paddle/pull/40067)) - -- Fix a TensorRT engine destructing issue. ([#35842](https://github.com/PaddlePaddle/Paddle/pull/35842), [#35938](https://github.com/PaddlePaddle/Paddle/pull/35938)) - -- For the lite xpu interface, fix an issue where the xpu card cannot be selected. ([#36610](https://github.com/PaddlePaddle/Paddle/pull/36610)) - -- The TensorRT dynamic shape parameter automatically generate the interface, to add the file existence check. ([#36628](https://github.com/PaddlePaddle/Paddle/pull/36628)) - -- Fix the bug that the MKLDNN does not support conv3d. ([#42055](https://github.com/PaddlePaddle/Paddle/pull/42055)) - -#### **Backend Capability Fixing** - -- Fix cuDNN default algorithm selection configuration for prediction, with using non-deterministic policies. ([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491)) - -- Fix the bug with deformable_conv op in TensorRT plugin resource recovery handling error. ([#38374](https://github.com/PaddlePaddle/Paddle/pull/38374)) - -- Fix a serialization error in the TensorRT plugin for deformable_conv op. ([#38057](https://github.com/PaddlePaddle/Paddle/pull/38057)) - -- Adapt the new refactor engine and serialization API of TensorRT 8.0. ([#36769](https://github.com/PaddlePaddle/Paddle/pull/36769)) - -- Fix the bug that the Flatten2MatmulFusePass, Squeeze2MatmulFusePass, and Reshape2MatmulFusePass do not take effect. ([#37644](https://github.com/PaddlePaddle/Paddle/pull/37644)) - -- Fix the bug with TensorRT input data reporting errors. ([#37427](https://github.com/PaddlePaddle/Paddle/pull/37427)) - -- Add error message when input dimension is wrong. ([#38962](https://github.com/PaddlePaddle/Paddle/pull/38962)) - -- Fix the bug with EmbEltwiseLayernorm output type error. ([#40015](https://github.com/PaddlePaddle/Paddle/pull/40015)) - -- Remove conv_affine_channel_fuse_pass and the corresponding unit test. ([#39817](https://github.com/PaddlePaddle/Paddle/pull/39817)) - -- Fix an issue where the adaptive_pool2d pass incorrectly replaces the pool attribute. ([#39600](https://github.com/PaddlePaddle/Paddle/pull/39600)) - -- Fix the bug that shuffle_channel_detect_pass incorrectly generates shuffle_channel op. ([#39242](https://github.com/PaddlePaddle/Paddle/pull/39242)) - -- Fix transpose parameter error. ([#39006](https://github.com/PaddlePaddle/Paddle/pull/39006)) - -- Fix the crash bug when nearest_interp_v2 input scale dimension is less than 1. ([#38725](https://github.com/PaddlePaddle/Paddle/pull/38725)) - -- Fix the bug that the prelu does not support one-dimensional input in dynamic shape. ([#39389](https://github.com/PaddlePaddle/Paddle/pull/39389)) - -- Fix the bug in the kernel function of slice's special_slice_plugin. ([#39875](https://github.com/PaddlePaddle/Paddle/pull/39875)) - -- Temporarily disable int8 branch under skip_layernorm variable length to prevent accuracy degradation. ([#39991](https://github.com/PaddlePaddle/Paddle/pull/39991)) - -- Fix some bugs regarding support for preln_ernie models. ([#39733](https://github.com/PaddlePaddle/Paddle/pull/39733)) - -- Fix the bug that slice may exceed threads limit in ERNIE. Fix the bug that the spacial_slice is incorrectly triggered. ([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096)) - -- Fix the bug that the elementwise does not support broadcast when the dimension is the same. ([#37908](https://github.com/PaddlePaddle/Paddle/pull/37908)) - -- Fix the problem that the underlying implementation is different in the nearest_interp op when align_corners is True and TensorRT layer results and native op have diff. ([#37525](https://github.com/PaddlePaddle/Paddle/pull/37525)) - -- Fix qkv_plugin: Kernel function computation error. ([#37096](https://github.com/PaddlePaddle/Paddle/pull/37096)) - -- Fix the bug with inference pass for dynamic quantization. ([#35879](https://github.com/PaddlePaddle/Paddle/pull/35879)) - -- Reuse directly when Tensor requests less memory than the allocated size. ([#37880](https://github.com/PaddlePaddle/Paddle/pull/37880)) - -- Fix the hang bug when ERNIE fixed-length model is enabled with TensorRT. ([#37839](https://github.com/PaddlePaddle/Paddle/pull/37839)) - -- Fix the crash bug when TensorRT int8 lacks of dynamic range information. ([#36900](https://github.com/PaddlePaddle/Paddle/pull/36900)) - -- Fix the bug with slice deserialization code. ([#36588](https://github.com/PaddlePaddle/Paddle/pull/36588)) - -- Fix yolo box calculation formula error. ([#36240](https://github.com/PaddlePaddle/Paddle/pull/36240)) - -- Fix the crash bug when the earlier version model uses a later version of roi_align. ([#38788](https://github.com/PaddlePaddle/Paddle/pull/38788)) External Developers - -- Fix the bug of a large performance difference of softmax between python and C++. ([#37130](https://github.com/PaddlePaddle/Paddle/pull/37130)) - -- Fix matmul inference failure on static shape 2-dimensional input and dynamic shape 3-dimensional input. ([#36849](https://github.com/PaddlePaddle/Paddle/pull/36849)) - -- Fix reshape_transpose_matmul_mkldnn_fuse_pass mishandling of shapes. ([#36731](https://github.com/PaddlePaddle/Paddle/pull/36731)) - -- Fix an issue where TensorRT gets 4 dimensions when the input is 2 dimensions. ([#36614](https://github.com/PaddlePaddle/Paddle/pull/36614)) - -- Fix the bug report when the interpolate_v2 MKLDNN operator is null in the scale attribute. ([#36623](https://github.com/PaddlePaddle/Paddle/pull/36623)) - -- Fix poor performance of the recurrent operator in multi-threaded scenarios. ([#36052](https://github.com/PaddlePaddle/Paddle/pull/36052)) - -- Remove restrictions of relu, sigmoid, tanh, relu6, batch_norm, clip, concat, gelu, hard_sigmoid, prelu, softmax, split, and swish on TensorRT 2-dimensional inputs. ([#37097](https://github.com/PaddlePaddle/Paddle/pull/37097)) - -- Fix reshape op to use TensorRT inference. ([#41090](https://github.com/PaddlePaddle/Paddle/pull/41090)) - -- Fix matmul related pass, which is compatible with matmul_v2. ([#36424](https://github.com/PaddlePaddle/Paddle/pull/36424)) - -- Support VALID and SAME attributes in the padding method of the conv2d operator when TensorRT is enabled. ([#38999](https://github.com/PaddlePaddle/Paddle/pull/38999)) - -- Fix MKLDNN multi-input operator quantization problem. ([#39593](https://github.com/PaddlePaddle/Paddle/pull/39593), [#39346](https://github.com/PaddlePaddle/Paddle/pull/39346), [#40717](https://github.com/PaddlePaddle/Paddle/pull/40717)) - -- Fix scale error of conv+activation in MKLDNN quantization scenarios. ([#38331](https://github.com/PaddlePaddle/Paddle/pull/38331)) - -- Fix the bug in MKLDNN quantization without parameters where the quantization of subsequent operators is handled differently. ([#39342](https://github.com/PaddlePaddle/Paddle/pull/39342)) - -- Fix a data type related issue in MKLDNN cpu_bfloat16_placement_pass. ([#38702](https://github.com/PaddlePaddle/Paddle/pull/38702)) - -- Fix a split operator execution issue in MKLDNN bfloat16 inference. ([#39548](https://github.com/PaddlePaddle/Paddle/pull/39548)) - -- Fix the bug with MKLDNN matmul_v2 operator not supporting 6 dimensions. ([#36342](https://github.com/PaddlePaddle/Paddle/pull/36342), [#38665](https://github.com/PaddlePaddle/Paddle/pull/38665)) - -- Fix MKLDNN DeviceContext error in MKLDNN matmul_v2_transpose_reshape. ([#38554](https://github.com/PaddlePaddle/Paddle/pull/38554)) - -- Fix incorrectly calculated results for segmentation models in MKLDNN inference scenarios. ([#37310](https://github.com/PaddlePaddle/Paddle/pull/37310)) - -- Fix MKLDNN bfloat16 placement operator list and add the missing operator. ([#36291](https://github.com/PaddlePaddle/Paddle/pull/36291)) - -- Fix the format bug of MKLDNN operators, including: FC, conv_transpose, 6-dimensional Tensor error reporting, and wrong output format of conv to NHWC input. ([#38890](https://github.com/PaddlePaddle/Paddle/pull/38890), [#37344](https://github.com/PaddlePaddle/Paddle/pull/37344), [#37175](https://github.com/PaddlePaddle/Paddle/pull/37175), [#38553](https://github.com/PaddlePaddle/Paddle/pull/38553), [#40049](https://github.com/PaddlePaddle/Paddle/pull/40049), [#39097](https://github.com/PaddlePaddle/Paddle/pull/39097)) - -- Fix MKLDNN multi-threaded reasoning scenario error due to cache mechanism. ([#36290](https://github.com/PaddlePaddle/Paddle/pull/36290), [#35884](https://github.com/PaddlePaddle/Paddle/pull/35884)) - -- Fix MKLDNN quantization model accuracy anomaly caused by matmul and FC. ([#38023](https://github.com/PaddlePaddle/Paddle/pull/38023), [#37618](https://github.com/PaddlePaddle/Paddle/pull/37618)) - -- Fix the abnormal quantization model accuracy issue in MKLDNN quantization conversion scripts caused by missing passes. ([#37619](https://github.com/PaddlePaddle/Paddle/pull/37619), [#40542](https://github.com/PaddlePaddle/Paddle/pull/40542),[#38912](https://github.com/PaddlePaddle/Paddle/pull/38912)) - -- Fix the crash bug in MKLDNN enabling volume op due to data type mismatch. ([#38133](https://github.com/PaddlePaddle/Paddle/pull/38133)) - -- Fix an issue where some MKLDNN ops need to change back to the original layout after modifying the layout. ([#39422](https://github.com/PaddlePaddle/Paddle/pull/39422)) - -- Fix the bug of Python API error report due to conflict with Ascend software stack, because the GIL lock is not released in the Ascend 910 inference scenario. ([#38605](https://github.com/PaddlePaddle/Paddle/pull/38605)) - - -## **5. Environment Adaptation** - -### **Compile and Install** - -- From version 2.3.0, PaddlePaddle has adjusted and upgraded the types of GPU architectures supported by the framework. (For more information, please refer to: [GPU architectures supported by PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.3rc/install/Tables.html#gpu)) - - -Notes: - -- PIP source installation means downloading the installation package and dependency libraries from PIP official website with using `pip install paddlepaddle` or `pip install paddlepaddle-gpu`. This supports less architecture types, and lighter installation package,and only one CUDA version of the installation package is provided(compared with BOS source). - - - Prior to version 2.3, the PIP source installer (CUDA10.2) supports the following GPU architectures: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, and 7.5. - - - Later than version 2.3, the PIP source installer (CUDA11.0) supports the following GPU architectures: 6.0, 6.1, 7.0, 7.5, 8.0 - -- The BOS source is a way to download the installation package and dependency libraries from the official website of PaddlePaddle, which supports more GPU architectures. The download source is from China and it is much faster. (compared with PIP source, it supports more kinds of architectures and provides multiple CUDA versions of installation packages). - - - Prior to version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website: - - CUDA10: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5; +#### New Features - - CUDA11: 5.2,6.0,6.1,7.0,7.5,8.0。 +- Add the support for Hygon DCU K100. [#63535](https://github.com/PaddlePaddle/Paddle/pull/63535) +- Support the complex64/128 data type and fusion operators such as fused_bias_residual_layernorm, fused_bias_dropout_residual_layer_norm, and rms_norm. [#63217](https://github.com/PaddlePaddle/Paddle/pull/63217) - - Later than version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website: +#### Bug Fixing - - CUDA10: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5; +- Fix compilation error issues in DTK and ROCM version upgrades. [#62832](https://github.com/PaddlePaddle/Paddle/pull/62832),[#62931](https://github.com/PaddlePaddle/Paddle/pull/62931),[#61872](https://github.com/PaddlePaddle/Paddle/pull/61872),[#63738](https://github.com/PaddlePaddle/Paddle/pull/63738) - - CUDA11: 3.5, 5.0, 6.0, 6.1, 7.0, 7.5, 8.0。 +## Environment Updates -- Support Python 3.10. Fix compilation bugs caused by some PythonC API changes on Windows. ([#41180](https://github.com/PaddlePaddle/Paddle/pull/42180)) +In this PaddlePaddle version, we complete the release and update synchronization of the basic dependency libraries, and remove the old dependency libraries that are no longer updated. Complete a number of optimizations to improve compilation efficiency and compatibility, and improve the CI pipeline monitoring function to enhance the user installation experience. Fixe the several known compilation problems, improved the compilation system of paddle, and add some new features. Through the optimizations, the compilation and installation experience of the PaddlePaddle framework is further improved to bring developers a better use and development experience. -- The Windows platform supports the compilation through Visual Studio 2019. ([#38719](https://github.com/PaddlePaddle/Paddle/pull/38719)) +### New Support -- Eliminate various warnings when compiling on the Windows platform. ([#38034](https://github.com/PaddlePaddle/Paddle/pull/38034), [#37890](https://github.com/PaddlePaddle/Paddle/pull/37890), [#37442](https://github.com/PaddlePaddle/Paddle/pull/37442), [#37439](https://github.com/PaddlePaddle/Paddle/pull/37439), [#36857](https://github.com/PaddlePaddle/Paddle/pull/36857)) +- Support users to install paddle without relying on local cuda and cudnn, thus improving the user installation experience. [#60841](https://github.com/PaddlePaddle/Paddle/pull/60841),[#61973](https://github.com/PaddlePaddle/Paddle/pull/61973),[#61862](https://github.com/PaddlePaddle/Paddle/pull/61862),[#61235](https://github.com/PaddlePaddle/Paddle/pull/61235),[#61209](https://github.com/PaddlePaddle/Paddle/pull/61209),[#61653](https://github.com/PaddlePaddle/Paddle/pull/61653),[#64083](https://github.com/PaddlePaddle/Paddle/pull/64083) +- Support CUDA 12.3 completely. Complete the retirement of cuda10.2. [#63356](https://github.com/PaddlePaddle/Paddle/pull/63356),[#60299](https://github.com/PaddlePaddle/Paddle/pull/60299),[#64171](https://github.com/PaddlePaddle/Paddle/pull/64171),[#62189](https://github.com/PaddlePaddle/Paddle/pull/62189),[#63392](https://github.com/PaddlePaddle/Paddle/pull/63392),[#64228](https://github.com/PaddlePaddle/Paddle/pull/64228),[#62498](https://github.com/PaddlePaddle/Paddle/pull/62498),[#64298](https://github.com/PaddlePaddle/Paddle/pull/64298) +- Support Python 3.12 completely, bringing more powerful language features and performance optimizations. Complete the retirement of python3.7. [#59875](https://github.com/PaddlePaddle/Paddle/pull/59875),[#59877](https://github.com/PaddlePaddle/Paddle/pull/59877),[#59876](https://github.com/PaddlePaddle/Paddle/pull/59876) +- Upgrade of other paddle-dependent third-party libraries: [#63741](https://github.com/PaddlePaddle/Paddle/pull/63741),[#64447](https://github.com/PaddlePaddle/Paddle/pull/64447),[#60195](https://github.com/PaddlePaddle/Paddle/pull/60195),[#60110](https://github.com/PaddlePaddle/Paddle/pull/60110),[#61509](https://github.com/PaddlePaddle/Paddle/pull/61509) -- Fix jetson compilation issues introduced by the underlying data structure upgrade. ([#39669](https://github.com/PaddlePaddle/Paddle/pull/39669), [#39441](https://github.com/PaddlePaddle/Paddle/pull/39441)) +### Compilation Optimizations +- Optimize paddle's CMake codes, significantly improving compilation efficiency and experience. [##59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#60167](https://github.com/PaddlePaddle/Paddle/pull/60167),[#61052](https://github.com/PaddlePaddle/Paddle/pull/61052),[#59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#59607](https://github.com/PaddlePaddle/Paddle/pull/59607),[#63093](https://github.com/PaddlePaddle/Paddle/pull/63093),[#63887](https://github.com/PaddlePaddle/Paddle/pull/63887),[#62969](https://github.com/PaddlePaddle/Paddle/pull/62969),[#64007](https://github.com/PaddlePaddle/Paddle/pull/64007),[#59811](https://github.com/PaddlePaddle/Paddle/pull/59811),[#63045](https://github.com/PaddlePaddle/Paddle/pull/63045),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#60240](https://github.com/PaddlePaddle/Paddle/pull/60240),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#61411](https://github.com/PaddlePaddle/Paddle/pull/61411),[#61944](https://github.com/PaddlePaddle/Paddle/pull/61944),[#61961](https://github.com/PaddlePaddle/Paddle/pull/61961),[#59990](https://github.com/PaddlePaddle/Paddle/pull/59990),[#59478](https://github.com/PaddlePaddle/Paddle/pull/59478),[#61501](https://github.com/PaddlePaddle/Paddle/pull/61501),[#60066](https://github.com/PaddlePaddle/Paddle/pull/60066),[#64133](https://github.com/PaddlePaddle/Paddle/pull/64133),[#64231](https://github.com/PaddlePaddle/Paddle/pull/64231),[#60087](https://github.com/PaddlePaddle/Paddle/pull/60087),[#60348](https://github.com/PaddlePaddle/Paddle/pull/60348),[#60737](https://github.com/PaddlePaddle/Paddle/pull/60737),[#61364](https://github.com/PaddlePaddle/Paddle/pull/61364),[#63214](https://github.com/PaddlePaddle/Paddle/pull/63214),[#62454](https://github.com/PaddlePaddle/Paddle/pull/62454),[#62473](https://github.com/PaddlePaddle/Paddle/pull/62473),[#63692](https://github.com/PaddlePaddle/Paddle/pull/63692),[#63950](https://github.com/PaddlePaddle/Paddle/pull/63950) +- Support C++ unit test link dynamic library under linux and windowx, greatly reducing the size of C++ unit test and the size of the entire build directory. [#60008](https://github.com/PaddlePaddle/Paddle/pull/60008),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60961](https://github.com/PaddlePaddle/Paddle/pull/60961),[#60831](https://github.com/PaddlePaddle/Paddle/pull/60831),[#60832](https://github.com/PaddlePaddle/Paddle/pull/60832),[#60833](https://github.com/PaddlePaddle/Paddle/pull/60833),[#61372](https://github.com/PaddlePaddle/Paddle/pull/61372),[#60834](https://github.com/PaddlePaddle/Paddle/pull/60834),[#61374](https://github.com/PaddlePaddle/Paddle/pull/61374),[#61463](https://github.com/PaddlePaddle/Paddle/pull/61463),[#61376](https://github.com/PaddlePaddle/Paddle/pull/61376),[#60830](https://github.com/PaddlePaddle/Paddle/pull/60830),[#61373](https://github.com/PaddlePaddle/Paddle/pull/61373),[#61672](https://github.com/PaddlePaddle/Paddle/pull/61672),[#61375](https://github.com/PaddlePaddle/Paddle/pull/61375),[#61676](https://github.com/PaddlePaddle/Paddle/pull/61676),[#62036](https://github.com/PaddlePaddle/Paddle/pull/62036),[#61945](https://github.com/PaddlePaddle/Paddle/pull/61945),[#61675](https://github.com/PaddlePaddle/Paddle/pull/61675),[#61674](https://github.com/PaddlePaddle/Paddle/pull/61674),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#61238](https://github.com/PaddlePaddle/Paddle/pull/61238),[#59988](https://github.com/PaddlePaddle/Paddle/pull/59988),[#60307](https://github.com/PaddlePaddle/Paddle/pull/60307),[#59612](https://github.com/PaddlePaddle/Paddle/pull/59612),[#59942](https://github.com/PaddlePaddle/Paddle/pull/59942),[#59968](https://github.com/PaddlePaddle/Paddle/pull/59968),[#59978](https://github.com/PaddlePaddle/Paddle/pull/59978),[#60121](https://github.com/PaddlePaddle/Paddle/pull/60121),[#60149](https://github.com/PaddlePaddle/Paddle/pull/60149),[#60161](https://github.com/PaddlePaddle/Paddle/pull/60161),[#60160](https://github.com/PaddlePaddle/Paddle/pull/60160),[#60230](https://github.com/PaddlePaddle/Paddle/pull/60230),[#60154](https://github.com/PaddlePaddle/Paddle/pull/60154),[#60356](https://github.com/PaddlePaddle/Paddle/pull/60356),[#60392](https://github.com/PaddlePaddle/Paddle/pull/60392),[#60517](https://github.com/PaddlePaddle/Paddle/pull/60517),[#61131](https://github.com/PaddlePaddle/Paddle/pull/61131),[#60959](https://github.com/PaddlePaddle/Paddle/pull/60959) +- Add the support for Clang compiler. Users can now use Clang to compile, enjoying faster compilation speed and better error message prompts. [#63382](https://github.com/PaddlePaddle/Paddle/pull/63382),[#63133](https://github.com/PaddlePaddle/Paddle/pull/63133),[#61705](https://github.com/PaddlePaddle/Paddle/pull/61705),[#63152](https://github.com/PaddlePaddle/Paddle/pull/63152),[#63373](https://github.com/PaddlePaddle/Paddle/pull/63373) -### **New Hardware Backend Extention** +### CI Pipeline Improvements -- Custom device support: provide a plug-in way to extend PaddlePaddle hardware backend. With this function, developers do not need to modify PaddlePaddle codes for specific hardware, but simply implement the standard interface and compile it into a dynamic link library that can be called by PaddlePaddle as a plug-in.This reduces the development effort of adding a new hardware backend to PaddlePaddle. Currently it supports custom Runtime and custom Kernel. +- Improve the merge-in code monitoring mechanism in the CI pipeline, to ensure higher code quality and stability. Add a function monitoring module, to monitor various indicators of the CI pipeline in real time, ensuring smooth execution of each stage, to identify and resolve issues in a timely manner. [#61384](https://github.com/PaddlePaddle/Paddle/pull/61384),[#62190](https://github.com/PaddlePaddle/Paddle/pull/62190),[#60758](https://github.com/PaddlePaddle/Paddle/pull/60758),[#60399](https://github.com/PaddlePaddle/Paddle/pull/60399),[#58623](https://github.com/PaddlePaddle/Paddle/pull/58623),[#62177](https://github.com/PaddlePaddle/Paddle/pull/62177),[#62361](https://github.com/PaddlePaddle/Paddle/pull/62361),[#62893](https://github.com/PaddlePaddle/Paddle/pull/62893),[#63705](https://github.com/PaddlePaddle/Paddle/pull/63705),[#64476](https://github.com/PaddlePaddle/Paddle/pull/64476),[#64752](https://github.com/PaddlePaddle/Paddle/pull/64752),[#64733](https://github.com/PaddlePaddle/Paddle/pull/64733),[#61914](https://github.com/PaddlePaddle/Paddle/pull/61914) -- Support Huawei NPU chip (Ascend910) training/inference. Support ResNet50, YoloV3, BERT, Transformer and many other models. Support static + dynamic graph and auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines. +### Code Cleanup -- Support Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU) training/inference. Support ResNet50, BERT and other models. Support static graph training. Support single card, and distribute training across multiple cards, multiple machines. +- Remove some old codes. [#63580](https://github.com/PaddlePaddle/Paddle/pull/63580),[#62840](https://github.com/PaddlePaddle/Paddle/pull/62840),[#62886](https://github.com/PaddlePaddle/Paddle/pull/62886),[#63046](https://github.com/PaddlePaddle/Paddle/pull/63046),[#63004](https://github.com/PaddlePaddle/Paddle/pull/63004),[#63039](https://github.com/PaddlePaddle/Paddle/pull/63039),[#62733](https://github.com/PaddlePaddle/Paddle/pull/62733),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#62768](https://github.com/PaddlePaddle/Paddle/pull/62768),[#62744](https://github.com/PaddlePaddle/Paddle/pull/62744),[#62861](https://github.com/PaddlePaddle/Paddle/pull/62861),[#62774](https://github.com/PaddlePaddle/Paddle/pull/62774),[#62851](https://github.com/PaddlePaddle/Paddle/pull/62851),[#62973](https://github.com/PaddlePaddle/Paddle/pull/62973),[#63273](https://github.com/PaddlePaddle/Paddle/pull/63273),[#62445](https://github.com/PaddlePaddle/Paddle/pull/62445),[#64382](https://github.com/PaddlePaddle/Paddle/pull/64382),[#64409](https://github.com/PaddlePaddle/Paddle/pull/64409),[#64391](https://github.com/PaddlePaddle/Paddle/pull/64391),[#64310](https://github.com/PaddlePaddle/Paddle/pull/64310),[#64348](https://github.com/PaddlePaddle/Paddle/pull/64348),[#64651](https://github.com/PaddlePaddle/Paddle/pull/64651),[#64709](https://github.com/PaddlePaddle/Paddle/pull/64709),[#61714](https://github.com/PaddlePaddle/Paddle/pull/61714),[#62109](https://github.com/PaddlePaddle/Paddle/pull/62109),[#61751](https://github.com/PaddlePaddle/Paddle/pull/61751),[#61691](https://github.com/PaddlePaddle/Paddle/pull/61691),[#61735](https://github.com/PaddlePaddle/Paddle/pull/61735) -- Support cambricon MLU chip (MLU370x4) training/inference. Support models such as ResNet50. Support static graph + dynamic graph training. Support auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines. +### Bug Fixing -- Support KUNLUNXIN 2 chips (KUNLUNXIN AI acceleration cards R200, R300) training/inference. Support ResNet50, YoloV3, OCR-DB, SSD, MobilnetV3, UNet, BERT, Transformer, GPT-2, Wide&Deep, and DeepFM. Support static graph + dynamic graph training. Support auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines. +- Fix several compilation issues of paddle framework. [#63297](https://github.com/PaddlePaddle/Paddle/pull/63297),[#62994](https://github.com/PaddlePaddle/Paddle/pull/62994),[#62651](https://github.com/PaddlePaddle/Paddle/pull/62651),[#64408](https://github.com/PaddlePaddle/Paddle/pull/64408),[#60934](https://github.com/PaddlePaddle/Paddle/pull/60934),[#62899](https://github.com/PaddlePaddle/Paddle/pull/62899),[#60528](https://github.com/PaddlePaddle/Paddle/pull/60528),[#63158](https://github.com/PaddlePaddle/Paddle/pull/63158),[#64549](https://github.com/PaddlePaddle/Paddle/pull/64549),[#62351](https://github.com/PaddlePaddle/Paddle/pull/62351),[#61259](https://github.com/PaddlePaddle/Paddle/pull/61259),[#61281](https://github.com/PaddlePaddle/Paddle/pull/61281),[#62304](https://github.com/PaddlePaddle/Paddle/pull/62304),[#60736](https://github.com/PaddlePaddle/Paddle/pull/60736),[#60811](https://github.com/PaddlePaddle/Paddle/pull/60811),[#63949](https://github.com/PaddlePaddle/Paddle/pull/63949),[#59892](https://github.com/PaddlePaddle/Paddle/pull/59892),[#60767](https://github.com/PaddlePaddle/Paddle/pull/60767),[#60856](https://github.com/PaddlePaddle/Paddle/pull/60856),[#61286](https://github.com/PaddlePaddle/Paddle/pull/61286),[#61638](https://github.com/PaddlePaddle/Paddle/pull/61638),[#62079](https://github.com/PaddlePaddle/Paddle/pull/62079),[#62142](https://github.com/PaddlePaddle/Paddle/pull/62142),[#62823](https://github.com/PaddlePaddle/Paddle/pull/62823),[#62814](https://github.com/PaddlePaddle/Paddle/pull/62814),[#62425](https://github.com/PaddlePaddle/Paddle/pull/62425),[#62619](https://github.com/PaddlePaddle/Paddle/pull/62619),[#60207](https://github.com/PaddlePaddle/Paddle/pull/60207),[#60765](https://github.com/PaddlePaddle/Paddle/pull/60765),[#61870](https://github.com/PaddlePaddle/Paddle/pull/61870),[#61923](https://github.com/PaddlePaddle/Paddle/pull/61923),[#62144](https://github.com/PaddlePaddle/Paddle/pull/62144),[#62426](https://github.com/PaddlePaddle/Paddle/pull/62426),[#63848](https://github.com/PaddlePaddle/Paddle/pull/63848),[#60682](https://github.com/PaddlePaddle/Paddle/pull/60682),[#61369](https://github.com/PaddlePaddle/Paddle/pull/61369),[#62882](https://github.com/PaddlePaddle/Paddle/pull/62882),[#63944](https://github.com/PaddlePaddle/Paddle/pull/63944),[#64812](https://github.com/PaddlePaddle/Paddle/pull/64812),[#60654](https://github.com/PaddlePaddle/Paddle/pull/60654),[#60887](https://github.com/PaddlePaddle/Paddle/pull/60887),[#62058](https://github.com/PaddlePaddle/Paddle/pull/62058),[#64639](https://github.com/PaddlePaddle/Paddle/pull/64639),[#60115](https://github.com/PaddlePaddle/Paddle/pull/60115),[#61940](https://github.com/PaddlePaddle/Paddle/pull/61940),[#62614](https://github.com/PaddlePaddle/Paddle/pull/62614),[#59914](https://github.com/PaddlePaddle/Paddle/pull/59914),[#63762](https://github.com/PaddlePaddle/Paddle/pull/63762),[#60145](https://github.com/PaddlePaddle/Paddle/pull/60145),[#60285](https://github.com/PaddlePaddle/Paddle/pull/60285),[#60378](https://github.com/PaddlePaddle/Paddle/pull/60378),[#60393](https://github.com/PaddlePaddle/Paddle/pull/60393),[#61057](https://github.com/PaddlePaddle/Paddle/pull/61057),[#61058](https://github.com/PaddlePaddle/Paddle/pull/61058),[#61151](https://github.com/PaddlePaddle/Paddle/pull/61151),[#61347](https://github.com/PaddlePaddle/Paddle/pull/61347),[#61554](https://github.com/PaddlePaddle/Paddle/pull/61554),[#61844](https://github.com/PaddlePaddle/Paddle/pull/61844),[#62915](https://github.com/PaddlePaddle/Paddle/pull/62915),[#61852](https://github.com/PaddlePaddle/Paddle/pull/61852),[#61704](https://github.com/PaddlePaddle/Paddle/pull/61704),[#61991](https://github.com/PaddlePaddle/Paddle/pull/61991),[#62264](https://github.com/PaddlePaddle/Paddle/pull/62264),[#62762](https://github.com/PaddlePaddle/Paddle/pull/62762),[#63820](https://github.com/PaddlePaddle/Paddle/pull/63820),[#63864](https://github.com/PaddlePaddle/Paddle/pull/63864),[#65017](https://github.com/PaddlePaddle/Paddle/pull/65017),[#61183](https://github.com/PaddlePaddle/Paddle/pull/61183),[#59866](https://github.com/PaddlePaddle/Paddle/pull/59866),[#61171](https://github.com/PaddlePaddle/Paddle/pull/61171),[#61290](https://github.com/PaddlePaddle/Paddle/pull/61290),[#61725](https://github.com/PaddlePaddle/Paddle/pull/61725),[#61614](https://github.com/PaddlePaddle/Paddle/pull/61614),[#61721](https://github.com/PaddlePaddle/Paddle/pull/61721),[#61494](https://github.com/PaddlePaddle/Paddle/pull/61494),[#61556](https://github.com/PaddlePaddle/Paddle/pull/61556),[#61689](https://github.com/PaddlePaddle/Paddle/pull/61689) +## Documentation-related Bug Fixing -## Thanks to our Contributors +- With the enhancement of API feature, some API documentations have been fixed and enhanced simultaneously. [#62875](https://github.com/PaddlePaddle/Paddle/pull/62875), [#59793](https://github.com/PaddlePaddle/Paddle/pull/59793), [#60002](https://github.com/PaddlePaddle/Paddle/pull/60002), [#59985](https://github.com/PaddlePaddle/Paddle/pull/59985), [#63365](https://github.com/PaddlePaddle/Paddle/pull/63365), [#60962](https://github.com/PaddlePaddle/Paddle/pull/60962), [#60942](https://github.com/PaddlePaddle/Paddle/pull/60942), [#64232](https://github.com/PaddlePaddle/Paddle/pull/64232), [#63255](https://github.com/PaddlePaddle/Paddle/pull/63255) +- Update/supplement API documentation. bernoulli_ ([#64504](https://github.com/PaddlePaddle/Paddle/pull/64504)), paddle.static.ctr_metric_bundle ([#60912](https://github.com/PaddlePaddle/Paddle/pull/60912)), LayerNorm ([#62928](https://github.com/PaddlePaddle/Paddle/pull/62928)), Sequential ([#63128](https://github.com/PaddlePaddle/Paddle/pull/63128)), paddle.summary ([#63121](https://github.com/PaddlePaddle/Paddle/pull/63121)), ShardOptimizer in AutoParallel ([#62933](https://github.com/PaddlePaddle/Paddle/pull/62933)), paddle.nccl.version ([#62480](https://github.com/PaddlePaddle/Paddle/pull/62480)) +- Update the Readme file. [#59883](https://github.com/PaddlePaddle/Paddle/pull/59883),[#60691](https://github.com/PaddlePaddle/Paddle/pull/60691),[#60749](https://github.com/PaddlePaddle/Paddle/pull/60749) +- Update mkldnn to onednn. [#63199](https://github.com/PaddlePaddle/Paddle/pull/63199),[#63202](https://github.com/PaddlePaddle/Paddle/pull/63202),[#63215](https://github.com/PaddlePaddle/Paddle/pull/63215),[#63209](https://github.com/PaddlePaddle/Paddle/pull/63209) +- Fix document rendering bugs. [#59725](https://github.com/PaddlePaddle/Paddle/pull/59725),[#60306](https://github.com/PaddlePaddle/Paddle/pull/60306) +- Fix a lot of typos in the codes to enhance source readability. [#60093](https://github.com/PaddlePaddle/Paddle/pull/60093),[#60603](https://github.com/PaddlePaddle/Paddle/pull/60603),[#60631](https://github.com/PaddlePaddle/Paddle/pull/60631),[#60679](https://github.com/PaddlePaddle/Paddle/pull/60679),[#60741](https://github.com/PaddlePaddle/Paddle/pull/60741),[#60770](https://github.com/PaddlePaddle/Paddle/pull/60770),[#60784](https://github.com/PaddlePaddle/Paddle/pull/60784),[#60825](https://github.com/PaddlePaddle/Paddle/pull/60825),[#60857](https://github.com/PaddlePaddle/Paddle/pull/60857),[#60891](https://github.com/PaddlePaddle/Paddle/pull/60891),[#60921](https://github.com/PaddlePaddle/Paddle/pull/60921),[#60920](https://github.com/PaddlePaddle/Paddle/pull/60920),[#60923](https://github.com/PaddlePaddle/Paddle/pull/60923),[#60928](https://github.com/PaddlePaddle/Paddle/pull/60928),[#60940](https://github.com/PaddlePaddle/Paddle/pull/60940),[#60936](https://github.com/PaddlePaddle/Paddle/pull/60936),[#60932](https://github.com/PaddlePaddle/Paddle/pull/60932),[#60935](https://github.com/PaddlePaddle/Paddle/pull/60935),[#60931](https://github.com/PaddlePaddle/Paddle/pull/60931),[#60951](https://github.com/PaddlePaddle/Paddle/pull/60951),[#60964](https://github.com/PaddlePaddle/Paddle/pull/60964),[#60965](https://github.com/PaddlePaddle/Paddle/pull/60965),[#60967](https://github.com/PaddlePaddle/Paddle/pull/60967),[#60972](https://github.com/PaddlePaddle/Paddle/pull/60972),[#60971](https://github.com/PaddlePaddle/Paddle/pull/60971),[#60980](https://github.com/PaddlePaddle/Paddle/pull/60980),[#60984](https://github.com/PaddlePaddle/Paddle/pull/60984),[#60985](https://github.com/PaddlePaddle/Paddle/pull/60985),[#60989](https://github.com/PaddlePaddle/Paddle/pull/60989),[#60990](https://github.com/PaddlePaddle/Paddle/pull/60990),[#60991](https://github.com/PaddlePaddle/Paddle/pull/60991),[#60992](https://github.com/PaddlePaddle/Paddle/pull/60992),[#60994](https://github.com/PaddlePaddle/Paddle/pull/60994),[#60995](https://github.com/PaddlePaddle/Paddle/pull/60995),[#60996](https://github.com/PaddlePaddle/Paddle/pull/60996),[#61001](https://github.com/PaddlePaddle/Paddle/pull/61001),[#61000](https://github.com/PaddlePaddle/Paddle/pull/61000),[#60999](https://github.com/PaddlePaddle/Paddle/pull/60999),[#60998](https://github.com/PaddlePaddle/Paddle/pull/60998),[#61026](https://github.com/PaddlePaddle/Paddle/pull/61026),[#61009](https://github.com/PaddlePaddle/Paddle/pull/61009),[#61034](https://github.com/PaddlePaddle/Paddle/pull/61034),[#61033](https://github.com/PaddlePaddle/Paddle/pull/61033),[#61020](https://github.com/PaddlePaddle/Paddle/pull/61020),[#61092](https://github.com/PaddlePaddle/Paddle/pull/61092),[#61066](https://github.com/PaddlePaddle/Paddle/pull/61066),[#61063](https://github.com/PaddlePaddle/Paddle/pull/61063),[#61089](https://github.com/PaddlePaddle/Paddle/pull/61089),[#61071](https://github.com/PaddlePaddle/Paddle/pull/61071),[#61129](https://github.com/PaddlePaddle/Paddle/pull/61129),[#61128](https://github.com/PaddlePaddle/Paddle/pull/61128),[#61126](https://github.com/PaddlePaddle/Paddle/pull/61126),[#61123](https://github.com/PaddlePaddle/Paddle/pull/61123),[#61113](https://github.com/PaddlePaddle/Paddle/pull/61113),[#61189](https://github.com/PaddlePaddle/Paddle/pull/61189),[#61175](https://github.com/PaddlePaddle/Paddle/pull/61175),[#61153](https://github.com/PaddlePaddle/Paddle/pull/61153),[#61198](https://github.com/PaddlePaddle/Paddle/pull/61198),[#61206](https://github.com/PaddlePaddle/Paddle/pull/61206),[#61256](https://github.com/PaddlePaddle/Paddle/pull/61256),[#61255](https://github.com/PaddlePaddle/Paddle/pull/61255),[#61251](https://github.com/PaddlePaddle/Paddle/pull/61251),[#61246](https://github.com/PaddlePaddle/Paddle/pull/61246),[#61245](https://github.com/PaddlePaddle/Paddle/pull/61245),[#61231](https://github.com/PaddlePaddle/Paddle/pull/61231),[#61247](https://github.com/PaddlePaddle/Paddle/pull/61247),[#61265](https://github.com/PaddlePaddle/Paddle/pull/61265),[#61264](https://github.com/PaddlePaddle/Paddle/pull/61264),[#61266](https://github.com/PaddlePaddle/Paddle/pull/61266),[#61267](https://github.com/PaddlePaddle/Paddle/pull/61267),[#61268](https://github.com/PaddlePaddle/Paddle/pull/61268),[#61270](https://github.com/PaddlePaddle/Paddle/pull/61270),[#61334](https://github.com/PaddlePaddle/Paddle/pull/61334),[#61392](https://github.com/PaddlePaddle/Paddle/pull/61392),[#61404](https://github.com/PaddlePaddle/Paddle/pull/61404),[#61318](https://github.com/PaddlePaddle/Paddle/pull/61318),[#61383](https://github.com/PaddlePaddle/Paddle/pull/61383),[#61306](https://github.com/PaddlePaddle/Paddle/pull/61306),[#61324](https://github.com/PaddlePaddle/Paddle/pull/61324),[#61426](https://github.com/PaddlePaddle/Paddle/pull/61426),[#61390](https://github.com/PaddlePaddle/Paddle/pull/61390),[#61419](https://github.com/PaddlePaddle/Paddle/pull/61419),[#61420](https://github.com/PaddlePaddle/Paddle/pull/61420),[#61408](https://github.com/PaddlePaddle/Paddle/pull/61408),[#61425](https://github.com/PaddlePaddle/Paddle/pull/61425),[#61557](https://github.com/PaddlePaddle/Paddle/pull/61557),[#61628](https://github.com/PaddlePaddle/Paddle/pull/61628),[#61652](https://github.com/PaddlePaddle/Paddle/pull/61652),[#61602](https://github.com/PaddlePaddle/Paddle/pull/61602),[#61558](https://github.com/PaddlePaddle/Paddle/pull/61558),[#61660](https://github.com/PaddlePaddle/Paddle/pull/61660),[#61423](https://github.com/PaddlePaddle/Paddle/pull/61423),[#61627](https://github.com/PaddlePaddle/Paddle/pull/61627),[#61685](https://github.com/PaddlePaddle/Paddle/pull/61685),[#61690](https://github.com/PaddlePaddle/Paddle/pull/61690),[#61727](https://github.com/PaddlePaddle/Paddle/pull/61727),[#61738](https://github.com/PaddlePaddle/Paddle/pull/61738),[#61740](https://github.com/PaddlePaddle/Paddle/pull/61740),[#61741](https://github.com/PaddlePaddle/Paddle/pull/61741),[#61743](https://github.com/PaddlePaddle/Paddle/pull/61743),[#61744](https://github.com/PaddlePaddle/Paddle/pull/61744),[#61745](https://github.com/PaddlePaddle/Paddle/pull/61745),[#61761](https://github.com/PaddlePaddle/Paddle/pull/61761),[#61762](https://github.com/PaddlePaddle/Paddle/pull/61762),[#61764](https://github.com/PaddlePaddle/Paddle/pull/61764),[#61767](https://github.com/PaddlePaddle/Paddle/pull/61767),[#61768](https://github.com/PaddlePaddle/Paddle/pull/61768),[#61774](https://github.com/PaddlePaddle/Paddle/pull/61774),[#61781](https://github.com/PaddlePaddle/Paddle/pull/61781),[#61783](https://github.com/PaddlePaddle/Paddle/pull/61783),[#61757](https://github.com/PaddlePaddle/Paddle/pull/61757),[#61732](https://github.com/PaddlePaddle/Paddle/pull/61732),[#61776](https://github.com/PaddlePaddle/Paddle/pull/61776),[#61780](https://github.com/PaddlePaddle/Paddle/pull/61780),[#61730](https://github.com/PaddlePaddle/Paddle/pull/61730),[#61728](https://github.com/PaddlePaddle/Paddle/pull/61728),[#61633](https://github.com/PaddlePaddle/Paddle/pull/61633),[#61720](https://github.com/PaddlePaddle/Paddle/pull/61720),[#61734](https://github.com/PaddlePaddle/Paddle/pull/61734),[#61779](https://github.com/PaddlePaddle/Paddle/pull/61779),[#61775](https://github.com/PaddlePaddle/Paddle/pull/61775),[#61773](https://github.com/PaddlePaddle/Paddle/pull/61773),[#61787](https://github.com/PaddlePaddle/Paddle/pull/61787),[#61687](https://github.com/PaddlePaddle/Paddle/pull/61687),[#61747](https://github.com/PaddlePaddle/Paddle/pull/61747),[#61760](https://github.com/PaddlePaddle/Paddle/pull/61760),[#61782](https://github.com/PaddlePaddle/Paddle/pull/61782),[#61800](https://github.com/PaddlePaddle/Paddle/pull/61800),[#61748](https://github.com/PaddlePaddle/Paddle/pull/61748),[#61772](https://github.com/PaddlePaddle/Paddle/pull/61772),[#61786](https://github.com/PaddlePaddle/Paddle/pull/61786),[#61880](https://github.com/PaddlePaddle/Paddle/pull/61880),[#61718](https://github.com/PaddlePaddle/Paddle/pull/61718),[#61742](https://github.com/PaddlePaddle/Paddle/pull/61742),[#61766](https://github.com/PaddlePaddle/Paddle/pull/61766),[#61835](https://github.com/PaddlePaddle/Paddle/pull/61835),[#61838](https://github.com/PaddlePaddle/Paddle/pull/61838),[#61754](https://github.com/PaddlePaddle/Paddle/pull/61754),[#61833](https://github.com/PaddlePaddle/Paddle/pull/61833),[#61749](https://github.com/PaddlePaddle/Paddle/pull/61749),[#61938](https://github.com/PaddlePaddle/Paddle/pull/61938),[#61919](https://github.com/PaddlePaddle/Paddle/pull/61919),[#61924](https://github.com/PaddlePaddle/Paddle/pull/61924),[#61778](https://github.com/PaddlePaddle/Paddle/pull/61778),[#61839](https://github.com/PaddlePaddle/Paddle/pull/61839),[#61879](https://github.com/PaddlePaddle/Paddle/pull/61879),[#61929](https://github.com/PaddlePaddle/Paddle/pull/61929),[#61801](https://github.com/PaddlePaddle/Paddle/pull/61801),[#61788](https://github.com/PaddlePaddle/Paddle/pull/61788),[#61999](https://github.com/PaddlePaddle/Paddle/pull/61999),[#61928](https://github.com/PaddlePaddle/Paddle/pull/61928),[#61958](https://github.com/PaddlePaddle/Paddle/pull/61958),[#61982](https://github.com/PaddlePaddle/Paddle/pull/61982),[#61996](https://github.com/PaddlePaddle/Paddle/pull/61996),[#61953](https://github.com/PaddlePaddle/Paddle/pull/61953),[#61998](https://github.com/PaddlePaddle/Paddle/pull/61998),[#62003](https://github.com/PaddlePaddle/Paddle/pull/62003),[#61921](https://github.com/PaddlePaddle/Paddle/pull/61921),[#61881](https://github.com/PaddlePaddle/Paddle/pull/61881),[#61746](https://github.com/PaddlePaddle/Paddle/pull/61746),[#61955](https://github.com/PaddlePaddle/Paddle/pull/61955),[#62002](https://github.com/PaddlePaddle/Paddle/pull/62002),[#62001](https://github.com/PaddlePaddle/Paddle/pull/62001),[#61997](https://github.com/PaddlePaddle/Paddle/pull/61997),[#61765](https://github.com/PaddlePaddle/Paddle/pull/61765),[#61956](https://github.com/PaddlePaddle/Paddle/pull/61956),[#62004](https://github.com/PaddlePaddle/Paddle/pull/62004),[#62044](https://github.com/PaddlePaddle/Paddle/pull/62044),[#62040](https://github.com/PaddlePaddle/Paddle/pull/62040),[#62043](https://github.com/PaddlePaddle/Paddle/pull/62043),[#62042](https://github.com/PaddlePaddle/Paddle/pull/62042),[#62041](https://github.com/PaddlePaddle/Paddle/pull/62041),[#62039](https://github.com/PaddlePaddle/Paddle/pull/62039),[#62019](https://github.com/PaddlePaddle/Paddle/pull/62019),[#61910](https://github.com/PaddlePaddle/Paddle/pull/61910),[#61882](https://github.com/PaddlePaddle/Paddle/pull/61882),[#61836](https://github.com/PaddlePaddle/Paddle/pull/61836),[#62013](https://github.com/PaddlePaddle/Paddle/pull/62013),[#62055](https://github.com/PaddlePaddle/Paddle/pull/62055),[#62047](https://github.com/PaddlePaddle/Paddle/pull/62047),[#62000](https://github.com/PaddlePaddle/Paddle/pull/62000),[#62048](https://github.com/PaddlePaddle/Paddle/pull/62048),[#62075](https://github.com/PaddlePaddle/Paddle/pull/62075),[#62038](https://github.com/PaddlePaddle/Paddle/pull/62038),[#62045](https://github.com/PaddlePaddle/Paddle/pull/62045),[#62105](https://github.com/PaddlePaddle/Paddle/pull/62105),[#62214](https://github.com/PaddlePaddle/Paddle/pull/62214),[#62212](https://github.com/PaddlePaddle/Paddle/pull/62212),[#62183](https://github.com/PaddlePaddle/Paddle/pull/62183),[#62182](https://github.com/PaddlePaddle/Paddle/pull/62182),[#62181](https://github.com/PaddlePaddle/Paddle/pull/62181),[#62179](https://github.com/PaddlePaddle/Paddle/pull/62179),[#62178](https://github.com/PaddlePaddle/Paddle/pull/62178),[#62172](https://github.com/PaddlePaddle/Paddle/pull/62172),[#62168](https://github.com/PaddlePaddle/Paddle/pull/62168),[#62163](https://github.com/PaddlePaddle/Paddle/pull/62163),[#62162](https://github.com/PaddlePaddle/Paddle/pull/62162),[#62161](https://github.com/PaddlePaddle/Paddle/pull/62161),[#62160](https://github.com/PaddlePaddle/Paddle/pull/62160),[#62046](https://github.com/PaddlePaddle/Paddle/pull/62046),[#62175](https://github.com/PaddlePaddle/Paddle/pull/62175),[#62259](https://github.com/PaddlePaddle/Paddle/pull/62259),[#62258](https://github.com/PaddlePaddle/Paddle/pull/62258),[#62213](https://github.com/PaddlePaddle/Paddle/pull/62213),[#62260](https://github.com/PaddlePaddle/Paddle/pull/62260),[#62290](https://github.com/PaddlePaddle/Paddle/pull/62290),[#62288](https://github.com/PaddlePaddle/Paddle/pull/62288),[#62323](https://github.com/PaddlePaddle/Paddle/pull/62323),[#62319](https://github.com/PaddlePaddle/Paddle/pull/62319),[#62331](https://github.com/PaddlePaddle/Paddle/pull/62331),[#62330](https://github.com/PaddlePaddle/Paddle/pull/62330),[#62329](https://github.com/PaddlePaddle/Paddle/pull/62329),[#62324](https://github.com/PaddlePaddle/Paddle/pull/62324),[#62317](https://github.com/PaddlePaddle/Paddle/pull/62317),[#62311](https://github.com/PaddlePaddle/Paddle/pull/62311),[#62310](https://github.com/PaddlePaddle/Paddle/pull/62310),[#62308](https://github.com/PaddlePaddle/Paddle/pull/62308),[#62289](https://github.com/PaddlePaddle/Paddle/pull/62289),[#62307](https://github.com/PaddlePaddle/Paddle/pull/62307),[#62315](https://github.com/PaddlePaddle/Paddle/pull/62315),[#62406](https://github.com/PaddlePaddle/Paddle/pull/62406),[#62458](https://github.com/PaddlePaddle/Paddle/pull/62458),[#62459](https://github.com/PaddlePaddle/Paddle/pull/62459),[#62481](https://github.com/PaddlePaddle/Paddle/pull/62481),[#62465](https://github.com/PaddlePaddle/Paddle/pull/62465),[#62462](https://github.com/PaddlePaddle/Paddle/pull/62462),[#62453](https://github.com/PaddlePaddle/Paddle/pull/62453),[#62496](https://github.com/PaddlePaddle/Paddle/pull/62496),[#62457](https://github.com/PaddlePaddle/Paddle/pull/62457),[#62537](https://github.com/PaddlePaddle/Paddle/pull/62537),[#62514](https://github.com/PaddlePaddle/Paddle/pull/62514),[#62548](https://github.com/PaddlePaddle/Paddle/pull/62548),[#62544](https://github.com/PaddlePaddle/Paddle/pull/62544),[#62575](https://github.com/PaddlePaddle/Paddle/pull/62575),[#62463](https://github.com/PaddlePaddle/Paddle/pull/62463),[#62643](https://github.com/PaddlePaddle/Paddle/pull/62643),[#62803](https://github.com/PaddlePaddle/Paddle/pull/62803),[#62924](https://github.com/PaddlePaddle/Paddle/pull/62924),[#63037](https://github.com/PaddlePaddle/Paddle/pull/63037),[#63102](https://github.com/PaddlePaddle/Paddle/pull/63102),[#63139](https://github.com/PaddlePaddle/Paddle/pull/63139),[#63092](https://github.com/PaddlePaddle/Paddle/pull/63092),[#63147](https://github.com/PaddlePaddle/Paddle/pull/63147),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#60485](https://github.com/PaddlePaddle/Paddle/pull/60485),[#61273](https://github.com/PaddlePaddle/Paddle/pull/61273),[#63429](https://github.com/PaddlePaddle/Paddle/pull/63429),[#61954](https://github.com/PaddlePaddle/Paddle/pull/61954) -This release contains contributions from the project core team as well as: +## Others -Adam Osewski, Allen Guo, arlesniak, chenenquan, chenyanlann, fengkuangxiaxia, fuqianya, fwenguang, guguguzi, helen88, houj04, Jacek Czaja, jakpiase, jianghaicheng, joanna.wozna.intel, joeqiao12, Leo Chen, Leo Guo, Li-fAngyU, lidanqing, Liyulingyue, Matsumoto GAO, maxhuiy, Ming-Xu Huang, Nyakku Shigure, piotrekobi, piotrekobiIntel, QingshuChen, qipengh, Skr Bang, Sylwester Fraczek, Sławomir Siwek, taixiurong, tanzhipeng, Tomasz Socha, TTerror, Webbley, yaozhixin, ykkk2333, yujun, Zhangjingyu06, zhangxiaoci, zhangyikun02, zhangyk0314, zlsh80826, zn, Zuza. +Non-user related changes, including deprecated code cleanup, useless unit test cleanup, debugging or upgrade of monitoring mechanism. [#63377](https://github.com/PaddlePaddle/Paddle/pull/63377),[#64106](https://github.com/PaddlePaddle/Paddle/pull/64106),[#64220](https://github.com/PaddlePaddle/Paddle/pull/64220),[#64293](https://github.com/PaddlePaddle/Paddle/pull/64293),[#64464](https://github.com/PaddlePaddle/Paddle/pull/64464),[#64944](https://github.com/PaddlePaddle/Paddle/pull/64944),[#63638](https://github.com/PaddlePaddle/Paddle/pull/63638),[#63732](https://github.com/PaddlePaddle/Paddle/pull/63732),[#63735](https://github.com/PaddlePaddle/Paddle/pull/63735),[#63826](https://github.com/PaddlePaddle/Paddle/pull/63826),[#63982](https://github.com/PaddlePaddle/Paddle/pull/63982),[#63737](https://github.com/PaddlePaddle/Paddle/pull/63737),[#64471](https://github.com/PaddlePaddle/Paddle/pull/64471),[#64574](https://github.com/PaddlePaddle/Paddle/pull/64574),[#64494](https://github.com/PaddlePaddle/Paddle/pull/64494),[#62775](https://github.com/PaddlePaddle/Paddle/pull/62775),[#63601](https://github.com/PaddlePaddle/Paddle/pull/63601),[#62564](https://github.com/PaddlePaddle/Paddle/pull/62564),[#63772](https://github.com/PaddlePaddle/Paddle/pull/63772),[#64719](https://github.com/PaddlePaddle/Paddle/pull/64719),[#61640](https://github.com/PaddlePaddle/Paddle/pull/61640),[#63459](https://github.com/PaddlePaddle/Paddle/pull/63459),[#64062](https://github.com/PaddlePaddle/Paddle/pull/64062),[#63480](https://github.com/PaddlePaddle/Paddle/pull/63480),[#63833](https://github.com/PaddlePaddle/Paddle/pull/63833)[#63673](https://github.com/PaddlePaddle/Paddle/pull/63673),[#63672](https://github.com/PaddlePaddle/Paddle/pull/63672),[#64131](https://github.com/PaddlePaddle/Paddle/pull/64131),[#64156](https://github.com/PaddlePaddle/Paddle/pull/64156),[#64155](https://github.com/PaddlePaddle/Paddle/pull/64155),[#64159](https://github.com/PaddlePaddle/Paddle/pull/64159),[#63902](https://github.com/PaddlePaddle/Paddle/pull/63902),[#64230](https://github.com/PaddlePaddle/Paddle/pull/64230),[#64229](https://github.com/PaddlePaddle/Paddle/pull/64229),[#64236](https://github.com/PaddlePaddle/Paddle/pull/64236),[#64260](https://github.com/PaddlePaddle/Paddle/pull/64260),[#64175](https://github.com/PaddlePaddle/Paddle/pull/64175),[#64250](https://github.com/PaddlePaddle/Paddle/pull/64250),[#64269](https://github.com/PaddlePaddle/Paddle/pull/64269),[#64238](https://github.com/PaddlePaddle/Paddle/pull/64238),[#64349](https://github.com/PaddlePaddle/Paddle/pull/64349),[#64394](https://github.com/PaddlePaddle/Paddle/pull/64394),[#64402](https://github.com/PaddlePaddle/Paddle/pull/64402),[#64401](https://github.com/PaddlePaddle/Paddle/pull/64401),[#64388](https://github.com/PaddlePaddle/Paddle/pull/64388),[#64329](https://github.com/PaddlePaddle/Paddle/pull/64329),[#64502](https://github.com/PaddlePaddle/Paddle/pull/64502),[#64501](https://github.com/PaddlePaddle/Paddle/pull/64501),[#64515](https://github.com/PaddlePaddle/Paddle/pull/64515),[#64503](https://github.com/PaddlePaddle/Paddle/pull/64503),[#64514](https://github.com/PaddlePaddle/Paddle/pull/64514),[#64601](https://github.com/PaddlePaddle/Paddle/pull/64601),[#64564](https://github.com/PaddlePaddle/Paddle/pull/64564),[#64012](https://github.com/PaddlePaddle/Paddle/pull/64012),[#64697](https://github.com/PaddlePaddle/Paddle/pull/64697),[#64682](https://github.com/PaddlePaddle/Paddle/pull/64682),[#64051](https://github.com/PaddlePaddle/Paddle/pull/64051),[#63267](https://github.com/PaddlePaddle/Paddle/pull/63267),[#63426](https://github.com/PaddlePaddle/Paddle/pull/63426),[#63626](https://github.com/PaddlePaddle/Paddle/pull/63626),[#63257](https://github.com/PaddlePaddle/Paddle/pull/63257),[#63266](https://github.com/PaddlePaddle/Paddle/pull/63266),[#63468](https://github.com/PaddlePaddle/Paddle/pull/63468),[#63262](https://github.com/PaddlePaddle/Paddle/pull/63262),[#63248](https://github.com/PaddlePaddle/Paddle/pull/63248),[#63241](https://github.com/PaddlePaddle/Paddle/pull/63241),[#63252](https://github.com/PaddlePaddle/Paddle/pull/63252),[#63258](https://github.com/PaddlePaddle/Paddle/pull/63258),[#63235](https://github.com/PaddlePaddle/Paddle/pull/63235),[#63399](https://github.com/PaddlePaddle/Paddle/pull/63399),[#63488](https://github.com/PaddlePaddle/Paddle/pull/63488),[#63487](https://github.com/PaddlePaddle/Paddle/pull/63487),[#63466](https://github.com/PaddlePaddle/Paddle/pull/63466),[#63464](https://github.com/PaddlePaddle/Paddle/pull/63464),[#63483](https://github.com/PaddlePaddle/Paddle/pull/63483),[#63486](https://github.com/PaddlePaddle/Paddle/pull/63486),[#63475](https://github.com/PaddlePaddle/Paddle/pull/63475),[#63489](https://github.com/PaddlePaddle/Paddle/pull/63489),[#63470](https://github.com/PaddlePaddle/Paddle/pull/63470),[#63457](https://github.com/PaddlePaddle/Paddle/pull/63457),[#63493](https://github.com/PaddlePaddle/Paddle/pull/63493),[#63561](https://github.com/PaddlePaddle/Paddle/pull/63561),[#63584](https://github.com/PaddlePaddle/Paddle/pull/63584),[#63587](https://github.com/PaddlePaddle/Paddle/pull/63587),[#63586](https://github.com/PaddlePaddle/Paddle/pull/63586),[#63569](https://github.com/PaddlePaddle/Paddle/pull/63569),[#63559](https://github.com/PaddlePaddle/Paddle/pull/63559),[#63558](https://github.com/PaddlePaddle/Paddle/pull/63558),[#63555](https://github.com/PaddlePaddle/Paddle/pull/63555),[#63543](https://github.com/PaddlePaddle/Paddle/pull/63543),[#63589](https://github.com/PaddlePaddle/Paddle/pull/63589),[#63583](https://github.com/PaddlePaddle/Paddle/pull/63583),[#63565](https://github.com/PaddlePaddle/Paddle/pull/63565),[#63564](https://github.com/PaddlePaddle/Paddle/pull/63564),[#63265](https://github.com/PaddlePaddle/Paddle/pull/63265),[#63562](https://github.com/PaddlePaddle/Paddle/pull/63562),[#63591](https://github.com/PaddlePaddle/Paddle/pull/63591),[#63460](https://github.com/PaddlePaddle/Paddle/pull/63460),[#63238](https://github.com/PaddlePaddle/Paddle/pull/63238),[#63631](https://github.com/PaddlePaddle/Paddle/pull/63631),[#63707](https://github.com/PaddlePaddle/Paddle/pull/63707),[#63714](https://github.com/PaddlePaddle/Paddle/pull/63714),[#63854](https://github.com/PaddlePaddle/Paddle/pull/63854),[#63929](https://github.com/PaddlePaddle/Paddle/pull/63929),[#63532](https://github.com/PaddlePaddle/Paddle/pull/63532),[#59628](https://github.com/PaddlePaddle/Paddle/pull/59628),[#62209](https://github.com/PaddlePaddle/Paddle/pull/62209),[#63742](https://github.com/PaddlePaddle/Paddle/pull/63742),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#62078](https://github.com/PaddlePaddle/Paddle/pull/62078),[#62684](https://github.com/PaddlePaddle/Paddle/pull/62684),[#62723](https://github.com/PaddlePaddle/Paddle/pull/62723),[#64141](https://github.com/PaddlePaddle/Paddle/pull/64141),[#60404](https://github.com/PaddlePaddle/Paddle/pull/60404),[#64212](https://github.com/PaddlePaddle/Paddle/pull/64212),[#60652](https://github.com/PaddlePaddle/Paddle/pull/60652),[#64545](https://github.com/PaddlePaddle/Paddle/pull/64545),[#64477](https://github.com/PaddlePaddle/Paddle/pull/64477),[#64556](https://github.com/PaddlePaddle/Paddle/pull/64556),[#63160](https://github.com/PaddlePaddle/Paddle/pull/63160),[#63796](https://github.com/PaddlePaddle/Paddle/pull/63796),[#64693](https://github.com/PaddlePaddle/Paddle/pull/64693),[#64484](https://github.com/PaddlePaddle/Paddle/pull/64484),[#64677](https://github.com/PaddlePaddle/Paddle/pull/64677),[#64461](https://github.com/PaddlePaddle/Paddle/pull/64461),[#63189](https://github.com/PaddlePaddle/Paddle/pull/63189),[#63855](https://github.com/PaddlePaddle/Paddle/pull/63855),[#63896](https://github.com/PaddlePaddle/Paddle/pull/63896),[#63193](https://github.com/PaddlePaddle/Paddle/pull/63193),[#63200](https://github.com/PaddlePaddle/Paddle/pull/63200),[#63406](https://github.com/PaddlePaddle/Paddle/pull/63406),[#61283](https://github.com/PaddlePaddle/Paddle/pull/61283),[#63607](https://github.com/PaddlePaddle/Paddle/pull/63607),[#64486](https://github.com/PaddlePaddle/Paddle/pull/64486),[#64004](https://github.com/PaddlePaddle/Paddle/pull/64004),[#63132](https://github.com/PaddlePaddle/Paddle/pull/63132),[#63553](https://github.com/PaddlePaddle/Paddle/pull/63553),[#63572](https://github.com/PaddlePaddle/Paddle/pull/63572),[#63794](https://github.com/PaddlePaddle/Paddle/pull/63794),[#63919](https://github.com/PaddlePaddle/Paddle/pull/63919),[#63980](https://github.com/PaddlePaddle/Paddle/pull/63980),[#62917](https://github.com/PaddlePaddle/Paddle/pull/62917),[#64451](https://github.com/PaddlePaddle/Paddle/pull/64451),[#63541](https://github.com/PaddlePaddle/Paddle/pull/63541),[#63703](https://github.com/PaddlePaddle/Paddle/pull/63703),[#64536](https://github.com/PaddlePaddle/Paddle/pull/64536),[#63264](https://github.com/PaddlePaddle/Paddle/pull/63264),[#63335](https://github.com/PaddlePaddle/Paddle/pull/63335),[#63841](https://github.com/PaddlePaddle/Paddle/pull/63841),[#64628](https://github.com/PaddlePaddle/Paddle/pull/64628),[#63419](https://github.com/PaddlePaddle/Paddle/pull/63419),[#62210](https://github.com/PaddlePaddle/Paddle/pull/62210),[#63557](https://github.com/PaddlePaddle/Paddle/pull/63557),[#63064](https://github.com/PaddlePaddle/Paddle/pull/63064),[#61442](https://github.com/PaddlePaddle/Paddle/pull/61442),[#63537](https://github.com/PaddlePaddle/Paddle/pull/63537),[#63839](https://github.com/PaddlePaddle/Paddle/pull/63839),[#60927](https://github.com/PaddlePaddle/Paddle/pull/60927),[#60566](https://github.com/PaddlePaddle/Paddle/pull/60566),[#60842](https://github.com/PaddlePaddle/Paddle/pull/60842),[#64612](https://github.com/PaddlePaddle/Paddle/pull/64612),[#60047](https://github.com/PaddlePaddle/Paddle/pull/60047),[#63898](https://github.com/PaddlePaddle/Paddle/pull/63898),[#60415](https://github.com/PaddlePaddle/Paddle/pull/60415),[#60474](https://github.com/PaddlePaddle/Paddle/pull/60474),[#60439](https://github.com/PaddlePaddle/Paddle/pull/60439),[#60565](https://github.com/PaddlePaddle/Paddle/pull/60565),[#64414](https://github.com/PaddlePaddle/Paddle/pull/64414),[#62526](https://github.com/PaddlePaddle/Paddle/pull/62526),[#54183](https://github.com/PaddlePaddle/Paddle/pull/54183),[#64096](https://github.com/PaddlePaddle/Paddle/pull/64096),[#61325](https://github.com/PaddlePaddle/Paddle/pull/61325),[#60629](https://github.com/PaddlePaddle/Paddle/pull/60629),[#61051](https://github.com/PaddlePaddle/Paddle/pull/61051),[#62103](https://github.com/PaddlePaddle/Paddle/pull/62103),[#63594](https://github.com/PaddlePaddle/Paddle/pull/63594),[#60968](https://github.com/PaddlePaddle/Paddle/pull/60968),[#64613](https://github.com/PaddlePaddle/Paddle/pull/64613),[#64073](https://github.com/PaddlePaddle/Paddle/pull/64073),[#63816](https://github.com/PaddlePaddle/Paddle/pull/63816),[#64416](https://github.com/PaddlePaddle/Paddle/pull/64416),[#62499](https://github.com/PaddlePaddle/Paddle/pull/62499),[#64531](https://github.com/PaddlePaddle/Paddle/pull/64531),[#63827](https://github.com/PaddlePaddle/Paddle/pull/63827),[#59885](https://github.com/PaddlePaddle/Paddle/pull/59885),[#59949](https://github.com/PaddlePaddle/Paddle/pull/59949),[#63428](https://github.com/PaddlePaddle/Paddle/pull/63428),[#63218](https://github.com/PaddlePaddle/Paddle/pull/63218),[#63538](https://github.com/PaddlePaddle/Paddle/pull/63538),[#64497](https://github.com/PaddlePaddle/Paddle/pull/64497),[#63082](https://github.com/PaddlePaddle/Paddle/pull/63082),[#64395](https://github.com/PaddlePaddle/Paddle/pull/64395),[#60183](https://github.com/PaddlePaddle/Paddle/pull/60183),[#63691](https://github.com/PaddlePaddle/Paddle/pull/63691),[#64428](https://github.com/PaddlePaddle/Paddle/pull/64428),[#64648](https://github.com/PaddlePaddle/Paddle/pull/64648),[#64650](https://github.com/PaddlePaddle/Paddle/pull/64650),[#59926](https://github.com/PaddlePaddle/Paddle/pull/59926),[#59750](https://github.com/PaddlePaddle/Paddle/pull/59750),[#60080](https://github.com/PaddlePaddle/Paddle/pull/60080),[#60208](https://github.com/PaddlePaddle/Paddle/pull/60208),[#64124](https://github.com/PaddlePaddle/Paddle/pull/64124),[#64187](https://github.com/PaddlePaddle/Paddle/pull/64187),[#64166](https://github.com/PaddlePaddle/Paddle/pull/64166),[#64284](https://github.com/PaddlePaddle/Paddle/pull/64284),[#64253](https://github.com/PaddlePaddle/Paddle/pull/64253),[#64555](https://github.com/PaddlePaddle/Paddle/pull/64555),[#59878](https://github.com/PaddlePaddle/Paddle/pull/59878),[#64081](https://github.com/PaddlePaddle/Paddle/pull/64081)