-
-
@@ -706,17 +461,17 @@ platform tag: 类似 'linux_x86_64', 'any'
cd /home/work
```
```
-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /work/train.py
+docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /work/train.py
```
上述命令中,`-it` 参数说明容器已交互式运行;`-v $PWD:/work`
指定将当前路径(Linux 中 PWD 变量会展开为当前路径的绝对路径)挂载到容器内部的:`/work`
-目录: `registry.baidubce.com/paddlepaddle/paddle` 指定需要使用的容器; 最后`/work/train.py`为容器内执行的命令,即运行训练程序。
+目录: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle` 指定需要使用的容器; 最后`/work/train.py`为容器内执行的命令,即运行训练程序。
当然,您也可以进入到 Docker 容器中,以交互式的方式执行或调试您的代码:
```
-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /bin/bash
+docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /bin/bash
```
```
cd /work
@@ -740,13 +495,13 @@ PaddlePaddle Book 是为用户和开发者制作的一个交互式的 Jupyter No
我们提供可以直接运行 PaddlePaddle Book 的 Docker 镜像,直接运行:
```
-docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book
+docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book
```
国内用户可以使用下面的镜像源来加速访问:
```
-docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book
+docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book
```
然后在浏览器中输入以下网址:
@@ -765,7 +520,7 @@ http://localhost:8888/
请不要忘记提前在物理机上安装 GPU 最新驱动。
```
-nvidia-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash
+nvidia-docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash
```
**注: 如果没有安装 nvidia-docker,可以尝试以下的方法,将 CUDA 库和 Linux 设备挂载到 Docker 容器内:**
@@ -775,5 +530,5 @@ export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') \
$(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
docker run ${CUDA_SO} \
-${DEVICES} -it registry.baidubce.com/paddlepaddle/paddle:latest-gpu
+${DEVICES} -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu
```
diff --git a/docs/install/Tables_en.md b/docs/install/Tables_en.md
index 9a651c5cd0a..2949939f18c 100644
--- a/docs/install/Tables_en.md
+++ b/docs/install/Tables_en.md
@@ -282,11 +282,11 @@ PaddePaddle implements references to various BLAS/CUDA/cuDNN libraries by specif
-
paddlepaddle==[version code] such as paddlepaddle==2.6.1
+
paddlepaddle==[version code] such as paddlepaddle==3.0.0b2
Only support the corresponding version of the CPU PaddlePaddle, please refer to Pypi for the specific version.
-
paddlepaddle-gpu==[version code], such as paddlepaddle-gpu==2.6.1
+
paddlepaddle-gpu==[version code], such as paddlepaddle-gpu==3.0.0b2
The default installation supports the PaddlePaddle installation package corresponding to [version number] of CUDA 11.2 and cuDNN 8
@@ -294,9 +294,8 @@ PaddePaddle implements references to various BLAS/CUDA/cuDNN libraries by specif
You can find various distributions of PaddlePaddle-gpu in [the Release History](https://pypi.org/project/paddlepaddle-gpu/#history).
-> 'postxx' corresponds to CUDA and cuDNN versions, and the number before 'postxx' represents the version of Paddle
-Please note that: in the commands, paddlepaddle-gpu==2.6.1 will install the installation package of PaddlePaddle that supports CUDA 11.2 and cuDNN 8 by default under Windows environment.
+Please note that: in the commands, paddlepaddle-gpu==3.0.0b2 will install the installation package of PaddlePaddle that supports CUDA 11.2 and cuDNN 8 by default under Windows environment.
@@ -320,181 +319,86 @@ Please note that: in the commands, paddlepaddle-gpu==2.6.1 will i
@@ -537,155 +441,6 @@ abi tag: similar to 'cp33m', 'abi3', 'none'
platform tag: similar to 'linux_x86_64', 'any'
-
-
-## **Multi-version whl package list - dev**
-
-
-
-
@@ -701,16 +456,16 @@ Suppose you have written a PaddlePaddle program in the current directory (such a
cd /home/work
```
```
-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /work/train.py
+docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /work/train.py
```
-In the above commands, the `-it` parameter indicates that the container has been run interactively; `-v $PWD:/work` specifies that the current path (the absolute path where the PWD variable in Linux will expand to the current path) is mounted to the `:/work` directory inside the container: `registry.baidubce.com/paddlepaddle/paddle` specifies the container to be used; finally `/work/train.py` is the command executed inside the container, ie. the training program.
+In the above commands, the `-it` parameter indicates that the container has been run interactively; `-v $PWD:/work` specifies that the current path (the absolute path where the PWD variable in Linux will expand to the current path) is mounted to the `:/work` directory inside the container: `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle` specifies the container to be used; finally `/work/train.py` is the command executed inside the container, ie. the training program.
Of course, you can also enter into the Docker container and execute or debug your code interactively:
```
-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle /bin/bash
+docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle /bin/bash
```
```
cd /work
@@ -732,13 +487,13 @@ Use Docker to quickly launch a local Jupyter Notebook containing the PaddlePaddl
We provide a Docker image that can run the PaddlePaddle Book directly, running directly:
```
-docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book
+docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book
```
Domestic users can use the following image source to speed up access:
```
-docker run -p 8888:8888 registry.baidubce.com/paddlepaddle/book
+docker run -p 8888:8888 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/book
```
Then enter the following URL in your browser:
@@ -756,7 +511,7 @@ http://localhost:8888/
In order to ensure that the GPU driver works properly in the image, we recommend using [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) to run the image. Don't forget to install the latest GPU drivers on your physical machine in advance.
```
-Nvidia-docker run -it -v $PWD:/work registry.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash
+Nvidia-docker run -it -v $PWD:/work ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu /bin/bash
```
**Note: If you don't have nvidia-docker installed, you can try the following to mount the CUDA library and Linux devices into the Docker container:**
@@ -766,5 +521,5 @@ export CUDA_SO="$(\ls /usr/lib64/libcuda* | xargs -I{} echo '-v {}:{}') \
$(\ls /usr/lib64/libnvidia* | xargs -I{} echo '-v {}:{}')"
export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
docker run ${CUDA_SO} \
-${DEVICES} -it registry.baidubce.com/paddlepaddle/paddle:latest-gpu
+${DEVICES} -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-gpu
```
diff --git a/docs/install/compile/linux-compile-by-make.md b/docs/install/compile/linux-compile-by-make.md
index d896d387019..eb1dc506b7a 100644
--- a/docs/install/compile/linux-compile-by-make.md
+++ b/docs/install/compile/linux-compile-by-make.md
@@ -60,12 +60,12 @@ cd Paddle
* CPU 版的 PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev
```
* GPU 版的 PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2
```
如果您的机器不在中国大陆地区,可以直接从 [DockerHub 中的 paddle 镜像仓库](https://hub.docker.com/r/paddlepaddle/paddle/tags) 拉取镜像:
@@ -90,7 +90,7 @@ cd Paddle
用从百度拉取的镜像创建容器:
```
- docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
+ docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
```
- `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test;
@@ -99,7 +99,7 @@ cd Paddle
- `-it`: 与宿主机保持交互状态;
- - `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可:
```
@@ -110,7 +110,7 @@ cd Paddle
用从百度拉取的镜像创建容器
```
- docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash
+ docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash
```
- `--gpus all`: 在 Docker 容器中允许使用 gpu;
@@ -121,7 +121,7 @@ cd Paddle
- `-it`: 与宿主机保持交互状态;
- - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可:
```
diff --git a/docs/install/compile/linux-compile-by-make_en.md b/docs/install/compile/linux-compile-by-make_en.md
index f5d6db2479c..0f1a9c42827 100644
--- a/docs/install/compile/linux-compile-by-make_en.md
+++ b/docs/install/compile/linux-compile-by-make_en.md
@@ -58,12 +58,12 @@ For domestic users, when downloading docker is slow due to network problems, you
* CPU version of PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev
```
* GPU version of PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2
```
If your machine is not in mainland China, you can pull the image directly from DockerHub:
@@ -90,7 +90,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
Using the image pulled from Baidu.
```
- docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
+ docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
```
- `--name paddle-test`: names the Docker container you created as paddle-test;
@@ -101,7 +101,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
- `-it`: keeps interaction with the host;
- - `registry.baidubce.com/paddlepaddle/paddle:latest-dev`: use the image named `registry.baidubce.com/paddlepaddle/paddle:latest-dev` to create Docker container, /bin/bash start the /bin/bash command after entering the container.
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`: use the image named `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev` to create Docker container, /bin/bash start the /bin/bash command after entering the container.
If you are using the image pulled from DockerHub, just modify the image name.
```
@@ -113,7 +113,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
Using the image pulled from Baidu.
```
- docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash
+ docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash
```
- `--gpus all`: gpu resources can be used in Docker container;
@@ -127,7 +127,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
- `-it`: keeps interaction with the host;
- - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`: use the image named `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2` to create Docker container, /bin/bash start the /bin/bash command after entering the container.
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`: use the image named `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2` to create Docker container, /bin/bash start the /bin/bash command after entering the container.
If you are using the image pulled from DockerHub, just modify the image name.
```
diff --git a/docs/install/compile/linux-compile-by-ninja.md b/docs/install/compile/linux-compile-by-ninja.md
index 018bd28e267..74ade0b7be4 100644
--- a/docs/install/compile/linux-compile-by-ninja.md
+++ b/docs/install/compile/linux-compile-by-ninja.md
@@ -60,12 +60,12 @@ cd Paddle
* CPU 版的 PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev
```
* GPU 版的 PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2
```
如果您的机器不在中国大陆地区,可以直接从 [DockerHub 中的 paddle 镜像仓库](https://hub.docker.com/r/paddlepaddle/paddle/tags) 拉取镜像:
@@ -90,7 +90,7 @@ cd Paddle
用从百度拉取的镜像创建容器
```
- docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
+ docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
```
- `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test;
@@ -99,7 +99,7 @@ cd Paddle
- `-it`: 与宿主机保持交互状态;
- - `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可:
```
@@ -110,7 +110,7 @@ cd Paddle
用从百度拉取的镜像创建容器
```
- docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash
+ docker run --gpus all --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash
```
- `--gpus all`: 在 Docker 容器中允许使用 gpu;
@@ -121,7 +121,7 @@ cd Paddle
- `-it`: 与宿主机保持交互状态;
- - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`registry.baidubce.com/paddlepaddle/paddle`, tag 为`latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle`, tag 为`latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令。
若使用的是从 DockerHub 拉取的镜像创建容器,则修改镜像名即可:
```
diff --git a/docs/install/compile/macos-compile-make.md b/docs/install/compile/macos-compile-make.md
index 813df2634cf..12abdda7698 100644
--- a/docs/install/compile/macos-compile-make.md
+++ b/docs/install/compile/macos-compile-make.md
@@ -48,7 +48,7 @@ cd Paddle
* CPU 版的 PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev
```
如果您的机器不在中国大陆地区,可以直接从 DockerHub 拉取镜像:
@@ -64,7 +64,7 @@ cd Paddle
#### 5. 创建并进入满足编译环境的 Docker 容器:
```
-docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
+docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
```
- `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test
@@ -73,7 +73,7 @@ docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidub
- `-it`:与宿主机保持交互状态
-- `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令
+- `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令
注意:请确保至少为 docker 分配 4g 以上的内存,否则编译过程可能因内存不足导致失败。您可以在 docker 用户界面的“Preferences-Resources”中设置容器的内存分配上限。
diff --git a/docs/install/compile/macos-compile-make_en.md b/docs/install/compile/macos-compile-make_en.md
index 4cc2fbbea1f..6b7da5879eb 100644
--- a/docs/install/compile/macos-compile-make_en.md
+++ b/docs/install/compile/macos-compile-make_en.md
@@ -49,7 +49,7 @@ For domestic users, when downloading docker is slow due to network problems, you
* CPU version of PaddlePaddle:
```
-docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev
+docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev
```
If your machine is not in mainland China, you can pull the image directly from DockerHub:
@@ -65,7 +65,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
#### 5. Create and enter a Docker container that meets the compilation environment:
```
-docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
+docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
```
- `--name paddle-test`: name the Docker container you created as paddle-test,
@@ -74,7 +74,7 @@ docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidub
- `-it`: keeps interacting with the host;
-- `registry.baidubce.com/paddlepaddle/paddle:latest-dev`: creates a Docker container with a mirror named `registry.baidubce.com/paddlepaddle/paddle:latest-dev`, /bin /bash starts the /bin/bash command after entering the container.
+- `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`: creates a Docker container with a mirror named `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`, /bin /bash starts the /bin/bash command after entering the container.
Note:
diff --git a/docs/install/compile/macos-compile-ninja.md b/docs/install/compile/macos-compile-ninja.md
index bed6f9378f2..ab2c12801e3 100644
--- a/docs/install/compile/macos-compile-ninja.md
+++ b/docs/install/compile/macos-compile-ninja.md
@@ -48,7 +48,7 @@ cd Paddle
* CPU 版的 PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:latest-dev
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev
```
如果您的机器不在中国大陆地区,可以直接从 DockerHub 拉取镜像:
* CPU 版的 PaddlePaddle:
@@ -58,12 +58,12 @@ cd Paddle
您可以访问[DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/)获取与您机器适配的镜像。
#### 5. 创建并进入满足编译环境的 Docker 容器:
```
-docker run --name paddle-test -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
+docker run --name paddle-test -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev /bin/bash
```
- `--name paddle-test`:为您创建的 Docker 容器命名为 paddle-test
- `-v:$PWD:/paddle`:将当前目录挂载到 Docker 容器中的/paddle 目录下(Linux 中 PWD 变量会展开为当前路径的[绝对路径](https://baike.baidu.com/item/绝对路径/481185))
- `-it`:与宿主机保持交互状态
-- `registry.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`registry.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令
+- `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`:使用名为`ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:latest-dev`的镜像创建 Docker 容器,/bin/bash 进入容器后启动/bin/bash 命令
注意:请确保至少为 docker 分配 4g 以上的内存,否则编译过程可能因内存不足导致失败。您可以在 docker 用户界面的“Preferences-Resources”中设置容器的内存分配上限。
#### 6. 进入 Docker 后进入 paddle 目录下:
```
diff --git a/docs/install/conda/linux-conda.md b/docs/install/conda/linux-conda.md
index fc55e0a4929..caece6b81dc 100644
--- a/docs/install/conda/linux-conda.md
+++ b/docs/install/conda/linux-conda.md
@@ -86,39 +86,29 @@ python3 -c "import platform;print(platform.architecture()[0]);print(platform.mac
#### CPU 版的 PaddlePaddle
+
如果您的计算机没有 NVIDIA® GPU,请安装 CPU 版的 PaddlePaddle
```
-conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+conda install paddlepaddle==3.0.0b2 -c paddle
```
-
#### GPU 版的 PaddlePaddle
-
-* 对于 `CUDA 11.2`,需要搭配 cuDNN 8.2.1(多卡环境下 NCCL>=2.7),安装命令为:
+* 对于 `CUDA 11.8` 安装命令为:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia
```
-* 对于 `CUDA 11.6`,需要搭配 cuDNN 8.4.0(多卡环境下 NCCL>=2.7),安装命令为:
+* 对于 `CUDA 12.3` 安装命令为:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia
```
-* 对于 `CUDA 11.7`,需要搭配 cuDNN 8.4.1(多卡环境下 NCCL>=2.7),安装命令为:
-
- ```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
- ```
-
-您可参考 NVIDIA 官方文档了解 CUDA 和 CUDNN 的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/)
-
-
## **三、验证安装**
diff --git a/docs/install/conda/linux-conda_en.md b/docs/install/conda/linux-conda_en.md
index 14916f84957..fefbee05785 100644
--- a/docs/install/conda/linux-conda_en.md
+++ b/docs/install/conda/linux-conda_en.md
@@ -91,37 +91,29 @@ You can choose the following version of PaddlePaddle to start installation:
#### CPU Version of PaddlePaddle
+
If your computer doesn't have NVIDIA® GPU, please install `the CPU Version of PaddlePaddle`
```
-conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+conda install paddlepaddle==3.0.0b2 -c paddle
```
-
#### GPU Version of PaddlePaddle
-* If you are using CUDA 11.2,cuDNN 8.2.1(for multi card support, NCCL>=2.7):
+* If you are using CUDA 11.8:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia
```
-* If you are using CUDA 11.6,cuDNN 8.4.0(for multi card support, NCCL>=2.7):
+* If you are using CUDA 12.3:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia
```
-* If you are using CUDA 11.7,cuDNN 8.4.1(for multi card support, NCCL>=2.7):
-
- ```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
- ```
-
-You can refer to NVIDIA official documents for installation process and configuration method of CUDA and cudnn. Please refer to [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/)
-
## Verify installation
diff --git a/docs/install/conda/macos-conda.md b/docs/install/conda/macos-conda.md
index 30600afbf54..3a43c093759 100644
--- a/docs/install/conda/macos-conda.md
+++ b/docs/install/conda/macos-conda.md
@@ -83,7 +83,7 @@ python3 -c "import platform;print(platform.architecture()[0]);print(platform.mac
* 目前在 macOS 环境仅支持 CPU 版 PaddlePaddle,请参考如下命令安装 Paddle:
```
- conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+ conda install paddlepaddle==3.0.0b2 -c paddle
```
## **三、验证安装**
diff --git a/docs/install/conda/macos-conda_en.md b/docs/install/conda/macos-conda_en.md
index 371e218972e..ac3eff46eb5 100644
--- a/docs/install/conda/macos-conda_en.md
+++ b/docs/install/conda/macos-conda_en.md
@@ -87,7 +87,7 @@ conda config --set show_channel_urls yes
* Currently, only the CPU version of PaddlePaddle is supported in the macOS environment. Please use the following command to install PaddlePaddle:
```
- conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+ conda install paddlepaddle==3.0.0b2 -c paddle
```
diff --git a/docs/install/conda/windows-conda.md b/docs/install/conda/windows-conda.md
index 29891f4c291..6edf9dea2f6 100644
--- a/docs/install/conda/windows-conda.md
+++ b/docs/install/conda/windows-conda.md
@@ -90,35 +90,27 @@ python -c "import platform;print(platform.architecture()[0]);print(platform.mach
如果您的计算机没有 NVIDIA® GPU,请安装 CPU 版的 PaddlePaddle
+
```
-conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+conda install paddlepaddle==3.0.0b2 -c paddle
```
-
#### GPU 版的 PaddlePaddle
-* 对于 `CUDA 11.2`,需要搭配 cuDNN 8.2.1,安装命令为:
+* 对于 `CUDA 11.8` 安装命令为:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia
```
-* 对于 `CUDA 11.6`,需要搭配 cuDNN 8.4.0,安装命令为:
+* 对于 `CUDA 12.3` 安装命令为:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia
```
-* 对于 `CUDA 11.7`,需要搭配 cuDNN 8.4.1,安装命令为:
-
- ```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
- ```
-
-您可参考 NVIDIA 官方文档了解 CUDA 和 CUDNN 的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/)
-
## **三、验证安装**
diff --git a/docs/install/conda/windows-conda_en.md b/docs/install/conda/windows-conda_en.md
index 736e0da59a7..a7be323dfe5 100644
--- a/docs/install/conda/windows-conda_en.md
+++ b/docs/install/conda/windows-conda_en.md
@@ -93,10 +93,11 @@ You can choose the following version of PaddlePaddle to start installation:
#### CPU Version of PaddlePaddle
+
If your computer doesn't have NVIDIA® GPU, please install `the CPU Version of PaddlePaddle`
```
-conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+conda install paddlepaddle==3.0.0b2 -c paddle
```
@@ -105,26 +106,18 @@ conda install paddlepaddle==2.6.1 --channel https://mirrors.tuna.tsinghua.edu.cn
#### GPU Version of PaddlePaddle
-* If you are using CUDA 11.2,cuDNN 8.2.1:
-
- ```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
- ```
-
-* If you are using CUDA 11.6,cuDNN 8.4.0:
+* If you are using CUDA 11.8:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.6 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia
```
-* If you are using CUDA 11.7,cuDNN 8.4.1:
+* If you are using CUDA 12.3:
```
- conda install paddlepaddle-gpu==2.6.1 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
+ conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=12.3 -c paddle -c nvidia
```
-You can refer to NVIDIA official documents for installation process and configuration method of CUDA and cudnn. Please refer to [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/)
-
## Verify installation
diff --git a/docs/install/docker/docker_list.md b/docs/install/docker/docker_list.md
index d1a48be1885..60179698128 100644
--- a/docs/install/docker/docker_list.md
+++ b/docs/install/docker/docker_list.md
@@ -18,7 +18,7 @@
diff --git a/docs/install/docker/docker_list_en.md b/docs/install/docker/docker_list_en.md
index c70f84b5817..ac4eb66e8f5 100644
--- a/docs/install/docker/docker_list_en.md
+++ b/docs/install/docker/docker_list_en.md
@@ -18,7 +18,7 @@ This document introduces the Docker environment commonly used by PaddlePaddle
diff --git a/docs/install/docker/linux-docker_en.md b/docs/install/docker/linux-docker_en.md
index 71f66e14a35..8c08f70bca5 100644
--- a/docs/install/docker/linux-docker_en.md
+++ b/docs/install/docker/linux-docker_en.md
@@ -21,46 +21,40 @@ For domestic users, when downloading docker is slow due to network problems, you
* CPU version of PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2
```
* CPU version of PaddlePaddle, and the image is pre-installed with jupyter:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter
```
* GPU version of PaddlePaddle(**Latest version of gpu image is recommended, and make sure NVIDIA Container Toolkit is installed successfully**):
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6
```
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4
- ```
- ```
- docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5
```
If your machine is not in mainland China, you can pull the image directly from DockerHub:
* CPU version of PaddlePaddle:
```
- docker pull paddlepaddle/paddle:2.6.1
+ docker pull paddlepaddle/paddle:3.0.0b2
```
* CPU version of PaddlePaddle, and the image is pre-installed with jupyter:
```
- docker pull paddlepaddle/paddle:2.6.1-jupyter
+ docker pull paddlepaddle/paddle:3.0.0b2-jupyter
```
* GPU version of PaddlePaddle(**Latest version of gpu image is recommended, and make sure NVIDIA Container Toolkit is installed successfully**):
```
- docker pull paddlepaddle/paddle:2.6.1-gpu-cuda11.2-cudnn8.2-trt8.0
- ```
+ docker pull paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6
```
- docker pull paddlepaddle/paddle:2.6.1-gpu-cuda11.7-cudnn8.4-trt8.4
```
- ```
- docker pull paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6
+ docker pull paddlepaddle/paddle:3.0.0b2-gpu-cuda11.8-cudnn8.6-trt8.5
```
You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to get more images.
@@ -72,7 +66,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
```
- docker run --name paddle_docker -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1 /bin/bash
+ docker run --name paddle_docker -it -v $PWD:/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 /bin/bash
```
- `--name paddle_docker`: set name of Docker, `paddle_docker` is name of docker you set;
@@ -83,7 +77,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
- `-v $PWD:/paddle`: Specifies to mount the current path of the host (PWD variable in Linux will expand to the absolute path of the current path) to the /paddle directory inside the container;
- - `registry.baidubce.com/paddlepaddle/paddle:2.6.1`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker
* Use GPU version of PaddlePaddle:
@@ -91,7 +85,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
```
- docker run --gpus all --name paddle_docker -v $PWD:/paddle --network=host -it registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 /bin/bash
+ docker run --gpus all --name paddle_docker -v $PWD:/paddle --network=host -it ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6 /bin/bash
```
- `--gpus all`: gpu resources can be used in Docker container;
@@ -104,7 +98,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
- `-v $PWD:/paddle`: Specifies to mount the current path of the host (PWD variable in Linux will expand to the absolute path of the current path) to the /paddle directory inside the container;
- - `registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-gpu-cuda12.3-cudnn9.0-trt8.6`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker
* Use CPU version of PaddlePaddle with jupyter:
@@ -120,7 +114,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
cd ./jupyter_docker
```
```
- docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter
+ docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter
```
- `--rm`: Delete the container after closing it;
@@ -131,7 +125,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
- `-v $PWD:/home/paddle`: Specifies to mount the current path (the PWD variable will be expanded to the absolute path of the current path) to the /home/paddle directory inside the container;
- - `registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command
Now you have successfully used Docker to install PaddlePaddle. For more information about using Docker, see[Docker official documents](https://docs.docker.com)
@@ -149,24 +143,20 @@ Now you have successfully used Docker to install PaddlePaddle. For more informat
diff --git a/docs/install/docker/macos-docker_en.md b/docs/install/docker/macos-docker_en.md
index fadf8883c33..80148a14a04 100644
--- a/docs/install/docker/macos-docker_en.md
+++ b/docs/install/docker/macos-docker_en.md
@@ -19,24 +19,24 @@ For domestic users, when downloading docker is slow due to network problems, you
* CPU version of PaddlePaddle:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2
```
* CPU version of PaddlePaddle, and the image is pre-installed with jupyter:
```
- docker pull registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter
+ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter
```
If your machine is not in mainland China, you can pull the image directly from DockerHub:
* CPU version of PaddlePaddle:
```
- docker pull paddlepaddle/paddle:2.6.1
+ docker pull paddlepaddle/paddle:3.0.0b2
```
* CPU version of PaddlePaddle, and the image is pre-installed with jupyter:
```
- docker pull paddlepaddle/paddle:2.6.1-jupyter
+ docker pull paddlepaddle/paddle:3.0.0b2-jupyter
```
You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to get more images.
@@ -48,7 +48,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
```
- docker run --name paddle_docker -it -v $PWD:/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1 /bin/bash
+ docker run --name paddle_docker -it -v $PWD:/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2 /bin/bash
```
- `--name paddle_docker`: set name of Docker, `paddle_docker` is name of docker you set;
@@ -59,7 +59,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
- `-v $PWD:/paddle`: Specifies to mount the current path of the host (PWD variable in Linux will expand to the absolute path of the current path) to the /paddle directory inside the container;
- - `registry.baidubce.com/paddlepaddle/paddle:2.6.1`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2`: Specify the name of the image to be used. You can view it through the 'docker images' command. /bin/Bash is the command to be executed in Docker
* Use CPU version of PaddlePaddle with jupyter:
@@ -75,7 +75,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
cd ./jupyter_docker
```
```
- docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter
+ docker run -p 80:80 --rm --env USER_PASSWD="password you set" -v $PWD:/home/paddle ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter
```
- `--rm`: Delete the container after closing it;
@@ -86,7 +86,7 @@ You can see [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags/) to g
- `-v $PWD:/home/paddle`: Specifies to mount the current path (the PWD variable will be expanded to the absolute path of the current path) to the /home/paddle directory inside the container;
- - `registry.baidubce.com/paddlepaddle/paddle:2.6.1-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command
+ - `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0b2-jupyter`: Specify the name of the image to be used, you can view it through the `docker images` command
@@ -105,12 +105,12 @@ Now you have successfully used Docker to install PaddlePaddle. For more informat
-
-```python
-# elu(x) = max(0, x) + min(0, α ∗ (e^x − 1))
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
- [ 0. , -74.48576355])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
- [ 0. , -74.48576355])
-```
-
-
-
-
-
-```python
-# elu(x) = x, if x > 0
-# elu(x) = α ∗ (e^x − 1), if x <= 0
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
- [0.12642412, 6. ])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
-Traceback (most recent call last):
- File "", line 1, in
- File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
- return caller(func, *(extras + args), **kw)
- File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
- return wrapped_func(*args, **kwargs)
- File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/inplace_utils.py", line 34, in __impl__
- return func(*args, **kwargs)
- File "/usr/local/lib/python3.7/dist-packages/paddle/nn/functional/activation.py", line 89, in elu_
- assert alpha >= 0., "elu_ only support alpha >= 0, please use elu instead."
-AssertionError: elu_ only support alpha >= 0, please use elu instead.
-```
-
-
-
-
-
-## 3. 训练框架(含分布式)
-
-### (1)新功能
-
-#### API
-
-- 新增 4 个自动微分类 API,支持科学计算需求,具体列表如下:([#40692](https://github.com/PaddlePaddle/Paddle/pull/40692))
-
- - `paddle.incubate.autograd.vjp`,计算向量-雅可比矩阵乘积。
-
- - `paddle.incubate.autograd.jvp`,计算雅可比矩阵-向量乘积。
-
- - `paddle.incubate.autograd.Jacobian`,计算雅可比矩阵。
-
- - `paddle.incubate.autograd.Hessian`,计算海森矩阵。
-
-- 新增线性代数类 API
-
- - 新增 `paddle.linalg.triangular_solve`,计算具有唯一解的三角系数线性方程组。([#36714](https://github.com/PaddlePaddle/Paddle/pull/36714))
-
- - 新增 `paddle.linalg.eig`,计算一般方阵的特征分解。([#35764](https://github.com/PaddlePaddle/Paddle/pull/35764))
-
- - 新增 `paddle.linalg.sovle`,计算线性方程组的解。([#35715](https://github.com/PaddlePaddle/Paddle/pull/35715))
-
- - 新增 `paddle.linalg.lstsq`,计算线性方程组的最小二乘解。([#38585](https://github.com/PaddlePaddle/Paddle/pull/38585), [#38621](https://github.com/PaddlePaddle/Paddle/pull/38621))
-
- - 新增 `paddle.linalg.qr`,计算矩阵的 QR 分解。([#35742](https://github.com/PaddlePaddle/Paddle/pull/35742), [#38824](https://github.com/PaddlePaddle/Paddle/pull/38824))
-
- - 新增 `paddle.inner`,计算矩阵内积。([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706))
-
- - 新增 `paddle.outer`,计算矩阵外积。([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706))
-
- - 新增 `paddle.linalg.cov`,计算向量间协方差。([#38392](https://github.com/PaddlePaddle/Paddle/pull/38392))
-
- - 新增 `paddle.linalg.cholesky_sovle`,计算方程 cholesky 解。([#38167](https://github.com/PaddlePaddle/Paddle/pull/38167))
-
- - 新增 `paddle.linalg.lu`、 `paddle.linalg.lu_unpack`,计算矩阵 lu 分解、解压缩 lu 矩阵。([#38617](https://github.com/PaddlePaddle/Paddle/pull/38617), [#38559](https://github.com/PaddlePaddle/Paddle/pull/38559), [#38616](https://github.com/PaddlePaddle/Paddle/pull/38616))
-
-- 新增 21 个概率分布类 API,包括 6 个随机变量分布,13 个随机变量变换,2 个 KL 散度计算,用于强化学习、变分推断、科学计算等场景,具体列表如下:([#40536](https://github.com/PaddlePaddle/Paddle/pull/40536), [#38820](https://github.com/PaddlePaddle/Paddle/pull/38820), [#38558](https://github.com/PaddlePaddle/Paddle/pull/38558/files), [#38445](https://github.com/PaddlePaddle/Paddle/pull/38445), [#38244](https://github.com/PaddlePaddle/Paddle/pull/38244), [#38047](https://github.com/PaddlePaddle/Paddle/pull/38047))
-
- - `paddle.distribution.ExponentialFamily`,指数分布族基类。
-
- - `paddle.distribution.Beta`,`Beta` 分布。
-
- - `paddle.distribution.Dirichlet`,`Dirichlet` 分布。
-
- - `paddle.distribution.Independent`,独立分布,用于创建高阶分布。
-
- - `paddle.distribution.TransformedDistribution`,变换分布,用于通过基础分布及一系列变换生成高阶分布。
-
- - `paddle.distribution.Multionmial`,多项分布。
-
- - `paddle.distribution.Transform`,随机变量变换的基类。
-
- - `paddle.distribution.AbsTransform`,取绝对值变换。
-
- - `paddle.distribution.AffineTransform`,仿射变换。
-
- - `paddle.distribution.ChainTransform`,变换的链式组合。
-
- - `paddle.distribution.ExpTransform`,指数变换。
-
- - `paddle.distribution.IndependentTransform`,独立变换,用于扩展变换定义域的 `event_dim`。
-
- - `paddle.distribution.PowerTransform`,幂变换。
-
- - `paddle.distribution.ReshapeTransform`,`reshape` 变换。
-
- - `paddle.distribution.SigmoidTransform`,`sigmoid` 变换。
-
- - `paddle.distribution.SoftmaxTransform`,`softmax` 变换。
-
- - `paddle.distribution.StackTransform`,`stack` 变换,用于以 `stack` 方式组合多个变换。
-
- - `paddle.distribution.StickBreakingTransform`, `stickbreaking` 变换。
-
- - `paddle.distribution.TanhTransform`,`tanh` 变换。
-
- - `paddle.distribution.kl_divergence`,计算 KL 散度。
-
- - `paddle.distribution.register_kl`,注册用户自定义 KL 散度计算函数。
-
-- 新增高层 API
-
- - 新增 `paddle.vision.models.AlexNet`、`paddle.vision.models.alexnet`,支持直接使用 AlexNet 模型。([#36058](https://github.com/PaddlePaddle/Paddle/pull/36058))
-
- - 新增 `paddle.vision.models.DenseNet`、 `paddle.vision.models.densenet121`、 `paddle.vision.models.densenet161`、 `paddle.vision.models.densenet169`、 `paddle.vision.models.densenet201`、 `paddle.vision.models.densenet264`,支持直接使用 DenseNet 模型。([#36069](https://github.com/PaddlePaddle/Paddle/pull/36069))
-
- - 新增 `paddle.vision.models.GoogLeNet`、`paddle.vision.models.googlenet`,支持直接使用 GoogLeNet 模型。([#36034](https://github.com/PaddlePaddle/Paddle/pull/36034))
-
- - 新增 `paddle.vision.models.InceptionV3`、`paddle.vision.models.inception_v3`,支持直接使用 InceptionV3 模型。([#36064](https://github.com/PaddlePaddle/Paddle/pull/36064))
-
- - 新增 `paddle.vision.models.MobileNetV3Small`、 `paddle.vision.models.MobileNetV3Large`、`paddle.vision.models.mobilenet_v3_small`、`paddle.vision.models.mobilenet_v3_large`,支持直接使用 MobileNetV3 模型。([#38653](https://github.com/PaddlePaddle/Paddle/pull/38653))
-
- - 新增 `paddle.vision.models.resnext50_32x4d`、 `paddle.vision.models.resnext50_64x4d`、`paddle.vision.models.resnext101_32x4d`、`paddle.vision.models.resnext101_64x4d`、`paddle.vision.models.resnext152_32x4d`、`paddle.vision.models.resnext152_64x4d`,支持直接使用 ResNeXt 模型。([#36070](https://github.com/PaddlePaddle/Paddle/pull/36070))
-
- - 新增 `paddle.vision.models.ShuffleNetV2`、 `paddle.vision.models.shufflenet_v2_x0_25`、`paddle.vision.models.shufflenet_v2_x0_33`、`paddle.vision.models.shufflenet_v2_x0_5`、`paddle.vision.models.shufflenet_v2_x1_0`、`paddle.vision.models.shufflenet_v2_x1_5`、`paddle.vision.models.shufflenet_v2_x2_0`、`paddle.vision.models.shufflenet_v2_swish`,支持直接使用 ShuffleNetV2 模型。([#36067](https://github.com/PaddlePaddle/Paddle/pull/36067))
-
- - 新增 `paddle.vision.models.SqueezeNet`、 `paddle.vision.models.squeezenet1_0`、`paddle.vision.models.squeezenet1_1`,支持直接使用 SqueezeNet 模型。([#36066](https://github.com/PaddlePaddle/Paddle/pull/36066))
-
- - 新增 `paddle.vision.models.wide_resnet50_2`、`paddle.vision.models.wide_resnet101_2`,支持直接使用 WideResNet 模型。([#36952](https://github.com/PaddlePaddle/Paddle/pull/36952))
-
- - 新增`paddle.vision.ops.nms` API,支持单类别和多类别非极大抑制(non-maximum supression, nms)算法,用于目标检测预测任务加速。([#40962](https://github.com/PaddlePaddle/Paddle/pull/40962))
-
- - 新增`paddle.vision.ops.roi_pool` 和 `paddle.vision.ops.RoIPool`,支持检测任务中 RoI 区域池化操作。([#36154](https://github.com/PaddlePaddle/Paddle/pull/36154))
-
- - 新增`paddle.vision.ops.roi_align` 和 `paddle.vision.ops.RoIAlign`,支持检测任务中 RoI Align 操作。([#35102](https://github.com/PaddlePaddle/Paddle/pull/36154))
-
- - 新增 `paddle.text.ViterbiDecoder`、`paddle.text.viterbi_decode` Viterbi 解码 API,主要用于序列标注模型的预测。([#35778](https://github.com/PaddlePaddle/Paddle/pull/35778))
-
-- 新增 11 个 Sparse 类 API,支持创建 COO、CSR 格式的 Sparse Tensor,与 Tensor 互相转换等基础功能:
-
- - `paddle.sparse.sparse_coo_tensor`,创建 COO 格式的 Sparse Tensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `paddle.sparse.sparse_csr_tensor`,创建 CSR 格式的 Sparse Tensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `paddle.sparse.ReLU`,支持 SparseCooTensor 的 ReLU 激活层。([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959))
-
- - `paddle.sparse.functional.relu`,支持 SparseCooTensor 的 ReLU 函数。([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959))
-
- - `Tensor.values()`,获取 SparseCooTensor 或者 SparseCsrTensor 的非零元素方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.indices()`,获取 SparseCooTensor 的坐标信息的方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.crows()`,获取 SparseCsrTensor 的压缩行信息的方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.cols()`,获取 SparseCsrTensor 的列信息的方法。([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.to_sparse_coo()`,将 DenseTensor 或者 SparseCsrTensor 转换为 SparseCooTensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `Tensor.to_sparse_csr()`,将 DenseTensor 或者 SparseCooTensor 转换为 SparseCsrTensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `Tensor.to_dense()`,将 SparseCooTensor 或者 SparseCsrTensor 转换为 DenseTensor。([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
-- 新增硬件相关 API
-
- - 新增 `paddle.device.cuda.max_memory_allocated`、`paddle.device.cuda.max_memory_reserved`、 `paddle.device.cuda.memory_allocated` 和 `paddle.device.cuda.memory_reserved` 四个 GPU 显存监测相关 API,方便实时查看和分析模型显存占用指标。([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657))
-
- - 新增 `paddle.device.cuda.get_device_properties`,支持返回 CUDA 设备属性信息。([#35661](https://github.com/PaddlePaddle/Paddle/pull/35661))
-
- - 新增 `paddle.device.cuda.get_device_name` 和 `paddle.device.cuda.get_device_capability`,支持返回 GPU 设备名称信息和计算能力的主要和次要修订号。([#35672](https://github.com/PaddlePaddle/Paddle/pull/35672))
-
-- 新增 Tensor 操作 API
-
- - 新增 `paddle.nansum`,沿 `axis` 对输入 Tensor 求和,且忽略掉 `NaNs` 值。([#38137](https://github.com/PaddlePaddle/Paddle/pull/38137))
-
- - 新增 `paddle.nanmean`,沿 `axis`对输入 Tensor 求平均,且忽略掉 `NaNs` 值。([#40472](https://github.com/PaddlePaddle/Paddle/pull/40472))
-
- - 新增 `paddle.clone`,返回输入 Tensor 的拷贝,并且提供梯度计算。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
- - 新增 `paddle.Tensor.element_size`,返回 Tensor 中的单个元素在计算机中所分配的 bytes 数量。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
- - 新增 `paddle.Tensor.to_uva_tensor`,支持将 numpy 对象转换为实际存储在 CPU,但可作为 CUDA 对象进行虚拟地址访问的功能。([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#38950](https://github.com/PaddlePaddle/Paddle/pull/38950))
-
- - 新增`paddle.rot90`,沿 `axes` 指定的平面将 n 维 Tensor 旋转 90 度。([#37634](https://github.com/PaddlePaddle/Paddle/pull/37634))
-
- - 新增`paddle.logit` 和 `paddle.Tensor.logit`,计算输入 Tensor 的 logit 函数值。([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844))
-
- - 新增 `paddle.repeat_interleave`,沿着指定轴对输入进行复制,创建并返回到一个新的 Tensor。([#37981](https://github.com/PaddlePaddle/Paddle/pull/37981))
-
- - 新增 `paddle.renorm`,把 Tensor 在指定的 `axis` 切分成多块后分别进行 p norm 操作。([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459))
-
- - 新增 `paddle.mode` 和 `paddle.Tensor.mode`,沿指定轴查找输入 Tensor 的众数及对应的索引。([#38446](https://github.com/PaddlePaddle/Paddle/pull/38446))
-
- - 新增 `paddle.quantile` 和 `paddle.Tensor.quantile`,沿指定轴计算 Tensor 的 q 分位数。([#38567](https://github.com/PaddlePaddle/Paddle/pull/38567))
-
- - 新增 `paddle.kthvalue` 和 `paddle.Tensor.kthvalue`,查找 Tensor 中指定轴上第 k 小的数及对应的索引。([#38386](https://github.com/PaddlePaddle/Paddle/pull/38386))
-
- - 新增 `paddle.is_floating_point` 和 `paddle.Tensor.is_floating_point`,判断输入 Tensor 是否为浮点类型。([#37885](https://github.com/PaddlePaddle/Paddle/pull/37885))
-
- - 新增 `paddle.erfinv` 和 `paddle.Tensor.erfinv`,计算输入 Tensor 的逆误差函数。([#38295](https://github.com/PaddlePaddle/Paddle/pull/38295))
-
- - 新增 `paddle.lerp` 和 `paddle.Tensor.lerp`,根据给定权重计算输入 Tensor 间的线性插值。([#37253](https://github.com/PaddlePaddle/Paddle/pull/37253))
-
- - 新增 `paddle.angle`,用于计算复数 Tensor 的相位角。([#37689](https://github.com/PaddlePaddle/Paddle/pull/37689))
-
- - 新增`paddle.rad2deg`和`paddle.Tensor.rad2deg`,将元素从弧度的角度转换为度。([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598))
-
- - 新增`paddle.deg2rad`和`paddle.Tensor.deg2rad`,将元素从度的角度转换为弧度。([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598))
-
- - 新增`paddle.gcd`和`paddle.Tensor.gcd`,计算两个输入的按元素绝对值的最大公约数。([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819))
-
- - 新增`paddle.lcm`和`paddle.Tensor.lcm`,计算两个输入的按元素绝对值的最小公倍数。([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819))
-
- - 新增`paddle.amax`和`paddle.Tensor.amax`,对指定维度上的 Tensor 元素求最大值,正向结果和 max 一样,有多个相等的最大值时,反向的梯度平均分到这多个值的位置上。([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417))
-
- - 新增`paddle.amin`和`paddle.Tensor.amin`,对指定维度上的 Tensor 元素求最小值,正向结果和 min 一样,有多个相等的最小值时,反向的梯度平均分到这多个值的位置上。([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417))
-
- - 新增`paddle.isclose`,用于判断两个 Tensor 的每个元素是否接近。([#37135](https://github.com/PaddlePaddle/Paddle/pull/37135))
-
- - 新增`paddle.put_along_axis` 和`paddle.take_along_axis`,用于提取或放置指定索引下标的元素。([#38608](https://github.com/PaddlePaddle/Paddle/pull/38608))
-
- - 新增 `paddle.bincount` 和 `paddle.Tensor.bincount`,用于统计 Tensor 中每个元素出现的次数。([#36317](https://github.com/PaddlePaddle/Paddle/pull/36317))
-
- - 新增 `paddle.fmax`、 `paddle.fmin`,扩展了 max/min 的功能,支持比较的两个 Tensor 中有 NaN 值的情况,即如果对应位置上有 1 个 NaN 值,则返回那个非 NaN 值;如果对应位置上有 2 个 NaN 值,则返回 NaN 值。([#37826](https://github.com/PaddlePaddle/Paddle/pull/37826))
-
- - 新增 `paddle.diff`,用于计算沿给定维度的第 n 个前向差值,目前支持 n=1。([#37441](https://github.com/PaddlePaddle/Paddle/pull/37441))
-
- - 新增 `paddle.asinh`、`paddle.acosh`、`paddle.atanh` 反双曲函数类 API。([#37076](https://github.com/PaddlePaddle/Paddle/pull/37076))
-
- - 新增 `paddle.as_real`,`paddle.as_complex` 用于实数 Tensor 和复数 Tensor 之间的转换。([#37784](https://github.com/PaddlePaddle/Paddle/pull/37784))
-
- - 新增 `paddle.complex` 用于给定实部和虚部构造复数 Tensor。([#37918](https://github.com/PaddlePaddle/Paddle/pull/37918), [#38272](https://github.com/PaddlePaddle/Paddle/pull/38272))
-
- - 新增 `paddle.det` 与 `paddle.slogdet`,用于计算矩阵的行列式和行列式的自然对数。([#34992](https://github.com/PaddlePaddle/Paddle/pull/34992))
-
- - 新增`paddle.nn.utils.parameters_to_vector`,可以将输入的多个 parameter 展平并连接为 1 个 1-D Tensor。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
- - 新增`paddle.nn.utils.vector_to_parameters`,将 1 个 1-D Tensor 按顺序切分给输入的多个 parameter。([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
-- 新增组网类 API
-
- - 新增 `paddle.nn.Fold`、`paddle.nn.functional.fold`,支持将提取出的滑动局部区域块还原成 batch 的 Tensor。([#38613](https://github.com/PaddlePaddle/Paddle/pull/38613))
-
- - 新增 `paddle.nn.CELU`、`paddle.nn.functional.celu`,支持 CELU 激活层。([#36088](https://github.com/PaddlePaddle/Paddle/pull/36088))
-
- - 新增 `paddle.nn.HingeEmbeddingLoss`,增加计算 hinge embedding 损失的方式,通常用于学习 nonlinear embedding 或半监督学习。([#37540](https://github.com/PaddlePaddle/Paddle/pull/37540))
-
- - 新增 `paddle.nn.ZeroPad2D` API,按照 padding 属性对输入进行零填充。([#37151](https://github.com/PaddlePaddle/Paddle/pull/37151))
-
- - 新增 `paddle.nn.MaxUnPool3D` 和 `paddle.nn.MaxUnPool1D`,用于计算 3D 最大反池化和 1D 最大反池化。([#38716](https://github.com/PaddlePaddle/Paddle/pull/38716))
-
- - 新增 `paddle.incubate.graph_khop_sampler`、`paddle.incubate.graph_sample_neighbors`、 `paddle.incubate.graph_reindex` API,支持图多阶邻居采样和图编号重索引操作,主要用于图神经网络模型训练。([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#40809](https://github.com/PaddlePaddle/Paddle/pull/40809))
-
-- 新增随机数类 API
-
- - 新增 `paddle.poisson`,以输入 Tensor 为泊松分布的 lambda 参数,生成一个泊松分布的随机数 Tensor。([#38117](https://github.com/PaddlePaddle/Paddle/pull/38117))
-
- - 新增 `paddle.randint_like` API,支持新建服从均匀分布的、范围在[low, high) 的随机 Tensor,输出的形状与输入的形状一致。([#36169](https://github.com/PaddlePaddle/Paddle/pull/36169))
-
- - 新增 `paddle.Tensor.exponential_`,为 inplace 式 API,通过指数分布随机数来填充输入 Tensor。([#38256](https://github.com/PaddlePaddle/Paddle/pull/38256))
-
-- 新增参数初始化类 API
-
- - 新增`paddle.nn.initializer.Dirac`,通过迪拉克 delta 函数来初始化 3D/4D/5D 参数,其常用于卷积层 Conv1D/Conv2D/Conv3D 的参数初始化。([#37389](https://github.com/PaddlePaddle/Paddle/pull/37389))
-
- - 新增`paddle.nn.initializer.Orthogonal`,正交矩阵初始化,被初始化后的参数是(半)正交向量。([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163))
-
- - 新增`paddle.nn.initializer.calculate_gain`,获取激活函数的推荐增益值,增益值可用于设置某些初始化 API,以调整初始化范围。([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163))
-
-- 新增学习率类 API
-
- - 新增 `paddle.optimizer.lr.MultiplicativeDecay`,提供 `lambda` 函数设置学习率的策略。([#38250](https://github.com/PaddlePaddle/Paddle/pull/38250))
-
-- 新增分布式相关 API
-
- - 新增 `paddle.incubate.optimizer.DistributedFusedLamb`,使得 Lamb 优化器可分布式更新参数。([#40011](https://github.com/PaddlePaddle/Paddle/pull/40011), [#39972](https://github.com/PaddlePaddle/Paddle/pull/39972), [#39900](https://github.com/PaddlePaddle/Paddle/pull/39900), [#39747](https://github.com/PaddlePaddle/Paddle/pull/39747), [#39148](https://github.com/PaddlePaddle/Paddle/pull/39148), [#39416](https://github.com/PaddlePaddle/Paddle/pull/39416))
-
-- 新增优化器相关 API([#40710](https://github.com/PaddlePaddle/Paddle/pull/40710))
-
- - `paddle.incubate.optimizer.functional.minimize_bfgs`,增加二阶优化器 BFGS。
-
- - `paddle.incubate.optimizer.functional.minimize_lbfgs`,增加二阶优化器 L-BFGS。
-
-- 新增 `paddle.incubate.multiprocessing`模块,支持 Tensor(CPU/GPU)在 python 进程间传输。([#37302](https://github.com/PaddlePaddle/Paddle/pull/37302), [#41339](https://github.com/PaddlePaddle/Paddle/pull/41339))
-
-- 新增 `paddle.incubate.autotune.set_config` API,支持多版本 Kernel 自动选择、混合精度数据布局自动转换、DataLoader 的 num_workers 自动选择,以自动提升模型性能。([#42301](https://github.com/PaddlePaddle/Paddle/pull/42301))
-
-- 新增 `paddle.incubate.nn.FusedMultiTransformer` 和 `paddle.incubate.nn.functional.fused_multi_transformer` API,可将多层 transformer 融合到一个 op 中,提升模型推理性能,注意:仅支持前向推理。([#42311](https://github.com/PaddlePaddle/Paddle/pull/42311))
-
-- 新增动静统一的 einsum_v2 op,兼容原有 python 端 `paddle.einsum` 实现的同时支持动转静导出和更加完备的 Infershape 推导。([#42495](https://github.com/PaddlePaddle/Paddle/pull/42495), [#42327](https://github.com/PaddlePaddle/Paddle/pull/42327), [#42397](https://github.com/PaddlePaddle/Paddle/pull/42397), [#42105](https://github.com/PaddlePaddle/Paddle/pull/42105))
-
-#### IR(Intermediate Representation)
-
-- 动态图转静态图
-
- - 变量类型 StaticAnalysis 模块新增支持类似 `a, b = paddle.shape(x)` 的类型标记。([#39245](https://github.com/PaddlePaddle/Paddle/pull/39245))
-
- - 新增支持 `InputSpec.name` 作为 Program 缓存 hash key 的计算字段。([#38273](https://github.com/PaddlePaddle/Paddle/pull/38273))
-
- - 新增支持 `dict['key'] = x.shape` 语法。([#40611](https://github.com/PaddlePaddle/Paddle/pull/40611))
-
- - 新增支持 Pure FP16 训练。([#36944](https://github.com/PaddlePaddle/Paddle/pull/36944))
-
- - 新增支持 `for i in [x,y,z]` 语法。([#37259](https://github.com/PaddlePaddle/Paddle/pull/37259))
-
- - 新增支持 python3 的 type hint 语法。([#36544](https://github.com/PaddlePaddle/Paddle/pull/36544))
-
-- Pass 开发
-
- - 新增基于 NVIDIA cuBlasLt Epilogue 的 FC + [relu|gelu] 的前向与反向融合。([#39437](https://github.com/PaddlePaddle/Paddle/pull/39437))
-
-- Kernel Primitive API
-
- - 新增 GPU 平台 KP 算子,包括 cast、scale、clip、bce_loss、abs_grad、reduce_sum_grad、reduce_mean_grad、clip、bce_loss、full、full_like、distribution、 random、masked_select_kernel、where_index、masked_select_grad、dropout、sigmoid、where、abs_grad。([#36203](https://github.com/PaddlePaddle/Paddle/pull/36203), [#36423](https://github.com/PaddlePaddle/Paddle/pull/36423), [#39390](https://github.com/PaddlePaddle/Paddle/pull/39390), [#39734](https://github.com/PaddlePaddle/Paddle/pull/39734), [#38500](https://github.com/PaddlePaddle/Paddle/pull/38500), [#38959](https://github.com/PaddlePaddle/Paddle/pull/38959), [#39197](https://github.com/PaddlePaddle/Paddle/pull/39197/), [#39563](https://github.com/PaddlePaddle/Paddle/pull/39563), [#39666](https://github.com/PaddlePaddle/Paddle/pull/39666), [#40517](https://github.com/PaddlePaddle/Paddle/pull/40517), [#40617](https://github.com/PaddlePaddle/Paddle/pull/40617), [#40766](https://github.com/PaddlePaddle/Paddle/pull/40766), [#39898](https://github.com/PaddlePaddle/Paddle/pull/39898), [#39609](https://github.com/PaddlePaddle/Paddle/pull/39609))
-
- - 新增支持 XPU2 源码编译模式。([#37254](https://github.com/PaddlePaddle/Paddle/pull/37254), [#40397](https://github.com/PaddlePaddle/Paddle/pull/40397), [#38455](https://github.com/PaddlePaddle/Paddle/pull/38455))
-
- - 新增支持 KP 算子在 XPU2 和 GPU 中复用,包括 reduce、broadcast、elementwise_add、`exp、log、relu、sigmoid、leaky_relu、softplus、hard_swish、reciprocal`。([#36904](https://github.com/PaddlePaddle/Paddle/pull/36904), [#37226](https://github.com/PaddlePaddle/Paddle/pull/37226), [#38918](https://github.com/PaddlePaddle/Paddle/pull/38918), [#40560](https://github.com/PaddlePaddle/Paddle/pull/40560/), [#39787](https://github.com/PaddlePaddle/Paddle/pull/39787), [#39917](https://github.com/PaddlePaddle/Paddle/pull/39917), [#40002](https://github.com/PaddlePaddle/Paddle/pull/40002), [#40364](https://github.com/PaddlePaddle/Paddle/pull/40364))
-
- - 新增 XPU2 平台 KP 算子单测,包括 `brelu、ceil、celu、elu、floor、hard_shrink、hard_sigmoid、log1p、logsigmoid、relu6、silu、soft_relu、softsign、sqrt、square、swish、thresholded_relu、softshrink`。([#40448](https://github.com/PaddlePaddle/Paddle/pull/40448), [#40524](https://github.com/PaddlePaddle/Paddle/pull/40524))
-
- - 新增 XPU2 KP 模型支持,包括 resnet50、deepfm、wide_deep、yolov3-darknet53、det_mv3_db、bert、transformer、mobilenet_v3、GPT2。
-
-#### 混合精度训练
-
-- 从混合精度训练 `paddle.amp.GradScaler` 的 `minimize` 中拆分出 `paddle.amp.Gradscaler.unscale_` 方法,提供恢复 loss 的独立接口。([#35825](https://github.com/PaddlePaddle/Paddle/pull/35825))
-
-- 为 `paddle.nn.ClipByGlobalNorm` 动态图模式添加 FP16 支持,为 clip op 添加 FP16 Kernel,使`clip`相关操作支持 FP16。([#36198](https://github.com/PaddlePaddle/Paddle/pull/36198), [#36577](https://github.com/PaddlePaddle/Paddle/pull/36577))
-
-- 支持 `paddle.amp.decorate` 传入的`optimizer`参数为 None。([#37541](https://github.com/PaddlePaddle/Paddle/pull/37541))
-
-- 为 merged_momentum op 添加支持输入多学习率、支持 use_nesterov 策略的计算、支持 regularization 计算。([#37527](https://github.com/PaddlePaddle/Paddle/pull/37527))
-
-- 为`paddle.optimizer.Momentum`优化器添加 multi_tensor 策略、为`Optimzizer`类的`clear_grad`添加`set_to_zero`分支。([#37564](https://github.com/PaddlePaddle/Paddle/pull/37564))
-
-- 为`paddle.optimizer.Adam`优化器添加 multi_tensor 策略。([#38010](https://github.com/PaddlePaddle/Paddle/pull/38010))
-
-- 为`paddle.optimizer.SGD`优化器添加 multi_precision 策略。([#38231](https://github.com/PaddlePaddle/Paddle/pull/38231))
-
-- 为优化器 `state_dict` 方法添加存储 `master weight` 参数。([#39121](https://github.com/PaddlePaddle/Paddle/pull/39121))
-
-- 添加支持 op CUDA bfloat16 混合精度训练,支持 O1、O2 模式,通过 `paddle.amp.auto_cast` 可开启上述训练模式。([#39029](https://github.com/PaddlePaddle/Paddle/pull/39029), [#39815](https://github.com/PaddlePaddle/Paddle/pull/39815))
-
-- 为如下 ops 添加 bfloat16 CUDA Kernel:matmul、concat、split、dropout、reshape、slice、squeeze、stack、transpose、unbind、elementwize_max、elementwize_add、elementwize_mul、elementwize_sub、scale、sum、layer_norm、p_norm、reduce_sum、softmax、log_softmax、sigmoid、sqrt、softplus、square、gaussian_random、fill_constant、fill_any_like。([#39485](https://github.com/PaddlePaddle/Paddle/pull/39485), [#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683), [#39843](https://github.com/PaddlePaddle/Paddle/pull/39843), [#39999](https://github.com/PaddlePaddle/Paddle/pull/39999), [#40004](https://github.com/PaddlePaddle/Paddle/pull/40004), [#40027](https://github.com/PaddlePaddle/Paddle/pull/40027))
-
-- 为如下 ops 添加 bfloat16 CPU Kernel:dropout、reshape、slice、squeeze、unsqueeze、stack、transpose、unbind、elementwize_max、elementwise_mul、elementwise_sub、gather。([#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683))
-
-- 支持打印 bfloat16 类型的 Tensor。([#39375](https://github.com/PaddlePaddle/Paddle/pull/39375), [#39370](https://github.com/PaddlePaddle/Paddle/pull/39370))
-
-- 为`p_norm`、`elementwise_max` 、`fill_constant_batch_size_like``scatter`增加 FP16 计算支持。([#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [#39907](https://github.com/PaddlePaddle/Paddle/pull/39907), [#38136](https://github.com/PaddlePaddle/Paddle/pull/38136), [#38499](https://github.com/PaddlePaddle/Paddle/pull/38499))
-
-- 为如下 ops 增加 int16_t 支持:cumsum、less_than、less_equal、greater_than、greater_equal、equal、not_equal、fill_any_like、grather_nd、reduce_sum、where_index、reshape、unsqueeze。([#39636](https://github.com/PaddlePaddle/Paddle/pull/39636))
-
-- 为 cross_entropy op 增加 int16_t label 类型的支持。([#39409](https://github.com/PaddlePaddle/Paddle/pull/39409))
-
-- 为 embedding op 增加 int16_t id 类型的支持。([#39381](https://github.com/PaddlePaddle/Paddle/pull/39381))
-
-- 为 reduce_mean op 增加 FP16 类型的支持。([#38289](https://github.com/PaddlePaddle/Paddle/pull/38289))
-
-- 为 elementwise_min op 增加 FP16 类型的支持。([#38123](https://github.com/PaddlePaddle/Paddle/pull/38123))
-
-- 更新 bfloat16 AMP oneDNN 默认支持列表。([#39304](https://github.com/PaddlePaddle/Paddle/pull/39304))
-
-#### 飞桨高可复用算子库 PHI
-
-针对飞桨框架原算子库存在的算子接口不清晰、算子复用成本较高、调用性能不够快的问题,我们重构了飞桨框架的算子库,设计了灵活、高效的函数式算子库 PHI,可以通过对函数式算子接口组合调用的方式实现新算子。新算子库提供了 200 余个跟 python 开发接口保持一致的 C++ 运算类 API,以及近 500 个可供组合调用的前、反向函数式算子内核 Kernel,可大幅降低框架原生算子和自定义算子的开发成本。新算子库支持 Primitive API 方式开发算子内核,可支持不同硬件(比如 GPU 和 XPU)的算子内核复用。新算子库支持以插件方式接入硬件(比如 NPU)的加速库,实现低成本复用硬件加速库。主要可分为以下几部分工作:
-
-- **算子库基础架构、核心组件与机制实现**:合理规划新算子库的目录结构,设计实现了新算子库的公共基础数据结构、新的函数式 InferMeta 和 Kernel 开发范式以及相应的注册和管理组件,并且支持 Kernel 文件的自动化编译对象生成及编译依赖关系生成,使开发者仅需关注函数式 Kernel 的实现,开发范式简洁清晰。([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#37107](https://github.com/PaddlePaddle/Paddle/pull/37107), [#36946](https://github.com/PaddlePaddle/Paddle/pull/36946), [#36948](https://github.com/PaddlePaddle/Paddle/pull/36948), [#37876](https://github.com/PaddlePaddle/Paddle/pull/37876), [#37916](https://github.com/PaddlePaddle/Paddle/pull/37916), [#37977](https://github.com/PaddlePaddle/Paddle/pull/37977), [#38078](https://github.com/PaddlePaddle/Paddle/pull/38078), [#38861](https://github.com/PaddlePaddle/Paddle/pull/38861), [#39123](https://github.com/PaddlePaddle/Paddle/pull/39123), [#39131](https://github.com/PaddlePaddle/Paddle/pull/39131), [#39748](https://github.com/PaddlePaddle/Paddle/pull/39748), [#39790](https://github.com/PaddlePaddle/Paddle/pull/39790), [#39941](https://github.com/PaddlePaddle/Paddle/pull/39941), [#40239](https://github.com/PaddlePaddle/Paddle/pull/40239), [#40635](https://github.com/PaddlePaddle/Paddle/pull/40635), [#41091](https://github.com/PaddlePaddle/Paddle/pull/41091), [#37409](https://github.com/PaddlePaddle/Paddle/pull/37409), [#37942](https://github.com/PaddlePaddle/Paddle/pull/37942), [#39002](https://github.com/PaddlePaddle/Paddle/pull/39002), [#38109](https://github.com/PaddlePaddle/Paddle/pull/38109), [#37881](https://github.com/PaddlePaddle/Paddle/pull/37881), [#37517](https://github.com/PaddlePaddle/Paddle/pull/37517), [#39870](https://github.com/PaddlePaddle/Paddle/pull/39870), [#40975](https://github.com/PaddlePaddle/Paddle/pull/40975), [#39475](https://github.com/PaddlePaddle/Paddle/pull/39475), [#37304](https://github.com/PaddlePaddle/Paddle/pull/37304), #36910, #37120, #37146, #37215, #37255, #37369, #38258, #38257, #38355, #38853, #38937, #38977, #38946, #39085, #39153, #39228, #38301, #38275, #38506, #38607, #38473, #38632, #38811, #38880, #38996, #38914, #39101)
-
-- **算子库 C++ API 体系建设**:设计实现了基于 yaml 配置文件的算子定义范式、自动生成了 200 余个 C++运算类 API,供内外部开发者复用,降低了基础运算的重复开发成本。([#37668](https://github.com/PaddlePaddle/Paddle/pull/37668), [#36938](https://github.com/PaddlePaddle/Paddle/pull/36938), [#38172](https://github.com/PaddlePaddle/Paddle/pull/38172), [#38182](https://github.com/PaddlePaddle/Paddle/pull/38182), [#38311](https://github.com/PaddlePaddle/Paddle/pull/38311), [#38438](https://github.com/PaddlePaddle/Paddle/pull/38438), [#39057](https://github.com/PaddlePaddle/Paddle/pull/39057), [#39229](https://github.com/PaddlePaddle/Paddle/pull/39229), [#39281](https://github.com/PaddlePaddle/Paddle/pull/39281), [#39263](https://github.com/PaddlePaddle/Paddle/pull/39263), [#39408](https://github.com/PaddlePaddle/Paddle/pull/39408), [#39436](https://github.com/PaddlePaddle/Paddle/pull/39436), [#39482](https://github.com/PaddlePaddle/Paddle/pull/39482), [#39497](https://github.com/PaddlePaddle/Paddle/pull/39497), [#39651](https://github.com/PaddlePaddle/Paddle/pull/39651), [#39521](https://github.com/PaddlePaddle/Paddle/pull/39521), [#39760](https://github.com/PaddlePaddle/Paddle/pull/39760), [#40060](https://github.com/PaddlePaddle/Paddle/pull/40060), [#40196](https://github.com/PaddlePaddle/Paddle/pull/40196), [#40218](https://github.com/PaddlePaddle/Paddle/pull/40218), [#40640](https://github.com/PaddlePaddle/Paddle/pull/40640), [#40732](https://github.com/PaddlePaddle/Paddle/pull/40732), [#40729](https://github.com/PaddlePaddle/Paddle/pull/40729), [#40840](https://github.com/PaddlePaddle/Paddle/pull/40840), [#40867](https://github.com/PaddlePaddle/Paddle/pull/40867), [#41025](https://github.com/PaddlePaddle/Paddle/pull/41025), [#41368](https://github.com/PaddlePaddle/Paddle/pull/41368))
-
-- **算子库兼容各执行体系**:实现新的 InferMeta 及 Kernel 接入原动静态图执行体系、支持原 OpKernel 注册安全移除并迁移为新的 Kernel 形式。([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#38825](https://github.com/PaddlePaddle/Paddle/pull/38825), [#38837](https://github.com/PaddlePaddle/Paddle/pull/38837), [#38842](https://github.com/PaddlePaddle/Paddle/pull/38842), [#38976](https://github.com/PaddlePaddle/Paddle/pull/38976), [#39134](https://github.com/PaddlePaddle/Paddle/pull/39134), [#39140](https://github.com/PaddlePaddle/Paddle/pull/39140), [#39135](https://github.com/PaddlePaddle/Paddle/pull/39135), [#39252](https://github.com/PaddlePaddle/Paddle/pull/39252), [#39222](https://github.com/PaddlePaddle/Paddle/pull/39222), [#39351](https://github.com/PaddlePaddle/Paddle/pull/39351))
-
-- **算子库底层数据结构及工具函数与框架解耦**:解除 Phi 在核心数据结构上对 框架的依赖,为后续 Phi 独立编译奠定基础,支持 infrt、自定义 Kernel 等一系列基于 Phi 的建设工作。([#38583](https://github.com/PaddlePaddle/Paddle/pull/38583), [#39188](https://github.com/PaddlePaddle/Paddle/pull/39188), [#39560](https://github.com/PaddlePaddle/Paddle/pull/39560), [#39931](https://github.com/PaddlePaddle/Paddle/pull/39931), [#39169](https://github.com/PaddlePaddle/Paddle/pull/39169), [#38951](https://github.com/PaddlePaddle/Paddle/pull/38951), [#38898](https://github.com/PaddlePaddle/Paddle/pull/38898), [#38873](https://github.com/PaddlePaddle/Paddle/pull/38873), [#38696](https://github.com/PaddlePaddle/Paddle/pull/38696), [#38651](https://github.com/PaddlePaddle/Paddle/pull/38651), [#39359](https://github.com/PaddlePaddle/Paddle/pull/39359), [#39305](https://github.com/PaddlePaddle/Paddle/pull/39305), [#39234](https://github.com/PaddlePaddle/Paddle/pull/39234), [#39098](https://github.com/PaddlePaddle/Paddle/pull/39098), [#39120](https://github.com/PaddlePaddle/Paddle/pull/39120), [#38979](https://github.com/PaddlePaddle/Paddle/pull/38979), [#38899](https://github.com/PaddlePaddle/Paddle/pull/38899), [#38844](https://github.com/PaddlePaddle/Paddle/pull/38844), [#39714](https://github.com/PaddlePaddle/Paddle/pull/39714), [#39729](https://github.com/PaddlePaddle/Paddle/pull/39729), [#39889](https://github.com/PaddlePaddle/Paddle/pull/39889), [#39587](https://github.com/PaddlePaddle/Paddle/pull/39587), [#39558](https://github.com/PaddlePaddle/Paddle/pull/39558), [#39514](https://github.com/PaddlePaddle/Paddle/pull/39514), [#39502](https://github.com/PaddlePaddle/Paddle/pull/39502), [#39300](https://github.com/PaddlePaddle/Paddle/pull/39300), [#39246](https://github.com/PaddlePaddle/Paddle/pull/39246), [#39124](https://github.com/PaddlePaddle/Paddle/pull/39124))
-
-- **自定义算子机制与 Phi 整合并完善**:支持在自定义算子编写时调用 Phi 自动生成的 200 余个 C++运算类 API,降低自定义算子开发成本,并进行一系列问题修复。([#37122](https://github.com/PaddlePaddle/Paddle/pull/37122), [#37276](https://github.com/PaddlePaddle/Paddle/pull/37276), [#37281](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37262](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37415](https://github.com/PaddlePaddle/Paddle/pull/37415), [#37423](https://github.com/PaddlePaddle/Paddle/pull/37423), [#37583](https://github.com/PaddlePaddle/Paddle/pull/37683), [#38776](https://github.com/PaddlePaddle/Paddle/pull/38776), [#39353](https://github.com/PaddlePaddle/Paddle/pull/39353), [#41072](https://github.com/PaddlePaddle/Paddle/pull/41072))
-
-- **算子规模化迁移改写**:迁移了约 250 个高频算子的前、反向算子内核 Kernel 至新算子库,改写为函数式,支持在 C++端通过调用多个基础 Kernel 函数封装,快速组合实现高性能算子;同时,添加相应的 yaml 算子定义,并接入新动态图执行体系,提升 python API 调度性能。迁移改写的算子包括:
-
- - sqrt ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - square ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - sin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - sinh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - elementwise_fmax ([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140))
-
- - elementwise_fmin ([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140))
-
- - pool2d ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - max_pool2d_with_index ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - pool3d ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - max_pool3d_with_index ([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - fill_constant ([#36930](https://github.com/PaddlePaddle/Paddle/pull/36930), [#39465](https://github.com/PaddlePaddle/Paddle/pull/39465))
-
- - p_norm ([#40819](https://github.com/PaddlePaddle/Paddle/pull/40819))
-
- - fill_constant_batch_size_like ([#40784](https://github.com/PaddlePaddle/Paddle/pull/40784))
-
- - conv2d ([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354))
-
- - conv2d_transpose ([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - conv3d ([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354))
-
- - conv3d_transpose ([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - mish ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - gather_nd ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043))
-
- - gather ([#40500](https://github.com/PaddlePaddle/Paddle/pull/40500))
-
- - scatter ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043))
-
- - scatter_nd_add ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043))
-
- - sgd ([40045](https://github.com/PaddlePaddle/Paddle/pull/40045))
-
- - momentum ([#41319](https://github.com/PaddlePaddle/Paddle/pull/41319))
-
- - rmsprop ([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994))
-
- - index_sample ([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459),[#39905](https://github.com/PaddlePaddle/Paddle/pull/39905))
-
- - adam ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351))
-
- - layer_norm ([#40193](https://github.com/PaddlePaddle/Paddle/pull/40193))
-
- - adagrad ([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994/))
-
- - adamax ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173))
-
- - adadelta ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173))
-
- - clip ([#40602](https://github.com/PaddlePaddle/Paddle/pull/40602), [#41661](https://github.com/PaddlePaddle/Paddle/pull/41661), [#41675](https://github.com/PaddlePaddle/Paddle/pull/41675))
-
- - ceil ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - cos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - atan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - cosh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - erf ([#40388](https://github.com/PaddlePaddle/Paddle/pull/40388))
-
- - asin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - acos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - scale ([#39278](https://github.com/PaddlePaddle/Paddle/pull/39278))
-
- - elementwise_pow ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993))
-
- - elementwise_sub ([#39225](https://github.com/PaddlePaddle/Paddle/pull/39225), [#37260](https://github.com/PaddlePaddle/Paddle/pull/37260))
-
- - round ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - floor ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - pow ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - elementwise_floordiv ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993))
-
- - reciprocal ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - log1p ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - allclose ([#40469](https://github.com/PaddlePaddle/Paddle/pull/40469))
-
- - mul ([#40833](https://github.com/PaddlePaddle/Paddle/pull/40833))
-
- - elementwise_max ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590))
-
- - elementwise_min ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590))
-
- - elementwise_mod ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590))
-
- - elementwise_add ([#39048](https://github.com/PaddlePaddle/Paddle/pull/39048), [#37043](https://github.com/PaddlePaddle/Paddle/pull/37043))
-
- - matmul_v2 ([#36844](https://github.com/PaddlePaddle/Paddle/pull/36844), [#38713](https://github.com/PaddlePaddle/Paddle/pull/38713))
-
- - elementwise_mul ([#41042](https://github.com/PaddlePaddle/Paddle/pull/41042), [#40252](https://github.com/PaddlePaddle/Paddle/pull/40252), [#37471](https://github.com/PaddlePaddle/Paddle/pull/37471))
-
- - elementwise_div ([#40172](https://github.com/PaddlePaddle/Paddle/pull/40172), [#40039](https://github.com/PaddlePaddle/Paddle/pull/40039), [#37418](https://github.com/PaddlePaddle/Paddle/pull/37418))
-
- - SelectedRows ([#39037](https://github.com/PaddlePaddle/Paddle/pull/39037), [#39087](https://github.com/PaddlePaddle/Paddle/pull/39087), [#39128](https://github.com/PaddlePaddle/Paddle/pull/39128), [#39162](https://github.com/PaddlePaddle/Paddle/pull/39162), [#39236](https://github.com/PaddlePaddle/Paddle/pull/39236))
-
- - fill_any_like ([#39807](https://github.com/PaddlePaddle/Paddle/pull/39807))
-
- - dot ([#38359](https://github.com/PaddlePaddle/Paddle/pull/38359))
-
- - sum ([#40873](https://github.com/PaddlePaddle/Paddle/pull/40873))
-
- - cumsum ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - diag_v2 ([#39914](https://github.com/PaddlePaddle/Paddle/pull/39914))
-
- - auc ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - log_loss ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - one_hot_v2 ([39876](https://github.com/PaddlePaddle/Paddle/pull/39876))
-
- - sigmoid_cross_entropy_with_logits ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - bce_loss ([#39868](https://github.com/PaddlePaddle/Paddle/pull/39868))
-
- - argsort ([#40151](https://github.com/PaddlePaddle/Paddle/pull/40151))
-
- - arg_max ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222))
-
- - arg_min ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222))
-
- - segment_pool ([#40099](https://github.com/PaddlePaddle/Paddle/pull/40099))
-
- - frobenius_norm ([#40707](https://github.com/PaddlePaddle/Paddle/pull/40707), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - dist ([#40178](https://github.com/PaddlePaddle/Paddle/pull/40178))
-
- - isnan_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076))
-
- - logical_and ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - logical_not ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - isfinite_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076))
-
- - logical_or ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - isinf_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076))
-
- - is_empty ([#39919](https://github.com/PaddlePaddle/Paddle/pull/39919))
-
- - logical_xor ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - less_than ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - not_equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - less_equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - equal_all ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - uniform_random ([#39937](https://github.com/PaddlePaddle/Paddle/pull/39937))
-
- - randint ([#39876](https://github.com/PaddlePaddle/Paddle/pull/39876), [#41375](https://github.com/PaddlePaddle/Paddle/pull/41375))
-
- - randperm ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265))
-
- - unbind ([#39789](https://github.com/PaddlePaddle/Paddle/pull/39789))
-
- - bernoulli ([#39590](https://github.com/PaddlePaddle/Paddle/pull/39590))
-
- - increment ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - multinomial ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - addmm ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - cholesky ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - where ([#39811](https://github.com/PaddlePaddle/Paddle/pull/39811))
-
- - log10 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - log2 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - expm1 ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - atan2 ([#39806](https://github.com/PaddlePaddle/Paddle/pull/39806))
-
- - gaussian_random ([#39932](https://github.com/PaddlePaddle/Paddle/pull/39932), [#40122](https://github.com/PaddlePaddle/Paddle/pull/40122), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191))
-
- - empty ([#38334](https://github.com/PaddlePaddle/Paddle/pull/38334))
-
- - truncated_gaussian_random ([#39971](https://github.com/PaddlePaddle/Paddle/pull/39971), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191))
-
- - mv ([#39861](https://github.com/PaddlePaddle/Paddle/pull/39861), [#39954](https://github.com/PaddlePaddle/Paddle/pull/39954))
-
- - tan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - set_value ([#40195](https://github.com/PaddlePaddle/Paddle/pull/40195), [#40478](https://github.com/PaddlePaddle/Paddle/pull/40478), [#40636](https://github.com/PaddlePaddle/Paddle/pull/40636))
-
- - bitwise_and ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - bitwise_not ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - bitwise_or ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - poisson ([#39814](https://github.com/PaddlePaddle/Paddle/pull/39814))
-
- - cholesky_solve ([#40387](https://github.com/PaddlePaddle/Paddle/pull/40387))
-
- - bitwise_xor ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - triangular_solve ([#40417](https://github.com/PaddlePaddle/Paddle/pull/40417))
-
- - sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626))
-
- - atanh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - softsign ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - thresholded_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - tanh_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - stanh ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - reduce_mean ([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559))
-
- - reduce_max ([#40225](https://github.com/PaddlePaddle/Paddle/pull/40225))
-
- - reduce_min ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374))
-
- - mean ([#40872](https://github.com/PaddlePaddle/Paddle/pull/40872), [#41319](https://github.com/PaddlePaddle/Paddle/pull/41319))
-
- - reduce_all ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374))
-
- - reduce_any ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374))
-
- - logsumexp ([#40790](https://github.com/PaddlePaddle/Paddle/pull/40790))
-
- - softshrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - range ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265), [#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - stack ([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - tile ([#40371](https://github.com/PaddlePaddle/Paddle/pull/40371))
-
- - unique ([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - unstack ([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - slice ([#40736](https://github.com/PaddlePaddle/Paddle/pull/40736))
-
- - transpose2 ([#39327](https://github.com/PaddlePaddle/Paddle/pull/39327))
-
- - unsqueeze2 ([#40596](https://github.com/PaddlePaddle/Paddle/pull/40596))
-
- - squeeze2 ([#40596](https://github.com/PaddlePaddle/Paddle/pull/40596))
-
- - strided_slice ([#40708](https://github.com/PaddlePaddle/Paddle/pull/40708))
-
- - softmax ([#39547](https://github.com/PaddlePaddle/Paddle/pull/39547))
-
- - leaky_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - gelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393))
-
- - prelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393))
-
- - log_softmax ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393))
-
- - elu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - logsigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626))
-
- - psroi_pool ([#40353](https://github.com/PaddlePaddle/Paddle/pull/40353), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - kthvalue([#40575](https://github.com/PaddlePaddle/Paddle/pull/40575))
-
- - mode ([#40571](https://github.com/PaddlePaddle/Paddle/pull/40571))
-
- - yolo_box ([#40112](https://github.com/PaddlePaddle/Paddle/pull/40112))
-
- - yolov3_loss ([#40944](https://github.com/PaddlePaddle/Paddle/pull/40944))
-
- - temporal_shift ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - depthwise_conv2d ([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354))
-
- - pad3d ([#40701](https://github.com/PaddlePaddle/Paddle/pull/40701))
-
- - pad ([#40012](https://github.com/PaddlePaddle/Paddle/pull/40012))
-
- - greater_equal ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - kldiv_loss ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - isclose ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - silu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - unfold ([#39778](https://github.com/PaddlePaddle/Paddle/pull/39778))
-
- - batch_norm ([39347](https://github.com/PaddlePaddle/Paddle/pull/39347))
-
- - norm ([#39324](https://github.com/PaddlePaddle/Paddle/pull/39324))
-
- - roi_pool ([#40574](https://github.com/PaddlePaddle/Paddle/pull/40574), [#40682](https://github.com/PaddlePaddle/Paddle/pull/40682), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - roi_align ([#40382](https://github.com/PaddlePaddle/Paddle/pull/40382), [#40556](https://github.com/PaddlePaddle/Paddle/pull/40556), [#41402](https://github.com/PaddlePaddle/Paddle/pull/41402))
-
- - deformable_conv ([#40700](https://github.com/PaddlePaddle/Paddle/pull/40700), [#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644))
-
- - deformable_conv_v1 ([#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644))
-
- - label_smooth ([#39796](https://github.com/PaddlePaddle/Paddle/pull/39796))
-
- - grid_sampler ([#40585](https://github.com/PaddlePaddle/Paddle/pull/40585))
-
- - greater_than ([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - pixel_shuffle ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712))
-
- - nearest_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855))
-
- - bilinear_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855))
-
- - softmax_with_cross_entropy ([#40832](https://github.com/PaddlePaddle/Paddle/pull/40832))
-
- - rnn ([#41007](https://github.com/PaddlePaddle/Paddle/pull/41007))
-
- - reverse ([#40791](https://github.com/PaddlePaddle/Paddle/pull/40791))
-
- - trace ([#39510](https://github.com/PaddlePaddle/Paddle/pull/39510))
-
- - kron ([#40427](https://github.com/PaddlePaddle/Paddle/pull/40427))
-
- - accuracy ([#39982](https://github.com/PaddlePaddle/Paddle/pull/39982))
-
- - gather_tree ([#40082](https://github.com/PaddlePaddle/Paddle/pull/40082), [#39844](https://github.com/PaddlePaddle/Paddle/pull/39844))
-
- - dropout ([#40148](https://github.com/PaddlePaddle/Paddle/pull/40148))
-
- - bincount ([#39947](https://github.com/PaddlePaddle/Paddle/pull/39947))
-
- - warpctc ([#41389](https://github.com/PaddlePaddle/Paddle/pull/41389), [#40023](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/40023))
-
- - multiplex ([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40102](https://github.com/PaddlePaddle/Paddle/pull/40102))
-
- - qr ([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40007](https://github.com/PaddlePaddle/Paddle/pull/40007))
-
- - assign_value ([#40967](https://github.com/PaddlePaddle/Paddle/pull/40967))
-
- - assign ([#40022](https://github.com/PaddlePaddle/Paddle/pull/40022))
-
- - cast ([#37610](https://github.com/PaddlePaddle/Paddle/pull/37610))
-
- - tril_triu ([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - where_index ([#40255](https://github.com/PaddlePaddle/Paddle/pull/40255))
-
- - index_select ([#40260](https://github.com/PaddlePaddle/Paddle/pull/40260), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - roll ([#40257](https://github.com/PaddlePaddle/Paddle/pull/40257), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - cumprod (熊昆 [#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - shard_index ([#40254](https://github.com/PaddlePaddle/Paddle/pull/40254))
-
- - reshape2 ([#40914](https://github.com/PaddlePaddle/Paddle/pull/40914), [#39631](https://github.com/PaddlePaddle/Paddle/pull/39631), [#38833](https://github.com/PaddlePaddle/Paddle/pull/38833), [#37164](https://github.com/PaddlePaddle/Paddle/pull/37164))
-
- - flip ([#39822](https://github.com/PaddlePaddle/Paddle/pull/39822), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - eye ([#39712](https://github.com/PaddlePaddle/Paddle/pull/39712), [#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#41476](https://github.com/PaddlePaddle/Paddle/pull/41476))
-
- - lookup_table_v2 ([#39901](https://github.com/PaddlePaddle/Paddle/pull/39901))
-
- - searchsorted ([#40520](https://github.com/PaddlePaddle/Paddle/pull/40520), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - adamw ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351))
-
- - tanh ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - cross ([#39829](https://github.com/PaddlePaddle/Paddle/pull/39829))
-
- - concat ([#38955](https://github.com/PaddlePaddle/Paddle/pull/38955), [#41112](https://github.com/PaddlePaddle/Paddle/pull/41112))
-
- - split ([#39060](https://github.com/PaddlePaddle/Paddle/pull/39060))
-
- - linspace ([#40124](https://github.com/PaddlePaddle/Paddle/pull/40124))
-
- - huber_loss ([#39761](https://github.com/PaddlePaddle/Paddle/pull/39761))
-
- - hierarchical_sigmoid ([#40553](https://github.com/PaddlePaddle/Paddle/pull/40553))
-
- - nll_loss ([#39936](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39936))
-
- - graph_send_recv ([#40092](https://github.com/PaddlePaddle/Paddle/pull/40092), [#40320](https://github.com/PaddlePaddle/Paddle/pull/40320))
-
- - abs ([#39492](https://github.com/PaddlePaddle/Paddle/pull/39492), [#39762](https://github.com/PaddlePaddle/Paddle/pull/39762))
-
- - exp ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - rsqrt ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - viterbi_decode ([#40186](https://github.com/PaddlePaddle/Paddle/pull/40186))
-
- - conj ([#38247](https://github.com/PaddlePaddle/Paddle/pull/38247))
-
- - real ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - imag ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - take_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40270](https://github.com/PaddlePaddle/Paddle/pull/40270), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - put_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - lgamma ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - relu ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - maxout ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - log ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - bilinear_tensor_product ([#39903](https://github.com/PaddlePaddle/Paddle/pull/39903))
-
- - flatten_contiguous_range ([#38712](https://github.com/PaddlePaddle/Paddle/pull/38712), [#36957](https://github.com/PaddlePaddle/Paddle/pull/36957), [#41345](https://github.com/PaddlePaddle/Paddle/pull/41345))
-
- - matrix_rank ([#40074](https://github.com/PaddlePaddle/Paddle/pull/40074), [#40519](https://github.com/PaddlePaddle/Paddle/pull/40519), [#41466](https://github.com/PaddlePaddle/Paddle/pull/41466))
-
- - logit ([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844))
-
- - lerp ([#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#39524](https://github.com/PaddlePaddle/Paddle/pull/39524))
-
- - erfinv ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712))
-
- - broadcast_tensors ([#40047](https://github.com/PaddlePaddle/Paddle/pull/40047))
-
- - gumbel_softmax ([#39873](https://github.com/PaddlePaddle/Paddle/pull/39873))
-
- - diagonal ([#39575](https://github.com/PaddlePaddle/Paddle/pull/39575))
-
- - trunc ([#39543](https://github.com/PaddlePaddle/Paddle/pull/39543), [#39772](https://github.com/PaddlePaddle/Paddle/pull/39772))
-
- - multi_dot ([#40038](https://github.com/PaddlePaddle/Paddle/pull/40038))
-
- - matrix_power ([#40231](https://github.com/PaddlePaddle/Paddle/pull/40231))
-
- - digamma ([#39240](https://github.com/PaddlePaddle/Paddle/pull/39240))
-
- - masked_select ([#39193](https://github.com/PaddlePaddle/Paddle/pull/39193))
-
- - determinant ([#40539](https://github.com/PaddlePaddle/Paddle/pull/40539))
-
- - eigh ([#40213](https://github.com/PaddlePaddle/Paddle/pull/40213))
-
- - size ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712))
-
- - shape ([#40248](https://github.com/PaddlePaddle/Paddle/pull/40248))
-
- - reduce_sum ([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559), [#41295](https://github.com/PaddlePaddle/Paddle/pull/41295))
-
- - reduce_prod ([#39844](https://github.com/PaddlePaddle/Paddle/pull/39844))
-
- - histogram ([#39496](https://github.com/PaddlePaddle/Paddle/pull/39496))
-
- - meshgrid ([#41411](https://github.com/PaddlePaddle/Paddle/pull/41411))
-
- - brelu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - hard_swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - hard_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - selu (熊昆 [#39819](https://github.com/PaddlePaddle/Paddle/pull/39819))
-
- - expand_v2 ([#39471](https://github.com/PaddlePaddle/Paddle/pull/39471))
-
- - top_k_v2 ([#40064](https://github.com/PaddlePaddle/Paddle/pull/40064))
-
- - expand_as_v2 ([#40373](https://github.com/PaddlePaddle/Paddle/pull/40373))
-
- - swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - hard_sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626))
-
- - exp, det, assign, gaussian_random, matrix_rank, eye, deformable_conv。([#41755]exp, det, assign, gaussian_random, matrix_rank, eye, deformable_conv。([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755), [#41737](https://github.com/PaddlePaddle/Paddle/pull/41737)
-
-#### 新动态图执行机制
-
-针对飞桨原动态图执行机制的调度性能、二次开发能力差的问题,我们重构了动态图的底层执行机制。通过全新的调用执行方式,配合 Phi 算子库进行高效的运行时执行,对于 Phi 算子库支持的算子,切换到新动态图模式能体验到调度性能有较大幅度的提升。但是由于整体框架执行机制升级的工作量巨大,且该部分工作耦合了大量 Phi 算子库的工作, 因此在这个版本下我们仍未默认使用该执行方式。如果想要试用可以通过设置环境变量 `FLAGS_enable_eager_mode=1` 来切换使用。具体包括如下内容:
-
-- **新动态图执行机制基础架构、核心组件与机制实现**:静态化动态图相关执行代码,将原本的同质化的算子构建变成针对不同 Phi API 的特异化调用从而极大的优化了调度开销。([#36059](https://github.com/PaddlePaddle/Paddle/pull/36059), [#37323](https://github.com/PaddlePaddle/Paddle/pull/37323), [#37556](https://github.com/PaddlePaddle/Paddle/pull/37556), [#37555](https://github.com/PaddlePaddle/Paddle/pull/37555), [#37478](https://github.com/PaddlePaddle/Paddle/pull/37478), [#37458](https://github.com/PaddlePaddle/Paddle/pull/37458), [#37479](https://github.com/PaddlePaddle/Paddle/pull/37479), [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599), [#37659](https://github.com/PaddlePaddle/Paddle/pull/37659), [#37654](https://github.com/PaddlePaddle/Paddle/pull/37654), [#39200](https://github.com/PaddlePaddle/Paddle/pull/39200), [#39309](https://github.com/PaddlePaddle/Paddle/pull/39309), [#39319](https://github.com/PaddlePaddle/Paddle/pull/39319), [#39414](https://github.com/PaddlePaddle/Paddle/pull/39414), [#39504](https://github.com/PaddlePaddle/Paddle/pull/39504), [#39526](https://github.com/PaddlePaddle/Paddle/pull/39526), [#39878](https://github.com/PaddlePaddle/Paddle/pull/39878), [#39963](https://github.com/PaddlePaddle/Paddle/pull/39963))
-
-- **新动态图执行机制子功能开发、适配**:支持了更加灵活,更加完备的动态图子功能例如 hook,pylayer,double_grad, inplace,amp 等等。([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396), [#40400](https://github.com/PaddlePaddle/Paddle/pull/40400), [#40695](https://github.com/PaddlePaddle/Paddle/pull/40695), [#41043](https://github.com/PaddlePaddle/Paddle/pull/41043), [#40915](https://github.com/PaddlePaddle/Paddle/pull/40915), [#41104](https://github.com/PaddlePaddle/Paddle/pull/41104), [#41350](https://github.com/PaddlePaddle/Paddle/pull/41350), [#41209](https://github.com/PaddlePaddle/Paddle/pull/41209), [#40830](https://github.com/PaddlePaddle/Paddle/pull/40830), [#40891](https://github.com/PaddlePaddle/Paddle/pull/40891), [#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377), [#37193](https://github.com/PaddlePaddle/Paddle/pull/37193), [#36965](https://github.com/PaddlePaddle/Paddle/pull/36965), [#37810](https://github.com/PaddlePaddle/Paddle/pull/37810), [#36837](https://github.com/PaddlePaddle/Paddle/pull/36837), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#39531](https://github.com/PaddlePaddle/Paddle/pull/39531), [#39638](https://github.com/PaddlePaddle/Paddle/pull/39638), [#39674](https://github.com/PaddlePaddle/Paddle/pull/39674), [#39893](https://github.com/PaddlePaddle/Paddle/pull/39893), [#40170](https://github.com/PaddlePaddle/Paddle/pull/40170), [#40693](https://github.com/PaddlePaddle/Paddle/pull/40693), [#40937](https://github.com/PaddlePaddle/Paddle/pull/40937), [#41016](https://github.com/PaddlePaddle/Paddle/pull/41016), [#41051](https://github.com/PaddlePaddle/Paddle/pull/41051), [#41121](https://github.com/PaddlePaddle/Paddle/pull/41121), [#41198](https://github.com/PaddlePaddle/Paddle/pull/41198), [#41287](https://github.com/PaddlePaddle/Paddle/pull/41287), [#41380](https://github.com/PaddlePaddle/Paddle/pull/41380), [#41306](https://github.com/PaddlePaddle/Paddle/pull/41306), [#41387](https://github.com/PaddlePaddle/Paddle/pull/41387), [#40623](https://github.com/PaddlePaddle/Paddle/pull/40623), [#40945](https://github.com/PaddlePaddle/Paddle/pull/40945), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488))
-
-- **新动态图执行的自动代码生成机制**:当我们为了将大量的同质化算子的计算和调度逻辑分化成不同的特异化的调度逻辑时,我们发现这是一个非常庞大的工作,因此我们引入了全新的自动代码生成逻辑来生成代码从而简化动态图的运行时逻辑。同时,为了能够适配之前框架中的各类运行时逻辑,我们也利用了一些复杂的编译手段来运行时的获取信息从而生成更加准确的调度代码。([#37574](https://github.com/PaddlePaddle/Paddle/pull/37574), [#37575](https://github.com/PaddlePaddle/Paddle/pull/37575), [#37639](https://github.com/PaddlePaddle/Paddle/pull/37639), [#37723](https://github.com/PaddlePaddle/Paddle/pull/37723), [#37753](https://github.com/PaddlePaddle/Paddle/pull/37753), [#37812](https://github.com/PaddlePaddle/Paddle/pull/37812), [#37837](https://github.com/PaddlePaddle/Paddle/pull/37837), [#37910](https://github.com/PaddlePaddle/Paddle/pull/37910), [#37943](https://github.com/PaddlePaddle/Paddle/pull/37943), [#37992](https://github.com/PaddlePaddle/Paddle/pull/37992), [#37959](https://github.com/PaddlePaddle/Paddle/pull/37959), [#38017](https://github.com/PaddlePaddle/Paddle/pull/38017), [#37969](https://github.com/PaddlePaddle/Paddle/pull/37969), [#38160](https://github.com/PaddlePaddle/Paddle/pull/38160), [#38085](https://github.com/PaddlePaddle/Paddle/pull/38085), [#38562](https://github.com/PaddlePaddle/Paddle/pull/38562), [#38573](https://github.com/PaddlePaddle/Paddle/pull/38573), [#39192](https://github.com/PaddlePaddle/Paddle/pull/39192), [#39215](https://github.com/PaddlePaddle/Paddle/pull/39215), [#39355](https://github.com/PaddlePaddle/Paddle/pull/39355), [#39358](https://github.com/PaddlePaddle/Paddle/pull/39358), [#39328](https://github.com/PaddlePaddle/Paddle/pull/39328), [#39233](https://github.com/PaddlePaddle/Paddle/pull/39233), [#39628](https://github.com/PaddlePaddle/Paddle/pull/39628), [#39767](https://github.com/PaddlePaddle/Paddle/pull/39767), [#39743](https://github.com/PaddlePaddle/Paddle/pull/39743), [#39897](https://github.com/PaddlePaddle/Paddle/pull/39897), [#39797](https://github.com/PaddlePaddle/Paddle/pull/39797), [#39997](https://github.com/PaddlePaddle/Paddle/pull/39997), [#40058](https://github.com/PaddlePaddle/Paddle/pull/40058), [#40080](https://github.com/PaddlePaddle/Paddle/pull/40080), [#40107](https://github.com/PaddlePaddle/Paddle/pull/40107), [#39962](https://github.com/PaddlePaddle/Paddle/pull/39962), [#40132](https://github.com/PaddlePaddle/Paddle/pull/40132), [#40276](https://github.com/PaddlePaddle/Paddle/pull/40276), [#40266](https://github.com/PaddlePaddle/Paddle/pull/40266), [#40480](https://github.com/PaddlePaddle/Paddle/pull/40480), [#40482](https://github.com/PaddlePaddle/Paddle/pull/40482), [#40368](https://github.com/PaddlePaddle/Paddle/pull/40368), [#40650](https://github.com/PaddlePaddle/Paddle/pull/40650), [#40815](https://github.com/PaddlePaddle/Paddle/pull/40815), [#40907](https://github.com/PaddlePaddle/Paddle/pull/40907), [#40935](https://github.com/PaddlePaddle/Paddle/pull/40935), [#41089](https://github.com/PaddlePaddle/Paddle/pull/41089))
-
-- **新动态图执行机制接入主框架,联合调试**:我们目前利用一些环境变量区分静态图模式和动态图模式(含新动态图和老动态图模式),这些模式下我们已经适配了大部分的动态图的逻辑,但是仍有大量问题正在修复中。([#37638](https://github.com/PaddlePaddle/Paddle/pull/37638), [#37643](https://github.com/PaddlePaddle/Paddle/pull/37643), [#37653](https://github.com/PaddlePaddle/Paddle/pull/37653), [#38314](https://github.com/PaddlePaddle/Paddle/pull/38314), [#38337](https://github.com/PaddlePaddle/Paddle/pull/38337), [#38338](https://github.com/PaddlePaddle/Paddle/pull/38338), [#39164](https://github.com/PaddlePaddle/Paddle/pull/39164), [#39326](https://github.com/PaddlePaddle/Paddle/pull/39326), [#40391](https://github.com/PaddlePaddle/Paddle/pull/40391), [#40201](https://github.com/PaddlePaddle/Paddle/pull/40201), [#40854](https://github.com/PaddlePaddle/Paddle/pull/40854), [#40887](https://github.com/PaddlePaddle/Paddle/pull/40887))
-
-- **更新了动态图下的一些判断逻辑,支持兼容形态下的动态图快速执行路径**:([#40786](https://github.com/PaddlePaddle/Paddle/pull/40786))
-
- - 非静态图模式(目前的过渡方案):`_non_static_mode()`。
-
- - 在动态图模式下且判断在新动态图(推荐的判断逻辑):`_in_dygrah_mode()`。
-
- - 在动态图模式下且判断在老动态图(不推荐的判断逻辑,在将来的版本中将废弃):`_in_legacy_dygraph()`。
-
- - 在动态图模式下开启老动态图并关闭新动态图:`_enable_legacy_dygraph()` 或者退出 `_test_eager_guard()`。
-
- - 在动态图模式下开启新动态图并关闭老动态图:`_disable_legacy_dygraph()` 或者 `with _test_eager_guard()`。
-
- - 在静态图或者动态图模式下判断在新动态图:`_in_eager_without_dygraph_check()`。
-
-- **动态图重构后支持 inplace 策略**:输入与输出为同一个 Tensor。
-
- - 为动态图重构中间态适配 inplace 策略。([#40400](https://github.com/PaddlePaddle/Paddle/pull/40400))
-
- - 为动态图重构最终态适配 inplace 策略。([#40695](https://github.com/PaddlePaddle/Paddle/pull/40695))
-
- - 动态图重构后,为 PyLayer 功能添加 inplace 策略。([#41043](https://github.com/PaddlePaddle/Paddle/pull/41043))
-
- - 动态图重构后,为 Tensor 的 setitem 功能添加 inplace 策略。([#40915](https://github.com/PaddlePaddle/Paddle/pull/40915))
-
- - 动态图重构后添加`_reset_grad_inplace_version`接口,将 Tensor 的梯度的 inplace version 置为 0。([#41101](https://github.com/PaddlePaddle/Paddle/pull/41101))
-
- - 反向计算过程中如果不需要前向 Tensor 的值(no need buffer 属性),则不需要对该 Tensor 进行 inplace version 的检测操作。 为 no_need_buffer 的 Tensor 跳过 inplace version 的检查。([#41350](https://github.com/PaddlePaddle/Paddle/pull/41350))
-
- - 统一动态图重构后与重构前对 inplace version 检查的报错信息。([#41209](https://github.com/PaddlePaddle/Paddle/pull/41209))
-
-- **动态图重构后支持 view 策略**:输入与输出 Tensor 共享底层数据。
-
- - 为动态图重构中间态适配 view 机制。包括`reshape`、`squeeze`、`unsqueeze`、`flatten` API。([#40830](https://github.com/PaddlePaddle/Paddle/pull/40830))
-
- - 为动态图重构最终态适配 view 机制。包括`reshape` API。([#40891](https://github.com/PaddlePaddle/Paddle/pull/40891))
-
-- **添加支持新动态图 eager Tensor 在 python 端的 weakref**。([#41797](https://github.com/PaddlePaddle/Paddle/pull/41797))
-
-- **增强新动态图 DoubleGrad 功能**,支持基础的 DoubleGrad 功能。([#41893](https://github.com/PaddlePaddle/Paddle/pull/41893), [#41894](https://github.com/PaddlePaddle/Paddle/pull/41894), [#41895](https://github.com/PaddlePaddle/Paddle/pull/41895))
-
-- **新增 `core.eager.StringTensor` 接口**,支持在 python 端构造 StringTensor 以及使用 StringTensor 相关 API。([#41039](https://github.com/PaddlePaddle/Paddle/pull/41039))
-
-- **为 `core.eager.Tensor` 新增 `*grad_name` 和 `_grad_value` API**,返回梯度的名称和值。([#41990](https://github.com/PaddlePaddle/Paddle/pull/41990))
-
-- **为动态图中间态添加对 no_need_buffer 属性的处理**。在 inplace 反向检查操作中,会跳过具有 no_need_buffer 属性的 Tensor 的检查。([#41720](https://github.com/PaddlePaddle/Paddle/pull/41720))
-
-
-#### 全新静态图执行器
-为了解决飞桨原静态图执行器在部分场景下调度性能不够理想,不便于扩展多 stream 等问题,我们实现了全新的性能优越,易于扩展的静态图执行器,充分利用了多 stream、多线程的异步调度能力。新执行器相当于原执行器是兼容升级,目前已在单机单卡场景下默认使用,用户不需要在训练代码中做任何修改即可自动使用。当然,我们也提供了接口来切换回原执行器,用户可以通过设置环境变量 `FLAGS_USE_STANDALONE_EXECUTOR=false` 来切换回原执行器。([#41179](https://github.com/PaddlePaddle/Paddle/pull/41179)) 主要内容如下:
-
-- 基础组件:用于执行器中多线程算子调度的高性能线程池 ([#35470](https://github.com/PaddlePaddle/Paddle/pull/35470), [#35930](https://github.com/PaddlePaddle/Paddle/pull/35930), [#36030](https://github.com/PaddlePaddle/Paddle/pull/36030), [#36480](https://github.com/PaddlePaddle/Paddle/pull/36480), [#36688](https://github.com/PaddlePaddle/Paddle/pull/36688), [#36740](https://github.com/PaddlePaddle/Paddle/pull/36740), [#38335](https://github.com/PaddlePaddle/Paddle/pull/38335), [#40770](https://github.com/PaddlePaddle/Paddle/pull/40770)) 及线程协同组件 ([#38779](https://github.com/PaddlePaddle/Paddle/pull/38779), [#40876](https://github.com/PaddlePaddle/Paddle/pull/40876), [#40912](https://github.com/PaddlePaddle/Paddle/pull/40912)),算子执行后及时地显存回收 ([#37642](https://github.com/PaddlePaddle/Paddle/pull/37642), [#39617](https://github.com/PaddlePaddle/Paddle/pull/39617), [#40859](https://github.com/PaddlePaddle/Paddle/pull/40859)),并行执行器新依赖分析算法 ([#37231](https://github.com/PaddlePaddle/Paddle/pull/37231)) 等。
-
-- 调度逻辑:优化执行器中算子的调度方法,支持多 stream 的多线程异步调度机制,将数据类型、设备、布局等转换改为算子调度以提升性能,支持缓存算子 Kernel 选择,支持选择全新 Phi 算子等。([#35024](https://github.com/PaddlePaddle/Paddle/pull/35024), [#34922](https://github.com/PaddlePaddle/Paddle/pull/34922), [#35711](https://github.com/PaddlePaddle/Paddle/pull/35711), [#35928](https://github.com/PaddlePaddle/Paddle/pull/35928), [#39458](https://github.com/PaddlePaddle/Paddle/pull/39458),[#36899](https://github.com/PaddlePaddle/Paddle/pull/36899))。
-
-- 接口兼容:兼容原执行器的用户接口和功能,如对齐 python 端 Executor.run()、支持 Scope 中管理 Tensor 等,确保用户可以无感知地切换新执行器。([#37278](https://github.com/PaddlePaddle/Paddle/pull/37278), [#37379](https://github.com/PaddlePaddle/Paddle/pull/37379), [#37445](https://github.com/PaddlePaddle/Paddle/pull/37445), [#37510](https://github.com/PaddlePaddle/Paddle/pull/37510), [#40955](https://github.com/PaddlePaddle/Paddle/pull/40955), [#41778](https://github.com/PaddlePaddle/Paddle/pull/41178), [#41058](https://github.com/PaddlePaddle/Paddle/pull/41058), [#38584](https://github.com/PaddlePaddle/Paddle/pull/38584), [#37957](https://github.com/PaddlePaddle/Paddle/pull/37957), [#37672](https://github.com/PaddlePaddle/Paddle/pull/37672), [#37474](https://github.com/PaddlePaddle/Paddle/pull/37474), [#37085](https://github.com/PaddlePaddle/Paddle/pull/37085), [#37061](https://github.com/PaddlePaddle/Paddle/pull/37061), [#36945](https://github.com/PaddlePaddle/Paddle/pull/36945))
-
-- 增强多线程场景下调试和报错功能,将子线程的报错捕获到主线程中统一抛出,以提升用户体验。([#36692](https://github.com/PaddlePaddle/Paddle/pull/36692),[#36802](https://github.com/PaddlePaddle/Paddle/pull/36802))
-
-- 修复新执行器通信流重置 Allocator 中 stream 缓存信息的问题,减少跨 stream 场景下的 RecordStream 开销,优化后 DeepFM 模型性能提升约 8%。([#42046](https://github.com/PaddlePaddle/Paddle/pull/42046))
-
-- 优化新执行器算子间的依赖分析方法,提升运行性能;为 send/recv 通信算子建立正确依赖以支持流水线并行。([#42009](https://github.com/PaddlePaddle/Paddle/pull/42009))
-
-
-#### 分布式训练
-
-- 集合通信多机多卡训练基础功能
-
- - 新增弹性功能(含节点故障、扩容、缩容),提升分布式的容错能力。([#36684](https://github.com/PaddlePaddle/Paddle/pull/36684), [#37177](https://github.com/PaddlePaddle/Paddle/pull/37177), [#37781](https://github.com/PaddlePaddle/Paddle/pull/37781))
-
- - Launch 启动模块,重构并新增 `master` 协同和节点个数 `nnodes` 定义,提升分布式启动易用性。([#40086](https://github.com/PaddlePaddle/Paddle/pull/40086), [#40568](https://github.com/PaddlePaddle/Paddle/pull/40568), [#40782](https://github.com/PaddlePaddle/Paddle/pull/40782), [#40844](https://github.com/PaddlePaddle/Paddle/pull/40844), [#40936](https://github.com/PaddlePaddle/Paddle/pull/40936), [#41190](https://github.com/PaddlePaddle/Paddle/pull/41190), [#41314](https://github.com/PaddlePaddle/Paddle/pull/41314))
-
- - 新增对 GPU/NPU/XPU 多种硬件的异构训练的支持。([#37613](https://github.com/PaddlePaddle/Paddle/pull/37613), [#37998](https://github.com/PaddlePaddle/Paddle/pull/37998))
-
- - 新增 fleet_executor 异步流水执行器。([#36966](https://github.com/PaddlePaddle/Paddle/pull/36966), [#37049](https://github.com/PaddlePaddle/Paddle/pull/37049), [#37087](https://github.com/PaddlePaddle/Paddle/pull/37087), [#37126](https://github.com/PaddlePaddle/Paddle/pull/37126), [#37150](https://github.com/PaddlePaddle/Paddle/pull/37150), [#37203](https://github.com/PaddlePaddle/Paddle/pull/37203), [#37167](https://github.com/PaddlePaddle/Paddle/pull/37167), [#37282](https://github.com/PaddlePaddle/Paddle/pull/37282), [#37319](https://github.com/PaddlePaddle/Paddle/pull/37319), [#37462](https://github.com/PaddlePaddle/Paddle/pull/37462), [#37507](https://github.com/PaddlePaddle/Paddle/pull/37507), [#37533](https://github.com/PaddlePaddle/Paddle/pull/37533), [#37576](https://github.com/PaddlePaddle/Paddle/pull/37576), [#37605](https://github.com/PaddlePaddle/Paddle/pull/37605), [#37691](https://github.com/PaddlePaddle/Paddle/pull/37691), [#37742](https://github.com/PaddlePaddle/Paddle/pull/37742), [#37783](https://github.com/PaddlePaddle/Paddle/pull/37783), [#37809](https://github.com/PaddlePaddle/Paddle/pull/37809), [#37862](https://github.com/PaddlePaddle/Paddle/pull/37862), [#37882](https://github.com/PaddlePaddle/Paddle/pull/37882), [#37934](https://github.com/PaddlePaddle/Paddle/pull/37934), [#38024](https://github.com/PaddlePaddle/Paddle/pull/38024), [#38083](https://github.com/PaddlePaddle/Paddle/pull/38083), [#38164](https://github.com/PaddlePaddle/Paddle/pull/38164), [#38261](https://github.com/PaddlePaddle/Paddle/pull/38261), [#38290](https://github.com/PaddlePaddle/Paddle/pull/38290), [#40607](https://github.com/PaddlePaddle/Paddle/pull/40607), [#37093](https://github.com/PaddlePaddle/Paddle/pull/37093), [#37106](https://github.com/PaddlePaddle/Paddle/pull/37106), [#37143](https://github.com/PaddlePaddle/Paddle/pull/37143), [#37338](https://github.com/PaddlePaddle/Paddle/pull/37338), [#37376](https://github.com/PaddlePaddle/Paddle/pull/37376), [#37485](https://github.com/PaddlePaddle/Paddle/pull/37485), [#37531](https://github.com/PaddlePaddle/Paddle/pull/37531), [#37623](https://github.com/PaddlePaddle/Paddle/pull/37623), [#37693](https://github.com/PaddlePaddle/Paddle/pull/37693), [#37755](https://github.com/PaddlePaddle/Paddle/pull/37755), [#37807](https://github.com/PaddlePaddle/Paddle/pull/37807), [#37889](https://github.com/PaddlePaddle/Paddle/pull/37889), [#38420](https://github.com/PaddlePaddle/Paddle/pull/38420), [#38539](https://github.com/PaddlePaddle/Paddle/pull/38539), [#36892](https://github.com/PaddlePaddle/Paddle/pull/36892), [#37084](https://github.com/PaddlePaddle/Paddle/pull/37084), [#37158](https://github.com/PaddlePaddle/Paddle/pull/37158), [#37361](https://github.com/PaddlePaddle/Paddle/pull/37361), [#37509](https://github.com/PaddlePaddle/Paddle/pull/37509), [#37603](https://github.com/PaddlePaddle/Paddle/pull/37603), [#37703](https://github.com/PaddlePaddle/Paddle/pull/37703), [#37824](https://github.com/PaddlePaddle/Paddle/pull/37824), [#38114](https://github.com/PaddlePaddle/Paddle/pull/38114), [#38322](https://github.com/PaddlePaddle/Paddle/pull/38322), [#38535](https://github.com/PaddlePaddle/Paddle/pull/38535), [#38650](https://github.com/PaddlePaddle/Paddle/pull/38650), [#38709](https://github.com/PaddlePaddle/Paddle/pull/38709), [#38799](https://github.com/PaddlePaddle/Paddle/pull/38799), [#38839](https://github.com/PaddlePaddle/Paddle/pull/38839), [#38904](https://github.com/PaddlePaddle/Paddle/pull/38904))
-
- - 新增分布式大模型推理功能。([#38795](https://github.com/PaddlePaddle/Paddle/pull/38795), [#39012](https://github.com/PaddlePaddle/Paddle/pull/39012), [#39032](https://github.com/PaddlePaddle/Paddle/pull/39032), [#39076](https://github.com/PaddlePaddle/Paddle/pull/39076), [#39194](https://github.com/PaddlePaddle/Paddle/pull/39194), [#39207](https://github.com/PaddlePaddle/Paddle/pull/39207), [#39241](https://github.com/PaddlePaddle/Paddle/pull/39241), [#39603](https://github.com/PaddlePaddle/Paddle/pull/39603), [#39758](https://github.com/PaddlePaddle/Paddle/pull/39758), [#39992](https://github.com/PaddlePaddle/Paddle/pull/39992))
-
-- 动态图混合并行
-
- - 重构 `paddle.distributed.fleet.utils.recompute`,支持新动态图。([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396))
-
- - 支持 Pure FP16 训练。([#36420](https://github.com/PaddlePaddle/Paddle/pull/36420))
-
- - 新增 MoE(Mixture of Experts)并行策略, 支持超大 MoE 模型训练。([#41092](https://github.com/PaddlePaddle/Paddle/pull/41092), [#40895](https://github.com/PaddlePaddle/Paddle/pull/40895), [#40850](https://github.com/PaddlePaddle/Paddle/pull/40580), [#39224](https://github.com/PaddlePaddle/Paddle/pull/39224))
-
- - 新增 GroupSharded 并行策略,支持 stage1、stage2、stage3 三个阶段模型状态分组切片训练策略,支持同、异步通信,并可与 Recompute、AMP O1\O2、Offload、GroupShardedClipGrad、GroupShardedScaler 等基础功能组合使用。([#37489](https://github.com/PaddlePaddle/Paddle/pull/37489), [#37568](https://github.com/PaddlePaddle/Paddle/pull/37568), [#37707](https://github.com/PaddlePaddle/Paddle/pull/37707), [#37836](https://github.com/PaddlePaddle/Paddle/pull/37836), [#37947](https://github.com/PaddlePaddle/Paddle/pull/37947), [#38151](https://github.com/PaddlePaddle/Paddle/pull/38151), [#38407](https://github.com/PaddlePaddle/Paddle/pull/38407), [#38052](https://github.com/PaddlePaddle/Paddle/pull/38052), [#39112](https://github.com/PaddlePaddle/Paddle/pull/39112), [#38989](https://github.com/PaddlePaddle/Paddle/pull/38989), [#39171](https://github.com/PaddlePaddle/Paddle/pull/39171), [#39285](https://github.com/PaddlePaddle/Paddle/pull/39285), [#39334](https://github.com/PaddlePaddle/Paddle/pull/39334), [#39397](https://github.com/PaddlePaddle/Paddle/pull/39397), [#39581](https://github.com/PaddlePaddle/Paddle/pull/39581), [#39668](https://github.com/PaddlePaddle/Paddle/pull/39668), [#40129](https://github.com/PaddlePaddle/Paddle/pull/40129), [#40396](https://github.com/PaddlePaddle/Paddle/pull/40396), [#40488](https://github.com/PaddlePaddle/Paddle/pull/40488), [#40601](https://github.com/PaddlePaddle/Paddle/pull/40601),[#37725](https://github.com/PaddlePaddle/Paddle/pull/37725),[#37904](https://github.com/PaddlePaddle/Paddle/pull/37904), [#38064](https://github.com/PaddlePaddle/Paddle/pull/38064))
-
-- 静态图混合并行
-
- - 新增`scale_gradient`标志位至`gradient_scale_configs`,用于控制流水线并行下梯度聚合运算对梯度进行求平均运算的位置。([#36384](https://github.com/PaddlePaddle/Paddle/pull/36384))
-
- - 张量模型并行下,dropout 支持设置确定性随机种子生成器,以确保非分布式变量的随机一致性和分布式变量的随机性。([#36228](https://github.com/PaddlePaddle/Paddle/pull/36228))
-
- - NPU 混合并行支持 Offload,可节约 40%显存。([#37224](https://github.com/PaddlePaddle/Paddle/pull/37224))
-
- - 为 seed op 增加 `force_cpu` 可选参数,使 dropout 可以直接从 CPU 读取 seed 的值。([#35820](https://github.com/PaddlePaddle/Paddle/pull/35820))
-
- - 完善 Automatic Sparsity (ASP)sharding 策略,支持根据 program 选择 sharding 策略。(#[#40028](https://github.com/PaddlePaddle/Paddle/pull/40028))
-
-- 自动并行
-
- - 新增逻辑进程与物理设备自动映射后的进程重新启动(relaunch)。([#37523](https://github.com/PaddlePaddle/Paddle/pull/37523), [#37326](https://github.com/PaddlePaddle/Paddle/pull/37326))
-
- - 完善自动并行底层机制和接口,利于各个模块统一和添加优化 pass。([#36617](https://github.com/PaddlePaddle/Paddle/pull/36617), [#38132](https://github.com/PaddlePaddle/Paddle/pull/38132))
-
- - 新增统一资源表示,支持逻辑进程与物理设备自动映射功能。([#37091](https://github.com/PaddlePaddle/Paddle/pull/37091), [#37482](https://github.com/PaddlePaddle/Paddle/pull/37482), [#37094](https://github.com/PaddlePaddle/Paddle/pull/37094))
-
- - 完善自动并行计算图反向和更新部分的分布式属性补全功能。([#36744](https://github.com/PaddlePaddle/Paddle/pull/36744))
-
- - 新增数据切分功能。([#36055](https://github.com/PaddlePaddle/Paddle/pull/36055))
-
- - 新增张量重切分功能,根据张量和算子的分布式属性对张量进行重新切分。([#40865](https://github.com/PaddlePaddle/Paddle/pull/40865), [#41106](https://github.com/PaddlePaddle/Paddle/pull/41106))
-
- - 新增资源数量或并行策略变化时分布式参数的自动转换功能。([#40434](https://github.com/PaddlePaddle/Paddle/pull/40434))
-
- - 新增梯度累加功能(GradientMerge),减少通信次数,提升训练效率。([#38259](https://github.com/PaddlePaddle/Paddle/pull/38259), [#40737](https://github.com/PaddlePaddle/Paddle/pull/40737))
-
- - 新增重计算功能(Recompute),优化显存。([#38920](https://github.com/PaddlePaddle/Paddle/pull/38920))
-
- - 新增 Sharding 优化 pass, 支持 p-g-os 3 个 stage 的切分优化。([#38502](https://github.com/PaddlePaddle/Paddle/pull/38502))
-
- - 新增 AMP + FP16 优化 pass。([#38764](https://github.com/PaddlePaddle/Paddle/pull/38764), [#40615](https://github.com/PaddlePaddle/Paddle/pull/40615))
-
- - 新增 Transformer 类模型的 QKV fuse 切分。([#39080](https://github.com/PaddlePaddle/Paddle/pull/39080))
-
- - 新增 while op 的分布式属性推导功能,确保迭代推导算法能收敛。([#39939](https://github.com/PaddlePaddle/Paddle/pull/39939), [#39086](https://github.com/PaddlePaddle/Paddle/pull/39086), [#39014](https://github.com/PaddlePaddle/Paddle/pull/39014))
-
- - 支持子 block 和 while op 控制流的训练和推理。([#39612](https://github.com/PaddlePaddle/Paddle/pull/39612), [#39895](https://github.com/PaddlePaddle/Paddle/pull/39895), [#40077](https://github.com/PaddlePaddle/Paddle/pull/40077))
-
-- 参数服务器
-
- - GPUPS 下,新增 NAN/INF 值检查工具。([#38131](https://github.com/PaddlePaddle/Paddle/pull/38131))
-
- - GPUPS 下,新增 set_date 接口,适配增量训练。([#36194](https://github.com/PaddlePaddle/Paddle/pull/36194))
-
- - GPUPS 下,新增异步 release dataset 功能。([#37790](https://github.com/PaddlePaddle/Paddle/pull/37790))
-
- - GPUPS 下,支持 Dump 参数和中间层 ([#36157](https://github.com/PaddlePaddle/Paddle/pull/36157));
-
- - GPUPS 下,支持优化器参数配置。([#39783](https://github.com/PaddlePaddle/Paddle/pull/39783), [#39849](https://github.com/PaddlePaddle/Paddle/pull/39849))
-
- - 统一参数服务器下,重构通信、存储等各个模块基类,提升各个模块的易二次开发性。([#41207](https://github.com/PaddlePaddle/Paddle/pull/41207), [#41022](https://github.com/PaddlePaddle/Paddle/pull/41022), [#40702](https://github.com/PaddlePaddle/Paddle/pull/40702), [#39341](https://github.com/PaddlePaddle/Paddle/pull/39341) [#39377](https://github.com/PaddlePaddle/Paddle/pull/39377), [#39191](https://github.com/PaddlePaddle/Paddle/pull/39191), [#39064](https://github.com/PaddlePaddle/Paddle/pull/39064))
-
- - 统一参数服务器下,新增评估指标模块,支持 AUC/WuAUC/MaskAuc 等评估指标计算及可自定义扩展。([#38789](https://github.com/PaddlePaddle/Paddle/pull/38789))
-
- - 支持在昆仑芯 2 芯片上的 XPU 参数服务器训练。([#41917](https://github.com/PaddlePaddle/Paddle/pull/41917), [#42266](https://github.com/PaddlePaddle/Paddle/pull/42266), [#41916](https://github.com/PaddlePaddle/Paddle/pull/41916))
-
-#### Profiler
-
-- Python 层新增性能分析模块 `paddle.profiler`:提供对训推过程中性能数据的收集,导出和统计的功能。([#40065](https://github.com/PaddlePaddle/Paddle/pull/40065), [#40357](https://github.com/PaddlePaddle/Paddle/pull/40357), [#40888](https://github.com/PaddlePaddle/Paddle/pull/40888))
-
- - `paddle.profiler.Profiler`,性能分析器,用户交互的接口。([#41029](https://github.com/PaddlePaddle/Paddle/pull/41029), [#41524](https://github.com/PaddlePaddle/Paddle/pull/41524), [#41157](https://github.com/PaddlePaddle/Paddle/pull/41157), [#40249](https://github.com/PaddlePaddle/Paddle/pull/40249), [#40111](https://github.com/PaddlePaddle/Paddle/pull/40111), [#39964](https://github.com/PaddlePaddle/Paddle/pull/39964), [#40133](https://github.com/PaddlePaddle/Paddle/pull/40133))
-
- - `paddle.profiler.RecordEvent`,提供自定义打点来记录时间的功能。([#39693](https://github.com/PaddlePaddle/Paddle/pull/39693), [#39694](https://github.com/PaddlePaddle/Paddle/pull/39694), [#39695](https://github.com/PaddlePaddle/Paddle/pull/39695), [#39675](https://github.com/PaddlePaddle/Paddle/pull/39675),[#41445](https://github.com/PaddlePaddle/Paddle/pull/41445), [#41132](https://github.com/PaddlePaddle/Paddle/pull/41132))
-
- - `paddle.profiler.ProfilerTarget`,指定性能分析的目标设备。
-
- - `paddle.profiler.ProfilerState`,表示性能分析器的状态。
-
- - `paddle.profiler.SortedKeys`,指定统计表单内数据的排序方式。
-
- - `paddle.profiler.make_scheduler`,生成性能分析器状态的调度器,实现采集范围的周期性控制。
-
- - `paddle.profiler.export_chrome_tracing`,将性能数据保存到可供 chrome://tracing 插件查看的 google chrome tracing 文件。([#39316](https://github.com/PaddlePaddle/Paddle/pull/39316), [#39984](https://github.com/PaddlePaddle/Paddle/pull/39984), [#41029](https://github.com/PaddlePaddle/Paddle/pull/41029))
-
- - `paddle.profiler.export_protobuf`,将性能数据保存到内部结构表示的 protobuf 文件。([#39519](https://github.com/PaddlePaddle/Paddle/pull/39519), [#39109](https://github.com/PaddlePaddle/Paddle/pull/39109), [#39474](https://github.com/PaddlePaddle/Paddle/pull/39474))
-
- - `paddle.profiler.load_profiler_result`,载入所保存到 protobuf 文件的性能数据。
-
- - `paddle.profiler.Profiler`通过指定 `timer_only` 参数,对模型进行数据读取、step 开销和吞吐量的统计。([#40386](https://github.com/PaddlePaddle/Paddle/pull/40386))
-
-- C++层重构 Profiler 底层基础设施
-
- - 重构 Profiler 的控制器架构。([#38826](https://github.com/PaddlePaddle/Paddle/pull/38826), [#39230](https://github.com/PaddlePaddle/Paddle/pull/39230), [#39779](https://github.com/PaddlePaddle/Paddle/pull/39779))
-
- - 新增 Host Tracer,收集主机侧性能指标。([#37629](https://github.com/PaddlePaddle/Paddle/pull/39629), [#37766](https://github.com/PaddlePaddle/Paddle/pull/37766), [#37944](https://github.com/PaddlePaddle/Paddle/pull/37944), [#38280](https://github.com/PaddlePaddle/Paddle/pull/38280), [#39975](https://github.com/PaddlePaddle/Paddle/pull/39975), [#40460](https://github.com/PaddlePaddle/Paddle/pull/40460))
-
- - 新增 CUDA Tracer,收集设备侧性能指标。([#39488](https://github.com/PaddlePaddle/Paddle/pull/39488))
-
- - Profiler 支持分级。([#39926](https://github.com/PaddlePaddle/Paddle/pull/39926))
-
-- 修改新动态图下 op 的打点名称和类型。([#41771](https://github.com/PaddlePaddle/Paddle/pull/41771/)
-
-- 添加 Kernel 表单,以及优化表单内容的展示方式。([#41989](https://github.com/PaddlePaddle/Paddle/pull/41989))
-
-- 消除 Profiler 关闭情况下对模型前向计算造成性能下降的影响。([#42142](https://github.com/PaddlePaddle/Paddle/pull/42142))
-
-#### CINN 编译器接入
-
-飞桨的编译器功能在逐步丰富中,针对 CINN ([GitHub - PaddlePaddle/CINN: Compiler Infrastructure for Neural Networks](https://github.com/PaddlePaddle/CINN)) 的变更,Paddle 侧接入也进行了相对应的更改,以适配编译器 CINN 的功能。其中主要包括增加 Paddle-CINN 运行流程的子图管理相关功能,显存和速度性能的优化、开发过程发现的 bug 修复。
-
-- 功能开发:
-
- - 子图 op 相关:
-
- - 添加从计算图中找到并生成 CINN 子图的功能。([#36345](https://github.com/PaddlePaddle/Paddle/pull/36345))
-
- - 新增 cinn_launch op 作为运行时接入 CINN 的入口,负责调度 CINN 对子图进行编译、初始化数据空间、调度生成 Kernel 的执行。([#36600](https://github.com/PaddlePaddle/Paddle/pull/36600))
-
- - 为 cinn_launch op 的 Kernel 实现添加辅助类 CinnLaunchContext 管理子图编译、运行的中间数据,提升可扩展性和代码可读性。([#37938](https://github.com/PaddlePaddle/Paddle/pull/37938))
-
- - 为 CINN 子图添加额外的 fetch 结点,从而保证 CINN 外部结点能取到待 fetch 变量的值。([#37172](https://github.com/PaddlePaddle/Paddle/pull/37172), [#37190](https://github.com/PaddlePaddle/Paddle/pull/37190))
-
- - 添加对 CINN 子图符号化的功能,符号化用于拓扑排序子图并返回 CINN 执行序列。([#36417](https://github.com/PaddlePaddle/Paddle/pull/36417))
-
- - 新增 CinnCompiler 类,用于调用 CINN 编译模型中可使用 CINN 算子替换的子图。([#36562](https://github.com/PaddlePaddle/Paddle/pull/36562), [#36975](https://github.com/PaddlePaddle/Paddle/pull/36975))
-
- - 为 CINN 符号化类新增获取子图 fetch 变量名的接口,防止编译优化中将 fetch 变量融合消除。([#37218](https://github.com/PaddlePaddle/Paddle/pull/37218))
-
- - 程序开发检查、debug、API 变更相关:
-
- - 同步更新 CINN 中 NetBuilder API 名称的变化。([#40392](https://github.com/PaddlePaddle/Paddle/pull/40392))
-
- - 为 Paddle-CINN 添加必要的用于 debug 的日志信息。([#36867](https://github.com/PaddlePaddle/Paddle/pull/36867))
-
- - 添加 Paddle desc 与 CINN desc 互转函数。([#36100](https://github.com/PaddlePaddle/Paddle/pull/36100))
-
- - 相比 Paddle,CINN 中实现的算子可能存在未使用到某些输入变量,因此在 cinn_launch op 中去除对输入变量必须被使用的检查。([#37119](https://github.com/PaddlePaddle/Paddle/pull/37119))
-
- - 新增 cinn_instruction_run op 用于调用 CINN 执行单个生成指令,便于 Paddle 侧构建 Graph 调度运行子图。([#39435](https://github.com/PaddlePaddle/Paddle/pull/39435), [#39576](https://github.com/PaddlePaddle/Paddle/pull/39576))
-
- - 在 Paddle 中添加编译 CINN 所需的 CUDA/CUBLAS/MKL/CINN pass 应用等控制宏。([#37066](https://github.com/PaddlePaddle/Paddle/pull/37066), [#36660](https://github.com/PaddlePaddle/Paddle/pull/36660))
-
- - 增加 FLAGS_allow_cinn_ops 和 FLAGS_deny_cinn_ops 两个控制标记,用于控制 Paddle 训练中使用 CINN 算子代替原生算子的种类。([#36842](https://github.com/PaddlePaddle/Paddle/pull/36842))
-
-- 性能优化:
-
- - 速度优化
-
- - 优化 CinnCacheKey 的计算耗时。([#37786](https://github.com/PaddlePaddle/Paddle/pull/37786), [#37317](https://github.com/PaddlePaddle/Paddle/pull/37317))
-
- - 缓存 CINN 编译子图的变量 scope,降低运行参数构造开销。([#37983](https://github.com/PaddlePaddle/Paddle/pull/37983))
-
- - 子图编译时接入 CINN 自动调优,支持通过 flag 启用,便于后续进一步调优训练性能。([#41795](https://github.com/PaddlePaddle/Paddle/pull/41795))
-
- - 重构子图编译时对编译结果的正确性校验,避免运行时重复检查,降低调度开销。([#41777](https://github.com/PaddlePaddle/Paddle/pull/41777))
-
- - 在 Paddle-CINN 训练功能中默认启用 TransposeFolding 和 GemmRewriter 优化 pass。([#41084](https://github.com/PaddlePaddle/Paddle/pull/41084))
-
- - 将 Paddle 中创建的 cuda stream 传入 CINN,使得 Paddle 和 CINN 执行计算时共用同一个 CUDA stream。([#37337](https://github.com/PaddlePaddle/Paddle/pull/37337))
-
- - 将 CINN 优化 pass 应用逻辑从 Paddle 中移动到 CINN 中。([#42047](https://github.com/PaddlePaddle/Paddle/pull/42047), [#42070](https://github.com/PaddlePaddle/Paddle/pull/42070))
-
- - 显存优化
-
- - 为 cinn_launch op 添加 NoNeedBufferVars 声明无须 buffer 的输入变量列表,以便显存优化提前释放无效空间。([#38367](https://github.com/PaddlePaddle/Paddle/pull/38367))
-
- - 传入子图外部变量的引用计数信息,便于 cinn_launch 内子图复用显存优化 pass,降低使用 CINN 的显存开销。([#39209](https://github.com/PaddlePaddle/Paddle/pull/39209), [#39622](https://github.com/PaddlePaddle/Paddle/pull/39622))
-
- - 添加 CINN 编译生成的可执行指令集合转换为 Paddle Graph 的功能,支持复用 Paddle 调度器及显存优化 pass,进一步降低使用 CINN 的显存开销。([#39724](https://github.com/PaddlePaddle/Paddle/pull/39724), [#39911](https://github.com/PaddlePaddle/Paddle/pull/39911))
-
- - 添加 cinn_instruction_run op 的 Kernel 支持根据编译结果推断的数据类型动态申请空间。([#40920](https://github.com/PaddlePaddle/Paddle/pull/40920))
-
-- 问题修复:
-
- - 修复并优化 CINN 子图的生成逻辑。([#36503](https://github.com/PaddlePaddle/Paddle/pull/36503))
-
- - 修复 Paddle-CINN 不支持无输入子图的问题。([#40814](https://github.com/PaddlePaddle/Paddle/pull/40814))
-
- - 修复由于 CINN 无法处理 batch_norm 等算子中存在的无用输出而报错的问题。([#36996](https://github.com/PaddlePaddle/Paddle/pull/36996))
-
- - 修复若干 CINN 子图划分以及符号化中存在的 bug,解决 Paddle 训练接入 CINN 全流程打通过程中遇到的问题。([#36739](https://github.com/PaddlePaddle/Paddle/pull/36739), [#36698](https://github.com/PaddlePaddle/Paddle/pull/36698) )
-
- - CINN 尚不支持控制流,添加遇控制流跳过的逻辑。([#40812](https://github.com/PaddlePaddle/Paddle/pull/40812))
-
-#### 其他
-
-- 模型量化
-
- - 升级量化存储格式,并统一动、静态图量化格式。([#41041](https://github.com/PaddlePaddle/Paddle/pull/41041))
-
- - 新增离线量化方法:EMD、Adaround。([#40421](https://github.com/PaddlePaddle/Paddle/pull/40421), [#38460](https://github.com/PaddlePaddle/Paddle/pull/38460))
-
- - 支持更多 op 适配模 op 量化。([#40083](https://github.com/PaddlePaddle/Paddle/pull/40083))
-
- - 支持控制流中的 OP 量化。([#37498](https://github.com/PaddlePaddle/Paddle/pull/37498))
-
- - 新增支持 matmul_v2 OP 的量化。([#36469](https://github.com/PaddlePaddle/Paddle/pull/36469))
-
- - 新增支持量化后的 matmul_v2 在 TensorRT 上的推理。([#36594](https://github.com/PaddlePaddle/Paddle/pull/36594))
-
-- 显存优化
-
- - 实现多 stream 安全 Allocator,支持在多 stream 异步计算场景下安全高效地使用显存。([#37290](https://github.com/PaddlePaddle/Paddle/pull/37290))
-
- - 新增运行时显存监控模块(paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved),支持高性能地实时统计显存数据。([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657))
-
- - 实现 CPU-GPU 统一内存寻址(CUDA Managed Memory),支持在显存受限场景下训练超大模型。([#39075](https://github.com/PaddlePaddle/Paddle/pull/39075))
-
- - C++底层新增 GetBasePtr 接口,用来获取设备接口 CUDAMalloc 创建的设备地址。([#37978](https://github.com/PaddlePaddle/Paddle/pull/37978))
-
- - 减少 AutoGrowth Allocator 中 free blocks 的数量,提升显存分配性能。([#35732](https://github.com/PaddlePaddle/Paddle/pull/35732))
-
- - 对于 `initializer.Normal` 和 `initializer.Constant` 数据类型是 FP16 的 Tensor 去除多余的 float32 临时 Tensor 以及 cast,节省 2 倍显存。([#38818](https://github.com/PaddlePaddle/Paddle/pull/38818))
-
-- 动态图高阶导数组网测试
-
- - 为动态图增加三阶导数组网测试,以及 Broadcast 情况的测试。([#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377))
-
-- 自定义 op:支持 ROCm(HIP) 平台进行自定义 op 注册。([#36771](https://github.com/PaddlePaddle/Paddle/pull/36771))
-
-- Cost Model:增加基于运行 Profile 的 Cost Model。([#35774](https://github.com/PaddlePaddle/Paddle/pull/35774))
-
-- 提供定制化层 (nn.Layer)的自动稀疏训练支持,让用戶可根据自定义的 Prune 函数来对其设计的层进行稀疏剪枝。([#40253](https://github.com/PaddlePaddle/Paddle/pull/40253))
-
-- 新增字符串张量底层数据结构表示,使框架具备字符串张量表示和计算的能力。([#39830](https://github.com/PaddlePaddle/Paddle/pull/39830), [#40992](https://github.com/PaddlePaddle/Paddle/pull/40992))
-
-- 新增或者升级 oneDNN FP32/int8/bfloat16 Kernel,包括:
-
- - ELU ([#37149](https://github.com/PaddlePaddle/Paddle/pull/37149))
-
- - exp ([#38624](https://github.com/PaddlePaddle/Paddle/pull/38624))
-
- - stack ([#37002](https://github.com/PaddlePaddle/Paddle/pull/37002))
-
- - softplus ([#36382](https://github.com/PaddlePaddle/Paddle/pull/36382))
-
- - round ([#39653](https://github.com/PaddlePaddle/Paddle/pull/39653))
-
- - shape ([#36033](https://github.com/PaddlePaddle/Paddle/pull/36033))
-
- - flatten and flatten2 ([#35892](https://github.com/PaddlePaddle/Paddle/pull/35892))
-
- - slice ([#37630](https://github.com/PaddlePaddle/Paddle/pull/37630))
-
- - elementwise_mul ([#40546](https://github.com/PaddlePaddle/Paddle/pull/40546))
-
- - elementwise_add ([#38176](https://github.com/PaddlePaddle/Paddle/pull/38176))
-
- - ementwise_div ([#36158](https://github.com/PaddlePaddle/Paddle/pull/36158))
-
- - elementwise_sub ([#35662](https://github.com/PaddlePaddle/Paddle/pull/35662))
-
- - roi_align ([#37848](https://github.com/PaddlePaddle/Paddle/pull/37848))
-
- - nearest_interp and nearest_interp_v2 ([#37985](https://github.com/PaddlePaddle/Paddle/pull/37985),[#38622](https://github.com/PaddlePaddle/Paddle/pull/38622),[#39490](https://github.com/PaddlePaddle/Paddle/pull/39490))
-
- - assembly optimized Adam ([#39158](https://github.com/PaddlePaddle/Paddle/pull/39158))
-
- - logsoftmax ([#39793](https://github.com/PaddlePaddle/Paddle/pull/39793))
-
- - activation ([#40721](https://github.com/PaddlePaddle/Paddle/pull/40721))
-
- - mul ([#38552](https://github.com/PaddlePaddle/Paddle/pull/38552))
-
- - mean ([#37104](https://github.com/PaddlePaddle/Paddle/pull/37104))
-
- - relu ([#36265](https://github.com/PaddlePaddle/Paddle/pull/36265))
-
- - pool2d ([#37081](https://github.com/PaddlePaddle/Paddle/pull/37081))
-
- - concat ([#35889](https://github.com/PaddlePaddle/Paddle/pull/35889))
-
- - conv2d ([#38507](https://github.com/PaddlePaddle/Paddle/pull/38507),[#38938](https://github.com/PaddlePaddle/Paddle/pull/38938),[#36284](https://github.com/PaddlePaddle/Paddle/pull/36284))
-
- - LayerNorm ([#40418](https://github.com/PaddlePaddle/Paddle/pull/40418))
-
-- 增加基于 SSD-内存-GPU 显存 的 3 级存储图检索引擎,支持大规模图神经网络训练。([#42472](https://github.com/PaddlePaddle/Paddle/pull/42472), [#42321](https://github.com/PaddlePaddle/Paddle/pull/42321), [#42027](https://github.com/PaddlePaddle/Paddle/pull/42027))
-
-- 增加异构多云训练通信模块 switch,实现 Send/Recv 接口,支持多云异构通信。([#40965](https://github.com/PaddlePaddle/Paddle/pull/40965) [40911](https://github.com/PaddlePaddle/Paddle/pull/40911))
-
-### (2)功能优化
-
-#### API
-
-- 为 `paddle.Model`新增支持混合精度训练 O2 模式,即支持原来动/静态图的 Pure FP16 训练模式。([#36441](https://github.com/PaddlePaddle/Paddle/pull/40962441))
-
-- 为 `paddle.nn.Layer` 支持 self chain 调用。([#36609](https://github.com/PaddlePaddle/Paddle/pull/36609))
-
-- 为 `paddle.nn.Layer`的`to`方法添加`is_distributed`属性的设置,保证网络参数转换前后分布式属性保持一致。([#36221](https://github.com/PaddlePaddle/Paddle/pull/36221))
-
-- 完善 `paddle.nn.Layer`的`to` 方法的参数转换逻辑,降低转换过程占用的峰值显存,提高转换成功率。([#36862](https://github.com/PaddlePaddle/Paddle/pull/36862))
-
-- 为 `paddle.incubate.graph_send_recv`支持设置输出 Tensor 的 shape,有利于减少实际计算过程的显存占用。([#40509](https://github.com/PaddlePaddle/Paddle/pull/40509))
-
-- 为 `paddle.incubate.segment_sum`、`segment_mean`、`segment_max`、`segment_min` 新增 int32、int64 数据类型支持。([#40577](https://github.com/PaddlePaddle/Paddle/pull/40577))
-
-- 为 transpose op 新增 bool 类型支持。([#35886](https://github.com/PaddlePaddle/Paddle/pull/35886))
-
-- 将 `paddle.mm` 底层算子从 matmul 切换到 matmul_v2。([#35770](https://github.com/PaddlePaddle/Paddle/pull/35770))
-
-- 为 `paddle.einsum` 支持静态图模式调用,支持未知 shape。([#40360](https://github.com/PaddlePaddle/Paddle/pull/40360))
-
-- 为 `paddle.nn.functional.margin_cross_entropy` 和 `paddle.nn.functional.class_center_sample` 支持数据并行。([#39852](https://github.com/PaddlePaddle/Paddle/pull/39852))
-
-- 为 `paddle.nn.functional.grid_sample`支持形状为[1]的输入。([#36183](https://github.com/PaddlePaddle/Paddle/pull/36183))
-
-- 为 `paddle.nn.PRelu` 支持 `NHWC` 数据格式。([#37019](https://github.com/PaddlePaddle/Paddle/pull/37019))
-
-- 为 `paddle.nn.functional.class_center_sample` 支持使用 `paddle.seed` 固定随机状态。([#38248](https://github.com/PaddlePaddle/Paddle/pull/38248))
-
-- 为 `paddle.fft` 下所有 API 新增 ROCM 后端支持,并优化 CUFFT 后端报错信息。([#36415](https://github.com/PaddlePaddle/Paddle/pull/36415), [#36114](https://github.com/PaddlePaddle/Paddle/pull/36114/files))
-
-- 为 `Tensor.getitem` 增加对切片部分维度为 0 的功能支持,即允许切片索引结果为空。([#37313](https://github.com/PaddlePaddle/Paddle/pull/37313))
-
-- 为 `Tensor.setitem` 支持 int 和 bool 类型 Tensor 使用 bool 索引。([#37761](https://github.com/PaddlePaddle/Paddle/pull/37761))
-
-- 为 `paddle.nn.functional.interpolate` 支持 nearest 模式时输入 shape 为 5D。([#38868](https://github.com/PaddlePaddle/Paddle/pull/38868))
-
-- 为 `paddle.nn.Embedding`、`paddle.gather` 增加 int16 支持。([#40964](https://github.com/PaddlePaddle/Paddle/pull/40964), [#40052](https://github.com/PaddlePaddle/Paddle/pull/40052))
-
-- 为 `paddle.distributed.spawn`添加 CPU 单机数据并行。([#35745](https://github.com/PaddlePaddle/Paddle/pull/35745), [#36758](https://github.com/PaddlePaddle/Paddle/pull/36758), [#36637](https://github.com/PaddlePaddle/Paddle/pull/36637))
-
-- 新增`depthwise_conv2d`MKLDNN 算子。([#38484](https://github.com/PaddlePaddle/Paddle/pull/38484))
-
-- 为`paddle.abs`、`paddle.transpose`、`paddle.squeeze`、`paddle.unsqueeze`、 `paddle.matmul`、`paddle.full` 静态图数据类型检测中增加复数类型。([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113))
-
-- 为 `paddle.autograd.PyLayer` 支持 tuple/list 类型的参数。([#38146](https://github.com/PaddlePaddle/Paddle/pull/38146))
-
-- 为 `paddle.autograd.PyLayer` 增加检查 inplace 策略下,输入叶子节点的 Tensor 的检查报错机制。([#37931](https://github.com/PaddlePaddle/Paddle/pull/37931))
-
-- 为 `paddle.autograd.PyLayer` 支持 HIP 库。([#38184](https://github.com/PaddlePaddle/Paddle/pull/38184))
-
-- 为 `paddle.take_along_axis`、`paddle.put_along_axis` 支持更多 size 的输入,允许 index 矩阵的 shape size 大于 arr 矩阵的 shape size。([#39072](https://github.com/PaddlePaddle/Paddle/pull/39072))
-
-- 优化 API `paddle.nn.Pad2D`在 replicate 为 0 时的报错信息。([#36510](https://github.com/PaddlePaddle/Paddle/pull/36510/files))
-
-- 支持 API `paddle.nn.Pad2D`在 tuple 格式的 pad 输入。([#35985](https://github.com/PaddlePaddle/Paddle/pull/35985/files))
-
-- 新增 `paddle.distributed.InMemoryDataset` 中 tdm_sample API 以支持 TDM 算法中的采样操作。([#37044](https://github.com/PaddlePaddle/Paddle/pull/37044))
-
-- 新增对于`paddle.jit.save`的 Pre-saving Hooks 机制。([#38186](https://github.com/PaddlePaddle/Paddle/pull/38186))
-
-- 新增高阶微分相关 API:
-
- - `elementwise_add` 增加三阶 Kernel,支持三阶微分的计算。([#36508](https://github.com/PaddlePaddle/Paddle/pull/36508), [#36618](https://github.com/PaddlePaddle/Paddle/pull/36618))
-
- - `matmul_v2` 增加三阶 Kernel,支持三阶微分的计算。([#36459](https://github.com/PaddlePaddle/Paddle/pull/36459))
-
- - `elementwise_mul` 增加三阶 Kernel,支持三阶微分的计算。([#37152](https://github.com/PaddlePaddle/Paddle/pull/37547))
-
-- 完善`paddle.amp.GradScaler`调用 check_finite_and_unscale op 的逻辑,消除该处创建 bool 变量所引入的 cudaMemcpy。([#37770](https://github.com/PaddlePaddle/Paddle/pull/37770))
-
-- 新增对 unstack 和 unique op 元素个数为 0 的 Tensor 增加检查。([#36021](https://github.com/PaddlePaddle/Paddle/pull/36021))
-
-- 新增支持昆仑芯 2 的多层、双向 LSTM 功能,完善 RNN 前反向 op,支持时序类模型训练使用。([#](https://github.com/PaddlePaddle/Paddle/pull/41781)[42076](https://github.com/PaddlePaddle/Paddle/pull/42076))
-
-- 新增支持昆昆仑芯 2 的 bce_loss 前反向 op。([#41610](https://github.com/PaddlePaddle/Paddle/pull/41610))
-
-- 添加 `paddle.linalg.det` 的反向实现。([#36013](https://github.com/PaddlePaddle/Paddle/pull/36013))
-
-#### IR(Intermediate Representation)
-
-- 动态图转静态图
-
- - 优化动转静下 `ProgramCache.last` 接口行为,使其返回最近使用的 Program,而非最后生成的 Program。([#39541](https://github.com/PaddlePaddle/Paddle/pull/39541))
-
- - 优化动转静下 `paddle.reshape` API 的报错信息,新增推荐用法提示。([#40599](https://github.com/PaddlePaddle/Paddle/pull/40599))
-
- - 优化动转静代码转写时 `is_api_in_module` 函数中异常捕获类型。([#40243](https://github.com/PaddlePaddle/Paddle/pull/40243))
-
- - 优化动转静模块报错提示,默认隐藏 warning 信息。([#39730](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39730))
-
- - 增加动转静对于 type hint 语法的支持,提高变量类型分析的准确性。([#39572](https://github.com/PaddlePaddle/Paddle/pull/39572))
-
- - 优化 `paddle.cond` 功能,允许 bool、int 等基本类型支持值相等。([#37888](https://github.com/PaddlePaddle/Paddle/pull/37888))
-
- - 优化动转静`@to_static` 装饰普通函数时,允许切换 train/eval 模式。([#37383](https://github.com/PaddlePaddle/Paddle/pull/37383))
-
- - 优化动转静报错栈,突出用户相关代码,减少框架冗余报错栈。([#36741](https://github.com/PaddlePaddle/Paddle/pull/36741))
-
- - 移除`paddle.cond` 返回值中 `no_value` 占位符。([#36513](https://github.com/PaddlePaddle/Paddle/pull/36513)、[#36826](https://github.com/PaddlePaddle/Paddle/pull/36826))
-
- - 为动转静 run_program op 适配新动态图模式。([#40198](https://github.com/PaddlePaddle/Paddle/pull/40198), [#40355](https://github.com/PaddlePaddle/Paddle/pull/40355))
-
- - 新增对于 zip 语法的检查。([#37846](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/37846))
-
- - 修复 `paddle.signal.frame`、`paddle.signal.stft`、`paddle.signal.istft` 因维度和类型判断错误导致的动转静失败问题。([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113))
-
- - 为 mean、pad3d ops 新增注册复数类型 Kernel。([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113))
-
-#### 混合精度训练
-
-- 为 amp 添加 GPU Compute Capability 环境检查,对无法产生训练加速效果的 GPU 环境添加使用警告。([#38086](https://github.com/PaddlePaddle/Paddle/pull/38086))
-
-- 添加`paddle.amp.decorate`与`paddle.DataParallel`同时使用时调用顺序的检查。([#38785](https://github.com/PaddlePaddle/Paddle/pull/38785))
-
-#### 分布式训练
-
-- 分布式训练基础功能
-
- - 优化 Fleet API 和 DistributedStrategy 配置以使用动态图并行功能,提升动态图易用性。([#40408](https://github.com/PaddlePaddle/Paddle/pull/40408))
-
- - 优化动态图混合并行 HybridParallelClipGrad 策略,支持 4D 混合并行 + Pure FP16 训练。([#36237](https://github.com/PaddlePaddle/Paddle/pull/36237), [#36555](https://github.com/PaddlePaddle/Paddle/pull/36555))
-
- - 重构动态图数据并行策略,以支持新动态图和新通信库功能。([#40389](https://github.com/PaddlePaddle/Paddle/pull/40389), [#40593](https://github.com/PaddlePaddle/Paddle/pull/40593), [#40836](https://github.com/PaddlePaddle/Paddle/pull/40836), [#41119](https://github.com/PaddlePaddle/Paddle/pull/41119), [#41413](https://github.com/PaddlePaddle/Paddle/pull/41413), [#39987](https://github.com/PaddlePaddle/Paddle/pull/39987))
-
- - 为 fused_attention op 支持分布式张量模型并行。([#40101](https://github.com/PaddlePaddle/Paddle/pull/40101))
-
- - 为 fused_feedforward op 支持分布式张量模型并行。([#40160](https://github.com/PaddlePaddle/Paddle/pull/40160))
-
-- 图检索引擎
-
- - 优化图引擎的图采样接口返回的数据格式,采样速度提升 3 倍。([#37315](https://github.com/PaddlePaddle/Paddle/pull/37315))
-
- - 减少图引擎线程量以提升性能。([#37098](https://github.com/PaddlePaddle/Paddle/pull/37098))
-
- - 优化图引擎数据传输以提升性能。([#37341](https://github.com/PaddlePaddle/Paddle/pull/37341))
-
- - 利用模型中 embedding op 的拓扑关系,优化 embedding op 的合并逻辑以提升性能。[(#35942)](https://github.com/PaddlePaddle/Paddle/pull/35942)
-
-- 通信库:重构通信库,提升通信库的易扩展性和二次开发性,支持异构通信。([#41398](https://github.com/PaddlePaddle/Paddle/pull/41398), [#39720](https://github.com/PaddlePaddle/Paddle/pull/39720), [#40911](https://github.com/PaddlePaddle/Paddle/pull/40911), [#40579](https://github.com/PaddlePaddle/Paddle/pull/40579), [#40629](https://github.com/PaddlePaddle/Paddle/pull/40629), [#40437](https://github.com/PaddlePaddle/Paddle/pull/40437), [#40430](https://github.com/PaddlePaddle/Paddle/pull/40430), [#40228](https://github.com/PaddlePaddle/Paddle/pull/40228), [#40181](https://github.com/PaddlePaddle/Paddle/pull/40181), [#40100](https://github.com/PaddlePaddle/Paddle/pull/40100), [#40097](https://github.com/PaddlePaddle/Paddle/pull/40097), [#39892](https://github.com/PaddlePaddle/Paddle/pull/39892), [#39384](https://github.com/PaddlePaddle/Paddle/pull/39384), [#39737](https://github.com/PaddlePaddle/Paddle/pull/39737), [#40040](https://github.com/PaddlePaddle/Paddle/pull/40040))
-
-- 支持 `paddle.incubate.distributed.models.moe`中 MoE 相关接口(`moe.GShardGate`, `moe.BaseGate`, `moe.SwitchGate`, `moe.MoELayer`, `moe.ClipGradForMOEByGlobalNorm` )的公开。([#42300](https://github.com/PaddlePaddle/Paddle/pull/42300))
-
-- 修复 `paddle.incubate.distributed.models.moe.MoELayer` 中使用 recomputing 可能报错的问题。([#42128](https://github.com/PaddlePaddle/Paddle/pull/42128))
-
-- 修复新动态图流水线并行因为数据类型不同导致的报错 ([#41937](https://github.com/PaddlePaddle/Paddle/pull/41937) [#42053](https://github.com/PaddlePaddle/Paddle/pull/42053))
-
-- 修复新动态图张量模型并行因为数据类型不同导致的报错 ([#41960](https://github.com/PaddlePaddle/Paddle/pull/41960))
-
-#### 自定义算子
-
-- 增强 C++自定义算子机制对二阶反向算子编写功能,支持为二阶反向算子的梯度输入变量添加后缀作为输出使用。([#41781](https://github.com/PaddlePaddle/Paddle/pull/41781))
-
-- 移除 Tensor API 成员方法中对废弃的枚举类型 PlaceType 的使用,进行相应兼容处理,并添加 deprecated warning 提示。([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882))
-
-- 为原 Tensor API 的一系列废弃接口,包括不完整构造函数、reshape、mutable_data、copy_to 方法添加 deprecated warning 提示。([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882))
-
-#### 其他
-
-- 报错调试优化
-
- - 优化 cross_entropy op 对 `label` 的边界检查报错信息。([#40001](https://github.com/PaddlePaddle/Paddle/pull/40001))
-
- - 为动态图添加 op 执行时`infer_shape`和`compute`方法的 profile record,用于在 timeline 中展示其开销。([#39023](https://github.com/PaddlePaddle/Paddle/pull/39023))
-
- - 替换了 Windows 下容易出现未知异常的 `pybind::index_error` 报错提示。([#40538](https://github.com/PaddlePaddle/Paddle/pull/40538))
-
- - 添加用户 scatter op 越界检查的报错信息。([#37429](https://github.com/PaddlePaddle/Paddle/pull/37429))
-
-- 下载工具:针对`paddle.utils.download.get_path_from_url`中解压含多文件目录速度慢的问题,将原先循环遍历目录下文件逐一解压的方式替换为在目录上调用 extractall 一次解压的方式,解压速度大幅提升。([#37311](https://github.com/PaddlePaddle/Paddle/pull/37311))
-
-- 加速 `fake_quantize_range_abs_max`、`fake_quantize_abs_max`、`fake_quantize_dequantize_abs_max`、 `fake_quantize_moving_average_abs_max` 等量化训练。([#40491](https://github.com/PaddlePaddle/Paddle/pull/40491))
-
-### (3)性能优化
-
-#### 分布式训练
-
-- 混合并行优化器 sharding 支持 optimize_cast 优化,将前反向参数 cast 移到优化器阶段,性能提升 7%。([#35878](https://github.com/PaddlePaddle/Paddle/pull/35878))
-
-- GPUPS 优化:支持梯度 fuse allreduce 训练,训练提升 20%。([#35131](https://github.com/PaddlePaddle/Paddle/pull/35131))
-
-- GPUPS 优化:dump CPU 优化提速 3.21 倍。([#40068](https://github.com/PaddlePaddle/Paddle/pull/40068))
-
-- CPU 参数服务器流式训练优化:支持稀疏参数统计量自动统计、稀疏参数增量保存等功能,训练性能提升 20%。([#36465](https://github.com/PaddlePaddle/Paddle/pull/36465), [#36601](https://github.com/PaddlePaddle/Paddle/pull/36601), [#36734](https://github.com/PaddlePaddle/Paddle/pull/36734), [#36909](https://github.com/PaddlePaddle/Paddle/pull/36909), [#36943](https://github.com/PaddlePaddle/Paddle/pull/36943), [#37181](https://github.com/PaddlePaddle/Paddle/pull/37181), [#37194](https://github.com/PaddlePaddle/Paddle/pull/37194), [#37515](https://github.com/PaddlePaddle/Paddle/pull/37515), [#37626](https://github.com/PaddlePaddle/Paddle/pull/37626), [#37995](https://github.com/PaddlePaddle/Paddle/pull/37995), [#38582](https://github.com/PaddlePaddle/Paddle/pull/38582), [#39250](https://github.com/PaddlePaddle/Paddle/pull/39250), [#40762](https://github.com/PaddlePaddle/Paddle/pull/40762), [#41234](https://github.com/PaddlePaddle/Paddle/pull/41234), [#41320](https://github.com/PaddlePaddle/Paddle/pull/41320), [#41400](https://github.com/PaddlePaddle/Paddle/pull/41400))
-
-#### 算子优化
-
-- 优化 `FasterTokenizer` 性能,性能与优化前相比提升 10%。([#36701](https://github.com/PaddlePaddle/Paddle/pull/36701))
-
-- 优化 `index_select` 反向计算,性能较优化前有 3.7~25.2 倍提升。([#37055](https://github.com/PaddlePaddle/Paddle/pull/37055))
-
-- 优化 `paddle.nn.ClipByGlobalNorm` 的性能,以 10*10 的 `paddle.nn.Linear` 为例,性能与优化前相比提升 30%左右。([#38209](https://github.com/PaddlePaddle/Paddle/pull/38209))
-
-- 优化 `pnorm` 在 `axis` 维度极大或极小情况下的性能,前向速度提升 31~96 倍,反向速度提升 1.1~19 倍。([#37685](https://github.com/PaddlePaddle/Paddle/pull/37685), [#38215](https://github.com/PaddlePaddle/Paddle/pull/38215), [#39011](https://github.com/PaddlePaddle/Paddle/pull/39011))
-
-- 优化 `softmax` 前、反向性能,对于 `axis!=-1` 的配置加速比为 2 倍左右。([#38602](https://github.com/PaddlePaddle/Paddle/pull/38602), [#38609](https://github.com/PaddlePaddle/Paddle/pull/38609), [#32387](https://github.com/PaddlePaddle/Paddle/pull/32387), [#37927](https://github.com/PaddlePaddle/Paddle/pull/37927/files))
-
-- 优化 `log_softmax` 前、反向性能,对于 `axis!=-1`的配置加速比为 6~20 倍左右。([#38992](https://github.com/PaddlePaddle/Paddle/pull/38992), [#40612](https://github.com/PaddlePaddle/Paddle/pull/40612))
-
-- 优化 `softmax_with_cross_entropy` 前、反向性能,对于 `hard_label` 的配置加速比为 1.3 倍左右。([#39553](https://github.com/PaddlePaddle/Paddle/pull/39553), [#40424](https://github.com/PaddlePaddle/Paddle/pull/40424), [#40643](https://github.com/PaddlePaddle/Paddle/pull/40643))
-
-- 优化 `top_k` 性能,对于一维且 `k` 较大时(k=5000)的配置加速比为 22 倍以上。([#40941](https://github.com/PaddlePaddle/Paddle/pull/40941))
-
-- 优化 `elementwise_mul` 反向计算,较优化前有 1.85~12.16 倍性能提升。([#37728](https://github.com/PaddlePaddle/Paddle/pull/37728))
-
-- 优化 `elementwise_min` 反向和 `elementwise_max` 反向,较优化前打平或有 1.05~18.75 倍性能提升。([#38236](https://github.com/PaddlePaddle/Paddle/pull/38236), [#37906](https://github.com/PaddlePaddle/Paddle/pull/37906))
-
-- 优化 `nearest_interp` 前向和反向计算,前向较优化前性能有 1.5~2.3 倍提升;反向性能较优化前有 60%~1.8 倍提升。([#38528](https://github.com/PaddlePaddle/Paddle/pull/38528), [#39067](https://github.com/PaddlePaddle/Paddle/pull/39067))
-
-- 优化 `bilinear_interp` 前向和反向计算,前向较优化前性能有 0.4~2.3 倍提升;反向性能较优化前有 10%~30%提升。([#39243](https://github.com/PaddlePaddle/Paddle/pull/39243), [#39423](https://github.com/PaddlePaddle/Paddle/pull/39423))
-
-- 优化 `dropout` 前向和反向计算,性能提升约 20%。([#39795](https://github.com/PaddlePaddle/Paddle/pull/39795), [#38859](https://github.com/PaddlePaddle/Paddle/pull/38859), [#38279](https://github.com/PaddlePaddle/Paddle/pull/38279), [#40053](https://github.com/PaddlePaddle/Paddle/pull/40053))
-
-- 优化 `grid_sampler`前向和反向计算,前向较优化前性能有 10%~30%提升;反向性能较优化前有 10%~60%提升。([#39751](https://github.com/PaddlePaddle/Paddle/pull/39751))
-
-- 优化 `group_norm` 前向和反向计算,前向性能提升 1.04~2.35 倍,反向性能提升 1.12~1.18 倍。([#39944](https://github.com/PaddlePaddle/Paddle/pull/39944), [#40657](https://github.com/PaddlePaddle/Paddle/pull/40657), [#39596](https://github.com/PaddlePaddle/Paddle/pull/39596))
-
-- 优化 `conv1d` 前向和反向计算,前向性能提升 1.00~2.01 倍,反向性能提升 1.01~474.56 倍。([#38425](https://github.com/PaddlePaddle/Paddle/pull/38425))
-
-- 优化 `elementwise_div` 反向计算,反向性能提升 1.02~29.25 倍。([#38044](https://github.com/PaddlePaddle/Paddle/pull/38044))
-
-- 优化 `gelu` 前向和反向计算,前向性能提升 1.13~1.43 倍,反向性能提升 1.10~1.55 倍。([#38188](https://github.com/PaddlePaddle/Paddle/pull/38188), [#38263](https://github.com/PaddlePaddle/Paddle/pull/38263))
-
-- 优化 `elementwise_sub` 反向计算,反向性能提升 1.04~15.64 倍。([#37754](https://github.com/PaddlePaddle/Paddle/pull/37754))
-
-- 优化 `flip` 在输入一维数据时前向性能,性能提升 100%。([#37825](https://github.com/PaddlePaddle/Paddle/pull/37825))
-
-- 优化 `layer_norm` 前向和反向计算,前向较优化前提升 2-5 倍,反向较优化前提升 20%~50%。([#39167](https://github.com/PaddlePaddle/Paddle/pull/39167), [#39247](https://github.com/PaddlePaddle/Paddle/pull/39247))
-
-- 优化 `embedding` 前向和反向计算,前向较优化前最大提升 1.51 倍,反向较优化前提升 1.03~7.79 倍。([#39856](https://github.com/PaddlePaddle/Paddle/pull/39856), [#39886](https://github.com/PaddlePaddle/Paddle/pull/398866))
-
-- 优化 `gelu` FP16 前向和反向计算,前向较优化前提升 9%~12%,反向较优化前提升 2%~9%。([#38980](https://github.com/PaddlePaddle/Paddle/pull/38980))
-
-- 移除 `gather_nd`前反向算子中的 CPU -> GPU 显式数据传输操作,移除 `index_select` 前反向算子中的显式同步操作,将 `scatter_nd` 中的 GPU -> GPU 数据传输由同步操作改成异步操作。([#40933](https://github.com/PaddlePaddle/Paddle/pull/40933))
-
-- 优化 `Lars optimzier` 计算,优化后 Resnet50 PF16 模型训练性能较优化前提升 5.1%。([#35652](https://github.com/PaddlePaddle/Paddle/pull/35652), [#35476](https://github.com/PaddlePaddle/Paddle/pull/35476))
-
-- 优化 `AvgPool2dGrad` 计算,优化后性能较优化前提升 2.6 倍。([#35389](https://github.com/PaddlePaddle/Paddle/pull/35389))
-
-- 优化 `Elementwise` 类计算对于多元输出的功能支持,优化后计算性能较优化前提升最多可达 15%。([#38329](https://github.com/PaddlePaddle/Paddle/pull/38329), [#38410](https://github.com/PaddlePaddle/Paddle/pull/38410))
-
-- 优化 `Categorical`的 `probs`计算,简化计算逻辑,性能提升 4 ~ 5 倍。([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178))
-
-- `paddle.sum` 性能优化,性能相比优化前提升约 20%。([#42309](https://github.com/PaddlePaddle/Paddle/pull/42309))
-
-#### 自动调优
-
-新增训练全流程硬件感知性能自动调优功能,在图像分类、分割、检测和图像生成任务上与模型默认参数配置下的性能相比提升约 3%~50%以上。通过 `paddle.incubate.autotune.set_config` API 设置自动调优状态,当前默认关闭。自动调优具体包括三个层次:
-
-- `paddle.io.DataLoader` 新增自动调优功能,根据训练数据和设备资源选择最佳的模型 num_workers。([#42004](https://github.com/PaddlePaddle/Paddle/pull/42004))
-
-- 新增混合精度训练数据布局自动调优功能,根据设备类型和数据类型选择最佳数据布局,并在运行时自动转换。([#41964](https://github.com/PaddlePaddle/Paddle/pull/41964))
-
-- 新增 Conv 运行时所需 workspace size 阈值自动调整功能,根据 GPU 当前可申请显存资源情况来自动设置;基于通用的 AlgorithmCache 设计和 Kernel 计时组件,新增 Conv cuDNN 算法自动选择功能,支持数据变长模型。([#41833](https://github.com/PaddlePaddle/Paddle/pull/41833))
-
-#### 调度优化
-
-- 移除 `paddle.nn.ClipGradByGlobalNorm` 中的 CudaStreamSync 隐藏操作,减少执行时的调度开销,在 ptb 模型上有 5%的性能提升。([#42170](https://github.com/PaddlePaddle/Paddle/pull/42170))
-
-- 优化一系列底层数据结构及原动态图执行体系中的细节实现,提升原动态图的调度性能。([#42010](https://github.com/PaddlePaddle/Paddle/pull/42010), [#42171](https://github.com/PaddlePaddle/Paddle/pull/42171), [#42224](https://github.com/PaddlePaddle/Paddle/pull/42224), [#42256](https://github.com/PaddlePaddle/Paddle/pull/42256), [#42306](https://github.com/PaddlePaddle/Paddle/pull/42306), [#42329](https://github.com/PaddlePaddle/Paddle/pull/42329)[, #42340](https://github.com/PaddlePaddle/Paddle/pull/42340), [#42368](https://github.com/PaddlePaddle/Paddle/pull/42368), [#42425](https://github.com/PaddlePaddle/Paddle/pull/42425))
-
-- 简化 `paddle.distribution.Categorical`的 probs 计算逻辑,提升性能 4 到 5 倍。([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178))
-
-### (4)问题修复
-
-#### API
-
-- 修复 `paddle.sum` 输入参数类型和输出参数类型不一致且 `axis` 轴对应的 reduce 元素个数为 1 时,输出类型错误问题。([#36123](https://github.com/PaddlePaddle/Paddle/pull/36123))
-
-- 修复 `paddle.flops` 在 layer 输出类型为 tuple 时的 `AttributeError`。([#38850](https://github.com/PaddlePaddle/Paddle/pull/38850))
-
-- 修复 `paddle.diag` 因为没有反向 Kernel 而无法传播梯度的问题。([#40447](https://github.com/PaddlePaddle/Paddle/pull/40447))
-
-- 修复 `paddle.sort` 输入存在 NaN 值排序错误。([#41070](https://github.com/PaddlePaddle/Paddle/pull/41070))
-
-- 修复 `paddle.full_like` 输入存在 Inf 值构建 Tensor 错误。([#40232](https://github.com/PaddlePaddle/Paddle/pull/40232))
-
-- 修复 `paddle.strided_slice` 在输入 starts 中数据小于 -rank 时,strided_slice 结果与 slice 不一致的 bug。([#39066](https://github.com/PaddlePaddle/Paddle/pull/39066))
-
-- 修复 `max_pool` 系列算子在返回 index 时 infer_shape 计算错误的问题,受影响的 API 有 `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`。([#40139](https://github.com/PaddlePaddle/Paddle/pull/40139))
-
-- 修复 `max_pool` 系列算子返回的 pooling_mask 的 dtype 错误的问题,现在 pooling_mask 的 dtype 为 int32,受影响的 API 有 `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`。([#39314](https://github.com/PaddlePaddle/Paddle/pull/39314))
-
-- 修复 `paddle.shape` 默认存在反向梯度导致计算错误的问题。([#37340](https://github.com/PaddlePaddle/Paddle/pull/37340))
-
-- 修复 `paddle.nn.Layer` 的 `to` 方法同时转换 dtype 和 place 存在的 bug。([#37007](https://github.com/PaddlePaddle/Paddle/pull/38007))
-
-- 修复 `paddle.amp.decorate` 无法对非叶子网络层的参数改写为 FP16 的 bug。([#38402](https://github.com/PaddlePaddle/Paddle/pull/38402))
-
-- 修复 `paddle.amp.decorate` 将 `paddle.nn.BatchNorm1D`、`paddle.nn.BatchNorm2D`、`paddle.nn.BatchNorm3D` 非输入参数改写为 FP16 的 bug。([#38541](https://github.com/PaddlePaddle/Paddle/pull/38541))
-
-- 修复 `paddle.amp.decorate` 将 `paddle.nn.SyncBatchNorm` 非输入参数改写为 FP16 的 bug。([#40943](https://github.com/PaddlePaddle/Paddle/pull/40943))
-
-- 修复 `paddle.nn.Layer.to` 当中多余的 warning。([#36700](https://github.com/PaddlePaddle/Paddle/pull/36700))
-
-- 修复 `paddle.nn.RNN` 在控制流下使用报错的问题。([#41162](https://github.com/PaddlePaddle/Paddle/pull/41162))
-
-- 修复 `paddle.to_tensor` 无法指定 Tensor 的 CUDA Place 的问题。([#39662](https://github.com/PaddlePaddle/Paddle/pull/39662))
-
-- 修复 `paddle.nn.Identity` 没有公开的问题。([#39615](https://github.com/PaddlePaddle/Paddle/pull/39615))
-
-- 修复动态图重构后,`fill_` 和 `zero_` inplace API 的输入在 CUDAPinned Place 上时,输出值不正确的 bug。([#41229](https://github.com/PaddlePaddle/Paddle/pull/41229))
-
-- 动态图重构后,修复使用 append op 的方式调用 assign op 导致输出 Tensor 的 inplace version 值不正确的 bug,修改为使用 `_C_ops` 的方式调用 assign op。([#41118](https://github.com/PaddlePaddle/Paddle/pull/41118))
-
-- 移除 `elementwise_add` 三阶 Kernel 中不合理的代码,修复组网过程未初始化问题。([#36618](https://github.com/PaddlePaddle/Paddle/pull/36618))
-
-- 修复 `conv2d` 执行 cuDNN Kernel 时属性缺失的问题。([#38827](https://github.com/PaddlePaddle/Paddle/pull/38827))
-
-- 修复 `multiclass_nms3` 输出 shape 不正确的问题。([#40059](https://github.com/PaddlePaddle/Paddle/pull/40059))
-
-- 修复 `yolo_box` 输出 shape 不正确的问题。([#40056](https://github.com/PaddlePaddle/Paddle/pull/40056))
-
-- 修复高阶微分 `gradients` 接口在指定 target_grad 时未按预期生效的问题。([#40940](https://github.com/PaddlePaddle/Paddle/pull/40940/))
-
-- 修复动态图 op`_BatchNormBase` 基类中修改了 default_dtype,导致后续组网参数类型错误的问题,受影响的 API 有 `paddle.nn.BatchNorm1D`,`paddle.nn.BatchNorm2D`,`paddle.nn.BatchNorm3D`,`paddle.nn.SyncBatchNorm`。具体原因是当 `get_default_dtype() == 'float16'` 时,通过 `set_default_dtype('float32')`修改默认参数数据类型,动态图组网的参数类型是通过 default_dtype 来创建的,因此当默认参数类型被修改后导致后续的组网参数类型错误。([#36376](https://github.com/PaddlePaddle/Paddle/pull/36376))
-
-- 修复 batchnorm op 中,当数据类型为 FP32,且数据维度 `dims = 2,data_layout = NHWC` 时,反向 op 内中间变量未定义问题。([#37020](https://github.com/PaddlePaddle/Paddle/pull/37020))
-
-- 修复静态图模式下,`paddle.static.nn.prelu` 对于 `NHWC` 输入格式且 `mode==channel` 权重的 shape 错误问题。([#38310](https://github.com/PaddlePaddle/Paddle/pull/38310))
-
-- 修复多机情况下,`paddle.nn.functional.class_center_sample` CUDA 种子设置 bug。([#38815](https://github.com/PaddlePaddle/Paddle/pull/38815))
-
-- 修复 `paddle.nn.functional.one_hot` 在输入不正确参数时,CUDA 版本无法正确报错的问题。([#41335](https://github.com/PaddlePaddle/Paddle/pull/41335))
-
-- 修复 DCU 设备上回收显存的 callback 未及时触发导致显存 OOM 的问题。([#40445](https://github.com/PaddlePaddle/Paddle/pull/40445))
-
-- 修复 `setitem` 索引赋值反向梯度传递异常以及动态图部分场景下 inplace 逻辑处理异常的问题。([#37023](https://github.com/PaddlePaddle/Paddle/pull/37023), [#38298](https://github.com/PaddlePaddle/Paddle/pull/38298))
-
-- 修复动转静下 Tensor array 使用 Slice 索引异常的问题。([#39251](https://github.com/PaddlePaddle/Paddle/pull/39251))
-
-- 修复 `paddle.Tensor.register_hook` 接口使用时临时变量未析构,从而导致内存或显存泄漏的问题。([#40716](https://github.com/PaddlePaddle/Paddle/pull/40716))
-
-- 修复 `Tensor.getitem` 当索引是全为 False 的 bool Tensor 时无法取值的问题。([#41297](https://github.com/PaddlePaddle/Paddle/pull/41297))
-
-- 修复 `Tensor.getitem` 当索引是 bool scalar Tensor 时无法取值的问题。([#40829](https://github.com/PaddlePaddle/Paddle/pull/40829))
-
-- 修复 `paddle.index_select` 在 index 为 0-shape Tensor 时报错的问题。([#41383](https://github.com/PaddlePaddle/Paddle/pull/41383))
-
-- 修复 `paddle.index_select`,`paddle.index_sample` 申请的 GPU 线程数超过有限机器资源时报错的问题。([#41127](https://github.com/PaddlePaddle/Paddle/pull/41127), [#37816](https://github.com/PaddlePaddle/Paddle/pull/37816), [#39736](https://github.com/PaddlePaddle/Paddle/pull/39736), [#41563](https://github.com/PaddlePaddle/Paddle/pull/41563))
-
-- 修复 ReduceConfig、elemwise_grad、gather、gather_nd、scatter ops 申请 GPU 线程数超过有限机器资源时报错的问题。([#40813](https://github.com/PaddlePaddle/Paddle/pull/40813), [#41127](https://github.com/PaddlePaddle/Paddle/pull/41127))
-
-- 修复 Kernel Primitive API 中 ReadData,ReadDataBc,ReadDataReduce 在 NX != 1 时访存越界的问题。([#36373](https://github.com/PaddlePaddle/Paddle/pull/36373))
-
-- 修复 IndexRandom 数据类型错误导致数据溢出计算结果异常的问题。([#39867](https://github.com/PaddlePaddle/Paddle/pull/39867), [#39891](https://github.com/PaddlePaddle/Paddle/pull/39891))
-
-- 修复 reduce op 在 reduce_num = 1 计算结果返回错误的问题。([#38771](https://github.com/PaddlePaddle/Paddle/pull/38771))
-
-- 修复 reduce op 在 HIP 环境下 reduce 中间维度出现访存越界的问题。([#41273](https://github.com/PaddlePaddle/Paddle/pull/41273))
-
-- 修复 matmul op 两个 FP16 一维向量计算时 Kernel 无法正常释放的问题。
-
-- 修复部分算子在 CUDA 上因整型计算溢出导致的问题,包括:bernoulli、gaussian_random、gumbel_softmax、multinomial、truncated_gaussian_random、uniform_random_inplace、uniform_random ops。([#37670](https://github.com/PaddlePaddle/Paddle/pull/37670))
-
-- 修复 `paddle.nn.Sequential` 在 for 循环遍历 sublayers 时会报 KeyError 错误的 bug。([#39372](https://github.com/PaddlePaddle/Paddle/pull/39372))
-
-- 修复 `paddle.nn.functional.unfold` 在静态图下编译时检查 shape 错误的 bug。([#38907](https://github.com/PaddlePaddle/Paddle/pull/38907), [#38819](https://github.com/PaddlePaddle/Paddle/pull/38819))
-
-- 修复静态图使用 dropout 时如果指定了 `axis` 后会报错的问题。([#37223](https://github.com/PaddlePaddle/Paddle/pull/37223))
-
-- 迁移 `paddle.nn.MultiHeadAttention`中 matmul 算子到 matmul_v2 算子。([#36222](https://github.com/PaddlePaddle/Paddle/pull/36222))
-
-- 修复 `paddle.nn.functional.label_smooth`在输入为空 Tensor 时抛出 FPE 的问题。([#35861](https://github.com/PaddlePaddle/Paddle/pull/35861))
-
-- 修复 reshape op 空 Tensor 形变问题, 支持将空 Tensor rehape 成[-1]。([#36087](https://github.com/PaddlePaddle/Paddle/pull/36087))
-
-- 修复 `fill_diagonal`参数 offset 非零时会造成修改值跨行问题。([#36212](https://github.com/PaddlePaddle/Paddle/pull/36212))
-
-- 修改动态图模式下 range op 返回 stop gradient 设置成 True。([#37486](https://github.com/PaddlePaddle/Paddle/pull/37486))
-
-- 修复 Lamb 优化器当 Beta1Pow 和 Beta2Pow 在 GPU 上时更新错误的 bug。([#38518](https://github.com/PaddlePaddle/Paddle/pull/38518))
-
-- 修复 conv2d 算子 FLAGS_cudnn_deterministic 设置不生效的问题。([#37173](https://github.com/PaddlePaddle/Paddle/pull/37173))
-
-- 修复因早期版本的 cufft 没有定义 CUFFT_VERSION 引发的问题。([#37312](https://github.com/PaddlePaddle/Paddle/pull/37312))
-
-- 修复 `paddle.ifftshit`, `paddle.fftshift` 计算错误问题。([#36834](https://github.com/PaddlePaddle/Paddle/pull/36834), [#36748](https://github.com/PaddlePaddle/Paddle/pull/36748))
-
-- 修复 `paddle.fft` 系列 API 中的 `axis` 计算错误。([#36321](https://github.com/PaddlePaddle/Paddle/pull/36321))
-
-- 修复 batch_norm_grad op 在 FP16 数据类型时输出数据类型注册的 bug,该 bug 会导致部分场景下编译失败,并且对 FP16 计算精度会有一定影响。([#42461](https://github.com/PaddlePaddle/Paddle/pull/42461))
-
-- 修复 `paddle.nn.functional.pad` API 在模型动转静时,padding 为 Tensor 条件下的 Infershape 信息错误问题。([#42414](https://github.com/PaddlePaddle/Paddle/pull/42414))
-
-- 修复 `paddle.distribution.StickBreakingTransform` 输入维度超过 2 时异常的问题。([#41762](https://github.com/PaddlePaddle/Paddle/pull/41672))
-
-- 修复 fused_attention op 中 QK^T 计算出 nan/inf 的问题。([#42032](https://github.com/PaddlePaddle/Paddle/pull/42032))
-
-- 修复 fused_attention op 中 FusedResidualDropoutBias 在 V100 上计算出 nan/inf 问题。([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398))
-
-- 修复 full_like op 在执行时引入的多余的 data transform 问题。([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973))
-
-- 修复 p_norm op 在 GPU 环境上计算 nan 的问题。([#41804](https://github.com/PaddlePaddle/Paddle/pull/41804))
-
-- 修复 split op 在参数 sections 存在为 0 的 size 情况下,段错误的问题。([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755))
-
-- 修复 6 个 elementwise op(pow、complex、divide_double、multiply_double、fmax、fmin)在需要 broadcast 的情况下,多卡训练时报 Place(gpu:0) 不支持的问题。([#42332](https://github.com/PaddlePaddle/Paddle/pull/42332))
-
-- 修复 import paddle 时由于 PIL 版本升级导致的废弃接口报 warning 的问题。([#42307](https://github.com/PaddlePaddle/Paddle/pull/42307))
-
-- 修复静态图下 `paddle.linalg.matrix_rank`不支持 tol 为 FP64 Tensor 的问题。([#42085](https://github.com/PaddlePaddle/Paddle/pull/42085))
-
-#### IR(Intermediate Representation)
-
-- 动态图转静态图
-
- - 修复 `tensor_array` 搭配控制流使用时,在反向梯度累加时存在的类型推导错误问题。([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585), [#39689](https://github.com/PaddlePaddle/Paddle/pull/39689))
-
- - 修复动转静 AMP 训练时参数梯度类型未被正确设置的问题。([#40938](https://github.com/PaddlePaddle/Paddle/pull/40938))
-
- - 修复代码中存在错位注释时,动转静代码解析报错的问题。([#39035](https://github.com/PaddlePaddle/Paddle/pull/39035), [#38003](https://github.com/PaddlePaddle/Paddle/pull/38003))
-
- - 修复动转静代码中调用非 forward 函数时,Tensor 未被正确转化为 Variable 的问题。([#37296](https://github.com/PaddlePaddle/Paddle/pull/37296), [#38540](https://github.com/PaddlePaddle/Paddle/pull/38540))
-
- - 修复动转静代码转写时 `paddle` 被错误地作为变量传递的问题。([#37999](https://github.com/PaddlePaddle/Paddle/pull/37999))
-
- - 修复模型动转静后调用 `paddle.flops` 时模型参数统计错误的问题。([#36852](https://github.com/PaddlePaddle/Paddle/pull/36852))
-
- - 修复使用 `paddle.jit.save/load` 接口加载模型后,在 train 模式和 no_grad 上下文中,显存会一直增长的问题。([#36434](https://github.com/PaddlePaddle/Paddle/pull/36434))
-
- - 添加在 convert_call 对 generator function 转换时的警告。([#35369](https://github.com/PaddlePaddle/Paddle/pull/35369))
-
- - 修复 run_program op 依赖分析的问题。([#38470](https://github.com/PaddlePaddle/Paddle/pull/38470))
-
- - 修复控制流 For 中返回单值时代码转换的问题。([#40683](https://github.com/PaddlePaddle/Paddle/pull/40683))
-
- - 修复控制流 cond 的输入包含 LoDTensorArray 时,生成反向 op 会报错的问题。([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585))
-
- - 修复 `padddle.jit.save`在导出动转静模型时丢失顶层 Layer 的 forward_pre_hook 和 forward_post_hook 的问题。([#42273](https://github.com/PaddlePaddle/Paddle/pull/42273))
-
- - 修复 `paddle.expand`中 shape 参数包含 Tensor 在动转静时会转换报错的问题。([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973))
-
-#### 分布式训练
-
-- 分布式训练基础功能
-
- - 修复分布式多机训练时,端口报错的问题。([#37274](https://github.com/PaddlePaddle/Paddle/pull/37274))
-
- - 修复 brpc 编译依赖问题。([#37064](https://github.com/PaddlePaddle/Paddle/pull/37064))
-
- - 修复 Fleet 启动时,由于 tcp 自连接产生的端口被占用的问题。([#38174](https://github.com/PaddlePaddle/Paddle/pull/38174))
-
- - 修复数据并行下,由于 FP16 参数在多卡下初始化不一致,导致精度下降的问题。([#38838](https://github.com/PaddlePaddle/Paddle/pull/38838), [#38563](https://github.com/PaddlePaddle/Paddle/pull/38563), [#38405](https://github.com/PaddlePaddle/Paddle/pull/38405))
-
- - 修复数据并行下,由于 FP16 梯度同步时,没有除以卡数,导致精度下降的问题。([#38378](https://github.com/PaddlePaddle/Paddle/pull/38378))
-
-- 动态图混合并行
-
- - 修复在混合并行下,通过使用新 update 接口,FP16 模式不更新参数的问题。([#36017](https://github.com/PaddlePaddle/Paddle/pull/36017))
-
-- 静态图混合并行
-
- - 修复分布式 dp 模式下 grad merge 与 ClipGradientByGlobalNorm 不兼容的问题。([#36334](https://github.com/PaddlePaddle/Paddle/pull/36334))
-
- - 修复混合并行下,张量模型并行的非分布式参数在初始化阶段未被广播,导致各卡非分布式参数不一致的问题。([#36186](https://github.com/PaddlePaddle/Paddle/pull/36186))
-
- - 修复 sharding 开启 offload 时,sharding 的 save_persistables 接口未保存 FP16 参数和 offload 持久化变量的问题。([#40477](https://github.com/PaddlePaddle/Paddle/pull/40477))
-
- - 修复开启 sharding 训练时,ema 参数在非 0 号卡上无法保存的问题。([#39860](https://github.com/PaddlePaddle/Paddle/pull/39860))
-
- - 修复 FC 按照列切分梯度计算错误的问题。([#38724](https://github.com/PaddlePaddle/Paddle/pull/38724))
-
- - 修复 DistributedStrategy 设置为 without_graph_optimizer 时和 rnn 一起使用报错的问题。([#36176](https://github.com/PaddlePaddle/Paddle/pull/36176))
-
-- GPUPS 参数服务器训练
-
- - 修复 GPUPS 宏定义触发 CPU 分支编译问题。([#37248](https://github.com/PaddlePaddle/Paddle/pull/37248))
-
- - 修复 GPUPS 流水线训练时在保存 delta 和 pullsparse 并发时引发的偶发报错问题。([#37233](https://github.com/PaddlePaddle/Paddle/pull/37233))
-
- - 修复 HDFSClient 查询目录未返回全路径,引发下载报错问题。([#36590](https://github.com/PaddlePaddle/Paddle/pull/36590))
-
- - 修复 GPUPS 流水线训练时拉取老参数问题。([#36512](https://github.com/PaddlePaddle/Paddle/pull/36512))
-
- - 修复 GPUPS 多流 allocation 问题。([#37476](https://github.com/PaddlePaddle/Paddle/pull/37476))
-
- - 修复 GPUPS pybind 出 core 的问题。([#37287](https://github.com/PaddlePaddle/Paddle/pull/37287))
-
-#### 其他
-
-- 修复动态图量化训练保存模型时 clip_extra 的问题。([#38323](https://github.com/PaddlePaddle/Paddle/pull/38323))
-
-- 修复动态图量化训练 abs_max scale 初始化的问题。([#39307](https://github.com/PaddlePaddle/Paddle/pull/39307))
-
-- 修复动态图量化训练保存模型节点异常的问题。([#38102](https://github.com/PaddlePaddle/Paddle/pull/38102), [#38012](https://github.com/PaddlePaddle/Paddle/pull/38012))
-
-- 修复离线量化 flatten op 输出错误问题。([#37722](https://github.com/PaddlePaddle/Paddle/pull/37722))
-
-- 修复了反量化 matmul op 时,维度对不上的问题。([#36982](https://github.com/PaddlePaddle/Paddle/pull/36982))
-
-- 修复了量化无权重的 matmul_v2 时,错误添加量化 op 的问题。([#36593](https://github.com/PaddlePaddle/Paddle/pull/36593))
-
-- 修复 conv op channel wise 量化在保存模型时 quant_axis 属性保存错误。([#39054](https://github.com/PaddlePaddle/Paddle/pull/39054))
-
-- 修复 ChannelWise 量化训练速度慢的问题。([#40772](https://github.com/PaddlePaddle/Paddle/pull/40772))
-
-- 修复量化训练初始化为 0 的 Tensor 出 NAN 的问题。([#36762](https://github.com/PaddlePaddle/Paddle/pull/36762))
-
-- 修复多线程场景下混合精度 amp_level 设置错误问题。([#39198](https://github.com/PaddlePaddle/Paddle/pull/39198))
-
-- 修复混合精度训练与 PyLayer,Recompute 等一起使用时,PyLayer 和 Recompute 中未正确设置混合精度的问题。([#39950](https://github.com/PaddlePaddle/Paddle/pull/39950), [#40042](https://github.com/PaddlePaddle/Paddle/pull/40042))
-
-- 修复了 Mac 下编译自定义算子时 `D_GLIBCXX_USE_CXX11_ABI` 未生效的问题。([#37878](https://github.com/PaddlePaddle/Paddle/pull/37878))
-
-- 修复 initializer 相关 API 在 block=None 时动静行为不统一的问题。([#37827](https://github.com/PaddlePaddle/Paddle/pull/37827))
-
-- 修复 python3.6 环境下没有 fluid 模块的 bug。([#35862](https://github.com/PaddlePaddle/Paddle/pull/35862))
-
-- 修复优化器 `paddle.optimizer.Adamw` 错误调用 adam op 的 bug。([#36028](https://github.com/PaddlePaddle/Paddle/pull/36028))
-
-- 修复 multi tensor 策略下 `paddle.optimizer.Momentum` 优化器参数 `regularizer` 属性为 None 时的逻辑错误。([#38344](https://github.com/PaddlePaddle/Paddle/pull/38344))
-
-- 修复 multi tensor 策略下 `paddle.optimizer.Momentum`、`paddle.optimizer.Adam` 优化器会对 `multi_precision` 属性进行修改的错误。([#38991](https://github.com/PaddlePaddle/Paddle/pull/38991))
-
-- 修复最终态 API amp 与 optional 类型 Tensor 组合使用的代码编译错误。([#40980](https://github.com/PaddlePaddle/Paddle/pull/40980))
-
-- 修复 paddle+lite+xpu 预测库调用 lite CPU 预测时会报错的 bug,修复 paddle+lite(without NNAdapter) 编译时会报错的 bug。([#37449](https://github.com/PaddlePaddle/Paddle/pull/37449))
-
-- 修复 Debug 编译模式下 LoDTensorArray 因 Pybind11 绑定不一致导致 crash 的 bug。([#37954](https://github.com/PaddlePaddle/Paddle/pull/37954))
-
-- 修复 shape 参数为 Tensor 和 int 构成列表的极端情况下,无法正确构建 Tensor 的 bug。([#38284](https://github.com/PaddlePaddle/Paddle/pull/38284))
-
-- 修复 `paddle.optimizer.AdamW` API 兼容性问题。([#37905](https://github.com/PaddlePaddle/Paddle/pull/37905))
-
-- 修复 _InstanceNormBase 中 extra_repr 的返回错误。([#38537](https://github.com/PaddlePaddle/Paddle/pull/38537))
-
-- 修复联编开启 -DWITH_DISTRIBUTED 生成 Paddle Inference 缺少符号 `paddle::distributed::TensorTable` 的问题。([#41128](https://github.com/PaddlePaddle/Paddle/pull/41128))
-
-- matmul_v2 op 新增 shape check,在 shape 中存在 0 值进行信息报错。([#35791](https://github.com/PaddlePaddle/Paddle/pull/35791))
-
-- 修复动态图 recompute 对于没有梯度输入提示信息反复打印,改成用 warning 只打印一次的方式。([#38293](https://github.com/PaddlePaddle/Paddle/pull/38293))
-
-- 修复 gelu op 在视觉模型中训练后期在验证集上精度低的问题。([#38450](https://github.com/PaddlePaddle/Paddle/pull/38450))
-
-- 修复 adamw op 在数值计算上误差问题。([#37746](https://github.com/PaddlePaddle/Paddle/pull/37746))
-
-- 补充 sparse_momentum `_C_ops` 接口 MasterParam 和 MasterParamOut 参数。([#39969](https://github.com/PaddlePaddle/Paddle/pull/39969))
-
-- 修复 python3.6 环境下没有 `distributed` 模块的 bug。([#35848](https://github.com/PaddlePaddle/Paddle/pull/35848))
-
-- 修复 eigh 单元测试数据初始化问题。([#39568](https://github.com/PaddlePaddle/Paddle/pull/39568))
-
-- 修复 eigvalsh 单元测试数据初始化问题。([#39841](https://github.com/PaddlePaddle/Paddle/pull/39841))
-
-- 修复 segment op 在 V100 上寄存器使用过多导致不能正常运行的问题。([#38113](https://github.com/PaddlePaddle/Paddle/pull/38113))
-
-- 修复 conv 相关算子稀疏化维度错误的问题。([#36054](https://github.com/PaddlePaddle/Paddle/pull/36054))
-
-- 提供自动稀疏训练(Automatic SParsity)静态图相关功能 Alias 至 `Paddle.static.sparsity`。([#36525](https://github.com/PaddlePaddle/Paddle/pull/36525))
-
-- 修复 divide op 整数除法还是整数的 bug。([#40890](https://github.com/PaddlePaddle/Paddle/pull/40890))
-
-- 修复 `paddle.multiplex` 候选 Tensor 大小为 0 崩溃问题。([#34972](https://github.com/PaddlePaddle/Paddle/pull/34972))
-
-- 修复 `paddle.kl_div` 参数 `reduction` 给定情况下速度异常的问题。([#37283](https://github.com/PaddlePaddle/Paddle/pull/37283))
-
-- 修复 Cifar 数据集加载 data source 无序的问题。([#37272](https://github.com/PaddlePaddle/Paddle/pull/37272))
-
-- 修复 ProgressBar 类中 loss 从 uint16 到 float 的转换。([#39231](https://github.com/PaddlePaddle/Paddle/pull/39231))
-
-- 修复 ShareBufferWith 共享数据类型的问题。([#37464](https://github.com/PaddlePaddle/Paddle/pull/37464), [#37247](https://github.com/PaddlePaddle/Paddle/pull/37247))
-
-- 修复 `paddle.io.DataLoader` 使用 IterableDataset 并且 num_workers>0 时的性能问题。([#40541](https://github.com/PaddlePaddle/Paddle/pull/40541))
-
-- 修复 `paddle.vision.ops.yolo_loss` 动态图返回值不全的问题。([#40185](https://github.com/PaddlePaddle/Paddle/pull/40185))
-
-- 移出 `paddle.io.BatchSampler` 对输入参数 dataset 需要是 `paddle.io.Dataset` 类型的限制,扩大对用户自定义数据集的支持。([#40184](https://github.com/PaddlePaddle/Paddle/pull/40184))
-
-- 修复 `paddle.summary` 报错 op_flops 不存在的问题。([#36489](https://github.com/PaddlePaddle/Paddle/pull/36489))
-
-- 修复 lars_momentum op 在 lars_weight_decay=0 时公式错误的问题。([#40892](https://github.com/PaddlePaddle/Paddle/pull/40892))
-
-- 修复 optimize-offload 无法保存 presistable var 的问题。([#36433](https://github.com/PaddlePaddle/Paddle/pull/36433))
-
-- 修复 optimizer-offload 不支持 adamw op type 的问题。([#36432](https://github.com/PaddlePaddle/Paddle/pull/36432))
-
-- 修复多线程场景下,Tracer 中 enable_program_desc_tracing_数据不安全的问题。([#39776](https://github.com/PaddlePaddle/Paddle/pull/39776))
-
-- 修复模型读取时模型档案大小未初始化的问题。([#40518](https://github.com/PaddlePaddle/Paddle/pull/40518))
-
-- 修复 Expand op 逻辑 bug,当输入 Tensor X 的维度,小于要拓展的 shape 时,可能导致取得 Out.Shape 是错误的。([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677))
-
-- 修复 Expand_As op 只取 y.shape,而没有 Y 变量输入时,导致的动转静报错。([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677))
-
-- 修复 Expand_As op 计算输出 shape 时逻辑的错误。([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677))
-
-
-- 修复 `core.VarDesc.VarType.STRINGS` 类型的变量获取 `lod_level` 属性报错的问题,并且设置其 `lod_level` 为 None。([#39077](https://github.com/PaddlePaddle/Paddle/pull/39077))
-
-- 修复框架功能 `PyLayer` 不支持不同 dtype 的问题。([#37974](https://github.com/PaddlePaddle/Paddle/pull/37974))
-
-- 修复了学习率衰减 API `paddle.optimizer.lr.PolynomialDecay` 的零除问题。([#38782](https://github.com/PaddlePaddle/Paddle/pull/38782))
-
-- 修复调用 DisableGlogInfo() 接口后依旧残留部分日志的问题。([#36356](https://github.com/PaddlePaddle/Paddle/pull/36356))
-
-- 修复 SimpleRNN、GRU 和 LSTM API CPU 训练时多层 RNN(dropout 设置为 0 时)反向计算出错的问题。([#37080](https://github.com/PaddlePaddle/Paddle/pull/37080))
-
-- 为 cufft 和 hipfft 后端的 fft 添加了 cache。([#36646](https://github.com/PaddlePaddle/Paddle/pull/36646))
-
-- 使 `paddle.roll` 的 shifts 参数支持传入 Tensor。([#36727](https://github.com/PaddlePaddle/Paddle/pull/36727))
-
-- 为 fft 添加 onemkl 作为可选的计算后端。([#36414](https://github.com/PaddlePaddle/Paddle/pull/36414))
-
-- 修复 mamtul_v2 和 elementwise_div 两个 op 在 bfloat16 类型下的精度问题。([#42479](https://github.com/PaddlePaddle/Paddle/pull/42479))
-
-- 修复显存回收时 LoDTensorArray 只清理内部 Tensor 而未清空 Array 导致的下个 step 可能出错的问题。([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398))
-
-## 4. 部署方向(Paddle Inference)
-
-### (1)新增特性
-
-#### 新增 API
-
-- 增加 Java API,Java 开发者可以通过简单灵活的接口实现在服务端和云端的高性能推理。([#37162](https://github.com/PaddlePaddle/Paddle/pull/37162))
-
-- 增加 `GetTrtCompileVersion` 和 `GetTrtRuntimeVersion` 接口,用于获取 TensorRT 版本信息。([#36429](https://github.com/PaddlePaddle/Paddle/pull/36429))
-
-- 增加 `ShareExternalData` 接口,避免推理时对输入数据进行内存拷贝。([#39809](https://github.com/PaddlePaddle/Paddle/pull/39809))
-
-#### 新增功能
-
-- 新增 ONNX Runtime 后端支持,当前集成版本只支持 CPU。([#39988](https://github.com/PaddlePaddle/Paddle/pull/39988), [#40561](https://github.com/PaddlePaddle/Paddle/pull/40561))
-
-- 基于 Paddle Lite 子图方式,新增昇腾 310 推理支持。([#35226](https://github.com/PaddlePaddle/Paddle/pull/35226))
-
-- 新增原生 GPU FP16 推理功能。([#40531](https://github.com/PaddlePaddle/Paddle/pull/40531))
-
-- switch_ir_debug 接口增加 dump 模型的功能。([#36581](https://github.com/PaddlePaddle/Paddle/pull/36581))
-
-- 新增 TensorRT config 的配置接口:`void UpdateConfigInterleaved(paddle_infer::Config* c, bool with_interleaved)`,用于 int8 量化推理中特殊的数据排布。([#38884](https://github.com/PaddlePaddle/Paddle/pull/38884))
-
-- log 中增加 TensorRT inspector 输出信息,仅在 TensorRT 8.2 及以上版本有效。([#38362](https://github.com/PaddlePaddle/Paddle/pull/38362),[#38200](https://github.com/PaddlePaddle/Paddle/pull/38200)))
-
-- 增加 TensorRT ASP 稀疏推理支持。([#36413](https://github.com/PaddlePaddle/Paddle/pull/36413))
-
-### (2)底层优化
-
-#### CPU 性能优化
-
-- 优化 MKLDNN 的缓存机制。([#38336](https://github.com/PaddlePaddle/Paddle/pull/38336), [#36980](https://github.com/PaddlePaddle/Paddle/pull/36980), [#36695](https://github.com/PaddlePaddle/Paddle/pull/36695))
-
-- 新增 matmul_scale_fuse pass。([#37962](https://github.com/PaddlePaddle/Paddle/pull/37962))
-
-- 新增 MKLDNN reshape_transpose_matmul_v2_mkldnn_fuse_pass。([#37847](https://github.com/PaddlePaddle/Paddle/pull/37847), [#40948](https://github.com/PaddlePaddle/Paddle/pull/40948))
-
-- 新增 MKLDNN conv_hard_sigmoid_mkldnn_fuse_pass。([#36869](https://github.com/PaddlePaddle/Paddle/pull/36869))
-
-- 新增 MKLDNN matmul_v2_transpose_reshape_fuse_pass。([#36481](https://github.com/PaddlePaddle/Paddle/pull/36481))
-
-- 新增 MKLDNN softplus_activation_mkldnn_fuse_pass。([#36657](https://github.com/PaddlePaddle/Paddle/pull/36657))
-
-- 新增 MKLDNN elt_act_mkldnn_fuse_pass。([#36541](https://github.com/PaddlePaddle/Paddle/pull/36541))
-
-- 新增 MKLDNN mish 算子及 conv_mish_mkldnn_fuse_pass。([#38623](https://github.com/PaddlePaddle/Paddle/pull/38623))
-
-#### GPU 性能优化
-
-- 将推理默认的显存分配策略由 `naive_best_fit` 变更为 `auto_growth`,解决部分模型占满 GPU 显存问题。([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491))
-
-- 支持 gelu、FC+gelu ops 使用 TensorRT 推理。([#38399](https://github.com/PaddlePaddle/Paddle/pull/38399))合作团队
-
-- 支持 `deformable_conv` 在静态 shape 下使用 TensorRT 推理。([#36612](https://github.com/PaddlePaddle/Paddle/pull/36612) [#36850](https://github.com/PaddlePaddle/Paddle/pull/36850) [#37345](https://github.com/PaddlePaddle/Paddle/pull/37345))
-
-- 支持 nearest_interp_v2 op 使用 TensorRT 推理。([#34126](https://github.com/PaddlePaddle/Paddle/pull/34126))
-
-- 增加 `yolo_box`TensorRT plugin,支持输入参数 `iou_aware` 和 `iou_aware_factor`,使推理计算得到的 IoU 作为置信度的因子。([#34128](https://github.com/PaddlePaddle/Paddle/pull/34128))
-
-- 支持 `elementwise_sub` 和 `elementwise_div` 调用 TensorRT 推理。([#40806](https://github.com/PaddlePaddle/Paddle/pull/40806) [#41253](https://github.com/PaddlePaddle/Paddle/pull/41253))
-
-- 支持 `multiclass_nms3` 使用 TensorRT 推理。([#41181](https://github.com/PaddlePaddle/Paddle/pull/41181) [#41344](https://github.com/PaddlePaddle/Paddle/pull/41344))
-
-- 支持 flatten_contiguous_rang op 使用 TensorRT 推理。([#38922](https://github.com/PaddlePaddle/Paddle/pull/38922))
-
-- 支持 `pool2d` 属性 `padding` 的维度为 4、`global_pooling` 和 `ceil_mode` 为 True 情况下使用 TensorRT 推理。([#39545](https://github.com/PaddlePaddle/Paddle/pull/39545))
-
-- 支持 batch_norm 和 elementwise_add 为 5 维时使用 TensorRT 推理。([#36446](https://github.com/PaddlePaddle/Paddle/pull/36446))
-
-- 新增 pool3d 使用 TensorRT 推理。([#36545](https://github.com/PaddlePaddle/Paddle/pull/36545), [#36783](https://github.com/PaddlePaddle/Paddle/pull/36783))
-
-- 增加 `reduce` int32 和 float 类型使用 TensorRT 推理,增加 `reduce_mean` GPU 算子 int32、int64 注册。([#39088](https://github.com/PaddlePaddle/Paddle/pull/39088))
-
-- 修改 MatmulV2ToMul pass,修改限定条件(不支持广播)和 op_teller 映射条件。([#36652](https://github.com/PaddlePaddle/Paddle/pull/36652))
-
-- 增加 TenorRT plugin 接口 AddPluginV2IOExt 的支持。([#36493](https://github.com/PaddlePaddle/Paddle/pull/36493))
-
-- 增加 roi_align op 中 aligned 属性并支持 TensorRT 推理。([#38905](https://github.com/PaddlePaddle/Paddle/pull/38905))
-
-- 增加 concat 属性 `axis = -1` 时支持 TensorRT 推理。([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096))
-
-- 新增 TensorRT plugin :preln_emb_eltwise_layernorm、 preln_skip_la、rnorm ops, 用于 ERNIE 类模型性能优化。([#39570](https://github.com/PaddlePaddle/Paddle/pull/39570))
-
-- 新增 TensorRT fuse pass:preln_embedding_eltwise_layernorm_fuse_pass, preln_skip_layernorm_fuse_pass,用于 ERNIE 类模型性能优化。([#39508](https://github.com/PaddlePaddle/Paddle/pull/39508))
-
-- 将 matmul 融合相关的 pass 基于不同的后端(GPU、CPU、TensorRT)拆开,支持 FC 权重的转置功能。([#39369](https://github.com/PaddlePaddle/Paddle/pull/39369))
-
-- 新增 roll、strided_slice、slice op 在动态 shape 的情况下对 TensorRT 的支持。([#41913](https://github.com/PaddlePaddle/Paddle/pull/41913), [#41573](https://github.com/PaddlePaddle/Paddle/pull/41573), [#41467](https://github.com/PaddlePaddle/Paddle/pull/41467))
-
-- 新增 div op 对 TensorRT 的支持。([#41243](https://github.com/PaddlePaddle/Paddle/pull/41243))
-
-- 量化支持
-
- - `PostTrainingQuantization` API 新增支持`paddle.io.DataLoader` 对象或者 `Python Generator`的输入。([#38686](https://github.com/PaddlePaddle/Paddle/pull/38686))
-
- - ERNIE 全量化模型推理支持 interleaved 数据排布。([#39424](https://github.com/PaddlePaddle/Paddle/pull/39424))
-
- - 支持 PaddleSlim 新量化模型格式推理。([#41049](https://github.com/PaddlePaddle/Paddle/pull/41049))
-
- - 新增 matmul int8 量化的推理 op converter 和 plugin。([#37285](https://github.com/PaddlePaddle/Paddle/pull/37285))
-
- - 新增判断模型所有 op 能否支持 int8 量化的 pass。([#36042](https://github.com/PaddlePaddle/Paddle/pull/36042))
-
- - 支持 multihead attention 非变长分支中 FC 部分的量化推理。([#39660](https://github.com/PaddlePaddle/Paddle/pull/39660))
-
-#### 昇腾 NPU 相关功能
-
-- - 重构 shape 算子前向计算逻辑,支持在 NPU 上执行。([#39613](https://github.com/PaddlePaddle/Paddle/pull/39613))
-
- - 重构 reshape 算子前向计算逻辑,支持 ShapeTensor 输入。([#38748](https://github.com/PaddlePaddle/Paddle/pull/38748))
-
- - 模型权重加载时精度类型统一。([#39160](https://github.com/PaddlePaddle/Paddle/pull/39160))
-
-### (3)问题修复
-
-#### 框架及 API 修复
-
-- 修复保存静态图时模型剪裁的问题。([#37579](https://github.com/PaddlePaddle/Paddle/pull/37579))
-
-- C API 增加对的字符串的封装 PD_Cstr,并提供构造和析构的方式,避免用户直接使用 C 运行时库来析构字符串。([#38667](https://github.com/PaddlePaddle/Paddle/pull/38667))
-
-- 修复预测时内存复用的逻辑问题。([#37324](https://github.com/PaddlePaddle/Paddle/pull/37324))
-
-- 修复多线程下内存复用报错问题。([#37894](https://github.com/PaddlePaddle/Paddle/pull/37894))
-
-- 在没有权重文件时,允许传递空字符串进行推理。([#38579](https://github.com/PaddlePaddle/Paddle/pull/38579))
-
-- 修复开启 TensorRT dynamic shape 后不支持 clone 问题。([#38520](https://github.com/PaddlePaddle/Paddle/pull/38520))
-
-- 修复开启 TensorRT dynamic shape 后多线程 clone 报错问题。([#40067](https://github.com/PaddlePaddle/Paddle/pull/40067))
-
-- 修复 TensorRT engine 析构问题。([#35842](https://github.com/PaddlePaddle/Paddle/pull/35842), [#35938](https://github.com/PaddlePaddle/Paddle/pull/35938))
-
-- lite xpu 接口修复无法选择 xpu 卡的问题。([#36610](https://github.com/PaddlePaddle/Paddle/pull/36610))
-
-- TensorRT 动态 shape 参数自动生成接口增加文件存在性检查。([#36628](https://github.com/PaddlePaddle/Paddle/pull/36628))
-
-- 修复 MKLDNN 不支持 conv3d 的问题。([#42055](https://github.com/PaddlePaddle/Paddle/pull/42055))
-
-#### 后端能力修复
-
-- 修复预测时 cuDNN 默认算法选择配置,使用非 deterministic 策略。([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491))
-
-- 修复 deformable_conv op 在 TensorRT plugin 资源回收处理错误的问题。([#38374](https://github.com/PaddlePaddle/Paddle/pull/38374))
-
-- 修复 deformable_conv op 在 TensorRT plugin 序列化错误问题。([#38057](https://github.com/PaddlePaddle/Paddle/pull/38057))
-
-- 适配 TensorRT 8.0 新的构建引擎和系列化 API。([#36769](https://github.com/PaddlePaddle/Paddle/pull/36769))
-
-- 修复 Flatten2MatmulFusePass、Squeeze2MatmulFusePass、Reshape2MatmulFusePass 没有生效问题。([#37644](https://github.com/PaddlePaddle/Paddle/pull/37644))
-
-- 修复 TensorRT 输入数据在上时报错的问题。([#37427](https://github.com/PaddlePaddle/Paddle/pull/37427))
-
-- 增加输入维度错误时的报错信息。([#38962](https://github.com/PaddlePaddle/Paddle/pull/38962))
-
-- 修复 EmbEltwiseLayernorm 输出类型错误的问题。([#40015](https://github.com/PaddlePaddle/Paddle/pull/40015))
-
-- 删除 conv_affine_channel_fuse_pass 以及对应的单元测试。([#39817](https://github.com/PaddlePaddle/Paddle/pull/39817))
-
-- 修复 adaptive_pool2d pass 错误替换 pool 属性的问题。([#39600](https://github.com/PaddlePaddle/Paddle/pull/39600))
-
-- 修复 shuffle_channel_detect_pass 错误生成 shuffle_channel op 的问题。([#39242](https://github.com/PaddlePaddle/Paddle/pull/39242))
-
-- 修复 transpose 参数错误。([#39006](https://github.com/PaddlePaddle/Paddle/pull/39006))
-
-- 修复 nearest_interp_v2 输入 scale 维度小于 1 时崩溃的问题。([#38725](https://github.com/PaddlePaddle/Paddle/pull/38725))
-
-- 修复 prelu 在 dynamic shape 时不支持一维输入的问题。([#39389](https://github.com/PaddlePaddle/Paddle/pull/39389))
-
-- 修复 slice 的 special_slice_plugin 的核函数计算错误的问题。([#39875](https://github.com/PaddlePaddle/Paddle/pull/39875))
-
-- 暂时禁用 skip_layernorm 变长下的 int8 分支,防止精度下降。([#39991](https://github.com/PaddlePaddle/Paddle/pull/39991))
-
-- 修复关于支持 preln_ernie 模型的一些 bug。([#39733](https://github.com/PaddlePaddle/Paddle/pull/39733))
-
-- 修复 slice 在 ERNIE 中 threads 可能超过限制的 bug,修复 spacial_slice 误触的 bug。([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096))
-
-- 修复 elementwise 在维度相同时不支持广播的问题。([#37908](https://github.com/PaddlePaddle/Paddle/pull/37908))
-
-- 修复 nearest_interp op 当 align_corners 为 True 时,TensorRT layer 的结果和原生 op 的结果有 diff,底层实现不一样。([#37525](https://github.com/PaddlePaddle/Paddle/pull/37525))
-
-- 修复 qkv_plugin:核函数计算错误。([#37096](https://github.com/PaddlePaddle/Paddle/pull/37096))
-
-- 修复动态量化的推理 pass 的问题。([#35879](https://github.com/PaddlePaddle/Paddle/pull/35879))
-
-- 当 Tensor 请求的内存容量低于已分配的 size 时直接复用。([#37880](https://github.com/PaddlePaddle/Paddle/pull/37880))
-
-- 修复 ERNIE 定长模型开启 TensorRT 出现的 hang 问题。([#37839](https://github.com/PaddlePaddle/Paddle/pull/37839))
-
-- 修复 TensorRT int8 时缺失 dynamic range 信息崩溃问题。([#36900](https://github.com/PaddlePaddle/Paddle/pull/36900))
-
-- 修复 slice 反序列化代码问题。([#36588](https://github.com/PaddlePaddle/Paddle/pull/36588))
-
-- 修复 yolo box 计算公式错误问题。([#36240](https://github.com/PaddlePaddle/Paddle/pull/36240))
-
-- 修复老版本模型在使用新版本 roi_align 时崩溃问题。([#38788](https://github.com/PaddlePaddle/Paddle/pull/38788)) 外部开发者
-
-- 修复 softmax 在 python 和 C++上性能差异较大的问题。([#37130](https://github.com/PaddlePaddle/Paddle/pull/37130))
-
-- 修复 matmul 在静态 shape 2 维输入和动态 shape 3 维输入情况下推理失败问题。([#36849](https://github.com/PaddlePaddle/Paddle/pull/36849))
-
-- 修复 reshape_transpose_matmul_mkldnn_fuse_pass 对 shape 处理不当问题。([#36731](https://github.com/PaddlePaddle/Paddle/pull/36731))
-
-- 修复输入为 2 维,但 TensorRT 获取到 4 维的问题。([#36614](https://github.com/PaddlePaddle/Paddle/pull/36614))
-
-- 修复 interpolate_v2 MKLDNN 算子在 scale 属性为空时报错问题。([#36623](https://github.com/PaddlePaddle/Paddle/pull/36623))
-
-- 修复 recurrent 算子在多线程场景性能差问题。([#36052](https://github.com/PaddlePaddle/Paddle/pull/36052))
-
-- 移除 relu、sigmoid、tanh、relu6、batch_norm、clip、concat、gelu、hard_sigmoid、prelu、softmax、split、swish 对 TensorRT 2 维输入的限制。([#37097](https://github.com/PaddlePaddle/Paddle/pull/37097))
-
-- 修复 reshape op 使用 TensorRT 推理。([#41090](https://github.com/PaddlePaddle/Paddle/pull/41090))
-
-- 修复 matmul 相关 pass,兼容 matmul_v2。([#36424](https://github.com/PaddlePaddle/Paddle/pull/36424))
-
-- 开启 TensorRT 时,conv2d 算子中 padding 方式支持 VALID 及 SAME 属性。([#38999](https://github.com/PaddlePaddle/Paddle/pull/38999))
-
-- 修复 MKLDNN 多输入算子量化问题。([#39593](https://github.com/PaddlePaddle/Paddle/pull/39593), [#39346](https://github.com/PaddlePaddle/Paddle/pull/39346), [#40717](https://github.com/PaddlePaddle/Paddle/pull/40717))
-
-- 修复 MKLDNN 量化场景下 conv+activation 的 scale 错误问题。([#38331](https://github.com/PaddlePaddle/Paddle/pull/38331))
-
-- 修复 MKLDNN 无参数算子量化中,根据后续算子量化情况不同需做不同处理的问题。([#39342](https://github.com/PaddlePaddle/Paddle/pull/39342))
-
-- 修复 MKLDNN cpu_bfloat16_placement_pass 中的数据类型相关问题。([#38702](https://github.com/PaddlePaddle/Paddle/pull/38702))
-
-- 修复 MKLDNN bfloat16 推理中 split 算子执行问题。([#39548](https://github.com/PaddlePaddle/Paddle/pull/39548))
-
-- 修复 MKLDNN matmul_v2 算子不支持 6 维问题。([#36342](https://github.com/PaddlePaddle/Paddle/pull/36342), [#38665](https://github.com/PaddlePaddle/Paddle/pull/38665))
-
-- 修复 MKLDNN matmul_v2_transpose_reshape 中的 MKLDNN DeviceContext 错误问题。([#38554](https://github.com/PaddlePaddle/Paddle/pull/38554))
-
-- 修复分割模型在 MKLDNN 推理场景计算结果错误问题。([#37310](https://github.com/PaddlePaddle/Paddle/pull/37310))
-
-- 修复 MKLDNN bfloat16 placement 算子列表并添加缺失算子。([#36291](https://github.com/PaddlePaddle/Paddle/pull/36291))
-
-- 修复 MKLDNN 算子的格式问题,包括:FC、conv_transpose、6 维 Tensor 报错问题、conv 对 `NHWC` 输入的输出 format 错误问题。([#38890](https://github.com/PaddlePaddle/Paddle/pull/38890), [#37344](https://github.com/PaddlePaddle/Paddle/pull/37344), [#37175](https://github.com/PaddlePaddle/Paddle/pull/37175), [#38553](https://github.com/PaddlePaddle/Paddle/pull/38553), [#40049](https://github.com/PaddlePaddle/Paddle/pull/40049), [#39097](https://github.com/PaddlePaddle/Paddle/pull/39097))
-
-- 修复 MKLDNN 多线程推理场景因 cache 机制报错问题。([#36290](https://github.com/PaddlePaddle/Paddle/pull/36290), [#35884](https://github.com/PaddlePaddle/Paddle/pull/35884))
-
-- 修复 MKLDNN 因 matmul 及 FC 引起的量化模型精度异常问题。([#38023](https://github.com/PaddlePaddle/Paddle/pull/38023), [#37618](https://github.com/PaddlePaddle/Paddle/pull/37618))
-
-- 修复 MKLDNN 量化转换脚本因 pass 缺少引起的量化模型精度异常问题。([#37619](https://github.com/PaddlePaddle/Paddle/pull/37619), [#40542](https://github.com/PaddlePaddle/Paddle/pull/40542),
- [#38912](https://github.com/PaddlePaddle/Paddle/pull/38912))
-
-- 修复 MKLDNN 开启量 op 因为数据类型不匹配崩溃的问题。([#38133](https://github.com/PaddlePaddle/Paddle/pull/38133))
-
-- 修复 MKLDNN 某些 op 修改 layout 后需要改回原 layout 的问题。([#39422](https://github.com/PaddlePaddle/Paddle/pull/39422))
-
-- 修复针对昇腾 910 推理场景下,由于未释放 GIL 锁,导致与昇腾软件栈冲突,python API 下报错的问题。([#38605](https://github.com/PaddlePaddle/Paddle/pull/38605))
-
-## 5. 环境适配
-
-### 编译安装
-
-- 从 2.3.0 版本开始,飞桨对框架支持的 GPU 架构种类进行了调整和升级。(更多请参考:[飞桨支持的 GPU 架构](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.3rc/install/Tables.html#gpu))
-
-备注:
-
-- PIP 源安装是指用 `pip install paddlepaddle` 或 `pip install paddlepaddle-gpu`从 PIP 官网下载安装包及依赖库的安装方式,支持架构种类少,安装包更轻量,下载源来自国外(相比 bos 源支持架构种类精简,安装包更轻量,只提供一种 CUDA 版本的安装包)。
-
- - 2.3 版本之前,飞桨 PIP 源安装包(CUDA10.2)支持的 GPU 架构为:3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5。
-
- - 2.3 版本之后,飞桨 PIP 源安装包(CUDA11.0)支持的 GPU 架构为:6.0, 6.1, 7.0, 7.5, 8.0
-
-- 飞桨官网 bos 源是指从飞桨官网下载安装包及依赖库的安装方式,支持的 GPU 架构更多,下载源来自国内,速度较快。(相比 PIP 源支持架构种类多,提供多个 CUDA 版本的安装包):
-
- - 2.3 版本之前,飞桨官网 bos 源安装包支持的 GPU 架构:
-
- - CUDA10:3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5;
-
- - CUDA11:5.2,6.0,6.1,7.0,7.5,8.0。
-
- - 2.3 版本之后,飞桨官网 bos 源安装包支持的 GPU 架构
-
- - CUDA10:3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5;
-
- - CUDA11:3.5, 5.0, 6.0, 6.1, 7.0, 7.5, 8.0。
-
-- 支持 Python 3.10,修复 Windows 下某些 PythonC API 变化导致的编译 bug。([#41180](https://github.com/PaddlePaddle/Paddle/pull/42180))
-
-- Windows 平台支持 Visual Studio 2019 编译。([#38719](https://github.com/PaddlePaddle/Paddle/pull/38719))
-
-- 消除 Windows 平台编译时出现的各种 warning。([#38034](https://github.com/PaddlePaddle/Paddle/pull/38034), [#37890](https://github.com/PaddlePaddle/Paddle/pull/37890), [#37442](https://github.com/PaddlePaddle/Paddle/pull/37442), [#37439](https://github.com/PaddlePaddle/Paddle/pull/37439), [#36857](https://github.com/PaddlePaddle/Paddle/pull/36857))
-
-- 修复底层数据结构升级引入的 jetson 编译问题。([#39669](https://github.com/PaddlePaddle/Paddle/pull/39669), [#39441](https://github.com/PaddlePaddle/Paddle/pull/39441))
-
-
-### 新硬件适配
-
-- 自定义新硬件接入:提供一种插件式扩展 PaddlePaddle 硬件后端的方式。通过该功能,开发者无需为特定硬件修改 PaddlePaddle 代码,只需实现标准接口,并编译成动态链接库,则可作为插件供 PaddlePaddle 调用。降低为 PaddlePaddle 添加新硬件后端的开发难度。当前支持自定义 Runtime 接入和自定义 Kernel 接入。
-
-- 华为 NPU 芯片(Ascend910)训练/推理支持,支持 ResNet50、YoloV3、BERT、Transformer 等多个模型,支持静态图与混合精度训练,支持单卡、单机、多机分布式训练。
-
-- Graphcore IPU 芯片(包括 IPU Mk2 GC200 和 Bow IPU)训练/推理支持,支持 ResNet50、BERT 等模型,支持静态图训练,支持单芯片、单机、多机分布式训练。
-
-- 寒武纪 MLU 芯片(MLU370x4)训练/推理支持,支持 ResNet50 等模型,支持静态图+动态图训练,支持混合精度训练,支持单卡、单机、多机分布式训练。
-
-- 昆仑芯 2 代芯片(昆仑芯 AI 加速卡 R200、R300)训练/推理支持,支持 ResNet50、YoloV3、OCR-DB、SSD、MobilnetV3、UNet、BERT、Transformer、GPT-2、Wide&Deep、DeepFM,支持静态图+动态图训练,支持混合精度训练,支持单机单卡、单机多卡训练。
-
-## Thanks to our Contributors
-
-This release contains contributions from the project core team as well as:
-
-Adam Osewski, Allen Guo, arlesniak, chenenquan, chenyanlann, fengkuangxiaxia, fuqianya, fwenguang, guguguzi, helen88, houj04, Jacek Czaja, jakpiase, jianghaicheng, joanna.wozna.intel, joeqiao12, Leo Chen, Leo Guo, Li-fAngyU, lidanqing, Liyulingyue, Matsumoto GAO, maxhuiy, Ming-Xu Huang, Nyakku Shigure, piotrekobi, piotrekobiIntel, QingshuChen, qipengh, Skr Bang, Sylwester Fraczek, Sławomir Siwek, taixiurong, tanzhipeng, Tomasz Socha, TTerror, Webbley, yaozhixin, ykkk2333, yujun, Zhangjingyu06, zhangxiaoci, zhangyikun02, zhangyk0314, zlsh80826, zn, Zuza.
+#### Bug 修复
+- 修复 DTK 和 ROCM 版本升级的编译错误问题。 [#62832](https://github.com/PaddlePaddle/Paddle/pull/62832),[#62931](https://github.com/PaddlePaddle/Paddle/pull/62931),[#61872](https://github.com/PaddlePaddle/Paddle/pull/61872),[#63738](https://github.com/PaddlePaddle/Paddle/pull/63738)
+
+## 10.环境更新
+此版本飞桨完成基础依赖库的发版和更新同步,移除了不再更新的老旧依赖库。完成了多项优化提升编译效率、兼容性,完善 CI 流水线监测功能以提升用户安装体验。修复了多个已知编译问题,完善 paddle 的编译系统,新增了一些特性支持。通过相关优化工作,飞桨框架的编译安装体验进一步提升,给开发者带来更好的使用和开发体验。
+
+### 新增支持
+- 支持用户安装 paddle 不依赖本地的 cuda 和 cudnn,提升用户安装体验。[#60841](https://github.com/PaddlePaddle/Paddle/pull/60841),[#61973](https://github.com/PaddlePaddle/Paddle/pull/61973),[#61862](https://github.com/PaddlePaddle/Paddle/pull/61862),[#61235](https://github.com/PaddlePaddle/Paddle/pull/61235),[#61209](https://github.com/PaddlePaddle/Paddle/pull/61209),[#61653](https://github.com/PaddlePaddle/Paddle/pull/61653),[#64083](https://github.com/PaddlePaddle/Paddle/pull/64083)
+- 全面支持 CUDA 12.3,同时完成 cuda10.2 退场。[#63356](https://github.com/PaddlePaddle/Paddle/pull/63356),[#60299](https://github.com/PaddlePaddle/Paddle/pull/60299),[#64171](https://github.com/PaddlePaddle/Paddle/pull/64171),[#62189](https://github.com/PaddlePaddle/Paddle/pull/62189),[#63392](https://github.com/PaddlePaddle/Paddle/pull/63392),[#64228](https://github.com/PaddlePaddle/Paddle/pull/64228),[#62498](https://github.com/PaddlePaddle/Paddle/pull/62498),[#64298](https://github.com/PaddlePaddle/Paddle/pull/64298)
+- 全面支持 Python 3.12,带来了更强大的语言特性和性能优化,同时完成 python3.7 退场。[#59875](https://github.com/PaddlePaddle/Paddle/pull/59875),[#59877](https://github.com/PaddlePaddle/Paddle/pull/59877),[#59876](https://github.com/PaddlePaddle/Paddle/pull/59876)
+- 其他 paddle 依赖的第三方库升级:[#63741](https://github.com/PaddlePaddle/Paddle/pull/63741),[#64447](https://github.com/PaddlePaddle/Paddle/pull/64447),[#60195](https://github.com/PaddlePaddle/Paddle/pull/60195),[#60110](https://github.com/PaddlePaddle/Paddle/pull/60110),[#61509](https://github.com/PaddlePaddle/Paddle/pull/61509)
+
+### 编译优化
+- 优化了 paddle 的 CMake 代码,显著提升了编译效率和编译体验。[##59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#60167](https://github.com/PaddlePaddle/Paddle/pull/60167),[#61052](https://github.com/PaddlePaddle/Paddle/pull/61052),[#59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#59607](https://github.com/PaddlePaddle/Paddle/pull/59607),[#63093](https://github.com/PaddlePaddle/Paddle/pull/63093),[#63887](https://github.com/PaddlePaddle/Paddle/pull/63887),[#62969](https://github.com/PaddlePaddle/Paddle/pull/62969),[#64007](https://github.com/PaddlePaddle/Paddle/pull/64007),[#59811](https://github.com/PaddlePaddle/Paddle/pull/59811),[#63045](https://github.com/PaddlePaddle/Paddle/pull/63045),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#60240](https://github.com/PaddlePaddle/Paddle/pull/60240),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#61411](https://github.com/PaddlePaddle/Paddle/pull/61411),[#61944](https://github.com/PaddlePaddle/Paddle/pull/61944),[#61961](https://github.com/PaddlePaddle/Paddle/pull/61961),[#59990](https://github.com/PaddlePaddle/Paddle/pull/59990),[#59478](https://github.com/PaddlePaddle/Paddle/pull/59478),[#61501](https://github.com/PaddlePaddle/Paddle/pull/61501),[#60066](https://github.com/PaddlePaddle/Paddle/pull/60066),[#64133](https://github.com/PaddlePaddle/Paddle/pull/64133),[#64231](https://github.com/PaddlePaddle/Paddle/pull/64231),[#60087](https://github.com/PaddlePaddle/Paddle/pull/60087),[#60348](https://github.com/PaddlePaddle/Paddle/pull/60348),[#60737](https://github.com/PaddlePaddle/Paddle/pull/60737),[#61364](https://github.com/PaddlePaddle/Paddle/pull/61364),[#63214](https://github.com/PaddlePaddle/Paddle/pull/63214),[#62454](https://github.com/PaddlePaddle/Paddle/pull/62454),[#62473](https://github.com/PaddlePaddle/Paddle/pull/62473),[#63692](https://github.com/PaddlePaddle/Paddle/pull/63692),[#63950](https://github.com/PaddlePaddle/Paddle/pull/63950)
+- 支持在 linux 和 windowx 下 C++单测链接动态库,大幅减少 C++单测的体积大小和整个 build 目录大小。[#60008](https://github.com/PaddlePaddle/Paddle/pull/60008),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60961](https://github.com/PaddlePaddle/Paddle/pull/60961),[#60831](https://github.com/PaddlePaddle/Paddle/pull/60831),[#60832](https://github.com/PaddlePaddle/Paddle/pull/60832),[#60833](https://github.com/PaddlePaddle/Paddle/pull/60833),[#61372](https://github.com/PaddlePaddle/Paddle/pull/61372),[#60834](https://github.com/PaddlePaddle/Paddle/pull/60834),[#61374](https://github.com/PaddlePaddle/Paddle/pull/61374),[#61463](https://github.com/PaddlePaddle/Paddle/pull/61463),[#61376](https://github.com/PaddlePaddle/Paddle/pull/61376),[#60830](https://github.com/PaddlePaddle/Paddle/pull/60830),[#61373](https://github.com/PaddlePaddle/Paddle/pull/61373),[#61672](https://github.com/PaddlePaddle/Paddle/pull/61672),[#61375](https://github.com/PaddlePaddle/Paddle/pull/61375),[#61676](https://github.com/PaddlePaddle/Paddle/pull/61676),[#62036](https://github.com/PaddlePaddle/Paddle/pull/62036),[#61945](https://github.com/PaddlePaddle/Paddle/pull/61945),[#61675](https://github.com/PaddlePaddle/Paddle/pull/61675),[#61674](https://github.com/PaddlePaddle/Paddle/pull/61674),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#61238](https://github.com/PaddlePaddle/Paddle/pull/61238),[#59988](https://github.com/PaddlePaddle/Paddle/pull/59988),[#60307](https://github.com/PaddlePaddle/Paddle/pull/60307),[#59612](https://github.com/PaddlePaddle/Paddle/pull/59612),[#59942](https://github.com/PaddlePaddle/Paddle/pull/59942),[#59968](https://github.com/PaddlePaddle/Paddle/pull/59968),[#59978](https://github.com/PaddlePaddle/Paddle/pull/59978),[#60121](https://github.com/PaddlePaddle/Paddle/pull/60121),[#60149](https://github.com/PaddlePaddle/Paddle/pull/60149),[#60161](https://github.com/PaddlePaddle/Paddle/pull/60161),[#60160](https://github.com/PaddlePaddle/Paddle/pull/60160),[#60230](https://github.com/PaddlePaddle/Paddle/pull/60230),[#60154](https://github.com/PaddlePaddle/Paddle/pull/60154),[#60356](https://github.com/PaddlePaddle/Paddle/pull/60356),[#60392](https://github.com/PaddlePaddle/Paddle/pull/60392),[#60517](https://github.com/PaddlePaddle/Paddle/pull/60517),[#61131](https://github.com/PaddlePaddle/Paddle/pull/61131),[#60959](https://github.com/PaddlePaddle/Paddle/pull/60959)
+- 新增对 Clang 编译器的支持,用户现在可以使用 Clang 进行编译,享受更快的编译速度和更好的报错信息提示。[#63382](https://github.com/PaddlePaddle/Paddle/pull/63382),[#63133](https://github.com/PaddlePaddle/Paddle/pull/63133),[#61705](https://github.com/PaddlePaddle/Paddle/pull/61705),[#63152](https://github.com/PaddlePaddle/Paddle/pull/63152),[#63373](https://github.com/PaddlePaddle/Paddle/pull/63373)
+
+### CI 流水线改进
+- 对 CI 流水线中的合入代码监测机制进行了完善,确保更高的代码质量和稳定性。新增了功能监控模块,实时监控 CI 流水线的各项指标,确保每个阶段的顺利执行,及时发现和解决问题。[#61384](https://github.com/PaddlePaddle/Paddle/pull/61384),[#62190](https://github.com/PaddlePaddle/Paddle/pull/62190),[#60758](https://github.com/PaddlePaddle/Paddle/pull/60758),[#60399](https://github.com/PaddlePaddle/Paddle/pull/60399),[#58623](https://github.com/PaddlePaddle/Paddle/pull/58623),[#62177](https://github.com/PaddlePaddle/Paddle/pull/62177),[#62361](https://github.com/PaddlePaddle/Paddle/pull/62361),[#62893](https://github.com/PaddlePaddle/Paddle/pull/62893),[#63705](https://github.com/PaddlePaddle/Paddle/pull/63705),[#64476](https://github.com/PaddlePaddle/Paddle/pull/64476),[#64752](https://github.com/PaddlePaddle/Paddle/pull/64752),[#64733](https://github.com/PaddlePaddle/Paddle/pull/64733),[#61914](https://github.com/PaddlePaddle/Paddle/pull/61914)
+
+### 代码清理
+- 删除了一些老旧的代码。[#63580](https://github.com/PaddlePaddle/Paddle/pull/63580),[#62840](https://github.com/PaddlePaddle/Paddle/pull/62840),[#62886](https://github.com/PaddlePaddle/Paddle/pull/62886),[#63046](https://github.com/PaddlePaddle/Paddle/pull/63046),[#63004](https://github.com/PaddlePaddle/Paddle/pull/63004),[#63039](https://github.com/PaddlePaddle/Paddle/pull/63039),[#62733](https://github.com/PaddlePaddle/Paddle/pull/62733),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#62768](https://github.com/PaddlePaddle/Paddle/pull/62768),[#62744](https://github.com/PaddlePaddle/Paddle/pull/62744),[#62861](https://github.com/PaddlePaddle/Paddle/pull/62861),[#62774](https://github.com/PaddlePaddle/Paddle/pull/62774),[#62851](https://github.com/PaddlePaddle/Paddle/pull/62851),[#62973](https://github.com/PaddlePaddle/Paddle/pull/62973),[#63273](https://github.com/PaddlePaddle/Paddle/pull/63273),[#62445](https://github.com/PaddlePaddle/Paddle/pull/62445),[#64382](https://github.com/PaddlePaddle/Paddle/pull/64382),[#64409](https://github.com/PaddlePaddle/Paddle/pull/64409),[#64391](https://github.com/PaddlePaddle/Paddle/pull/64391),[#64310](https://github.com/PaddlePaddle/Paddle/pull/64310),[#64348](https://github.com/PaddlePaddle/Paddle/pull/64348),[#64651](https://github.com/PaddlePaddle/Paddle/pull/64651),[#64709](https://github.com/PaddlePaddle/Paddle/pull/64709),[#61714](https://github.com/PaddlePaddle/Paddle/pull/61714),[#62109](https://github.com/PaddlePaddle/Paddle/pull/62109),[#61751](https://github.com/PaddlePaddle/Paddle/pull/61751),[#61691](https://github.com/PaddlePaddle/Paddle/pull/61691),[#61735](https://github.com/PaddlePaddle/Paddle/pull/61735)
+### Bug 修复
+- 修复多个 paddle 框架的编译问题。[#63297](https://github.com/PaddlePaddle/Paddle/pull/63297),[#62994](https://github.com/PaddlePaddle/Paddle/pull/62994),[#62651](https://github.com/PaddlePaddle/Paddle/pull/62651),[#64408](https://github.com/PaddlePaddle/Paddle/pull/64408),[#60934](https://github.com/PaddlePaddle/Paddle/pull/60934),[#62899](https://github.com/PaddlePaddle/Paddle/pull/62899),[#60528](https://github.com/PaddlePaddle/Paddle/pull/60528),[#63158](https://github.com/PaddlePaddle/Paddle/pull/63158),[#64549](https://github.com/PaddlePaddle/Paddle/pull/64549),[#62351](https://github.com/PaddlePaddle/Paddle/pull/62351),[#61259](https://github.com/PaddlePaddle/Paddle/pull/61259),[#61281](https://github.com/PaddlePaddle/Paddle/pull/61281),[#62304](https://github.com/PaddlePaddle/Paddle/pull/62304),[#60736](https://github.com/PaddlePaddle/Paddle/pull/60736),[#60811](https://github.com/PaddlePaddle/Paddle/pull/60811),[#63949](https://github.com/PaddlePaddle/Paddle/pull/63949),[#59892](https://github.com/PaddlePaddle/Paddle/pull/59892),[#60767](https://github.com/PaddlePaddle/Paddle/pull/60767),[#60856](https://github.com/PaddlePaddle/Paddle/pull/60856),[#61286](https://github.com/PaddlePaddle/Paddle/pull/61286),[#61638](https://github.com/PaddlePaddle/Paddle/pull/61638),[#62079](https://github.com/PaddlePaddle/Paddle/pull/62079),[#62142](https://github.com/PaddlePaddle/Paddle/pull/62142),[#62823](https://github.com/PaddlePaddle/Paddle/pull/62823),[#62814](https://github.com/PaddlePaddle/Paddle/pull/62814),[#62425](https://github.com/PaddlePaddle/Paddle/pull/62425),[#62619](https://github.com/PaddlePaddle/Paddle/pull/62619),[#60207](https://github.com/PaddlePaddle/Paddle/pull/60207),[#60765](https://github.com/PaddlePaddle/Paddle/pull/60765),[#61870](https://github.com/PaddlePaddle/Paddle/pull/61870),[#61923](https://github.com/PaddlePaddle/Paddle/pull/61923),[#62144](https://github.com/PaddlePaddle/Paddle/pull/62144),[#62426](https://github.com/PaddlePaddle/Paddle/pull/62426),[#63848](https://github.com/PaddlePaddle/Paddle/pull/63848),[#60682](https://github.com/PaddlePaddle/Paddle/pull/60682),[#61369](https://github.com/PaddlePaddle/Paddle/pull/61369),[#62882](https://github.com/PaddlePaddle/Paddle/pull/62882),[#63944](https://github.com/PaddlePaddle/Paddle/pull/63944),[#64812](https://github.com/PaddlePaddle/Paddle/pull/64812),[#60654](https://github.com/PaddlePaddle/Paddle/pull/60654),[#60887](https://github.com/PaddlePaddle/Paddle/pull/60887),[#62058](https://github.com/PaddlePaddle/Paddle/pull/62058),[#64639](https://github.com/PaddlePaddle/Paddle/pull/64639),[#60115](https://github.com/PaddlePaddle/Paddle/pull/60115),[#61940](https://github.com/PaddlePaddle/Paddle/pull/61940),[#62614](https://github.com/PaddlePaddle/Paddle/pull/62614),[#59914](https://github.com/PaddlePaddle/Paddle/pull/59914),[#63762](https://github.com/PaddlePaddle/Paddle/pull/63762),[#60145](https://github.com/PaddlePaddle/Paddle/pull/60145),[#60285](https://github.com/PaddlePaddle/Paddle/pull/60285),[#60378](https://github.com/PaddlePaddle/Paddle/pull/60378),[#60393](https://github.com/PaddlePaddle/Paddle/pull/60393),[#61057](https://github.com/PaddlePaddle/Paddle/pull/61057),[#61058](https://github.com/PaddlePaddle/Paddle/pull/61058),[#61151](https://github.com/PaddlePaddle/Paddle/pull/61151),[#61347](https://github.com/PaddlePaddle/Paddle/pull/61347),[#61554](https://github.com/PaddlePaddle/Paddle/pull/61554),[#61844](https://github.com/PaddlePaddle/Paddle/pull/61844),[#62915](https://github.com/PaddlePaddle/Paddle/pull/62915),[#61852](https://github.com/PaddlePaddle/Paddle/pull/61852),[#61704](https://github.com/PaddlePaddle/Paddle/pull/61704),[#61991](https://github.com/PaddlePaddle/Paddle/pull/61991),[#62264](https://github.com/PaddlePaddle/Paddle/pull/62264),[#62762](https://github.com/PaddlePaddle/Paddle/pull/62762),[#63820](https://github.com/PaddlePaddle/Paddle/pull/63820),[#63864](https://github.com/PaddlePaddle/Paddle/pull/63864),[#65017](https://github.com/PaddlePaddle/Paddle/pull/65017),[#61183](https://github.com/PaddlePaddle/Paddle/pull/61183),[#59866](https://github.com/PaddlePaddle/Paddle/pull/59866),[#61171](https://github.com/PaddlePaddle/Paddle/pull/61171),[#61290](https://github.com/PaddlePaddle/Paddle/pull/61290),[#61725](https://github.com/PaddlePaddle/Paddle/pull/61725),[#61614](https://github.com/PaddlePaddle/Paddle/pull/61614),[#61721](https://github.com/PaddlePaddle/Paddle/pull/61721),[#61494](https://github.com/PaddlePaddle/Paddle/pull/61494),[#61556](https://github.com/PaddlePaddle/Paddle/pull/61556),[#61689](https://github.com/PaddlePaddle/Paddle/pull/61689)
+
+## 11.文档相关的问题修复
+- 随着 API 功能增强工作的开展,对部分 API 文档也同步进行了修正和增强。[#62875](https://github.com/PaddlePaddle/Paddle/pull/62875), [#59793](https://github.com/PaddlePaddle/Paddle/pull/59793), [#60002](https://github.com/PaddlePaddle/Paddle/pull/60002), [#59985](https://github.com/PaddlePaddle/Paddle/pull/59985), [#63365](https://github.com/PaddlePaddle/Paddle/pull/63365), [#60962](https://github.com/PaddlePaddle/Paddle/pull/60962), [#60942](https://github.com/PaddlePaddle/Paddle/pull/60942), [#64232](https://github.com/PaddlePaddle/Paddle/pull/64232), [#63255](https://github.com/PaddlePaddle/Paddle/pull/63255)
+- 更新/补充 API 文档。bernoulli_ ([#64504](https://github.com/PaddlePaddle/Paddle/pull/64504)),paddle.static.ctr_metric_bundle ([#60912](https://github.com/PaddlePaddle/Paddle/pull/60912)),LayerNorm ([#62928](https://github.com/PaddlePaddle/Paddle/pull/62928)),Sequential ([#63128](https://github.com/PaddlePaddle/Paddle/pull/63128)),paddle.summary ([#63121](https://github.com/PaddlePaddle/Paddle/pull/63121)),AutoParallel 中的 ShardOptimizer ([#62933](https://github.com/PaddlePaddle/Paddle/pull/62933)),paddle.nccl.version ([#62480](https://github.com/PaddlePaddle/Paddle/pull/62480))
+- 更新 Readme 文件。[#59883](https://github.com/PaddlePaddle/Paddle/pull/59883),[#60691](https://github.com/PaddlePaddle/Paddle/pull/60691),[#60749](https://github.com/PaddlePaddle/Paddle/pull/60749)
+- 将 mkldnn 更新为 onednn。[#63199](https://github.com/PaddlePaddle/Paddle/pull/63199),[#63202](https://github.com/PaddlePaddle/Paddle/pull/63202),[#63215](https://github.com/PaddlePaddle/Paddle/pull/63215),[#63209](https://github.com/PaddlePaddle/Paddle/pull/63209)
+- 修复文档渲染错误。[#59725](https://github.com/PaddlePaddle/Paddle/pull/59725),[#60306](https://github.com/PaddlePaddle/Paddle/pull/60306)
+- 修改了代码中大量的错别字,增强源码可读性。[#60093](https://github.com/PaddlePaddle/Paddle/pull/60093),[#60603](https://github.com/PaddlePaddle/Paddle/pull/60603),[#60631](https://github.com/PaddlePaddle/Paddle/pull/60631),[#60679](https://github.com/PaddlePaddle/Paddle/pull/60679),[#60741](https://github.com/PaddlePaddle/Paddle/pull/60741),[#60770](https://github.com/PaddlePaddle/Paddle/pull/60770),[#60784](https://github.com/PaddlePaddle/Paddle/pull/60784),[#60825](https://github.com/PaddlePaddle/Paddle/pull/60825),[#60857](https://github.com/PaddlePaddle/Paddle/pull/60857),[#60891](https://github.com/PaddlePaddle/Paddle/pull/60891),[#60921](https://github.com/PaddlePaddle/Paddle/pull/60921),[#60920](https://github.com/PaddlePaddle/Paddle/pull/60920),[#60923](https://github.com/PaddlePaddle/Paddle/pull/60923),[#60928](https://github.com/PaddlePaddle/Paddle/pull/60928),[#60940](https://github.com/PaddlePaddle/Paddle/pull/60940),[#60936](https://github.com/PaddlePaddle/Paddle/pull/60936),[#60932](https://github.com/PaddlePaddle/Paddle/pull/60932),[#60935](https://github.com/PaddlePaddle/Paddle/pull/60935),[#60931](https://github.com/PaddlePaddle/Paddle/pull/60931),[#60951](https://github.com/PaddlePaddle/Paddle/pull/60951),[#60964](https://github.com/PaddlePaddle/Paddle/pull/60964),[#60965](https://github.com/PaddlePaddle/Paddle/pull/60965),[#60967](https://github.com/PaddlePaddle/Paddle/pull/60967),[#60972](https://github.com/PaddlePaddle/Paddle/pull/60972),[#60971](https://github.com/PaddlePaddle/Paddle/pull/60971),[#60980](https://github.com/PaddlePaddle/Paddle/pull/60980),[#60984](https://github.com/PaddlePaddle/Paddle/pull/60984),[#60985](https://github.com/PaddlePaddle/Paddle/pull/60985),[#60989](https://github.com/PaddlePaddle/Paddle/pull/60989),[#60990](https://github.com/PaddlePaddle/Paddle/pull/60990),[#60991](https://github.com/PaddlePaddle/Paddle/pull/60991),[#60992](https://github.com/PaddlePaddle/Paddle/pull/60992),[#60994](https://github.com/PaddlePaddle/Paddle/pull/60994),[#60995](https://github.com/PaddlePaddle/Paddle/pull/60995),[#60996](https://github.com/PaddlePaddle/Paddle/pull/60996),[#61001](https://github.com/PaddlePaddle/Paddle/pull/61001),[#61000](https://github.com/PaddlePaddle/Paddle/pull/61000),[#60999](https://github.com/PaddlePaddle/Paddle/pull/60999),[#60998](https://github.com/PaddlePaddle/Paddle/pull/60998),[#61026](https://github.com/PaddlePaddle/Paddle/pull/61026),[#61009](https://github.com/PaddlePaddle/Paddle/pull/61009),[#61034](https://github.com/PaddlePaddle/Paddle/pull/61034),[#61033](https://github.com/PaddlePaddle/Paddle/pull/61033),[#61020](https://github.com/PaddlePaddle/Paddle/pull/61020),[#61092](https://github.com/PaddlePaddle/Paddle/pull/61092),[#61066](https://github.com/PaddlePaddle/Paddle/pull/61066),[#61063](https://github.com/PaddlePaddle/Paddle/pull/61063),[#61089](https://github.com/PaddlePaddle/Paddle/pull/61089),[#61071](https://github.com/PaddlePaddle/Paddle/pull/61071),[#61129](https://github.com/PaddlePaddle/Paddle/pull/61129),[#61128](https://github.com/PaddlePaddle/Paddle/pull/61128),[#61126](https://github.com/PaddlePaddle/Paddle/pull/61126),[#61123](https://github.com/PaddlePaddle/Paddle/pull/61123),[#61113](https://github.com/PaddlePaddle/Paddle/pull/61113),[#61189](https://github.com/PaddlePaddle/Paddle/pull/61189),[#61175](https://github.com/PaddlePaddle/Paddle/pull/61175),[#61153](https://github.com/PaddlePaddle/Paddle/pull/61153),[#61198](https://github.com/PaddlePaddle/Paddle/pull/61198),[#61206](https://github.com/PaddlePaddle/Paddle/pull/61206),[#61256](https://github.com/PaddlePaddle/Paddle/pull/61256),[#61255](https://github.com/PaddlePaddle/Paddle/pull/61255),[#61251](https://github.com/PaddlePaddle/Paddle/pull/61251),[#61246](https://github.com/PaddlePaddle/Paddle/pull/61246),[#61245](https://github.com/PaddlePaddle/Paddle/pull/61245),[#61231](https://github.com/PaddlePaddle/Paddle/pull/61231),[#61247](https://github.com/PaddlePaddle/Paddle/pull/61247),[#61265](https://github.com/PaddlePaddle/Paddle/pull/61265),[#61264](https://github.com/PaddlePaddle/Paddle/pull/61264),[#61266](https://github.com/PaddlePaddle/Paddle/pull/61266),[#61267](https://github.com/PaddlePaddle/Paddle/pull/61267),[#61268](https://github.com/PaddlePaddle/Paddle/pull/61268),[#61270](https://github.com/PaddlePaddle/Paddle/pull/61270),[#61334](https://github.com/PaddlePaddle/Paddle/pull/61334),[#61392](https://github.com/PaddlePaddle/Paddle/pull/61392),[#61404](https://github.com/PaddlePaddle/Paddle/pull/61404),[#61318](https://github.com/PaddlePaddle/Paddle/pull/61318),[#61383](https://github.com/PaddlePaddle/Paddle/pull/61383),[#61306](https://github.com/PaddlePaddle/Paddle/pull/61306),[#61324](https://github.com/PaddlePaddle/Paddle/pull/61324),[#61426](https://github.com/PaddlePaddle/Paddle/pull/61426),[#61390](https://github.com/PaddlePaddle/Paddle/pull/61390),[#61419](https://github.com/PaddlePaddle/Paddle/pull/61419),[#61420](https://github.com/PaddlePaddle/Paddle/pull/61420),[#61408](https://github.com/PaddlePaddle/Paddle/pull/61408),[#61425](https://github.com/PaddlePaddle/Paddle/pull/61425),[#61557](https://github.com/PaddlePaddle/Paddle/pull/61557),[#61628](https://github.com/PaddlePaddle/Paddle/pull/61628),[#61652](https://github.com/PaddlePaddle/Paddle/pull/61652),[#61602](https://github.com/PaddlePaddle/Paddle/pull/61602),[#61558](https://github.com/PaddlePaddle/Paddle/pull/61558),[#61660](https://github.com/PaddlePaddle/Paddle/pull/61660),[#61423](https://github.com/PaddlePaddle/Paddle/pull/61423),[#61627](https://github.com/PaddlePaddle/Paddle/pull/61627),[#61685](https://github.com/PaddlePaddle/Paddle/pull/61685),[#61690](https://github.com/PaddlePaddle/Paddle/pull/61690),[#61727](https://github.com/PaddlePaddle/Paddle/pull/61727),[#61738](https://github.com/PaddlePaddle/Paddle/pull/61738),[#61740](https://github.com/PaddlePaddle/Paddle/pull/61740),[#61741](https://github.com/PaddlePaddle/Paddle/pull/61741),[#61743](https://github.com/PaddlePaddle/Paddle/pull/61743),[#61744](https://github.com/PaddlePaddle/Paddle/pull/61744),[#61745](https://github.com/PaddlePaddle/Paddle/pull/61745),[#61761](https://github.com/PaddlePaddle/Paddle/pull/61761),[#61762](https://github.com/PaddlePaddle/Paddle/pull/61762),[#61764](https://github.com/PaddlePaddle/Paddle/pull/61764),[#61767](https://github.com/PaddlePaddle/Paddle/pull/61767),[#61768](https://github.com/PaddlePaddle/Paddle/pull/61768),[#61774](https://github.com/PaddlePaddle/Paddle/pull/61774),[#61781](https://github.com/PaddlePaddle/Paddle/pull/61781),[#61783](https://github.com/PaddlePaddle/Paddle/pull/61783),[#61757](https://github.com/PaddlePaddle/Paddle/pull/61757),[#61732](https://github.com/PaddlePaddle/Paddle/pull/61732),[#61776](https://github.com/PaddlePaddle/Paddle/pull/61776),[#61780](https://github.com/PaddlePaddle/Paddle/pull/61780),[#61730](https://github.com/PaddlePaddle/Paddle/pull/61730),[#61728](https://github.com/PaddlePaddle/Paddle/pull/61728),[#61633](https://github.com/PaddlePaddle/Paddle/pull/61633),[#61720](https://github.com/PaddlePaddle/Paddle/pull/61720),[#61734](https://github.com/PaddlePaddle/Paddle/pull/61734),[#61779](https://github.com/PaddlePaddle/Paddle/pull/61779),[#61775](https://github.com/PaddlePaddle/Paddle/pull/61775),[#61773](https://github.com/PaddlePaddle/Paddle/pull/61773),[#61787](https://github.com/PaddlePaddle/Paddle/pull/61787),[#61687](https://github.com/PaddlePaddle/Paddle/pull/61687),[#61747](https://github.com/PaddlePaddle/Paddle/pull/61747),[#61760](https://github.com/PaddlePaddle/Paddle/pull/61760),[#61782](https://github.com/PaddlePaddle/Paddle/pull/61782),[#61800](https://github.com/PaddlePaddle/Paddle/pull/61800),[#61748](https://github.com/PaddlePaddle/Paddle/pull/61748),[#61772](https://github.com/PaddlePaddle/Paddle/pull/61772),[#61786](https://github.com/PaddlePaddle/Paddle/pull/61786),[#61880](https://github.com/PaddlePaddle/Paddle/pull/61880),[#61718](https://github.com/PaddlePaddle/Paddle/pull/61718),[#61742](https://github.com/PaddlePaddle/Paddle/pull/61742),[#61766](https://github.com/PaddlePaddle/Paddle/pull/61766),[#61835](https://github.com/PaddlePaddle/Paddle/pull/61835),[#61838](https://github.com/PaddlePaddle/Paddle/pull/61838),[#61754](https://github.com/PaddlePaddle/Paddle/pull/61754),[#61833](https://github.com/PaddlePaddle/Paddle/pull/61833),[#61749](https://github.com/PaddlePaddle/Paddle/pull/61749),[#61938](https://github.com/PaddlePaddle/Paddle/pull/61938),[#61919](https://github.com/PaddlePaddle/Paddle/pull/61919),[#61924](https://github.com/PaddlePaddle/Paddle/pull/61924),[#61778](https://github.com/PaddlePaddle/Paddle/pull/61778),[#61839](https://github.com/PaddlePaddle/Paddle/pull/61839),[#61879](https://github.com/PaddlePaddle/Paddle/pull/61879),[#61929](https://github.com/PaddlePaddle/Paddle/pull/61929),[#61801](https://github.com/PaddlePaddle/Paddle/pull/61801),[#61788](https://github.com/PaddlePaddle/Paddle/pull/61788),[#61999](https://github.com/PaddlePaddle/Paddle/pull/61999),[#61928](https://github.com/PaddlePaddle/Paddle/pull/61928),[#61958](https://github.com/PaddlePaddle/Paddle/pull/61958),[#61982](https://github.com/PaddlePaddle/Paddle/pull/61982),[#61996](https://github.com/PaddlePaddle/Paddle/pull/61996),[#61953](https://github.com/PaddlePaddle/Paddle/pull/61953),[#61998](https://github.com/PaddlePaddle/Paddle/pull/61998),[#62003](https://github.com/PaddlePaddle/Paddle/pull/62003),[#61921](https://github.com/PaddlePaddle/Paddle/pull/61921),[#61881](https://github.com/PaddlePaddle/Paddle/pull/61881),[#61746](https://github.com/PaddlePaddle/Paddle/pull/61746),[#61955](https://github.com/PaddlePaddle/Paddle/pull/61955),[#62002](https://github.com/PaddlePaddle/Paddle/pull/62002),[#62001](https://github.com/PaddlePaddle/Paddle/pull/62001),[#61997](https://github.com/PaddlePaddle/Paddle/pull/61997),[#61765](https://github.com/PaddlePaddle/Paddle/pull/61765),[#61956](https://github.com/PaddlePaddle/Paddle/pull/61956),[#62004](https://github.com/PaddlePaddle/Paddle/pull/62004),[#62044](https://github.com/PaddlePaddle/Paddle/pull/62044),[#62040](https://github.com/PaddlePaddle/Paddle/pull/62040),[#62043](https://github.com/PaddlePaddle/Paddle/pull/62043),[#62042](https://github.com/PaddlePaddle/Paddle/pull/62042),[#62041](https://github.com/PaddlePaddle/Paddle/pull/62041),[#62039](https://github.com/PaddlePaddle/Paddle/pull/62039),[#62019](https://github.com/PaddlePaddle/Paddle/pull/62019),[#61910](https://github.com/PaddlePaddle/Paddle/pull/61910),[#61882](https://github.com/PaddlePaddle/Paddle/pull/61882),[#61836](https://github.com/PaddlePaddle/Paddle/pull/61836),[#62013](https://github.com/PaddlePaddle/Paddle/pull/62013),[#62055](https://github.com/PaddlePaddle/Paddle/pull/62055),[#62047](https://github.com/PaddlePaddle/Paddle/pull/62047),[#62000](https://github.com/PaddlePaddle/Paddle/pull/62000),[#62048](https://github.com/PaddlePaddle/Paddle/pull/62048),[#62075](https://github.com/PaddlePaddle/Paddle/pull/62075),[#62038](https://github.com/PaddlePaddle/Paddle/pull/62038),[#62045](https://github.com/PaddlePaddle/Paddle/pull/62045),[#62105](https://github.com/PaddlePaddle/Paddle/pull/62105),[#62214](https://github.com/PaddlePaddle/Paddle/pull/62214),[#62212](https://github.com/PaddlePaddle/Paddle/pull/62212),[#62183](https://github.com/PaddlePaddle/Paddle/pull/62183),[#62182](https://github.com/PaddlePaddle/Paddle/pull/62182),[#62181](https://github.com/PaddlePaddle/Paddle/pull/62181),[#62179](https://github.com/PaddlePaddle/Paddle/pull/62179),[#62178](https://github.com/PaddlePaddle/Paddle/pull/62178),[#62172](https://github.com/PaddlePaddle/Paddle/pull/62172),[#62168](https://github.com/PaddlePaddle/Paddle/pull/62168),[#62163](https://github.com/PaddlePaddle/Paddle/pull/62163),[#62162](https://github.com/PaddlePaddle/Paddle/pull/62162),[#62161](https://github.com/PaddlePaddle/Paddle/pull/62161),[#62160](https://github.com/PaddlePaddle/Paddle/pull/62160),[#62046](https://github.com/PaddlePaddle/Paddle/pull/62046),[#62175](https://github.com/PaddlePaddle/Paddle/pull/62175),[#62259](https://github.com/PaddlePaddle/Paddle/pull/62259),[#62258](https://github.com/PaddlePaddle/Paddle/pull/62258),[#62213](https://github.com/PaddlePaddle/Paddle/pull/62213),[#62260](https://github.com/PaddlePaddle/Paddle/pull/62260),[#62290](https://github.com/PaddlePaddle/Paddle/pull/62290),[#62288](https://github.com/PaddlePaddle/Paddle/pull/62288),[#62323](https://github.com/PaddlePaddle/Paddle/pull/62323),[#62319](https://github.com/PaddlePaddle/Paddle/pull/62319),[#62331](https://github.com/PaddlePaddle/Paddle/pull/62331),[#62330](https://github.com/PaddlePaddle/Paddle/pull/62330),[#62329](https://github.com/PaddlePaddle/Paddle/pull/62329),[#62324](https://github.com/PaddlePaddle/Paddle/pull/62324),[#62317](https://github.com/PaddlePaddle/Paddle/pull/62317),[#62311](https://github.com/PaddlePaddle/Paddle/pull/62311),[#62310](https://github.com/PaddlePaddle/Paddle/pull/62310),[#62308](https://github.com/PaddlePaddle/Paddle/pull/62308),[#62289](https://github.com/PaddlePaddle/Paddle/pull/62289),[#62307](https://github.com/PaddlePaddle/Paddle/pull/62307),[#62315](https://github.com/PaddlePaddle/Paddle/pull/62315),[#62406](https://github.com/PaddlePaddle/Paddle/pull/62406),[#62458](https://github.com/PaddlePaddle/Paddle/pull/62458),[#62459](https://github.com/PaddlePaddle/Paddle/pull/62459),[#62481](https://github.com/PaddlePaddle/Paddle/pull/62481),[#62465](https://github.com/PaddlePaddle/Paddle/pull/62465),[#62462](https://github.com/PaddlePaddle/Paddle/pull/62462),[#62453](https://github.com/PaddlePaddle/Paddle/pull/62453),[#62496](https://github.com/PaddlePaddle/Paddle/pull/62496),[#62457](https://github.com/PaddlePaddle/Paddle/pull/62457),[#62537](https://github.com/PaddlePaddle/Paddle/pull/62537),[#62514](https://github.com/PaddlePaddle/Paddle/pull/62514),[#62548](https://github.com/PaddlePaddle/Paddle/pull/62548),[#62544](https://github.com/PaddlePaddle/Paddle/pull/62544),[#62575](https://github.com/PaddlePaddle/Paddle/pull/62575),[#62463](https://github.com/PaddlePaddle/Paddle/pull/62463),[#62643](https://github.com/PaddlePaddle/Paddle/pull/62643),[#62803](https://github.com/PaddlePaddle/Paddle/pull/62803),[#62924](https://github.com/PaddlePaddle/Paddle/pull/62924),[#63037](https://github.com/PaddlePaddle/Paddle/pull/63037),[#63102](https://github.com/PaddlePaddle/Paddle/pull/63102),[#63139](https://github.com/PaddlePaddle/Paddle/pull/63139),[#63092](https://github.com/PaddlePaddle/Paddle/pull/63092),[#63147](https://github.com/PaddlePaddle/Paddle/pull/63147),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#60485](https://github.com/PaddlePaddle/Paddle/pull/60485),[#61273](https://github.com/PaddlePaddle/Paddle/pull/61273),[#63429](https://github.com/PaddlePaddle/Paddle/pull/63429),[#61954](https://github.com/PaddlePaddle/Paddle/pull/61954)
+
+## 12.其他升级内容
+与用户使用无关的改动,包括废弃代码清理、无用单测清理、调试或者监控机制升级等。[#63377](https://github.com/PaddlePaddle/Paddle/pull/63377),[#64106](https://github.com/PaddlePaddle/Paddle/pull/64106),[#64220](https://github.com/PaddlePaddle/Paddle/pull/64220),[#64293](https://github.com/PaddlePaddle/Paddle/pull/64293),[#64464](https://github.com/PaddlePaddle/Paddle/pull/64464),[#64944](https://github.com/PaddlePaddle/Paddle/pull/64944),[#63638](https://github.com/PaddlePaddle/Paddle/pull/63638),[#63732](https://github.com/PaddlePaddle/Paddle/pull/63732),[#63735](https://github.com/PaddlePaddle/Paddle/pull/63735),[#63826](https://github.com/PaddlePaddle/Paddle/pull/63826),[#63982](https://github.com/PaddlePaddle/Paddle/pull/63982),[#63737](https://github.com/PaddlePaddle/Paddle/pull/63737),[#64471](https://github.com/PaddlePaddle/Paddle/pull/64471),[#64574](https://github.com/PaddlePaddle/Paddle/pull/64574),[#64494](https://github.com/PaddlePaddle/Paddle/pull/64494),[#62775](https://github.com/PaddlePaddle/Paddle/pull/62775),[#63601](https://github.com/PaddlePaddle/Paddle/pull/63601),[#62564](https://github.com/PaddlePaddle/Paddle/pull/62564),[#63772](https://github.com/PaddlePaddle/Paddle/pull/63772),[#64719](https://github.com/PaddlePaddle/Paddle/pull/64719),[#61640](https://github.com/PaddlePaddle/Paddle/pull/61640),[#63459](https://github.com/PaddlePaddle/Paddle/pull/63459),[#64062](https://github.com/PaddlePaddle/Paddle/pull/64062),[#63480](https://github.com/PaddlePaddle/Paddle/pull/63480),[#63833](https://github.com/PaddlePaddle/Paddle/pull/63833)[#63673](https://github.com/PaddlePaddle/Paddle/pull/63673),[#63672](https://github.com/PaddlePaddle/Paddle/pull/63672),[#64131](https://github.com/PaddlePaddle/Paddle/pull/64131),[#64156](https://github.com/PaddlePaddle/Paddle/pull/64156),[#64155](https://github.com/PaddlePaddle/Paddle/pull/64155),[#64159](https://github.com/PaddlePaddle/Paddle/pull/64159),[#63902](https://github.com/PaddlePaddle/Paddle/pull/63902),[#64230](https://github.com/PaddlePaddle/Paddle/pull/64230),[#64229](https://github.com/PaddlePaddle/Paddle/pull/64229),[#64236](https://github.com/PaddlePaddle/Paddle/pull/64236),[#64260](https://github.com/PaddlePaddle/Paddle/pull/64260),[#64175](https://github.com/PaddlePaddle/Paddle/pull/64175),[#64250](https://github.com/PaddlePaddle/Paddle/pull/64250),[#64269](https://github.com/PaddlePaddle/Paddle/pull/64269),[#64238](https://github.com/PaddlePaddle/Paddle/pull/64238),[#64349](https://github.com/PaddlePaddle/Paddle/pull/64349),[#64394](https://github.com/PaddlePaddle/Paddle/pull/64394),[#64402](https://github.com/PaddlePaddle/Paddle/pull/64402),[#64401](https://github.com/PaddlePaddle/Paddle/pull/64401),[#64388](https://github.com/PaddlePaddle/Paddle/pull/64388),[#64329](https://github.com/PaddlePaddle/Paddle/pull/64329),[#64502](https://github.com/PaddlePaddle/Paddle/pull/64502),[#64501](https://github.com/PaddlePaddle/Paddle/pull/64501),[#64515](https://github.com/PaddlePaddle/Paddle/pull/64515),[#64503](https://github.com/PaddlePaddle/Paddle/pull/64503),[#64514](https://github.com/PaddlePaddle/Paddle/pull/64514),[#64601](https://github.com/PaddlePaddle/Paddle/pull/64601),[#64564](https://github.com/PaddlePaddle/Paddle/pull/64564),[#64012](https://github.com/PaddlePaddle/Paddle/pull/64012),[#64697](https://github.com/PaddlePaddle/Paddle/pull/64697),[#64682](https://github.com/PaddlePaddle/Paddle/pull/64682),[#64051](https://github.com/PaddlePaddle/Paddle/pull/64051),[#63267](https://github.com/PaddlePaddle/Paddle/pull/63267),[#63426](https://github.com/PaddlePaddle/Paddle/pull/63426),[#63626](https://github.com/PaddlePaddle/Paddle/pull/63626),[#63257](https://github.com/PaddlePaddle/Paddle/pull/63257),[#63266](https://github.com/PaddlePaddle/Paddle/pull/63266),[#63468](https://github.com/PaddlePaddle/Paddle/pull/63468),[#63262](https://github.com/PaddlePaddle/Paddle/pull/63262),[#63248](https://github.com/PaddlePaddle/Paddle/pull/63248),[#63241](https://github.com/PaddlePaddle/Paddle/pull/63241),[#63252](https://github.com/PaddlePaddle/Paddle/pull/63252),[#63258](https://github.com/PaddlePaddle/Paddle/pull/63258),[#63235](https://github.com/PaddlePaddle/Paddle/pull/63235),[#63399](https://github.com/PaddlePaddle/Paddle/pull/63399),[#63488](https://github.com/PaddlePaddle/Paddle/pull/63488),[#63487](https://github.com/PaddlePaddle/Paddle/pull/63487),[#63466](https://github.com/PaddlePaddle/Paddle/pull/63466),[#63464](https://github.com/PaddlePaddle/Paddle/pull/63464),[#63483](https://github.com/PaddlePaddle/Paddle/pull/63483),[#63486](https://github.com/PaddlePaddle/Paddle/pull/63486),[#63475](https://github.com/PaddlePaddle/Paddle/pull/63475),[#63489](https://github.com/PaddlePaddle/Paddle/pull/63489),[#63470](https://github.com/PaddlePaddle/Paddle/pull/63470),[#63457](https://github.com/PaddlePaddle/Paddle/pull/63457),[#63493](https://github.com/PaddlePaddle/Paddle/pull/63493),[#63561](https://github.com/PaddlePaddle/Paddle/pull/63561),[#63584](https://github.com/PaddlePaddle/Paddle/pull/63584),[#63587](https://github.com/PaddlePaddle/Paddle/pull/63587),[#63586](https://github.com/PaddlePaddle/Paddle/pull/63586),[#63569](https://github.com/PaddlePaddle/Paddle/pull/63569),[#63559](https://github.com/PaddlePaddle/Paddle/pull/63559),[#63558](https://github.com/PaddlePaddle/Paddle/pull/63558),[#63555](https://github.com/PaddlePaddle/Paddle/pull/63555),[#63543](https://github.com/PaddlePaddle/Paddle/pull/63543),[#63589](https://github.com/PaddlePaddle/Paddle/pull/63589),[#63583](https://github.com/PaddlePaddle/Paddle/pull/63583),[#63565](https://github.com/PaddlePaddle/Paddle/pull/63565),[#63564](https://github.com/PaddlePaddle/Paddle/pull/63564),[#63265](https://github.com/PaddlePaddle/Paddle/pull/63265),[#63562](https://github.com/PaddlePaddle/Paddle/pull/63562),[#63591](https://github.com/PaddlePaddle/Paddle/pull/63591),[#63460](https://github.com/PaddlePaddle/Paddle/pull/63460),[#63238](https://github.com/PaddlePaddle/Paddle/pull/63238),[#63631](https://github.com/PaddlePaddle/Paddle/pull/63631),[#63707](https://github.com/PaddlePaddle/Paddle/pull/63707),[#63714](https://github.com/PaddlePaddle/Paddle/pull/63714),[#63854](https://github.com/PaddlePaddle/Paddle/pull/63854),[#63929](https://github.com/PaddlePaddle/Paddle/pull/63929),[#63532](https://github.com/PaddlePaddle/Paddle/pull/63532),[#59628](https://github.com/PaddlePaddle/Paddle/pull/59628),[#62209](https://github.com/PaddlePaddle/Paddle/pull/62209),[#63742](https://github.com/PaddlePaddle/Paddle/pull/63742),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#62078](https://github.com/PaddlePaddle/Paddle/pull/62078),[#62684](https://github.com/PaddlePaddle/Paddle/pull/62684),[#62723](https://github.com/PaddlePaddle/Paddle/pull/62723),[#64141](https://github.com/PaddlePaddle/Paddle/pull/64141),[#60404](https://github.com/PaddlePaddle/Paddle/pull/60404),[#64212](https://github.com/PaddlePaddle/Paddle/pull/64212),[#60652](https://github.com/PaddlePaddle/Paddle/pull/60652),[#64545](https://github.com/PaddlePaddle/Paddle/pull/64545),[#64477](https://github.com/PaddlePaddle/Paddle/pull/64477),[#64556](https://github.com/PaddlePaddle/Paddle/pull/64556),[#63160](https://github.com/PaddlePaddle/Paddle/pull/63160),[#63796](https://github.com/PaddlePaddle/Paddle/pull/63796),[#64693](https://github.com/PaddlePaddle/Paddle/pull/64693),[#64484](https://github.com/PaddlePaddle/Paddle/pull/64484),[#64677](https://github.com/PaddlePaddle/Paddle/pull/64677),[#64461](https://github.com/PaddlePaddle/Paddle/pull/64461),[#63189](https://github.com/PaddlePaddle/Paddle/pull/63189),[#63855](https://github.com/PaddlePaddle/Paddle/pull/63855),[#63896](https://github.com/PaddlePaddle/Paddle/pull/63896),[#63193](https://github.com/PaddlePaddle/Paddle/pull/63193),[#63200](https://github.com/PaddlePaddle/Paddle/pull/63200),[#63406](https://github.com/PaddlePaddle/Paddle/pull/63406),[#61283](https://github.com/PaddlePaddle/Paddle/pull/61283),[#63607](https://github.com/PaddlePaddle/Paddle/pull/63607),[#64486](https://github.com/PaddlePaddle/Paddle/pull/64486),[#64004](https://github.com/PaddlePaddle/Paddle/pull/64004),[#63132](https://github.com/PaddlePaddle/Paddle/pull/63132),[#63553](https://github.com/PaddlePaddle/Paddle/pull/63553),[#63572](https://github.com/PaddlePaddle/Paddle/pull/63572),[#63794](https://github.com/PaddlePaddle/Paddle/pull/63794),[#63919](https://github.com/PaddlePaddle/Paddle/pull/63919),[#63980](https://github.com/PaddlePaddle/Paddle/pull/63980),[#62917](https://github.com/PaddlePaddle/Paddle/pull/62917),[#64451](https://github.com/PaddlePaddle/Paddle/pull/64451),[#63541](https://github.com/PaddlePaddle/Paddle/pull/63541),[#63703](https://github.com/PaddlePaddle/Paddle/pull/63703),[#64536](https://github.com/PaddlePaddle/Paddle/pull/64536),[#63264](https://github.com/PaddlePaddle/Paddle/pull/63264),[#63335](https://github.com/PaddlePaddle/Paddle/pull/63335),[#63841](https://github.com/PaddlePaddle/Paddle/pull/63841),[#64628](https://github.com/PaddlePaddle/Paddle/pull/64628),[#63419](https://github.com/PaddlePaddle/Paddle/pull/63419),[#62210](https://github.com/PaddlePaddle/Paddle/pull/62210),[#63557](https://github.com/PaddlePaddle/Paddle/pull/63557),[#63064](https://github.com/PaddlePaddle/Paddle/pull/63064),[#61442](https://github.com/PaddlePaddle/Paddle/pull/61442),[#63537](https://github.com/PaddlePaddle/Paddle/pull/63537),[#63839](https://github.com/PaddlePaddle/Paddle/pull/63839),[#60927](https://github.com/PaddlePaddle/Paddle/pull/60927),[#60566](https://github.com/PaddlePaddle/Paddle/pull/60566),[#60842](https://github.com/PaddlePaddle/Paddle/pull/60842),[#64612](https://github.com/PaddlePaddle/Paddle/pull/64612),[#60047](https://github.com/PaddlePaddle/Paddle/pull/60047),[#63898](https://github.com/PaddlePaddle/Paddle/pull/63898),[#60415](https://github.com/PaddlePaddle/Paddle/pull/60415),[#60474](https://github.com/PaddlePaddle/Paddle/pull/60474),[#60439](https://github.com/PaddlePaddle/Paddle/pull/60439),[#60565](https://github.com/PaddlePaddle/Paddle/pull/60565),[#64414](https://github.com/PaddlePaddle/Paddle/pull/64414),[#62526](https://github.com/PaddlePaddle/Paddle/pull/62526),[#54183](https://github.com/PaddlePaddle/Paddle/pull/54183),[#64096](https://github.com/PaddlePaddle/Paddle/pull/64096),[#61325](https://github.com/PaddlePaddle/Paddle/pull/61325),[#60629](https://github.com/PaddlePaddle/Paddle/pull/60629),[#61051](https://github.com/PaddlePaddle/Paddle/pull/61051),[#62103](https://github.com/PaddlePaddle/Paddle/pull/62103),[#63594](https://github.com/PaddlePaddle/Paddle/pull/63594),[#60968](https://github.com/PaddlePaddle/Paddle/pull/60968),[#64613](https://github.com/PaddlePaddle/Paddle/pull/64613),[#64073](https://github.com/PaddlePaddle/Paddle/pull/64073),[#63816](https://github.com/PaddlePaddle/Paddle/pull/63816),[#64416](https://github.com/PaddlePaddle/Paddle/pull/64416),[#62499](https://github.com/PaddlePaddle/Paddle/pull/62499),[#64531](https://github.com/PaddlePaddle/Paddle/pull/64531),[#63827](https://github.com/PaddlePaddle/Paddle/pull/63827),[#59885](https://github.com/PaddlePaddle/Paddle/pull/59885),[#59949](https://github.com/PaddlePaddle/Paddle/pull/59949),[#63428](https://github.com/PaddlePaddle/Paddle/pull/63428),[#63218](https://github.com/PaddlePaddle/Paddle/pull/63218),[#63538](https://github.com/PaddlePaddle/Paddle/pull/63538),[#64497](https://github.com/PaddlePaddle/Paddle/pull/64497),[#63082](https://github.com/PaddlePaddle/Paddle/pull/63082),[#64395](https://github.com/PaddlePaddle/Paddle/pull/64395),[#60183](https://github.com/PaddlePaddle/Paddle/pull/60183),[#63691](https://github.com/PaddlePaddle/Paddle/pull/63691),[#64428](https://github.com/PaddlePaddle/Paddle/pull/64428),[#64648](https://github.com/PaddlePaddle/Paddle/pull/64648),[#64650](https://github.com/PaddlePaddle/Paddle/pull/64650),[#59926](https://github.com/PaddlePaddle/Paddle/pull/59926),[#59750](https://github.com/PaddlePaddle/Paddle/pull/59750),[#60080](https://github.com/PaddlePaddle/Paddle/pull/60080),[#60208](https://github.com/PaddlePaddle/Paddle/pull/60208),[#64124](https://github.com/PaddlePaddle/Paddle/pull/64124),[#64187](https://github.com/PaddlePaddle/Paddle/pull/64187),[#64166](https://github.com/PaddlePaddle/Paddle/pull/64166),[#64284](https://github.com/PaddlePaddle/Paddle/pull/64284),[#64253](https://github.com/PaddlePaddle/Paddle/pull/64253),[#64555](https://github.com/PaddlePaddle/Paddle/pull/64555),[#59878](https://github.com/PaddlePaddle/Paddle/pull/59878),[#64081](https://github.com/PaddlePaddle/Paddle/pull/64081)
+
+## 13.贡献者名单
+6clc, Android zhang, Asthestarsfalll, Ataf Fazledin Ahamed, Aurelius84, AyaseNana, Baizhou Zhang, bapijun, BiynXu, Botao Zhou, Bo Zhang, bukejiyu, caozhou, chalsliu, Chang Xu, Charles-hit, chen2016013, Chen Zhiyang, C.J.0_0, cmcamdy, co63oc, coco, cyber-pioneer, cyberslack_lee, danleifeng, diadestiny, Difer, Dmovic, Eddie-Wang, Eddie Zhang, engineer1109, enzodechine, fanhaoxuee, feifei-111, flying-forever, Frank Lin, freeliuzc, fsczz, Galaxy1458, GGBond8488, Ghost Screaming, gongweibao, gouzil, Guoxia Wang, handiz, HankYang, Haohongxiang, haosicheng, hess, hjyp, hong, Hongqing-work, Hongwen Xin, HongyuJia, houj04, huangjiyi, Huihuang Zheng, hxzd5568, hyDONG, HydrogenSulfate, idontkonwher, iLeGend, Jeng Bai-Cheng, Jianbang Yang, Jia Wenxuan, JYChen, jzhang533, JZ-LIANG, Kai Song, kangguangli, kevin, Kunbo Ding, lanxianghit, Leo Chen, Leo Guo, lijialin03, lijin23, linkk08, Liujie0926, Liuyinfeng, liu zhengxi, liuzhenhai93, liym27, LiYuRio, lizexu123, LoneRanger, Longzhi Wang, Lucas, Lu Qi, lzy, lzydev, MayYouBeProsperous, megemini, Meiyim, ming1753, Mingdong Wang, ndren, NeroLoh, NetPunk, Nguyen Cong Vinh, Nyakku Shigure, Omri Alon, onepick, ooo oo, pangengzheng, PommesPeter, Qi Li, QingshuChen, Qi Shao, RedContritio, Reese Wang, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, Shaopeng Ling, ShenLiang, Shijie, Shuhao Liang, Siming Dai, skywalker2012, smallpoxscattered, sneaxiy, Sonder, Sunny-bot1, Tao Luo, tc20042008, Terry, Tian, tianhaodongbd, tianshuo78520a, Tianyu Feng, Tian Zheng, Tongkai, Travis-Lee, unseenme, Vigi Zhang, walkalone20, Wang Bojun, wanghuancoder, wangna11BD, Wang Xin, Wangzheee, WangZhen, wanly young, wawltor, wendaxiao, Wen Sun, wentao yu, Wenyu, wenzhe.wang, Winters Montagne, winter-wang, WoWYoYLoL, Wu Chencan, Wu Fei, wuhuachaocoding, Xianduo Li, XiangGao, XiaociZhang, xiaoguoguo626807, xiaoxiaohehe001, Xiao Xiyuan, Xiaoxu Chen, xiaoyao0115, xiaoye, xingmingyyj, Xinyi_LI, Xinyu Yang, xiongkun, xuxinyi389, xysheng-baidu, yangguohao, YibLiu, Yichen Zhang, yinfan98, yinwei, Yiqun Liu, YKTian, Yuang Liu, Yuanle Liu, YuanRisheng, yuguo, yujun, yulangz, YUNSHEN XIE, zbt78, ZelinMa557, Zero Rains, Zeyu Chen, zhangbo9674, Zhang,Lirong, Zhang Ting, zhangyikun02, zhangyuqin1998, Zhan Rongrui, zhaohaixu, zhaoyingli, Zhenghai Zhang, zhengzhonghui, zhink, ZhouMengLei1999, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, Zichao, zxcd, zyfncg, zyt1024, 东百月, 傅剑寒, 周周周, 周波涛, 张春乔, 萧
diff --git a/docs/release_note_en.md b/docs/release_note_en.md
index 3e6ea0cd50f..0052edd54df 100644
--- a/docs/release_note_en.md
+++ b/docs/release_note_en.md
@@ -1,3495 +1,537 @@
-# 2.6.0 Release Note
+# 3.0 Beta Release Note
-## 1. Important Updates
+# Overview of PaddlePaddle 3.0 Beta
-- **Paddle New generation IR(PIR)** : In order to further improve scalability of the PaddlePaddle framework, we have developed a new generation intermediate representaion. It abstracts underlying core concepts of the PaddlePaddle framework, such as Operation, Attribute and Type, providing developers with flexible and efficient basic components. By introducing Dialect mechanism, PIR can comprehensively and hierarchically satisfy needs of each module for intermediate representations to greatly enhancing scalability of the framework. PIR strictly follows Static Single Assignment (SSA) principle, ensuring unity of top-level structure and harmonious coexistence of "operator sequentiality" and "computational graph semantics". In addition, PIR provides a more concise and low-cost Pass development process, with a series of built-in rich and functional Pass optimization strategies. It provides technical support for the ultimate performance optimization of large-scale models.
-- **Static graph construction and compiler Optimization Architecture**: In order to further improve performance of the framework, PaddlePaddle's dynamic to static training capability has been comprehensively upgraded to support adaptive graph construction capability. This has been tested on more than 700 PaddlePaddle industry-level models, with 100% success rate of one line code converter to start static training. Meanwhile, Compiler Infrastructure for Neural Networks (CINN) of PaddlePaddle framework is integrated into PaddlePaddle main Repo, making the compiler and PaddlePaddle more integrated. CINN completes architectural optimization and improvement of expansion capability, increasing system stability. Based on PIR framework, it is much more easied to bind dynamic to static, primitive operator, executor and compiler together, to providing more space for boosting overall performance of PaddlePaddle framework.
-- **Enhanced dynamic graph distributed capability**: Large models pose higher demands on the distributed training performance of framework. PaddlePaddle has comprehensive optimizations in dimensions of communication library, graph analysis, distributed strategy and task enable/disable, enhancing distributed computing capability of PaddlePaddle's dynamic graph and providing support for efficient training of large models. In terms of performance, training performance is further improved by reducing pipelined GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is significantly improved by fixing related Bugs.
-- **Auto parallel architecture with dynamic-static unification**: In order to further reduce difficulty of programming and optimizing large models, PaddlePaddle has fully optimized the Semi-Auto Parallel programming paradigm with dynamic-static unification, simplifying programming complexity for developers. Developers do not need to deeply understand complex concepts and APIs under the manual parallel programming paradigm, such as row-parallel, and column-parallel. They only need a small amount of tensor distribution annotations to implement the hybrid parallelism. The distribution specification will be propagated to all tensors and operators automatically, and the framework would handle the communication and synchronization needed by distributed training appropriately. Meanwhile, it supports dynamic-to-static distributed training by adding one extra code only, allowing developers to efficiently implement any mixed parallelism strategy and deeply simplify the development process of hybrid-parallel training paradigm.
-- **Hardware Integration Solution (CustomDevice)**: With increased demand for parallel training on new hardware in large model scenarios, PaddlePaddle has added support for distributed advanced policies, custom operators, and custom fusion policies. Distributed communication library is upgraded, with newly added support for many advanced distributed policies such as MP, GroupShared, PP, SP and MOE. Moreover, it supports vendors to flexibly access Transformer operator libraries of different granularities and modify the computation graph through Fusion Pass for performance acceleration.
-- **Installation and development experience**: use of modular compilation optimizes logics of CMake codes, and improves efficiency of PaddlePaddle full compilation and incremental compilation. In addition, this can increase efficiency of RD development. It supports Python3.12, CUDA12, Hopper architecture compilation, with introduction of Clang and other tools to fully optimize code formats. In addition, C++ is changed from linking static libraries to linking dynamic libraries to reduce compilation volume. These optimizations provide users with a smoother and more efficient installation and development experience.
+The core features of this version mainly include new technologies such as dynamic-static unity auto parallel and automatic optimization of neural network compiler, to aim to address the new challenges in the current deep learning field.PaddlePaddle Framework 3.0 Beta extends the design concepts of 2.x such as dynamic-static unity and integrated training and inference. The development interface is fully compatible with 2.x version. This means that codes developed in version 2.x can run directly on version 3.x without modification in most cases. Several key features are detailed as follows:
-## 2. Incompatible Upgrade
+- Dynamic-static graph unified auto parallel: To make the parallel training programming of large models easier, PaddlePaddle has also optimized the semi-auto parallel programming paradigm with dynamic-static graph unified. Developers do not need to delve into the complex concepts and APIs need in manual parallel programming; developers only need to perform a small amount of tensor sharding annotation to complete the construction of hybrid parallelism for large models. The framework is able to automatically derive distributed sharding states and add communication operators, and also supports one-key dynamic-to-static distributed training, thus dramatically simplifying the development of hybrid parallel training codes. In terms of dynamic-static unity, PaddlePaddle has comprehensively upgraded its dynamic-to-static training capability by adopting bytecode-based dynamic-static conversion technology, to support adaptive graph construction functions. It has been verified on more than 700 PaddlePaddle industrial-grade models, achieving a 100% success rate of one-key dynamic-to-static training.
+- Automatic optimization of neural network compiler: PaddlePaddle Compiler Infrastructure for Neural Networks (CINN) adopts the design of integration with the framework, supporting the efficient training and dynamic shape inference of generative models, scientific computing models and other models. This provides a good balance between computational flexibility and high performance. The inference performance of Llama2 and Stable Diffusion models has been improved by 30% through automatic fusion of operators and code generation technology.
+- High-order automatic differentiation: In order to better support scientific computing scenarios, PaddlePaddle Framework designs and implements high-order automatic differentiation technology based on combinatorial operator mechanism, combined with automatic optimization technology of neural network compiler. We have tested more than 40 differential equations in scientific computing scenarios, and its solution speed is 70% ahead of similar products in the industry.
+- Highly scalable intermediate representation: In order to improve the scalability of the PaddlePaddle framework, we have developed a highly scalable Paddle Intermediate Representation (PIR).This representation systematically abstracts the underlying core concepts and provides flexible and efficient components. PIR serves as the infrastructure to support a number of technologies such as dynamic-to-static, automatic differentiation, auto parallel, combinatorial operators, and graph optimization; it is widely used in scenarios such as distributed training, model compression, and inference deployment. With the Declarative Rewrite Rule (DRR) mechanism provided by PIR, the development cost of Pass can be reduced by 60%.We have tested over 900 model configurations and the results show that the overall performance of inference improves by more than 10% after using PIR.
+- Multi-Hardware adaptation: PaddlePaddle provides a well-functioning and low-cost solution for large model hardware adaptation. The new hardware only needs to be adapted with more than 30 interfaces to support training, compression and inference of large models. Meanwhile, PaddlePaddle provides compiler-based hardware access mode, and hardware vendors only need to implement the compiler's code generation back-end in the form of plug-ins to achieve efficient adaptation with the PaddlePaddle framework.PaddlePaddle hardware access this time has additional support for the daily release of four hardware units: Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU.
-- In order to avoid misuse, we removed the 0-dimensional Tensor compatibility state switch, to achieve the same API behaviors as industry's mainstream habits. In the previous version, we already supported 0-dimensional Tensor, but we added a compatibility state switch in order to avoid error reporting of some models, as much as possible. That is, in some scenarios where model suite is used frequently and modification is not completed, we still used 1-dimensional Tensor with only 1 element to replace the 0-dimensional Tensor by default. In this version, compatibility state switch is removed, so the 1-dimensional Tensor with only 1 element will no longer be used, to replace 0-dimensional Tensor in any scenario. Behaviors of 376 APIs that should support the 0-dimensional Tensor have been corrected and unified, to thoroughly complete support for the 0-dimensional Tensor.[#57036](https://github.com/PaddlePaddle/Paddle/pull/57036), [#54581](https://github.com/PaddlePaddle/Paddle/pull/54581), [#54500](https://github.com/PaddlePaddle/Paddle/pull/54500)
-- To improve API usability, paddle.nn.functional.diag_embed has been streamlined to paddle.diag_embed, with support of use of Tensor.diag_embed. [#58223](https://github.com/PaddlePaddle/Paddle/pull/58223)
-- In order to solve the problem of differential computation error caused by Tensor index writing (e.g., tensor[0] = 10) under static graphs, and to comply with static graph specifications, this version introduces paddle.static.setitem API. In static graph environments, this API is recommended to support indexed write operations for tensor, instead of subscript operators. This change does not affect dynamic graph environments, where index write using subscript operators are still allowed. [#53682](https://github.com/PaddlePaddle/Paddle/pull/53682)
-- paddle.fluid API is completely retired in this version. In this update, we completely removed all paddle.fluid APIs and deleted the fluid directory. Meanwhile, a small number of PaddlePaddle underlying public components have been consolidated into the paddle.base directory. It is unnecessary for PaddlePaddle users to pay attention to fluid-related concepts and APIs, further simplifying PaddlePaddle API system and improving readability.[#56576](https://github.com/PaddlePaddle/Paddle/pull/56576), [#54424](https://github.com/PaddlePaddle/Paddle/pull/54424), [#54829](https://github.com/PaddlePaddle/Paddle/pull/54829), [#53992](https://github.com/PaddlePaddle/Paddle/pull/53992), [#54806](https://github.com/PaddlePaddle/Paddle/pull/54806), [#55754](https://github.com/PaddlePaddle/Paddle/pull/55754), [#55986](https://github.com/PaddlePaddle/Paddle/pull/55986), [#55345](https://github.com/PaddlePaddle/Paddle/pull/55345), [#56099](https://github.com/PaddlePaddle/Paddle/pull/56099), [#51717](https://github.com/PaddlePaddle/Paddle/pull/51717), [#54152](https://github.com/PaddlePaddle/Paddle/pull/54152), [#55522](https://github.com/PaddlePaddle/Paddle/pull/55522), [#55757](https://github.com/PaddlePaddle/Paddle/pull/55757), [#58521](https://github.com/PaddlePaddle/Paddle/pull/58521), [#54936](https://github.com/PaddlePaddle/Paddle/pull/54936), [#55007](https://github.com/PaddlePaddle/Paddle/pull/55007), [#55661](https://github.com/PaddlePaddle/Paddle/pull/55661), [#55970](https://github.com/PaddlePaddle/Paddle/pull/55970)
+This version includes the continuous improvement of some of the existing features of the framework 2.x. Meanwhile, the new features of this version bring significant improvements in terms of user experience, performance, ease of secondary development and hardware adaptability. In addition to the above core features, this version continues to enrich and enhance the API functions to meet more scenarios at the user experience level, optimizes and improves the distributed parallel strategy optimization and reasoning function enhancement for the large model scenarios, makes thorough improvement in terms of ease-of-use in compilation and installation, makes a new synchronous upgrade to the installation method and version of the dependency packages, strengthens the security of the system comprehensively, and makes comprehensive error-correction checks to the product documentation. We have also carried out a cleanup of some deprecated codes to ensure architectural simplicity. The performance of PaddlePaddle 3.0 Beta is still mature and stable without the use of new features, and each new feature provides a switch for flexible control, which makes it easy for users to quickly understand the related product features and experience comparison.
-## 3. Training Framework (including Distributed)
+## User Experience Upgrade
-### Python API
+### Incompatibility Upgrade
-#### Upgrade Tensor indexing mechanism
+- PaddlePaddle API supports type promotion.In the most common calculations such as addition, subtraction, multiplication, and division, if the two inputs are of different data types, it is necessary to determine the data type of the output. Historically, PaddlePaddle partially supported this and the actual rules were not clear. Objectively, there were dynamic-static inconsistency, inconsistent API and operator overloading, and inconsistent interchange rates, and unexpected problems (hard to fix) especially in the case of large models using a mix of bf16/fp16 and fp32 for a wide range of calculations. Starting from the 3.0 beta, PaddlePaddle has clarified the [type promotion rules](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/advanced/auto_type_promotion_cn.html), and defined in detail the types of Tensor vs Tensor and Tensor vs. 1 number (Scalar) computation results, ensuring that the computation conforms to the exchange law, the operator overloading is consistent with the results of the binary API, and the results of the dynamic graph are consistent with those of the static graph. This is more in line with user understanding and industry practice. [#60638](https://github.com/PaddlePaddle/Paddle/pull/60638), [#63842](https://github.com/PaddlePaddle/Paddle/pull/63842), [#60011](https://github.com/PaddlePaddle/Paddle/pull/60011)
-This version comprehensively optimizes basic index, advanced index and joint index functions of Tensor, to better comply with industry standards and user habits. Specifically, we added support for view in basic index, fixed some wrong behaviors in advanced index, and implemented read function of joint index. In addition, we have sunk index parsing to C++ level, improved performance of high-level indexing operators, and removed redundant computations in bool indexing. With these optimizations, performance of Tensor's basic, advanced and joint index has been improved comprehensively. [#56893](https://github.com/PaddlePaddle/Paddle/pull/56893), [#58643](https://github.com/PaddlePaddle/Paddle/pull/58643), [#57986](https://github.com/PaddlePaddle/Paddle/pull/57986), [#56272](https://github.com/PaddlePaddle/Paddle/pull/56272), [#58856](https://github.com/PaddlePaddle/Paddle/pull/58856), [#55211](https://github.com/PaddlePaddle/Paddle/pull/55211), [#57023](https://github.com/PaddlePaddle/Paddle/pull/57023), [#56613](https://github.com/PaddlePaddle/Paddle/pull/56613), [#55602](https://github.com/PaddlePaddle/Paddle/pull/55602), [#59281](https://github.com/PaddlePaddle/Paddle/pull/59281), [#57737](https://github.com/PaddlePaddle/Paddle/pull/57737)
+### Deprecated Features
-#### Upgrade Inplace mechanism
+- There have been two versions stably supporting 0-dimensional Tensor. This version removes the switch `FLAGS_set_to_1d` that converts a 0-dimensional Tensor to a 1-dimensional Tensor with only 1 element in some cases. This switch is for compatibility with the incorrect way of writing a 1-element 1-dimensional Tensor to represent a 0-dimensional Tensor in some kits. That is, the current PaddlePaddle fully distinguish between the semantics of a 0-dimensional Tensor and a 1-dimensional Tensor with only 1 element, both are not equivalent. [#61227](https://github.com/PaddlePaddle/Paddle/pull/61227)
-In earlier versions, in order to ensure correctness of inverse differentiation calculations, when reverse calculation of an API depends on its forward input data, PaddlePaddle avoids using Inplace operation method, with possibly overwriting original input data. This mechanism simplifies implementation process, and also limits the ability of many APIs to implement Inplace functionality. As a result, user experience may be affected.
-In this version, PaddlePaddle has fully upgraded the Inplace mechanism. It implements automatic detection of the dependency of reverse computation on forward inputs, to save input data when needed. Therefore, more Inplace operations are supported. This improvement not only improves memory usage efficiency, but also enhances functionality of the API.
-In addition, we have added 109 new APIs that support Inplace operations, including paddle.abs\_, paddle.sin\_/cos\_/tan\_, comparison operations such as paddle.greater_than\_/less_than\_/equal\_, logical operations such as paddle.logical_and\_/logical_or\_/logical_not\_, paddle.neg\_ and paddle.log\_. While enriching the feature set of PaddlePaddle, it improves users' efficiency and convenience in numerical computation and deep learning tasks. [#54683](https://github.com/PaddlePaddle/Paddle/pull/54683), [#55078](https://github.com/PaddlePaddle/Paddle/pull/55078), [#55576](https://github.com/PaddlePaddle/Paddle/pull/55576), [#56888](https://github.com/PaddlePaddle/Paddle/pull/56888), [#55509](https://github.com/PaddlePaddle/Paddle/pull/55509), [#57093](https://github.com/PaddlePaddle/Paddle/pull/57093)
+### New API Features
-#### Other new APIs
+Compared with the previous version, this version is added with 126 new APIs, richer API functions to better support the needs of large models, and scientific computation. The details are as follows:
-- Added paddle.nn.functional.scaled_dot_product_attention. This significantly improves computational efficiency of the attention mechanism in large models, and meets demand for high-performance computation in large-scale deep learning models. [#55242](https://github.com/PaddlePaddle/Paddle/pull/55242)
-- Added a series of new scientific computing-related APIs, including paddle.cummax and paddle.cummin for cumulative maximum and minimum computation, paddle.index_fill and paddle.masked_fill for filling tensor by index or mask, paddle.linalg.pca_lowrank for low-rank principal component analysis, paddle.hypot for calculating length of the hypotenuses of right triangles, and paddle.atleast_1d, paddle.atleast_2d, and paddle.atleast_3d to ensure the tensor is at least one, two, or three dimensional. We also provide paddle.select_scatter and paddle.diagonal_scatter for more flexible selection and hashing of tensor data, and paddle.multigammaln for choosing the natural logarithm of multigamma function. In addition, new optimizer-related APIs are added in this version, including: paddle.optimizer.lr.LinearLR and paddle.optimizer.lr.CosineAnnealingWarmRestarts for learning rate scheduling strategies; introduction of paddle.io.SubsetRandomSampler to support random sampling from a subset of data. These added APIs will further enhance flexibility and efficiency of PaddlePaddle in various application scenarios. [#57416](https://github.com/PaddlePaddle/Paddle/pull/57416), [#53546](https://github.com/PaddlePaddle/Paddle/pull/53546), [#53743](https://github.com/PaddlePaddle/Paddle/pull/53743), [#57295](https://github.com/PaddlePaddle/Paddle/pull/57295), [#57726](https://github.com/PaddlePaddle/Paddle/pull/57726), [#58764](https://github.com/PaddlePaddle/Paddle/pull/58764), [#58323](https://github.com/PaddlePaddle/Paddle/pull/58323), [#57720](https://github.com/PaddlePaddle/Paddle/pull/57720), [#58209](https://github.com/PaddlePaddle/Paddle/pull/58209), [#58214](https://github.com/PaddlePaddle/Paddle/pull/58214), [#57792](https://github.com/PaddlePaddle/Paddle/pull/57792), [#51395](https://github.com/PaddlePaddle/Paddle/pull/51395), [#57724](https://github.com/PaddlePaddle/Paddle/pull/57724), [#57355](https://github.com/PaddlePaddle/Paddle/pull/57355), [#57744](https://github.com/PaddlePaddle/Paddle/pull/57744), [#58244](https://github.com/PaddlePaddle/Paddle/pull/58244), [#57599](https://github.com/PaddlePaddle/Paddle/pull/57599), [#59343](https://github.com/PaddlePaddle/Paddle/pull/59343), [#57879](https://github.com/PaddlePaddle/Paddle/pull/57879)
+- Add Tensor computation API. `paddle.gammaln`, `paddle.gammainc`, `paddle.gammaincc`, `paddle.sinc`, `paddle.pdist`, `paddle.histogramdd`,`paddle.signbit`, `paddle.copysign`, `paddle.bitwise_right_shift/bitwise_left_shift`, `paddle.isposinf/isneginf/isreal`, `paddle.isin`, `paddle.hsplit/dsplit`, `paddle.column_stack/row_stack/dstack/hstack/vstack`, `paddle.slice_scatter`, `paddle.masked_scatter` [#60553](https://github.com/PaddlePaddle/Paddle/pull/60553), [#59311](https://github.com/PaddlePaddle/Paddle/pull/59311), [#59357](https://github.com/PaddlePaddle/Paddle/pull/59357), [#63521](https://github.com/PaddlePaddle/Paddle/pull/63521), [#57869](https://github.com/PaddlePaddle/Paddle/pull/57869), [#57880](https://github.com/PaddlePaddle/Paddle/pull/57880), [#57882](https://github.com/PaddlePaddle/Paddle/pull/57882), [#60150](https://github.com/PaddlePaddle/Paddle/pull/60150), [#57785](https://github.com/PaddlePaddle/Paddle/pull/57785), [#58092](https://github.com/PaddlePaddle/Paddle/pull/58092), [#63523](https://github.com/PaddlePaddle/Paddle/pull/63523), [#64001](https://github.com/PaddlePaddle/Paddle/pull/64001), [#58917](https://github.com/PaddlePaddle/Paddle/pull/58917), [#59127](https://github.com/PaddlePaddle/Paddle/pull/59127), [#59973](https://github.com/PaddlePaddle/Paddle/pull/59973), [#59383](https://github.com/PaddlePaddle/Paddle/pull/59383)
+- Add probability distribution API. `paddle.distribution.ContinuousBernoulli`, `paddle.distribution.MultivariateNormal`, `paddle.distribution.Exponential`, `paddle.distribution.Gamma`, `paddle.distribution.Binomial`, `paddle.distribution.Poisson` [#58004](https://github.com/PaddlePaddle/Paddle/pull/58004), [#57899](https://github.com/PaddlePaddle/Paddle/pull/57899), [#57856](https://github.com/PaddlePaddle/Paddle/pull/57856)
+- Add optimizer API. `paddle.optimizer.ASGD`, `paddle.optimizer.NAdam`, `paddle.optimizer.RAdam`, `paddle.optimizer.Rprop` [#58834](https://github.com/PaddlePaddle/Paddle/pull/58834), [#63671](https://github.com/PaddlePaddle/Paddle/pull/63671), [#58851](https://github.com/PaddlePaddle/Paddle/pull/58851)
+- Add Linear Algebra API. `paddle.linalg.matrix_exp` [#59715](https://github.com/PaddlePaddle/Paddle/pull/59715)
+- Add other APIs. `paddle.bernoulli_`, `paddle.nn.ZeroPad1D/ZeroPad3D`, `paddle.nn.AdaptiveLogSoftmaxWithLoss`, `paddle.Tensor.apply` [#64252](https://github.com/PaddlePaddle/Paddle/pull/64252), [#59690](https://github.com/PaddlePaddle/Paddle/pull/59690), [#63728](https://github.com/PaddlePaddle/Paddle/pull/63728), [#63302](https://github.com/PaddlePaddle/Paddle/pull/63302), [#59374](https://github.com/PaddlePaddle/Paddle/pull/59374),[#63227](https://github.com/PaddlePaddle/Paddle/pull/63227)
-### New Generation of Paddle Intermediate Representation (PIR)
+### Some API Enhancements
-PIR systematically abstracts underlying core concepts such as Operation, Attribute and Type, to build a set of flexible and powerful base components for developers. In addition, PaddlePaddle can comprehensively and hierarchically manage requirements of each module on Intermediate Representation (IR) by introducing the concept of Dialect, and support developers to customize extension of Dialect according to specific needs to significantly improving scalability and adaptability of framework. In terms of designs, PIR strictly follows the Static Single Assignment (SSA) principle, unifies top-level structure, realizes compatibility of "Operator sequentiality" and "computational graph semantics". This provides a clear and consistent view of the complex computation process. In order to further optimize performance of large models, PIR also provides a set of more concise and low-cost Pass development processes, including Declarative Rewrite Rule (DRR) and Pattern Rewriter. In addition, a series of rich and full-featured Pass optimization strategies are built-in, to deeply optimize application according to characteristics of large models, thus providing strong support for ultimate performance of large models. Through these innovative designs and optimization methods, PIR lays a solid foundation for efficient operation and continuous expansion of the PaddlePaddle framework.
+- Enhance about 30 APIs to support complex number computation, such as `paddle.log`, `paddle.log1p`, `paddle.square`, and `paddle.reciprocal`, to extend the support for more scientific computing scenarios. [#62448](https://github.com/PaddlePaddle/Paddle/pull/62448), [#60821](https://github.com/PaddlePaddle/Paddle/pull/60821), [#60897](https://github.com/PaddlePaddle/Paddle/pull/60897), [#62764](https://github.com/PaddlePaddle/Paddle/pull/62764), [#59536](https://github.com/PaddlePaddle/Paddle/pull/59536), [#59529](https://github.com/PaddlePaddle/Paddle/pull/59529), [#63207](https://github.com/PaddlePaddle/Paddle/pull/63207), [#62237](https://github.com/PaddlePaddle/Paddle/pull/62237), [#64684](https://github.com/PaddlePaddle/Paddle/pull/64684)
+- Enhance 46 APIs, to make existing APIs easier to use and easier to convert to codes,including but not limited to, adding API parameters, extending the data types supported by the APIs, and fixing the existing unreasonable designs. [#59890](https://github.com/PaddlePaddle/Paddle/pull/59890), [#63513](https://github.com/PaddlePaddle/Paddle/pull/63513), [#59674](https://github.com/PaddlePaddle/Paddle/pull/59674), [#62778](https://github.com/PaddlePaddle/Paddle/pull/62778), [#64110](https://github.com/PaddlePaddle/Paddle/pull/64110), [#63222](https://github.com/PaddlePaddle/Paddle/pull/63222), [#64331](https://github.com/PaddlePaddle/Paddle/pull/64331), [#64715](https://github.com/PaddlePaddle/Paddle/pull/64715), [#61155](https://github.com/PaddlePaddle/Paddle/pull/61155), [#60070](https://github.com/PaddlePaddle/Paddle/pull/60070), [#61974](https://github.com/PaddlePaddle/Paddle/pull/61974), [#62407](https://github.com/PaddlePaddle/Paddle/pull/62407), [#62672](https://github.com/PaddlePaddle/Paddle/pull/62672),[#62722](https://github.com/PaddlePaddle/Paddle/pull/62722), [#62876](https://github.com/PaddlePaddle/Paddle/pull/62876), [#63284](https://github.com/PaddlePaddle/Paddle/pull/63284), [#63860](https://github.com/PaddlePaddle/Paddle/pull/63860), [#60466](https://github.com/PaddlePaddle/Paddle/pull/60466), [#63690](https://github.com/PaddlePaddle/Paddle/pull/63690), [#63953](https://github.com/PaddlePaddle/Paddle/pull/63953), [#63901](https://github.com/PaddlePaddle/Paddle/pull/63901), [#62624](https://github.com/PaddlePaddle/Paddle/pull/62624), [#59857](https://github.com/PaddlePaddle/Paddle/pull/59857), [#60084](https://github.com/PaddlePaddle/Paddle/pull/60084), [#60766](https://github.com/PaddlePaddle/Paddle/pull/60766), [#62788](https://github.com/PaddlePaddle/Paddle/pull/62788), [#62937](https://github.com/PaddlePaddle/Paddle/pull/62937), [#63134](https://github.com/PaddlePaddle/Paddle/pull/63134), [#62966](https://github.com/PaddlePaddle/Paddle/pull/62966), [#63648](https://github.com/PaddlePaddle/Paddle/pull/63648), [#63881](https://github.com/PaddlePaddle/Paddle/pull/63881), [#64358](https://github.com/PaddlePaddle/Paddle/pull/64358), [#60503](https://github.com/PaddlePaddle/Paddle/pull/60503), [#63604](https://github.com/PaddlePaddle/Paddle/pull/63604), [#62338](https://github.com/PaddlePaddle/Paddle/pull/62338)
+- Enhance single-test infrastructure for higher-order differentiation, making it easier to add single-test use cases for higher-order differentiation. [#62074](https://github.com/PaddlePaddle/Paddle/pull/62074)
-#### New features
+### API Performance Improvements
-- Abstracted core concepts of IR bottom layer and provided developers with flexible base components, such as Operation, Attribute, Value, Type, Trait, and Interface. [#56354](https://github.com/PaddlePaddle/Paddle/pull/56354),[#57106](https://github.com/PaddlePaddle/Paddle/pull/57106),[#57349](https://github.com/PaddlePaddle/Paddle/pull/57349),[#54844](https://github.com/PaddlePaddle/Paddle/pull/54844),[#54984](https://github.com/PaddlePaddle/Paddle/pull/54984),[#54565](https://github.com/PaddlePaddle/Paddle/pull/54565),[#54562](https://github.com/PaddlePaddle/Paddle/pull/54562),[#57249](https://github.com/PaddlePaddle/Paddle/pull/57249),[#57550](https://github.com/PaddlePaddle/Paddle/pull/57550),[#59278](https://github.com/PaddlePaddle/Paddle/pull/59278),[#54875](https://github.com/PaddlePaddle/Paddle/pull/54875),[#55041](https://github.com/PaddlePaddle/Paddle/pull/55041),[#54987](https://github.com/PaddlePaddle/Paddle/pull/54987),[#55903](https://github.com/PaddlePaddle/Paddle/pull/55903),[#57582](https://github.com/PaddlePaddle/Paddle/pull/57582),[#57580](https://github.com/PaddlePaddle/Paddle/pull/57580),[#58052](https://github.com/PaddlePaddle/Paddle/pull/58052),[#55322](https://github.com/PaddlePaddle/Paddle/pull/55322),[#57418](https://github.com/PaddlePaddle/Paddle/pull/57418),[#57635](https://github.com/PaddlePaddle/Paddle/pull/57635),[#55328](https://github.com/PaddlePaddle/Paddle/pull/55328),[#57463](https://github.com/PaddlePaddle/Paddle/pull/57463),[#59791](https://github.com/PaddlePaddle/Paddle/pull/59791),[#59821](https://github.com/PaddlePaddle/Paddle/pull/59821),[#59115](https://github.com/PaddlePaddle/Paddle/pull/59115),[#57461](https://github.com/PaddlePaddle/Paddle/pull/57461),[#59392](https://github.com/PaddlePaddle/Paddle/pull/59392),[#57373](https://github.com/PaddlePaddle/Paddle/pull/57373),[#59118](https://github.com/PaddlePaddle/Paddle/pull/59118)
-- Added Dialect mechanism to support comprehensive and hierarchical management of intermediate representation requirements of each module of framework. Through built-in Builtin Dialect, it supports developers to customize and extend Dialect according to their needs. [#56325](https://github.com/PaddlePaddle/Paddle/pull/56325),[#57539](https://github.com/PaddlePaddle/Paddle/pull/57539),[#54682](https://github.com/PaddlePaddle/Paddle/pull/54682),[#55381](https://github.com/PaddlePaddle/Paddle/pull/55381),[#56156](https://github.com/PaddlePaddle/Paddle/pull/56156),[#56431](https://github.com/PaddlePaddle/Paddle/pull/56431),[#56615](https://github.com/PaddlePaddle/Paddle/pull/56615),[#57103](https://github.com/PaddlePaddle/Paddle/pull/57103),[#57209](https://github.com/PaddlePaddle/Paddle/pull/57209)
-- Normalized PaddlePaddle static graph operator system. Added OperatorDialect and KernelDialect. Managed conceptual differences of operators in the form of Dialect during compilation and execution, making Architecture clearer. [#56284](https://github.com/PaddlePaddle/Paddle/pull/56284),[#54469](https://github.com/PaddlePaddle/Paddle/pull/54469),[#58660](https://github.com/PaddlePaddle/Paddle/pull/58660),[#58975](https://github.com/PaddlePaddle/Paddle/pull/58975),[#56680](https://github.com/PaddlePaddle/Paddle/pull/56680),[#54790](https://github.com/PaddlePaddle/Paddle/pull/54790),[#54826](https://github.com/PaddlePaddle/Paddle/pull/54826),[#54840](https://github.com/PaddlePaddle/Paddle/pull/54840),[#55699](https://github.com/PaddlePaddle/Paddle/pull/55699),[#55648](https://github.com/PaddlePaddle/Paddle/pull/55648),[#55880](https://github.com/PaddlePaddle/Paddle/pull/55880),[#56101](https://github.com/PaddlePaddle/Paddle/pull/56101),[#56754](https://github.com/PaddlePaddle/Paddle/pull/56754),[#54944](https://github.com/PaddlePaddle/Paddle/pull/54944),[#56836](https://github.com/PaddlePaddle/Paddle/pull/56836),[#57185](https://github.com/PaddlePaddle/Paddle/pull/57185),[#58757](https://github.com/PaddlePaddle/Paddle/pull/58757),[#56243](https://github.com/PaddlePaddle/Paddle/pull/56243),[#56436](https://github.com/PaddlePaddle/Paddle/pull/56436),[#57741](https://github.com/PaddlePaddle/Paddle/pull/57741),[#59124](https://github.com/PaddlePaddle/Paddle/pull/59124),[#57054](https://github.com/PaddlePaddle/Paddle/pull/57054),[#56984](https://github.com/PaddlePaddle/Paddle/pull/56984),[#57403](https://github.com/PaddlePaddle/Paddle/pull/57403),[#57904](https://github.com/PaddlePaddle/Paddle/pull/57904),[#58031](https://github.com/PaddlePaddle/Paddle/pull/58031),[#56924](https://github.com/PaddlePaddle/Paddle/pull/56924),[#59270](https://github.com/PaddlePaddle/Paddle/pull/59270),[#55343](https://github.com/PaddlePaddle/Paddle/pull/55343),[#56557](https://github.com/PaddlePaddle/Paddle/pull/56557),[#55693](https://github.com/PaddlePaddle/Paddle/pull/55693),[#54428](https://github.com/PaddlePaddle/Paddle/pull/54428)
-- Added ShapeDialect with built-in rich shape operation operators for constructing dynamic shape constraints and expressions for AI compilers. [#56727](https://github.com/PaddlePaddle/Paddle/pull/56727),[#59254](https://github.com/PaddlePaddle/Paddle/pull/59254),[#58368](https://github.com/PaddlePaddle/Paddle/pull/58368),[#57069](https://github.com/PaddlePaddle/Paddle/pull/57069),[#57337](https://github.com/PaddlePaddle/Paddle/pull/57337),[#56351](https://github.com/PaddlePaddle/Paddle/pull/56351),[#57029](https://github.com/PaddlePaddle/Paddle/pull/57029),[#58036](https://github.com/PaddlePaddle/Paddle/pull/58036),[#59032](https://github.com/PaddlePaddle/Paddle/pull/59032),[#57961](https://github.com/PaddlePaddle/Paddle/pull/57961),[#56427](https://github.com/PaddlePaddle/Paddle/pull/56427),[#57459](https://github.com/PaddlePaddle/Paddle/pull/57459)
-- Unified top-level structure of Framework Program, supporting compatible representation of "operator sequentiality" and "computational graph semantics", decoupling dependency on ir::Graph, and strictly following the principle of Static Single Assignment (SSA). [#59369](https://github.com/PaddlePaddle/Paddle/pull/59369),[#54563](https://github.com/PaddlePaddle/Paddle/pull/54563),[#57051](https://github.com/PaddlePaddle/Paddle/pull/57051),[#57306](https://github.com/PaddlePaddle/Paddle/pull/57306),[#57857](https://github.com/PaddlePaddle/Paddle/pull/57857)
-- Added IrPrinter and IrPaser components to support serialization and deserialization of PIR Programs, providing a friendly debugging experience for PIR development. [#55695](https://github.com/PaddlePaddle/Paddle/pull/55695),[#59449](https://github.com/PaddlePaddle/Paddle/pull/59449),[#54369](https://github.com/PaddlePaddle/Paddle/pull/54369),[#54499](https://github.com/PaddlePaddle/Paddle/pull/54499),[#55518](https://github.com/PaddlePaddle/Paddle/pull/55518),[#55784](https://github.com/PaddlePaddle/Paddle/pull/55784),[#57180](https://github.com/PaddlePaddle/Paddle/pull/57180),[#57471](https://github.com/PaddlePaddle/Paddle/pull/57471),[#54859](https://github.com/PaddlePaddle/Paddle/pull/54859),[#54968](https://github.com/PaddlePaddle/Paddle/pull/54968),[#55209](https://github.com/PaddlePaddle/Paddle/pull/55209),[#57314](https://github.com/PaddlePaddle/Paddle/pull/57314),[#57969](https://github.com/PaddlePaddle/Paddle/pull/57969)
-- Built a new, simple and low-cost Pass development system based on Declarative Rewrite Rule (DDR) and Pattern Rewriter, with built-in a series of rich and full-featured Pass Optimization strategies, to accelerate training and inference execution process. [#54385](https://github.com/PaddlePaddle/Paddle/pull/54385),[#54738](https://github.com/PaddlePaddle/Paddle/pull/54738),[#55859](https://github.com/PaddlePaddle/Paddle/pull/55859),[#56638](https://github.com/PaddlePaddle/Paddle/pull/56638),[#57090](https://github.com/PaddlePaddle/Paddle/pull/57090),[#58673](https://github.com/PaddlePaddle/Paddle/pull/58673),[#59415](https://github.com/PaddlePaddle/Paddle/pull/59415),[#56729](https://github.com/PaddlePaddle/Paddle/pull/56729),[#58655](https://github.com/PaddlePaddle/Paddle/pull/58655)
-- Added ProgramTranslator component, to support conversion from ProgramDesc to new generation of IR representations of PaddlePaddle by pressing one key, with provision of easy-to-use C++ and Python interfaces. [#55433](https://github.com/PaddlePaddle/Paddle/pull/55433),[#54470](https://github.com/PaddlePaddle/Paddle/pull/54470),[#58044](https://github.com/PaddlePaddle/Paddle/pull/58044),[#58390](https://github.com/PaddlePaddle/Paddle/pull/58390),[#58100](https://github.com/PaddlePaddle/Paddle/pull/58100),[#55403](https://github.com/PaddlePaddle/Paddle/pull/55403),[#55406](https://github.com/PaddlePaddle/Paddle/pull/55406),[#54719](https://github.com/PaddlePaddle/Paddle/pull/54719),[#56550](https://github.com/PaddlePaddle/Paddle/pull/56550),[#55448](https://github.com/PaddlePaddle/Paddle/pull/55448),[#55453](https://github.com/PaddlePaddle/Paddle/pull/55453),[#56294](https://github.com/PaddlePaddle/Paddle/pull/56294),[#56308](https://github.com/PaddlePaddle/Paddle/pull/56308),[#56842](https://github.com/PaddlePaddle/Paddle/pull/56842),[#58517](https://github.com/PaddlePaddle/Paddle/pull/58517)
-- With help of automatic code generation technology, it can generate the full amount of static graph operator representations for PaddlePaddle framework by pressing one key. Sank static graph networking logic to C++ side and bind it to \_C_ops module. This can greatly streamline API code on Python side, realize ultimate dynamic-static unification of APIs of PaddlePaddle Framework, and upgrade a lot of Python APIs to support static graph networking of the new IR. [#56570](https://github.com/PaddlePaddle/Paddle/pull/56570),[#55745](https://github.com/PaddlePaddle/Paddle/pull/55745),[#56955](https://github.com/PaddlePaddle/Paddle/pull/56955),[#57298](https://github.com/PaddlePaddle/Paddle/pull/57298),[#57946](https://github.com/PaddlePaddle/Paddle/pull/57946),[#57248](https://github.com/PaddlePaddle/Paddle/pull/57248),[#56080](https://github.com/PaddlePaddle/Paddle/pull/56080),[#54396](https://github.com/PaddlePaddle/Paddle/pull/54396),[#54551](https://github.com/PaddlePaddle/Paddle/pull/54551),[#56520](https://github.com/PaddlePaddle/Paddle/pull/56520),[#55002](https://github.com/PaddlePaddle/Paddle/pull/55002),[#57067](https://github.com/PaddlePaddle/Paddle/pull/57067),[#59320](https://github.com/PaddlePaddle/Paddle/pull/59320),[#59348](https://github.com/PaddlePaddle/Paddle/pull/59348),[#57164](https://github.com/PaddlePaddle/Paddle/pull/57164),[#57267](https://github.com/PaddlePaddle/Paddle/pull/57267),[#59064](https://github.com/PaddlePaddle/Paddle/pull/59064),[#54340](https://github.com/PaddlePaddle/Paddle/pull/54340),[#54895](https://github.com/PaddlePaddle/Paddle/pull/54895),[#55004](https://github.com/PaddlePaddle/Paddle/pull/55004),[#56196](https://github.com/PaddlePaddle/Paddle/pull/56196),[#56862](https://github.com/PaddlePaddle/Paddle/pull/56862),[#58991](https://github.com/PaddlePaddle/Paddle/pull/58991),[#55428](https://github.com/PaddlePaddle/Paddle/pull/55428),[#55909](https://github.com/PaddlePaddle/Paddle/pull/55909),[#56241](https://github.com/PaddlePaddle/Paddle/pull/56241),[#56526](https://github.com/PaddlePaddle/Paddle/pull/56526),[#56571](https://github.com/PaddlePaddle/Paddle/pull/56571),[#56518](https://github.com/PaddlePaddle/Paddle/pull/56518),[#57016](https://github.com/PaddlePaddle/Paddle/pull/57016),[#56653](https://github.com/PaddlePaddle/Paddle/pull/56653),[#56809](https://github.com/PaddlePaddle/Paddle/pull/56809),[#57158](https://github.com/PaddlePaddle/Paddle/pull/57158),[#55422](https://github.com/PaddlePaddle/Paddle/pull/55422),[#55458](https://github.com/PaddlePaddle/Paddle/pull/55458),[#55432](https://github.com/PaddlePaddle/Paddle/pull/55432),[#55467](https://github.com/PaddlePaddle/Paddle/pull/55467),[#55483](https://github.com/PaddlePaddle/Paddle/pull/55483),[#55419](https://github.com/PaddlePaddle/Paddle/pull/55419),[#55517](https://github.com/PaddlePaddle/Paddle/pull/55517),[#55500](https://github.com/PaddlePaddle/Paddle/pull/55500),[#56674](https://github.com/PaddlePaddle/Paddle/pull/56674),[#57693](https://github.com/PaddlePaddle/Paddle/pull/57693),[#55008](https://github.com/PaddlePaddle/Paddle/pull/55008),[#57166](https://github.com/PaddlePaddle/Paddle/pull/57166),[#57157](https://github.com/PaddlePaddle/Paddle/pull/57157),[#57159](https://github.com/PaddlePaddle/Paddle/pull/57159),[#57175](https://github.com/PaddlePaddle/Paddle/pull/57175),[#57325](https://github.com/PaddlePaddle/Paddle/pull/57325),[#57330](https://github.com/PaddlePaddle/Paddle/pull/57330),[#57415](https://github.com/PaddlePaddle/Paddle/pull/57415),[#57122](https://github.com/PaddlePaddle/Paddle/pull/57122),[#57393](https://github.com/PaddlePaddle/Paddle/pull/57393),[#57344](https://github.com/PaddlePaddle/Paddle/pull/57344),[#57667](https://github.com/PaddlePaddle/Paddle/pull/57667),[#57348](https://github.com/PaddlePaddle/Paddle/pull/57348),[#57700](https://github.com/PaddlePaddle/Paddle/pull/57700),[#58093](https://github.com/PaddlePaddle/Paddle/pull/58093),[#58005](https://github.com/PaddlePaddle/Paddle/pull/58005),[#58081](https://github.com/PaddlePaddle/Paddle/pull/58081),[#58094](https://github.com/PaddlePaddle/Paddle/pull/58094),[#58137](https://github.com/PaddlePaddle/Paddle/pull/58137),[#58287](https://github.com/PaddlePaddle/Paddle/pull/58287),[#58352](https://github.com/PaddlePaddle/Paddle/pull/58352),[#58340](https://github.com/PaddlePaddle/Paddle/pull/58340),[#58363](https://github.com/PaddlePaddle/Paddle/pull/58363),[#58331](https://github.com/PaddlePaddle/Paddle/pull/58331),[#58343](https://github.com/PaddlePaddle/Paddle/pull/58343),[#58317](https://github.com/PaddlePaddle/Paddle/pull/58317),[#58450](https://github.com/PaddlePaddle/Paddle/pull/58450),[#58377](https://github.com/PaddlePaddle/Paddle/pull/58377),[#58466](https://github.com/PaddlePaddle/Paddle/pull/58466),[#58470](https://github.com/PaddlePaddle/Paddle/pull/58470),[#58491](https://github.com/PaddlePaddle/Paddle/pull/58491),[#58546](https://github.com/PaddlePaddle/Paddle/pull/58546),[#58587](https://github.com/PaddlePaddle/Paddle/pull/58587),[#58453](https://github.com/PaddlePaddle/Paddle/pull/58453),[#58634](https://github.com/PaddlePaddle/Paddle/pull/58634),[#58604](https://github.com/PaddlePaddle/Paddle/pull/58604),[#58605](https://github.com/PaddlePaddle/Paddle/pull/58605),[#58593](https://github.com/PaddlePaddle/Paddle/pull/58593),[#58675](https://github.com/PaddlePaddle/Paddle/pull/58675),[#58699](https://github.com/PaddlePaddle/Paddle/pull/58699),[#58384](https://github.com/PaddlePaddle/Paddle/pull/58384),[#58629](https://github.com/PaddlePaddle/Paddle/pull/58629),[#58579](https://github.com/PaddlePaddle/Paddle/pull/58579),[#58695](https://github.com/PaddlePaddle/Paddle/pull/58695),[#58548](https://github.com/PaddlePaddle/Paddle/pull/58548),[#58688](https://github.com/PaddlePaddle/Paddle/pull/58688),[#58792](https://github.com/PaddlePaddle/Paddle/pull/58792),[#58843](https://github.com/PaddlePaddle/Paddle/pull/58843),[#58840](https://github.com/PaddlePaddle/Paddle/pull/58840),[#58718](https://github.com/PaddlePaddle/Paddle/pull/58718),[#58883](https://github.com/PaddlePaddle/Paddle/pull/58883),[#58785](https://github.com/PaddlePaddle/Paddle/pull/58785),[#58608](https://github.com/PaddlePaddle/Paddle/pull/58608),[#58781](https://github.com/PaddlePaddle/Paddle/pull/58781),[#58783](https://github.com/PaddlePaddle/Paddle/pull/58783),[#58429](https://github.com/PaddlePaddle/Paddle/pull/58429),[#58685](https://github.com/PaddlePaddle/Paddle/pull/58685),[#58696](https://github.com/PaddlePaddle/Paddle/pull/58696),[#58690](https://github.com/PaddlePaddle/Paddle/pull/58690),[#58831](https://github.com/PaddlePaddle/Paddle/pull/58831),[#58929](https://github.com/PaddlePaddle/Paddle/pull/58929),[#58740](https://github.com/PaddlePaddle/Paddle/pull/58740),[#58937](https://github.com/PaddlePaddle/Paddle/pull/58937),[#58782](https://github.com/PaddlePaddle/Paddle/pull/58782),[#58833](https://github.com/PaddlePaddle/Paddle/pull/58833),[#58882](https://github.com/PaddlePaddle/Paddle/pull/58882),[#58935](https://github.com/PaddlePaddle/Paddle/pull/58935),[#58931](https://github.com/PaddlePaddle/Paddle/pull/58931),[#59041](https://github.com/PaddlePaddle/Paddle/pull/59041),[#59040](https://github.com/PaddlePaddle/Paddle/pull/59040),[#58877](https://github.com/PaddlePaddle/Paddle/pull/58877),[#58888](https://github.com/PaddlePaddle/Paddle/pull/58888),[#59042](https://github.com/PaddlePaddle/Paddle/pull/59042),[#58780](https://github.com/PaddlePaddle/Paddle/pull/58780),[#58682](https://github.com/PaddlePaddle/Paddle/pull/58682),[#58815](https://github.com/PaddlePaddle/Paddle/pull/58815),[#58676](https://github.com/PaddlePaddle/Paddle/pull/58676),[#58678](https://github.com/PaddlePaddle/Paddle/pull/58678),[#58446](https://github.com/PaddlePaddle/Paddle/pull/58446),[#59077](https://github.com/PaddlePaddle/Paddle/pull/59077),[#59091](https://github.com/PaddlePaddle/Paddle/pull/59091),[#58661](https://github.com/PaddlePaddle/Paddle/pull/58661),[#58832](https://github.com/PaddlePaddle/Paddle/pull/58832),[#58642](https://github.com/PaddlePaddle/Paddle/pull/58642),[#58698](https://github.com/PaddlePaddle/Paddle/pull/58698),[#59313](https://github.com/PaddlePaddle/Paddle/pull/59313),[#59371](https://github.com/PaddlePaddle/Paddle/pull/59371),[#58700](https://github.com/PaddlePaddle/Paddle/pull/58700),[#58953](https://github.com/PaddlePaddle/Paddle/pull/58953),[#58879](https://github.com/PaddlePaddle/Paddle/pull/58879),[#59469](https://github.com/PaddlePaddle/Paddle/pull/59469),[#59573](https://github.com/PaddlePaddle/Paddle/pull/59573),[#59481](https://github.com/PaddlePaddle/Paddle/pull/59481),[#59419](https://github.com/PaddlePaddle/Paddle/pull/59419),[#59509](https://github.com/PaddlePaddle/Paddle/pull/59509),[#58735](https://github.com/PaddlePaddle/Paddle/pull/58735),[#59616](https://github.com/PaddlePaddle/Paddle/pull/59616),[#59582](https://github.com/PaddlePaddle/Paddle/pull/59582),[#59420](https://github.com/PaddlePaddle/Paddle/pull/59420),[#59500](https://github.com/PaddlePaddle/Paddle/pull/59500),[#58911](https://github.com/PaddlePaddle/Paddle/pull/58911),[#59535](https://github.com/PaddlePaddle/Paddle/pull/59535),[#54891](https://github.com/PaddlePaddle/Paddle/pull/54891),[#56794](https://github.com/PaddlePaddle/Paddle/pull/56794),[#57477](https://github.com/PaddlePaddle/Paddle/pull/57477),[#57929](https://github.com/PaddlePaddle/Paddle/pull/57929),[#57765](https://github.com/PaddlePaddle/Paddle/pull/57765),[#58693](https://github.com/PaddlePaddle/Paddle/pull/58693),[#58603](https://github.com/PaddlePaddle/Paddle/pull/58603),[#56291](https://github.com/PaddlePaddle/Paddle/pull/56291),[#57123](https://github.com/PaddlePaddle/Paddle/pull/57123),[#57317](https://github.com/PaddlePaddle/Paddle/pull/57317),[#57341](https://github.com/PaddlePaddle/Paddle/pull/57341),[#57020](https://github.com/PaddlePaddle/Paddle/pull/57020),[#57324](https://github.com/PaddlePaddle/Paddle/pull/57324),[#57761](https://github.com/PaddlePaddle/Paddle/pull/57761),[#57762](https://github.com/PaddlePaddle/Paddle/pull/57762),[#57907](https://github.com/PaddlePaddle/Paddle/pull/57907),[#57909](https://github.com/PaddlePaddle/Paddle/pull/57909),[#58099](https://github.com/PaddlePaddle/Paddle/pull/58099),[#58110](https://github.com/PaddlePaddle/Paddle/pull/58110),[#58114](https://github.com/PaddlePaddle/Paddle/pull/58114),[#58139](https://github.com/PaddlePaddle/Paddle/pull/58139),[#58144](https://github.com/PaddlePaddle/Paddle/pull/58144),[#58165](https://github.com/PaddlePaddle/Paddle/pull/58165),[#58194](https://github.com/PaddlePaddle/Paddle/pull/58194),[#58138](https://github.com/PaddlePaddle/Paddle/pull/58138),[#58113](https://github.com/PaddlePaddle/Paddle/pull/58113),[#58245](https://github.com/PaddlePaddle/Paddle/pull/58245),[#58318](https://github.com/PaddlePaddle/Paddle/pull/58318),[#58105](https://github.com/PaddlePaddle/Paddle/pull/58105),[#58348](https://github.com/PaddlePaddle/Paddle/pull/58348),[#58235](https://github.com/PaddlePaddle/Paddle/pull/58235),[#58354](https://github.com/PaddlePaddle/Paddle/pull/58354),[#58341](https://github.com/PaddlePaddle/Paddle/pull/58341),[#58445](https://github.com/PaddlePaddle/Paddle/pull/58445),[#58418](https://github.com/PaddlePaddle/Paddle/pull/58418),[#58239](https://github.com/PaddlePaddle/Paddle/pull/58239),[#58473](https://github.com/PaddlePaddle/Paddle/pull/58473),[#58239](https://github.com/PaddlePaddle/Paddle/pull/58239),[#58391](https://github.com/PaddlePaddle/Paddle/pull/58391),[#58501](https://github.com/PaddlePaddle/Paddle/pull/58501),[#58519](https://github.com/PaddlePaddle/Paddle/pull/58519),[#58416](https://github.com/PaddlePaddle/Paddle/pull/58416),[#58588](https://github.com/PaddlePaddle/Paddle/pull/58588),[#58531](https://github.com/PaddlePaddle/Paddle/pull/58531),[#58730](https://github.com/PaddlePaddle/Paddle/pull/58730),[#58773](https://github.com/PaddlePaddle/Paddle/pull/58773),[#58862](https://github.com/PaddlePaddle/Paddle/pull/58862),[#58946](https://github.com/PaddlePaddle/Paddle/pull/58946),[#58500](https://github.com/PaddlePaddle/Paddle/pull/58500),[#56585](https://github.com/PaddlePaddle/Paddle/pull/56585),[#57480](https://github.com/PaddlePaddle/Paddle/pull/57480),[#57433](https://github.com/PaddlePaddle/Paddle/pull/57433),[#58498](https://github.com/PaddlePaddle/Paddle/pull/58498)
+- Focus on optimizing the performance of Tensor basic index, advanced index, and combined index, improving computational performance by 2X to 31X on GPUs and 1.8X to 1004X on CPUs. [#60254](https://github.com/PaddlePaddle/Paddle/pull/60254), [#60276](https://github.com/PaddlePaddle/Paddle/pull/60276), [#60452](https://github.com/PaddlePaddle/Paddle/pull/60452), [#60771](https://github.com/PaddlePaddle/Paddle/pull/60771), [#61021](https://github.com/PaddlePaddle/Paddle/pull/61021), [#60983](https://github.com/PaddlePaddle/Paddle/pull/60983), [#61060](https://github.com/PaddlePaddle/Paddle/pull/61060), [#60618](https://github.com/PaddlePaddle/Paddle/pull/60618)
-#### Function optimization
+### Bug Fixing
-- Upgraded static graph executor to extend more Kernel Instruction types, and supported loading of PIR with efficiently scheduling execution. This has significant video memory and performance gains in training and inference. [#54570](https://github.com/PaddlePaddle/Paddle/pull/54570),[#58665](https://github.com/PaddlePaddle/Paddle/pull/58665),[#57291](https://github.com/PaddlePaddle/Paddle/pull/57291),[#54452](https://github.com/PaddlePaddle/Paddle/pull/54452),[#57431](https://github.com/PaddlePaddle/Paddle/pull/57431),[#54692](https://github.com/PaddlePaddle/Paddle/pull/54692),[#55112](https://github.com/PaddlePaddle/Paddle/pull/55112),[#55210](https://github.com/PaddlePaddle/Paddle/pull/55210),[#55401](https://github.com/PaddlePaddle/Paddle/pull/55401),[#55772](https://github.com/PaddlePaddle/Paddle/pull/55772),[#55828](https://github.com/PaddlePaddle/Paddle/pull/55828),[#56148](https://github.com/PaddlePaddle/Paddle/pull/56148),[#54763](https://github.com/PaddlePaddle/Paddle/pull/54763),[#56886](https://github.com/PaddlePaddle/Paddle/pull/56886),[#57284](https://github.com/PaddlePaddle/Paddle/pull/57284),[#57268](https://github.com/PaddlePaddle/Paddle/pull/57268),[#57791](https://github.com/PaddlePaddle/Paddle/pull/57791),[#56789](https://github.com/PaddlePaddle/Paddle/pull/56789),[#56704](https://github.com/PaddlePaddle/Paddle/pull/56704),[#57594](https://github.com/PaddlePaddle/Paddle/pull/57594),[#58397](https://github.com/PaddlePaddle/Paddle/pull/58397),[#58337](https://github.com/PaddlePaddle/Paddle/pull/58337),[#58756](https://github.com/PaddlePaddle/Paddle/pull/58756),[#58371](https://github.com/PaddlePaddle/Paddle/pull/58371)
-- Reconstructed auto-differentiation module for PIR, migrate and adapted the high-order auto-differentiation function. Optimized Stop Gradient transfer mechanism, so logic is clearer and function is more robust. [#55660](https://github.com/PaddlePaddle/Paddle/pull/55660),[#57084](https://github.com/PaddlePaddle/Paddle/pull/57084),[#56890](https://github.com/PaddlePaddle/Paddle/pull/56890),[#58942](https://github.com/PaddlePaddle/Paddle/pull/58942),[#59373](https://github.com/PaddlePaddle/Paddle/pull/59373),[#57206](https://github.com/PaddlePaddle/Paddle/pull/57206),[#58145](https://github.com/PaddlePaddle/Paddle/pull/58145),[#55235](https://github.com/PaddlePaddle/Paddle/pull/55235),[#57255](https://github.com/PaddlePaddle/Paddle/pull/57255),[#56925](https://github.com/PaddlePaddle/Paddle/pull/56925),[#55957](https://github.com/PaddlePaddle/Paddle/pull/55957),[#56163](https://github.com/PaddlePaddle/Paddle/pull/56163),[#56316](https://github.com/PaddlePaddle/Paddle/pull/56316),[#57294](https://github.com/PaddlePaddle/Paddle/pull/57294),[#57449](https://github.com/PaddlePaddle/Paddle/pull/57449),[#59520](https://github.com/PaddlePaddle/Paddle/pull/59520),[#59565](https://github.com/PaddlePaddle/Paddle/pull/59565),[#56265](https://github.com/PaddlePaddle/Paddle/pull/56265),[#56512](https://github.com/PaddlePaddle/Paddle/pull/56512),[#56650](https://github.com/PaddlePaddle/Paddle/pull/56650),[#57183](https://github.com/PaddlePaddle/Paddle/pull/57183),[#57956](https://github.com/PaddlePaddle/Paddle/pull/57956),[#59100](https://github.com/PaddlePaddle/Paddle/pull/59100)
-- Optimized design and representation of control flow forward and reverse operators, introduced ControlFlow Dialect, and supported conversion and execution from control flow operators to PIR under ProgramDesc. [#58729](https://github.com/PaddlePaddle/Paddle/pull/58729),[#57364](https://github.com/PaddlePaddle/Paddle/pull/57364),[#58625](https://github.com/PaddlePaddle/Paddle/pull/58625),[#57475](https://github.com/PaddlePaddle/Paddle/pull/57475),[#57265](https://github.com/PaddlePaddle/Paddle/pull/57265),[#56799](https://github.com/PaddlePaddle/Paddle/pull/56799),[#59033](https://github.com/PaddlePaddle/Paddle/pull/59033),[#57342](https://github.com/PaddlePaddle/Paddle/pull/57342),[#57801](https://github.com/PaddlePaddle/Paddle/pull/57801),[#57958](https://github.com/PaddlePaddle/Paddle/pull/57958),[#57949](https://github.com/PaddlePaddle/Paddle/pull/57949),[#57937](https://github.com/PaddlePaddle/Paddle/pull/57937),[#59231](https://github.com/PaddlePaddle/Paddle/pull/59231),[#59496](https://github.com/PaddlePaddle/Paddle/pull/59496),[#59321](https://github.com/PaddlePaddle/Paddle/pull/59321),[#58088](https://github.com/PaddlePaddle/Paddle/pull/58088),[#58198](https://github.com/PaddlePaddle/Paddle/pull/58198),[#58024](https://github.com/PaddlePaddle/Paddle/pull/58024),[#58089](https://github.com/PaddlePaddle/Paddle/pull/58089),[#58086](https://github.com/PaddlePaddle/Paddle/pull/58086),[#59175](https://github.com/PaddlePaddle/Paddle/pull/59175),[#59423](https://github.com/PaddlePaddle/Paddle/pull/59423),[#59567](https://github.com/PaddlePaddle/Paddle/pull/59567),[#58098](https://github.com/PaddlePaddle/Paddle/pull/58098),[#58163](https://github.com/PaddlePaddle/Paddle/pull/58163),[#58250](https://github.com/PaddlePaddle/Paddle/pull/58250),[#58277](https://github.com/PaddlePaddle/Paddle/pull/58277),[#58355](https://github.com/PaddlePaddle/Paddle/pull/58355),[#59020](https://github.com/PaddlePaddle/Paddle/pull/59020),[#59200](https://github.com/PaddlePaddle/Paddle/pull/59200),[#59585](https://github.com/PaddlePaddle/Paddle/pull/59585),[#58109](https://github.com/PaddlePaddle/Paddle/pull/58109)
-- Upgraded dynamic to static execution flow to support PIR, optimized dynamic to static subgraph Pass mechanism, and supported users to try and use functions in the PIR system under the @to_static function. [#57566](https://github.com/PaddlePaddle/Paddle/pull/57566),[#55620](https://github.com/PaddlePaddle/Paddle/pull/55620),[#56791](https://github.com/PaddlePaddle/Paddle/pull/56791),[#57357](https://github.com/PaddlePaddle/Paddle/pull/57357),[#59152](https://github.com/PaddlePaddle/Paddle/pull/59152),[#59312](https://github.com/PaddlePaddle/Paddle/pull/59312),[#58630](https://github.com/PaddlePaddle/Paddle/pull/58630),[#56035](https://github.com/PaddlePaddle/Paddle/pull/56035),[#59447](https://github.com/PaddlePaddle/Paddle/pull/59447),[#57361](https://github.com/PaddlePaddle/Paddle/pull/57361),[#59261](https://github.com/PaddlePaddle/Paddle/pull/59261),[#59774](https://github.com/PaddlePaddle/Paddle/pull/59774)
-- Upgraded combination operator function with introducing the concept of Backend to manage logic of combination operator module of dynamic and static graphs in a hierarchical way. Sank necessary components and operator splitting rules into C++, to dramatically reduce maintenance costs. [#58153](https://github.com/PaddlePaddle/Paddle/pull/58153),[#56391](https://github.com/PaddlePaddle/Paddle/pull/56391),[#56614](https://github.com/PaddlePaddle/Paddle/pull/56614),[#57030](https://github.com/PaddlePaddle/Paddle/pull/57030),[#57554](https://github.com/PaddlePaddle/Paddle/pull/57554),[#58018](https://github.com/PaddlePaddle/Paddle/pull/58018),[#58130](https://github.com/PaddlePaddle/Paddle/pull/58130),[#58581](https://github.com/PaddlePaddle/Paddle/pull/58581),[#58679](https://github.com/PaddlePaddle/Paddle/pull/58679),[#59054](https://github.com/PaddlePaddle/Paddle/pull/59054),[#55480](https://github.com/PaddlePaddle/Paddle/pull/55480),[#58451](https://github.com/PaddlePaddle/Paddle/pull/58451),[#55647](https://github.com/PaddlePaddle/Paddle/pull/55647),[#56342](https://github.com/PaddlePaddle/Paddle/pull/56342),[#56798](https://github.com/PaddlePaddle/Paddle/pull/56798),[#57561](https://github.com/PaddlePaddle/Paddle/pull/57561),[#58023](https://github.com/PaddlePaddle/Paddle/pull/58023),[#57722](https://github.com/PaddlePaddle/Paddle/pull/57722)
+- Fix errors in `paddle.optimizer.LBFGS` caused by using non-Tensor computations [#60219](https://github.com/PaddlePaddle/Paddle/pull/60219)
+- Fix the problem of random numbers not being fixed in `paddle.optimizer.LBFGS` [#60591](https://github.com/PaddlePaddle/Paddle/pull/60591)
+- Fix the incorrect calculation of gradient of `set_value` operator [#59034](https://github.com/PaddlePaddle/Paddle/pull/59034)
+- Fix the problem of Tensor basic index adapting to PIR [#60259](https://github.com/PaddlePaddle/Paddle/pull/60259), [#61103](https://github.com/PaddlePaddle/Paddle/pull/61103)
+- Fix the problem of Tensor combined index assignment [problem](https://github.com/PaddlePaddle/Paddle/issues/60376) [#60447](https://github.com/PaddlePaddle/Paddle/pull/60447)
+- Fix the problem when Tensor combined index takes values [problem] [#61922](https://github.com/PaddlePaddle/Paddle/pull/61922)
+- Fix `paddle.flatten` stride calculation error issue, with being able to add `paddle.flatten_` [#63084](https://github.com/PaddlePaddle/Paddle/pull/63084)
+- Fix the result inconsistency problem between `paddle.index_fill` and `paddle.index_fill_` [#59863](https://github.com/PaddlePaddle/Paddle/pull/59863)
+- Fix the `paddle.masked_scatter` error report issue [#60835](https://github.com/PaddlePaddle/Paddle/pull/60835)
+- Fix the `paddle.histogramdd` cpu error report issue [#61891](https://github.com/PaddlePaddle/Paddle/pull/61891)
+- Fix the bug that `paddle.cast_` continuous use on cpu leads to incorrect result [#60054](https://github.com/PaddlePaddle/Paddle/pull/60054)
+- Fix `paddle.put_along_axis` bug when input size is very large [#60551](https://github.com/PaddlePaddle/Paddle/pull/60551)
+- Fix `paddle.nanmedian` cpu error report issue [#63221](https://github.com/PaddlePaddle/Paddle/pull/63221)
+- Fix the bug that `paddle.median` does not support inputs other than floating-point types in the min branch. [#64444](https://github.com/PaddlePaddle/Paddle/pull/64444)
+- Fix the dataloader issue in distributed scenarios. [#62696](https://github.com/PaddlePaddle/Paddle/pull/62696), [#63378](https://github.com/PaddlePaddle/Paddle/pull/63378)
+- Fix the formatting issue in error prompt [#63106](https://github.com/PaddlePaddle/Paddle/pull/63106), [#63144](https://github.com/PaddlePaddle/Paddle/pull/63144)
+- Fix the format issue under GLOG_v>=6. [#63345](https://github.com/PaddlePaddle/Paddle/pull/63345)
-#### Performance optimization
+### Security Improvements
-- Added PIR Program operators such as DCE and constant_folding_pass, and structure-optimized Pass. [#54935](https://github.com/PaddlePaddle/Paddle/pull/54935),[#59430](https://github.com/PaddlePaddle/Paddle/pull/59430),[#58753](https://github.com/PaddlePaddle/Paddle/pull/58753),[#58732](https://github.com/PaddlePaddle/Paddle/pull/58732)
+- Enhance the checking of parent_ids [#62826](https://github.com/PaddlePaddle/Paddle/pull/62826)
-2. Added optimization operators fusing class Pass, such as fused_attention, fused_dropout_add, fused_gemm_epilogue_pass, fused_linear_param_grad_add_pass, fused_weight_only_linear_pass, and fused_softmax_mask_upper_triangle, to improve training and inference performance. [#57557](https://github.com/PaddlePaddle/Paddle/pull/57557),[#58272](https://github.com/PaddlePaddle/Paddle/pull/58272),[#58188](https://github.com/PaddlePaddle/Paddle/pull/58188),[#58401](https://github.com/PaddlePaddle/Paddle/pull/58401),[#59366](https://github.com/PaddlePaddle/Paddle/pull/59366),[#57655](https://github.com/PaddlePaddle/Paddle/pull/57655),[#57360](https://github.com/PaddlePaddle/Paddle/pull/57360),[#56672](https://github.com/PaddlePaddle/Paddle/pull/56672),[#58537](https://github.com/PaddlePaddle/Paddle/pull/58537),[#56247](https://github.com/PaddlePaddle/Paddle/pull/56247),[#59391](https://github.com/PaddlePaddle/Paddle/pull/59391),[#58897](https://github.com/PaddlePaddle/Paddle/pull/58897),[#54933](https://github.com/PaddlePaddle/Paddle/pull/54933)
+## Basic Execution Architecture
-### Dynamic to static capability enhancement
+PIR basic functions have been upgraded and improved comprehensively, and the maturity level has been greatly improved. Based on PIR, the design of the PaddlePaddle infrastructure is more reasonable, ensuring the excellent performance and good scalability of the framework. In this version, we have completed the inference verification of PIR in multiple scenarios: For the single-machine scenario, complete the PIR back-end switching in the dynamic-to-static scenarios; For inference scenario, complete the verification of all the stock models, and 84.2% of the models have a gain of 10%+; we have completed the verification of distributed scenarios based on PIR. Meanwhile, based on PIR, we have completed the development and validation of core modules such as control flow, backward logic, save/load, and OneDNN adaptation, which lays a solid foundation for the switching of the PaddlePaddle PIR to the default mode. The functional completeness, execution efficiency and stability of the PaddlePaddle framework operator system are further improved, bringing better use and development experience to the developers.
-Dynamic to static graph conversion is a key technology in deep learning frameworks. It allows developers to find the best balance between flexibility and training efficiency. This version of PaddlePaddle has fully upgraded core Dynamic to Static functionality. Success rate of dynamic to static training is up to 100% among 700+ models in PaddlePaddle industry-grade model library.
+### Function Optimization
-#### New features
+- Improve the basic functions of PIR, including basic type system enhancement, debugging, printing, Pass development, and AMP support, to enhance the development efficiency of PIR. [#60723](https://github.com/PaddlePaddle/Paddle/pull/60723), [#60677](https://github.com/PaddlePaddle/Paddle/pull/60677), [#60783](https://github.com/PaddlePaddle/Paddle/pull/60783), [#60798](https://github.com/PaddlePaddle/Paddle/pull/60798), [#61053](https://github.com/PaddlePaddle/Paddle/pull/61053), [#61366](https://github.com/PaddlePaddle/Paddle/pull/61366), [#61446](https://github.com/PaddlePaddle/Paddle/pull/61446), [#60024](https://github.com/PaddlePaddle/Paddle/pull/60024), [#59939](https://github.com/PaddlePaddle/Paddle/pull/59939), [#63376](https://github.com/PaddlePaddle/Paddle/pull/63376), [#61853](https://github.com/PaddlePaddle/Paddle/pull/61853), [#63914](https://github.com/PaddlePaddle/Paddle/pull/63914), [#60170](https://github.com/PaddlePaddle/Paddle/pull/60170), [#60678](https://github.com/PaddlePaddle/Paddle/pull/60678), [#64093](https://github.com/PaddlePaddle/Paddle/pull/64093), [#64065](https://github.com/PaddlePaddle/Paddle/pull/64065), [#62451](https://github.com/PaddlePaddle/Paddle/pull/62451), [#59784](https://github.com/PaddlePaddle/Paddle/pull/59784), [#60136](https://github.com/PaddlePaddle/Paddle/pull/60136), [#63336](https://github.com/PaddlePaddle/Paddle/pull/63336), [#62108](https://github.com/PaddlePaddle/Paddle/pull/62108), [#60860](https://github.com/PaddlePaddle/Paddle/pull/60860), [#60536](https://github.com/PaddlePaddle/Paddle/pull/60536), [#60590](https://github.com/PaddlePaddle/Paddle/pull/60590), [#60752](https://github.com/PaddlePaddle/Paddle/pull/60752), [#61435](https://github.com/PaddlePaddle/Paddle/pull/61435), [#62977](https://github.com/PaddlePaddle/Paddle/pull/62977), [#62139](https://github.com/PaddlePaddle/Paddle/pull/62139), [#60432](https://github.com/PaddlePaddle/Paddle/pull/60432), [#61452](https://github.com/PaddlePaddle/Paddle/pull/61452), [#61978](https://github.com/PaddlePaddle/Paddle/pull/61978), [#62262](https://github.com/PaddlePaddle/Paddle/pull/62262), [#62422](https://github.com/PaddlePaddle/Paddle/pull/62422), [#60359](https://github.com/PaddlePaddle/Paddle/pull/60359), [#62989](https://github.com/PaddlePaddle/Paddle/pull/62989), [#61297](https://github.com/PaddlePaddle/Paddle/pull/61297), [#61399](https://github.com/PaddlePaddle/Paddle/pull/61399), [#61871](https://github.com/PaddlePaddle/Paddle/pull/61871), [#61496](https://github.com/PaddlePaddle/Paddle/pull/61496), [#62413](https://github.com/PaddlePaddle/Paddle/pull/62413)
+- Optimize the execution logic of the PaddlePaddle actuator, improve the Pass system, enhance the performance of training and inference, to better support distributed parallel logic operation. [#60182](https://github.com/PaddlePaddle/Paddle/pull/60182), [#60516](https://github.com/PaddlePaddle/Paddle/pull/60516), [#63573](https://github.com/PaddlePaddle/Paddle/pull/63573), [#60181](https://github.com/PaddlePaddle/Paddle/pull/60181), [#59792](https://github.com/PaddlePaddle/Paddle/pull/59792), [#62025](https://github.com/PaddlePaddle/Paddle/pull/62025), [#61160](https://github.com/PaddlePaddle/Paddle/pull/61160), [#61188](https://github.com/PaddlePaddle/Paddle/pull/61188), [#61277](https://github.com/PaddlePaddle/Paddle/pull/61277), [#61669](https://github.com/PaddlePaddle/Paddle/pull/61669), [#60823](https://github.com/PaddlePaddle/Paddle/pull/60823), [#61310](https://github.com/PaddlePaddle/Paddle/pull/61310), [#60892](https://github.com/PaddlePaddle/Paddle/pull/60892), [#60578](https://github.com/PaddlePaddle/Paddle/pull/60578), [#61657](https://github.com/PaddlePaddle/Paddle/pull/61657), [#62638](https://github.com/PaddlePaddle/Paddle/pull/62638), [#63960](https://github.com/PaddlePaddle/Paddle/pull/63960), [#64234](https://github.com/PaddlePaddle/Paddle/pull/64234)
-- Adopted Python Eval Frame and VM simulation execution technology to innovatively implement an adaptive Graph Break mechanism. This mechanism is especially designed for control flow scenarios. By introducing the CallLayer mechanism, it makes full use of the advantage of PaddlePaddle dynamic-static unification motion. Support hybrid mode of Abstract Syntax Tree (AST) and bytecode simulation. Efficiently captures control flow operators, thus dramatically improving ability of computational graph to be static. At cache optimization level, fuse advanced optimization technologies such as common sub-expression elimination, to significantly improve execution efficiency of Guard. These optimizations not only reduce redundant computations, but also improve overall system operation speed. To enhance robustness of the system, a simple and efficient data intermediate layer structure is designed. Structure supports correctness recovery of SideEffects, ensuring stability and reliability of system in complex environments. In addition, it is widely compatible with mainstream interpreter versions from Python 3.8 to 3.11, providing users with a wide range of applicability. [#57824](https://github.com/PaddlePaddle/Paddle/pull/57824),[#55887](https://github.com/PaddlePaddle/Paddle/pull/55887),[#58155](https://github.com/PaddlePaddle/Paddle/pull/58155),[#56107](https://github.com/PaddlePaddle/Paddle/pull/56107),[#57490](https://github.com/PaddlePaddle/Paddle/pull/57490),[#58829](https://github.com/PaddlePaddle/Paddle/pull/58829),[#57240](https://github.com/PaddlePaddle/Paddle/pull/57240),[#57588](https://github.com/PaddlePaddle/Paddle/pull/57588),[#58117](https://github.com/PaddlePaddle/Paddle/pull/58117),[#59823](https://github.com/PaddlePaddle/Paddle/pull/59823),[#56077](https://github.com/PaddlePaddle/Paddle/pull/56077),[#58956](https://github.com/PaddlePaddle/Paddle/pull/58956),[#57653](https://github.com/PaddlePaddle/Paddle/pull/57653),[#59855](https://github.com/PaddlePaddle/Paddle/pull/59855),[#59017](https://github.com/PaddlePaddle/Paddle/pull/59017),[#58424](https://github.com/PaddlePaddle/Paddle/pull/58424),[#58187](https://github.com/PaddlePaddle/Paddle/pull/58187),[#57793](https://github.com/PaddlePaddle/Paddle/pull/57793),[#59698](https://github.com/PaddlePaddle/Paddle/pull/59698),[#59747](https://github.com/PaddlePaddle/Paddle/pull/59747),[#59710](https://github.com/PaddlePaddle/Paddle/pull/59710),[#59297](https://github.com/PaddlePaddle/Paddle/pull/59297),[#58423](https://github.com/PaddlePaddle/Paddle/pull/58423),[#56262](https://github.com/PaddlePaddle/Paddle/pull/56262),[#58103](https://github.com/PaddlePaddle/Paddle/pull/58103),[#58538](https://github.com/PaddlePaddle/Paddle/pull/58538),[#58771](https://github.com/PaddlePaddle/Paddle/pull/58771),[#59191](https://github.com/PaddlePaddle/Paddle/pull/59191),[#57754](https://github.com/PaddlePaddle/Paddle/pull/57754),[#59439](https://github.com/PaddlePaddle/Paddle/pull/59439),[#59816](https://github.com/PaddlePaddle/Paddle/pull/59816),[#59035](https://github.com/PaddlePaddle/Paddle/pull/59035)
-- Added dynamic to static syntax transcription parsing for PyLayer functions, making PyLayer's conversion between dynamic and static graphs smoother. Users can now seamlessly carry out dynamic to static training on PyLayer, to easily export inference models. [#56108](https://github.com/PaddlePaddle/Paddle/pull/56108),[#56531](https://github.com/PaddlePaddle/Paddle/pull/56531),[#57066](https://github.com/PaddlePaddle/Paddle/pull/57066),[#57633](https://github.com/PaddlePaddle/Paddle/pull/57633)
+### PIR New Features
-#### Bug Fix
+- Realize reverse logic based on PIR, generate reverse computation graph directly, and support higher-order differentiation at the same time. [#60174](https://github.com/PaddlePaddle/Paddle/pull/60174), [#60328](https://github.com/PaddlePaddle/Paddle/pull/60328), [#60818](https://github.com/PaddlePaddle/Paddle/pull/60818), [#61352](https://github.com/PaddlePaddle/Paddle/pull/61352), [#61661](https://github.com/PaddlePaddle/Paddle/pull/61661), [#61927](https://github.com/PaddlePaddle/Paddle/pull/61927), [#62772](https://github.com/PaddlePaddle/Paddle/pull/62772), [#60360](https://github.com/PaddlePaddle/Paddle/pull/60360), [#60866](https://github.com/PaddlePaddle/Paddle/pull/60866), [#60970](https://github.com/PaddlePaddle/Paddle/pull/60970), [#60810](https://github.com/PaddlePaddle/Paddle/pull/60810), [#64696](https://github.com/PaddlePaddle/Paddle/pull/64696), [#59844](https://github.com/PaddlePaddle/Paddle/pull/59844), [#59999](https://github.com/PaddlePaddle/Paddle/pull/59999), [#60262](https://github.com/PaddlePaddle/Paddle/pull/60262), [#60338](https://github.com/PaddlePaddle/Paddle/pull/60338), [#59935](https://github.com/PaddlePaddle/Paddle/pull/59935), [#59982](https://github.com/PaddlePaddle/Paddle/pull/59982), [#60221](https://github.com/PaddlePaddle/Paddle/pull/60221), [#62621](https://github.com/PaddlePaddle/Paddle/pull/62621), [#60044](https://github.com/PaddlePaddle/Paddle/pull/60044), [#59790](https://github.com/PaddlePaddle/Paddle/pull/59790), [#60529](https://github.com/PaddlePaddle/Paddle/pull/60529), [#61378](https://github.com/PaddlePaddle/Paddle/pull/61378), [#61584](https://github.com/PaddlePaddle/Paddle/pull/61584)
+- Implement control flow logic based on PIR to improve the expressive ability of PIR and better support multi-scenario services such as training and inference. [#61396](https://github.com/PaddlePaddle/Paddle/pull/61396), [#64045](https://github.com/PaddlePaddle/Paddle/pull/64045), [#60953](https://github.com/PaddlePaddle/Paddle/pull/60953), [#61091](https://github.com/PaddlePaddle/Paddle/pull/61091), [#61304](https://github.com/PaddlePaddle/Paddle/pull/61304), [#62093](https://github.com/PaddlePaddle/Paddle/pull/62093), [#64710](https://github.com/PaddlePaddle/Paddle/pull/64710), [#60668](https://github.com/PaddlePaddle/Paddle/pull/60668), [#60433](https://github.com/PaddlePaddle/Paddle/pull/60433), [#60963](https://github.com/PaddlePaddle/Paddle/pull/60963), [#61192](https://github.com/PaddlePaddle/Paddle/pull/61192), [#60895](https://github.com/PaddlePaddle/Paddle/pull/60895), [#60017](https://github.com/PaddlePaddle/Paddle/pull/60017), [#60369](https://github.com/PaddlePaddle/Paddle/pull/60369), [#60330](https://github.com/PaddlePaddle/Paddle/pull/60330), [#60364](https://github.com/PaddlePaddle/Paddle/pull/60364), [#61416](https://github.com/PaddlePaddle/Paddle/pull/61416), [#60460](https://github.com/PaddlePaddle/Paddle/pull/60460), [#60703](https://github.com/PaddlePaddle/Paddle/pull/60703), [#61027](https://github.com/PaddlePaddle/Paddle/pull/61027)
+- Realize save/load logic based on PIR, to carry out the process of PIR and upstream/downstream training and inference services. [#63438](https://github.com/PaddlePaddle/Paddle/pull/63438), [#63574](https://github.com/PaddlePaddle/Paddle/pull/63574), [#64281](https://github.com/PaddlePaddle/Paddle/pull/64281), [#64327](https://github.com/PaddlePaddle/Paddle/pull/64327), [#63622](https://github.com/PaddlePaddle/Paddle/pull/63622), [#64507](https://github.com/PaddlePaddle/Paddle/pull/64507), [#63389](https://github.com/PaddlePaddle/Paddle/pull/63389), [#63539](https://github.com/PaddlePaddle/Paddle/pull/63539), [#63749](https://github.com/PaddlePaddle/Paddle/pull/63749), [#63957](https://github.com/PaddlePaddle/Paddle/pull/63957), [#64044](https://github.com/PaddlePaddle/Paddle/pull/64044), [#64121](https://github.com/PaddlePaddle/Paddle/pull/64121), [#64239](https://github.com/PaddlePaddle/Paddle/pull/64239), [#63818](https://github.com/PaddlePaddle/Paddle/pull/63818), [#63910](https://github.com/PaddlePaddle/Paddle/pull/63910),[#63380](https://github.com/PaddlePaddle/Paddle/pull/63380)[#63380](https://github.com/PaddlePaddle/Paddle/pull/63380),[#63275](https://github.com/PaddlePaddle/Paddle/pull/63275),[#63663](https://github.com/PaddlePaddle/Paddle/pull/63663),[#64692](https://github.com/PaddlePaddle/Paddle/pull/64692),[#63958](https://github.com/PaddlePaddle/Paddle/pull/63958)
+- Completed the development and validation of OneDNN related basic functions to prepare for the full-scale switch of OneDNN. [#60680](https://github.com/PaddlePaddle/Paddle/pull/60680), [#60665](https://github.com/PaddlePaddle/Paddle/pull/60665), [#63162](https://github.com/PaddlePaddle/Paddle/pull/63162), [#59917](https://github.com/PaddlePaddle/Paddle/pull/59917), [#62901](https://github.com/PaddlePaddle/Paddle/pull/62901), [#59918](https://github.com/PaddlePaddle/Paddle/pull/59918), [#60257](https://github.com/PaddlePaddle/Paddle/pull/60257), [#60502](https://github.com/PaddlePaddle/Paddle/pull/60502), [#61062](https://github.com/PaddlePaddle/Paddle/pull/61062), [#61170](https://github.com/PaddlePaddle/Paddle/pull/61170), [#61474](https://github.com/PaddlePaddle/Paddle/pull/61474), [#60874](https://github.com/PaddlePaddle/Paddle/pull/60874), [#61495](https://github.com/PaddlePaddle/Paddle/pull/61495), [#61664](https://github.com/PaddlePaddle/Paddle/pull/61664), [#61649](https://github.com/PaddlePaddle/Paddle/pull/61649), [#61592](https://github.com/PaddlePaddle/Paddle/pull/61592), [#61667](https://github.com/PaddlePaddle/Paddle/pull/61667), [#61137](https://github.com/PaddlePaddle/Paddle/pull/61137), [#60952](https://github.com/PaddlePaddle/Paddle/pull/60952), [#61651](https://github.com/PaddlePaddle/Paddle/pull/61651), [#62126](https://github.com/PaddlePaddle/Paddle/pull/62126), [#62187](https://github.com/PaddlePaddle/Paddle/pull/62187), [#61307](https://github.com/PaddlePaddle/Paddle/pull/61307), [#62734](https://github.com/PaddlePaddle/Paddle/pull/62734), [#60974](https://github.com/PaddlePaddle/Paddle/pull/60974), [#61451](https://github.com/PaddlePaddle/Paddle/pull/61451), [#61011](https://github.com/PaddlePaddle/Paddle/pull/61011), [#61218](https://github.com/PaddlePaddle/Paddle/pull/61218), [#61623](https://github.com/PaddlePaddle/Paddle/pull/61623), [#61893](https://github.com/PaddlePaddle/Paddle/pull/61893), [#61876](https://github.com/PaddlePaddle/Paddle/pull/61876), [#61892](https://github.com/PaddlePaddle/Paddle/pull/61892), [#62085](https://github.com/PaddlePaddle/Paddle/pull/62085), [#62220](https://github.com/PaddlePaddle/Paddle/pull/62220), [#62244](https://github.com/PaddlePaddle/Paddle/pull/62244), [#62265](https://github.com/PaddlePaddle/Paddle/pull/62265), [#60754](https://github.com/PaddlePaddle/Paddle/pull/60754), [#60896](https://github.com/PaddlePaddle/Paddle/pull/60896), [#61868](https://github.com/PaddlePaddle/Paddle/pull/61868), [#61659](https://github.com/PaddlePaddle/Paddle/pull/61659), [#62241](https://github.com/PaddlePaddle/Paddle/pull/62241), [#62471](https://github.com/PaddlePaddle/Paddle/pull/62471), [#61165](https://github.com/PaddlePaddle/Paddle/pull/61165),[#64441](https://github.com/PaddlePaddle/Paddle/pull/64441),[#63141](https://github.com/PaddlePaddle/Paddle/pull/63141),[#63145](https://github.com/PaddlePaddle/Paddle/pull/63145),[#63592](https://github.com/PaddlePaddle/Paddle/pull/63592),[#63617](https://github.com/PaddlePaddle/Paddle/pull/63617),[#63518](https://github.com/PaddlePaddle/Paddle/pull/63518),[#63726](https://github.com/PaddlePaddle/Paddle/pull/63726),[#63853](https://github.com/PaddlePaddle/Paddle/pull/63853),[#63812](https://github.com/PaddlePaddle/Paddle/pull/63812),[#63811](https://github.com/PaddlePaddle/Paddle/pull/63811),[#64524](https://github.com/PaddlePaddle/Paddle/pull/64524),[#62993](https://github.com/PaddlePaddle/Paddle/pull/62993),[#63516](https://github.com/PaddlePaddle/Paddle/pull/63516),[#62998](https://github.com/PaddlePaddle/Paddle/pull/62998),[#63151](https://github.com/PaddlePaddle/Paddle/pull/63151),[#64661](https://github.com/PaddlePaddle/Paddle/pull/64661),[#64433](https://github.com/PaddlePaddle/Paddle/pull/64433),[#64448](https://github.com/PaddlePaddle/Paddle/pull/64448),[#63201](https://github.com/PaddlePaddle/Paddle/pull/63201),[#63230](https://github.com/PaddlePaddle/Paddle/pull/63230),[#63233](https://github.com/PaddlePaddle/Paddle/pull/63233),[#63281](https://github.com/PaddlePaddle/Paddle/pull/63281),[#64671](https://github.com/PaddlePaddle/Paddle/pull/64671),[#63274](https://github.com/PaddlePaddle/Paddle/pull/63274)
+- Implement Sparse related logic based on PIR, including basic Type and operator expression, and complete the verification of Sparse key functions. [#62868](https://github.com/PaddlePaddle/Paddle/pull/62868), [#63015](https://github.com/PaddlePaddle/Paddle/pull/63015), [#62894](https://github.com/PaddlePaddle/Paddle/pull/62894)
-- Fixed the issue that video memory is abnormal in some scenarios of dynamic to static in is_test=True mode. [#58350](https://github.com/PaddlePaddle/Paddle/pull/58350)
-- Fixed the issue that function decorated by @to_static is exported to jit.save model in scenarios like foo(x,x,y). [#55963](https://github.com/PaddlePaddle/Paddle/pull/55963)
-- Fixed the issue that dynamic and static logic of some API behaviors is not uniform. This improves success rate and user experience of dynamic to static graph conversion. [#56092](https://github.com/PaddlePaddle/Paddle/pull/56092)
+### Dynamic-to-static Function Optimization
-#### Fixed vulnerability
+Optimize the dynamic-to-static basic capability, adapt to the dynamic dimension in SOT training scenarios, and support Python 3.12.
-- Fixed a potential security vulnerability in use of eval() in dynamic to static syntax transcription module. [#60100](https://github.com/PaddlePaddle/Paddle/pull/60100)
+- Complete the PIR adaptation in dynamic-to-static scenarios. [#60988](https://github.com/PaddlePaddle/Paddle/pull/60988), [#61936](https://github.com/PaddlePaddle/Paddle/pull/61936), [#59929](https://github.com/PaddlePaddle/Paddle/pull/59929), [#61790](https://github.com/PaddlePaddle/Paddle/pull/61790), [#64323](https://github.com/PaddlePaddle/Paddle/pull/64323), [#62030](https://github.com/PaddlePaddle/Paddle/pull/62030), [#61143](https://github.com/PaddlePaddle/Paddle/pull/61143), [#62680](https://github.com/PaddlePaddle/Paddle/pull/62680), [#63309](https://github.com/PaddlePaddle/Paddle/pull/63309), [#63311](https://github.com/PaddlePaddle/Paddle/pull/63311), [#62199](https://github.com/PaddlePaddle/Paddle/pull/62199)
+- SOT adapts to Python 3.12 bytecode, and the dynamic-to-static SOT function can be used in Python 3.12. [#61414](https://github.com/PaddlePaddle/Paddle/pull/61414), [#59562](https://github.com/PaddlePaddle/Paddle/pull/59562), [#61031](https://github.com/PaddlePaddle/Paddle/pull/61031), [#61272](https://github.com/PaddlePaddle/Paddle/pull/61272), [#61412](https://github.com/PaddlePaddle/Paddle/pull/61412), [#61305](https://github.com/PaddlePaddle/Paddle/pull/61305), [#61964](https://github.com/PaddlePaddle/Paddle/pull/61964), [#62008](https://github.com/PaddlePaddle/Paddle/pull/62008), [#62028](https://github.com/PaddlePaddle/Paddle/pull/62028), [#61995](https://github.com/PaddlePaddle/Paddle/pull/61995), [#62073](https://github.com/PaddlePaddle/Paddle/pull/62073), [#62120](https://github.com/PaddlePaddle/Paddle/pull/62120), [#62218](https://github.com/PaddlePaddle/Paddle/pull/62218), [#62155](https://github.com/PaddlePaddle/Paddle/pull/62155)
+- SOT completes the adaptation of the dynamic dimension of the training scenario, avoiding triggering duplicate graph compositions in dimension changes, and improving the operation efficiency. [#64278](https://github.com/PaddlePaddle/Paddle/pull/64278), [#64435](https://github.com/PaddlePaddle/Paddle/pull/64435), [#64499](https://github.com/PaddlePaddle/Paddle/pull/64499), [#64500](https://github.com/PaddlePaddle/Paddle/pull/64500), [#62080](https://github.com/PaddlePaddle/Paddle/pull/62080)
-### Enhanced distributed dynamic graph capability
+### Operator Mechanisms
-In order to meet the needs of large models, this version focuses on improving the distributed computing capability of the dynamic graph of the PaddlePaddle. Various improvements have been made in communication library, graph analysis, distributed policies and task enable/disable, to provide comprehensive support for large model training. In terms of performance, we further improved training performance by reducing streaming parallel GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is further improved by fixing related Bugs.
+For the problems of incomplete implementation of some kernels and inefficient calculation logic, we have improved and optimized some of the operator implementation and internal mechanisms of framework, fixed some known problems, and supported some new features.
-#### New features
+- For XPU kernel, we have optimized the data type support of `numel`, `concat`, and `slice`, and the mixed-precision training support for `AdamW` optimizer. [#63715](https://github.com/PaddlePaddle/Paddle/pull/63715), [#61617](https://github.com/PaddlePaddle/Paddle/pull/61617), [#61694](https://github.com/PaddlePaddle/Paddle/pull/61694), [#64542](https://github.com/PaddlePaddle/Paddle/pull/64542), [#63644](https://github.com/PaddlePaddle/Paddle/pull/63644), [#61340](https://github.com/PaddlePaddle/Paddle/pull/61340), [#63108](https://github.com/PaddlePaddle/Paddle/pull/63108)
+- Improve the function and performance of some operators. [#59413](https://github.com/PaddlePaddle/Paddle/pull/59413), [#60295](https://github.com/PaddlePaddle/Paddle/pull/60295), [#64304](https://github.com/PaddlePaddle/Paddle/pull/64304), [#60979](https://github.com/PaddlePaddle/Paddle/pull/60979), [#63556](https://github.com/PaddlePaddle/Paddle/pull/63556), [#63061](https://github.com/PaddlePaddle/Paddle/pull/63061), [#62533](https://github.com/PaddlePaddle/Paddle/pull/62533)
+- Improve the mechanism of composite operators, and optimize composite logic for some operators. [#59448](https://github.com/PaddlePaddle/Paddle/pull/59448), [#60505](https://github.com/PaddlePaddle/Paddle/pull/60505), [#59891](https://github.com/PaddlePaddle/Paddle/pull/59891), [#63161](https://github.com/PaddlePaddle/Paddle/pull/63161), [#63245](https://github.com/PaddlePaddle/Paddle/pull/63245), [#63782](https://github.com/PaddlePaddle/Paddle/pull/63782), [#64346](https://github.com/PaddlePaddle/Paddle/pull/64346), [#63156](https://github.com/PaddlePaddle/Paddle/pull/63156), [#63171](https://github.com/PaddlePaddle/Paddle/pull/63171), [#61315](https://github.com/PaddlePaddle/Paddle/pull/61315), [#61701](https://github.com/PaddlePaddle/Paddle/pull/61701), [#61874](https://github.com/PaddlePaddle/Paddle/pull/61874), [#61873](https://github.com/PaddlePaddle/Paddle/pull/61873), [#62059](https://github.com/PaddlePaddle/Paddle/pull/62059), [#61912](https://github.com/PaddlePaddle/Paddle/pull/61912), [#62112](https://github.com/PaddlePaddle/Paddle/pull/62112), [#63011](https://github.com/PaddlePaddle/Paddle/pull/63011), [#63009](https://github.com/PaddlePaddle/Paddle/pull/63009), [#64714](https://github.com/PaddlePaddle/Paddle/pull/64714)
-- Added TraceHang function in communication library, to quickly locate the faulty node when cluster training has Hang problem. [#59217](https://github.com/PaddlePaddle/Paddle/pull/59217)
-- In order to improve training efficiency and reduce memory, dynamic graph supports stride mechanism. [#55156](https://github.com/PaddlePaddle/Paddle/pull/55156),[#54762](https://github.com/PaddlePaddle/Paddle/pull/54762),[#55850](https://github.com/PaddlePaddle/Paddle/pull/55850),[#59190](https://github.com/PaddlePaddle/Paddle/pull/59190),[#57005](https://github.com/PaddlePaddle/Paddle/pull/57005),[#57005](https://github.com/PaddlePaddle/Paddle/pull/57005),[#57331](https://github.com/PaddlePaddle/Paddle/pull/57331),[#58033](https://github.com/PaddlePaddle/Paddle/pull/58033),[#58033](https://github.com/PaddlePaddle/Paddle/pull/58033),[#58303](https://github.com/PaddlePaddle/Paddle/pull/58303),[#57835](https://github.com/PaddlePaddle/Paddle/pull/57835),[#57189](https://github.com/PaddlePaddle/Paddle/pull/57189)
-- Enhanced paddleviz function to facilitate analysis of computational graphs. [#56837](https://github.com/PaddlePaddle/Paddle/pull/56837),[#57626](https://github.com/PaddlePaddle/Paddle/pull/57626)
-- In distributed Sharding strategies (Stage1,2,3), added main_grad function to support higher precision gradient accumulation, and reduce precision loss caused by low precision accumulation. [#57972](https://github.com/PaddlePaddle/Paddle/pull/57972),[#57934](https://github.com/PaddlePaddle/Paddle/pull/57934),[#57473](https://github.com/PaddlePaddle/Paddle/pull/57473),[#57537](https://github.com/PaddlePaddle/Paddle/pull/57537),[#59611](https://github.com/PaddlePaddle/Paddle/pull/59611),[#57960](https://github.com/PaddlePaddle/Paddle/pull/57960)
-- In Sharding Stage1 strategy, added a switch variable to control whether to perform fusion calculation on Optimizer. [#58790](https://github.com/PaddlePaddle/Paddle/pull/58790)
-- In Recompute function, added support for Tuple input parameters, enhancing calling ability of Recompute interface. [#56793](https://github.com/PaddlePaddle/Paddle/pull/56793)
-- Enhanced Launch function, allowing distributed training without specifying endpoints in dynamic graphs. [#54636](https://github.com/PaddlePaddle/Paddle/pull/54636)
+### Bug Fixing
-#### Function optimization
+- Fix the bugs related to PIR, actuator, and dynamic-to-static. [#64442](https://github.com/PaddlePaddle/Paddle/pull/64442), [#60443](https://github.com/PaddlePaddle/Paddle/pull/60443), [#60122](https://github.com/PaddlePaddle/Paddle/pull/60122), [#60625](https://github.com/PaddlePaddle/Paddle/pull/60625), [#60607](https://github.com/PaddlePaddle/Paddle/pull/60607), [#60705](https://github.com/PaddlePaddle/Paddle/pull/60705), [#61110](https://github.com/PaddlePaddle/Paddle/pull/61110), [#61278](https://github.com/PaddlePaddle/Paddle/pull/61278), [#61448](https://github.com/PaddlePaddle/Paddle/pull/61448), [#61491](https://github.com/PaddlePaddle/Paddle/pull/61491), [#61692](https://github.com/PaddlePaddle/Paddle/pull/61692), [#62100](https://github.com/PaddlePaddle/Paddle/pull/62100), [#62239](https://github.com/PaddlePaddle/Paddle/pull/62239), [#62365](https://github.com/PaddlePaddle/Paddle/pull/62365), [#62758](https://github.com/PaddlePaddle/Paddle/pull/62758), [#63395](https://github.com/PaddlePaddle/Paddle/pull/63395), [#64272](https://github.com/PaddlePaddle/Paddle/pull/64272), [#62165](https://github.com/PaddlePaddle/Paddle/pull/62165), [#64151](https://github.com/PaddlePaddle/Paddle/pull/64151), [#64204](https://github.com/PaddlePaddle/Paddle/pull/64204), [#64815](https://github.com/PaddlePaddle/Paddle/pull/64815), [#63757](https://github.com/PaddlePaddle/Paddle/pull/63757), [#61972](https://github.com/PaddlePaddle/Paddle/pull/61972), [#64806](https://github.com/PaddlePaddle/Paddle/pull/64806), [#60010](https://github.com/PaddlePaddle/Paddle/pull/60010), [#60461](https://github.com/PaddlePaddle/Paddle/pull/60461), [#60310](https://github.com/PaddlePaddle/Paddle/pull/60310), [#62006](https://github.com/PaddlePaddle/Paddle/pull/62006), [#61591](https://github.com/PaddlePaddle/Paddle/pull/61591), [#60327](https://github.com/PaddlePaddle/Paddle/pull/60327), [#60720](https://github.com/PaddlePaddle/Paddle/pull/60720), [#64656](https://github.com/PaddlePaddle/Paddle/pull/64656), [#60236](https://github.com/PaddlePaddle/Paddle/pull/60236), [#60684](https://github.com/PaddlePaddle/Paddle/pull/60684), [#60790](https://github.com/PaddlePaddle/Paddle/pull/60790), [#60944](https://github.com/PaddlePaddle/Paddle/pull/60944), [#62056](https://github.com/PaddlePaddle/Paddle/pull/62056), [#62891](https://github.com/PaddlePaddle/Paddle/pull/62891), [#64676](https://github.com/PaddlePaddle/Paddle/pull/64676), [#60271](https://github.com/PaddlePaddle/Paddle/pull/60271), [#60634](https://github.com/PaddlePaddle/Paddle/pull/60634), [#60663](https://github.com/PaddlePaddle/Paddle/pull/60663), [#60827](https://github.com/PaddlePaddle/Paddle/pull/60827), [#60845](https://github.com/PaddlePaddle/Paddle/pull/60845), [#60905](https://github.com/PaddlePaddle/Paddle/pull/60905), [#60945](https://github.com/PaddlePaddle/Paddle/pull/60945), [#60949](https://github.com/PaddlePaddle/Paddle/pull/60949), [#61107](https://github.com/PaddlePaddle/Paddle/pull/61107), [#61111](https://github.com/PaddlePaddle/Paddle/pull/61111), [#61117](https://github.com/PaddlePaddle/Paddle/pull/61117), [#61158](https://github.com/PaddlePaddle/Paddle/pull/61158), [#61177](https://github.com/PaddlePaddle/Paddle/pull/61177), [#61355](https://github.com/PaddlePaddle/Paddle/pull/61355), [#61593](https://github.com/PaddlePaddle/Paddle/pull/61593), [#61666](https://github.com/PaddlePaddle/Paddle/pull/61666), [#61934](https://github.com/PaddlePaddle/Paddle/pull/61934), [#62216](https://github.com/PaddlePaddle/Paddle/pull/62216), [#62491](https://github.com/PaddlePaddle/Paddle/pull/62491), [#62515](https://github.com/PaddlePaddle/Paddle/pull/62515), [#62594](https://github.com/PaddlePaddle/Paddle/pull/62594), [#62605](https://github.com/PaddlePaddle/Paddle/pull/62605), [#62895](https://github.com/PaddlePaddle/Paddle/pull/62895), [#62913](https://github.com/PaddlePaddle/Paddle/pull/62913), [#64413](https://github.com/PaddlePaddle/Paddle/pull/64413), [#59947](https://github.com/PaddlePaddle/Paddle/pull/59947), [#60264](https://github.com/PaddlePaddle/Paddle/pull/60264), [#60721](https://github.com/PaddlePaddle/Paddle/pull/60721), [#63113](https://github.com/PaddlePaddle/Paddle/pull/63113), [#63629](https://github.com/PaddlePaddle/Paddle/pull/63629), [#64300](https://github.com/PaddlePaddle/Paddle/pull/64300), [#64450](https://github.com/PaddlePaddle/Paddle/pull/64450), [#64532](https://github.com/PaddlePaddle/Paddle/pull/64532), [#64561](https://github.com/PaddlePaddle/Paddle/pull/64561), [#64625](https://github.com/PaddlePaddle/Paddle/pull/64625), [#64731](https://github.com/PaddlePaddle/Paddle/pull/64731), [#60059](https://github.com/PaddlePaddle/Paddle/pull/60059), [#60487](https://github.com/PaddlePaddle/Paddle/pull/60487), [#60423](https://github.com/PaddlePaddle/Paddle/pull/60423), [#61599](https://github.com/PaddlePaddle/Paddle/pull/61599), [#62032](https://github.com/PaddlePaddle/Paddle/pull/62032), [#62686](https://github.com/PaddlePaddle/Paddle/pull/62686), [#64055](https://github.com/PaddlePaddle/Paddle/pull/64055), [#60751](https://github.com/PaddlePaddle/Paddle/pull/60751), [#61646](https://github.com/PaddlePaddle/Paddle/pull/61646), [#60454](https://github.com/PaddlePaddle/Paddle/pull/60454), [#62530](https://github.com/PaddlePaddle/Paddle/pull/62530), [#62821](https://github.com/PaddlePaddle/Paddle/pull/62821), [#64454](https://github.com/PaddlePaddle/Paddle/pull/64454), [#64754](https://github.com/PaddlePaddle/Paddle/pull/64754), [#59860](https://github.com/PaddlePaddle/Paddle/pull/59860), [#60280](https://github.com/PaddlePaddle/Paddle/pull/60280), [#60357](https://github.com/PaddlePaddle/Paddle/pull/60357), [#60363](https://github.com/PaddlePaddle/Paddle/pull/60363), [#60900](https://github.com/PaddlePaddle/Paddle/pull/60900), [#61185](https://github.com/PaddlePaddle/Paddle/pull/61185), [#61505](https://github.com/PaddlePaddle/Paddle/pull/61505), [#61644](https://github.com/PaddlePaddle/Paddle/pull/61644), [#62256](https://github.com/PaddlePaddle/Paddle/pull/62256), [#62396](https://github.com/PaddlePaddle/Paddle/pull/62396), [#63040](https://github.com/PaddlePaddle/Paddle/pull/63040), [#63409](https://github.com/PaddlePaddle/Paddle/pull/63409), [#63764](https://github.com/PaddlePaddle/Paddle/pull/63764), [#59571](https://github.com/PaddlePaddle/Paddle/pull/59571), [#59894](https://github.com/PaddlePaddle/Paddle/pull/59894), [#59569](https://github.com/PaddlePaddle/Paddle/pull/59569), [#59896](https://github.com/PaddlePaddle/Paddle/pull/59896), [#60015](https://github.com/PaddlePaddle/Paddle/pull/60015), [#60081](https://github.com/PaddlePaddle/Paddle/pull/60081), [#60164](https://github.com/PaddlePaddle/Paddle/pull/60164), [#60200](https://github.com/PaddlePaddle/Paddle/pull/60200), [#60211](https://github.com/PaddlePaddle/Paddle/pull/60211), [#60267](https://github.com/PaddlePaddle/Paddle/pull/60267), [#60458](https://github.com/PaddlePaddle/Paddle/pull/60458), [#60395](https://github.com/PaddlePaddle/Paddle/pull/60395), [#60907](https://github.com/PaddlePaddle/Paddle/pull/60907), [#60707](https://github.com/PaddlePaddle/Paddle/pull/60707), [#60993](https://github.com/PaddlePaddle/Paddle/pull/60993), [#61401](https://github.com/PaddlePaddle/Paddle/pull/61401), [#61433](https://github.com/PaddlePaddle/Paddle/pull/61433), [#61450](https://github.com/PaddlePaddle/Paddle/pull/61450), [#61577](https://github.com/PaddlePaddle/Paddle/pull/61577), [#61575](https://github.com/PaddlePaddle/Paddle/pull/61575), [#61703](https://github.com/PaddlePaddle/Paddle/pull/61703), [#61711](https://github.com/PaddlePaddle/Paddle/pull/61711), [#61883](https://github.com/PaddlePaddle/Paddle/pull/61883), [#61822](https://github.com/PaddlePaddle/Paddle/pull/61822), [#62012](https://github.com/PaddlePaddle/Paddle/pull/62012), [#61858](https://github.com/PaddlePaddle/Paddle/pull/61858), [#62176](https://github.com/PaddlePaddle/Paddle/pull/62176), [#62257](https://github.com/PaddlePaddle/Paddle/pull/62257), [#62470](https://github.com/PaddlePaddle/Paddle/pull/62470), [#62536](https://github.com/PaddlePaddle/Paddle/pull/62536), [#62606](https://github.com/PaddlePaddle/Paddle/pull/62606), [#62808](https://github.com/PaddlePaddle/Paddle/pull/62808), [#62854](https://github.com/PaddlePaddle/Paddle/pull/62854), [#62879](https://github.com/PaddlePaddle/Paddle/pull/62879), [#62864](https://github.com/PaddlePaddle/Paddle/pull/62864), [#63063](https://github.com/PaddlePaddle/Paddle/pull/63063), [#62958](https://github.com/PaddlePaddle/Paddle/pull/62958), [#63397](https://github.com/PaddlePaddle/Paddle/pull/63397), [#63805](https://github.com/PaddlePaddle/Paddle/pull/63805), [#63694](https://github.com/PaddlePaddle/Paddle/pull/63694), [#64168](https://github.com/PaddlePaddle/Paddle/pull/64168), [#64184](https://github.com/PaddlePaddle/Paddle/pull/64184), [#64174](https://github.com/PaddlePaddle/Paddle/pull/64174), [#64315](https://github.com/PaddlePaddle/Paddle/pull/64315), [#64362](https://github.com/PaddlePaddle/Paddle/pull/64362), [#64400](https://github.com/PaddlePaddle/Paddle/pull/64400), [#64475](https://github.com/PaddlePaddle/Paddle/pull/64475), [#64458](https://github.com/PaddlePaddle/Paddle/pull/64458), [#64548](https://github.com/PaddlePaddle/Paddle/pull/64548), [#59858](https://github.com/PaddlePaddle/Paddle/pull/59858), [#61132](https://github.com/PaddlePaddle/Paddle/pull/61132), [#62010](https://github.com/PaddlePaddle/Paddle/pull/62010), [#62069](https://github.com/PaddlePaddle/Paddle/pull/62069), [#62707](https://github.com/PaddlePaddle/Paddle/pull/62707), [#62921](https://github.com/PaddlePaddle/Paddle/pull/62921), [#63085](https://github.com/PaddlePaddle/Paddle/pull/63085), [#63321](https://github.com/PaddlePaddle/Paddle/pull/63321), [#63351](https://github.com/PaddlePaddle/Paddle/pull/63351), [#63549](https://github.com/PaddlePaddle/Paddle/pull/63549), [#64567](https://github.com/PaddlePaddle/Paddle/pull/64567), [#59936](https://github.com/PaddlePaddle/Paddle/pull/59936), [#60269](https://github.com/PaddlePaddle/Paddle/pull/60269), [#60879](https://github.com/PaddlePaddle/Paddle/pull/60879), [#61314](https://github.com/PaddlePaddle/Paddle/pull/61314), [#61391](https://github.com/PaddlePaddle/Paddle/pull/61391), [#61479](https://github.com/PaddlePaddle/Paddle/pull/61479), [#61789](https://github.com/PaddlePaddle/Paddle/pull/61789), [#61832](https://github.com/PaddlePaddle/Paddle/pull/61832), [#61864](https://github.com/PaddlePaddle/Paddle/pull/61864), [#61917](https://github.com/PaddlePaddle/Paddle/pull/61917), [#62052](https://github.com/PaddlePaddle/Paddle/pull/62052), [#62068](https://github.com/PaddlePaddle/Paddle/pull/62068), [#62293](https://github.com/PaddlePaddle/Paddle/pull/62293), [#62479](https://github.com/PaddlePaddle/Paddle/pull/62479), [#62506](https://github.com/PaddlePaddle/Paddle/pull/62506), [#59948](https://github.com/PaddlePaddle/Paddle/pull/59948), [#64118](https://github.com/PaddlePaddle/Paddle/pull/64118), [#64126](https://github.com/PaddlePaddle/Paddle/pull/64126), [#64195](https://github.com/PaddlePaddle/Paddle/pull/64195), [#64307](https://github.com/PaddlePaddle/Paddle/pull/64307), [#64314](https://github.com/PaddlePaddle/Paddle/pull/64314), [#64276](https://github.com/PaddlePaddle/Paddle/pull/64276), [#64312](https://github.com/PaddlePaddle/Paddle/pull/64312), [#64350](https://github.com/PaddlePaddle/Paddle/pull/64350), [#64319](https://github.com/PaddlePaddle/Paddle/pull/64319), [#64463](https://github.com/PaddlePaddle/Paddle/pull/64463), [#64457](https://github.com/PaddlePaddle/Paddle/pull/64457), [#64455](https://github.com/PaddlePaddle/Paddle/pull/64455), [#64487](https://github.com/PaddlePaddle/Paddle/pull/64487), [#64645](https://github.com/PaddlePaddle/Paddle/pull/64645), [#63155](https://github.com/PaddlePaddle/Paddle/pull/63155), [#59893](https://github.com/PaddlePaddle/Paddle/pull/59893), [#63332](https://github.com/PaddlePaddle/Paddle/pull/63332), [#63332](https://github.com/PaddlePaddle/Paddle/pull/63332), [#64786](https://github.com/PaddlePaddle/Paddle/pull/64786), [#60515](https://github.com/PaddlePaddle/Paddle/pull/60515), [#60627](https://github.com/PaddlePaddle/Paddle/pull/60627), [#60863](https://github.com/PaddlePaddle/Paddle/pull/60863), [#60854](https://github.com/PaddlePaddle/Paddle/pull/60854), [#61447](https://github.com/PaddlePaddle/Paddle/pull/61447), [#61440](https://github.com/PaddlePaddle/Paddle/pull/61440), [#61932](https://github.com/PaddlePaddle/Paddle/pull/61932), [#62131](https://github.com/PaddlePaddle/Paddle/pull/62131), [#62252](https://github.com/PaddlePaddle/Paddle/pull/62252), [#62283](https://github.com/PaddlePaddle/Paddle/pull/62283), [#62358](https://github.com/PaddlePaddle/Paddle/pull/62358), [#62411](https://github.com/PaddlePaddle/Paddle/pull/62411), [#62424](https://github.com/PaddlePaddle/Paddle/pull/62424), [#62810](https://github.com/PaddlePaddle/Paddle/pull/62810), [#62811](https://github.com/PaddlePaddle/Paddle/pull/62811), [#62896](https://github.com/PaddlePaddle/Paddle/pull/62896), [#62947](https://github.com/PaddlePaddle/Paddle/pull/62947), [#63182](https://github.com/PaddlePaddle/Paddle/pull/63182), [#63190](https://github.com/PaddlePaddle/Paddle/pull/63190), [#63294](https://github.com/PaddlePaddle/Paddle/pull/63294), [#63306](https://github.com/PaddlePaddle/Paddle/pull/63306), [#63352](https://github.com/PaddlePaddle/Paddle/pull/63352), [#63404](https://github.com/PaddlePaddle/Paddle/pull/63404), [#63474](https://github.com/PaddlePaddle/Paddle/pull/63474), [#64013](https://github.com/PaddlePaddle/Paddle/pull/64013), [#64674](https://github.com/PaddlePaddle/Paddle/pull/64674),[#60055](https://github.com/PaddlePaddle/Paddle/pull/60055),[#62050](https://github.com/PaddlePaddle/Paddle/pull/62050),[#62770](https://github.com/PaddlePaddle/Paddle/pull/62770),[#63234](https://github.com/PaddlePaddle/Paddle/pull/63234),[#63374](https://github.com/PaddlePaddle/Paddle/pull/63374),[#64277](https://github.com/PaddlePaddle/Paddle/pull/64277), [#63420](https://github.com/PaddlePaddle/Paddle/pull/63420), [#60312](https://github.com/PaddlePaddle/Paddle/pull/60312), [#63810](https://github.com/PaddlePaddle/Paddle/pull/63810), [#64631](https://github.com/PaddlePaddle/Paddle/pull/64631), [#63970](https://github.com/PaddlePaddle/Paddle/pull/63970), [#63708](https://github.com/PaddlePaddle/Paddle/pull/63708), [#62062](https://github.com/PaddlePaddle/Paddle/pull/62062), [#60898](https://github.com/PaddlePaddle/Paddle/pull/60898), [#62373](https://github.com/PaddlePaddle/Paddle/pull/62373), [#59878](https://github.com/PaddlePaddle/Paddle/pull/59878)
+- Fix some bugs in operator mechanism, operator implementation logic and related unit tests. [#63792](https://github.com/PaddlePaddle/Paddle/pull/63792), [#60570](https://github.com/PaddlePaddle/Paddle/pull/60570), [#61572](https://github.com/PaddlePaddle/Paddle/pull/61572), [#59971](https://github.com/PaddlePaddle/Paddle/pull/59971), [#61336](https://github.com/PaddlePaddle/Paddle/pull/61336), [#63276](https://github.com/PaddlePaddle/Paddle/pull/63276), [#63251](https://github.com/PaddlePaddle/Paddle/pull/63251), [#63697](https://github.com/PaddlePaddle/Paddle/pull/63697), [#63706](https://github.com/PaddlePaddle/Paddle/pull/63706), [#64685](https://github.com/PaddlePaddle/Paddle/pull/64685), [#64009](https://github.com/PaddlePaddle/Paddle/pull/64009), [#62461](https://github.com/PaddlePaddle/Paddle/pull/62461), [#61568](https://github.com/PaddlePaddle/Paddle/pull/61568), [#63912](https://github.com/PaddlePaddle/Paddle/pull/63912), [#60475](https://github.com/PaddlePaddle/Paddle/pull/60475), [#60222](https://github.com/PaddlePaddle/Paddle/pull/60222), [#63961](https://github.com/PaddlePaddle/Paddle/pull/63961), [#63593](https://github.com/PaddlePaddle/Paddle/pull/63593)
-- Implemented new communication library with dynamic-static unification. Communication operators are fully adapted to PHI operator system, reducing development and maintenance costs to better support dynamic graphs and auto parallel architecture upgrade. [#54417](https://github.com/PaddlePaddle/Paddle/pull/54417),[#57768](https://github.com/PaddlePaddle/Paddle/pull/57768),[#57897](https://github.com/PaddlePaddle/Paddle/pull/57897),[#55537](https://github.com/PaddlePaddle/Paddle/pull/55537),[#56604](https://github.com/PaddlePaddle/Paddle/pull/56604),[#57519](https://github.com/PaddlePaddle/Paddle/pull/57519),[#56088](https://github.com/PaddlePaddle/Paddle/pull/56088),[#57153](https://github.com/PaddlePaddle/Paddle/pull/57153),[#57161](https://github.com/PaddlePaddle/Paddle/pull/57161),[#57252](https://github.com/PaddlePaddle/Paddle/pull/57252),[#57251](https://github.com/PaddlePaddle/Paddle/pull/57251),[#57208](https://github.com/PaddlePaddle/Paddle/pull/57208),[#57305](https://github.com/PaddlePaddle/Paddle/pull/57305),[#57424](https://github.com/PaddlePaddle/Paddle/pull/57424),[#57548](https://github.com/PaddlePaddle/Paddle/pull/57548),[#57560](https://github.com/PaddlePaddle/Paddle/pull/57560),[#57564](https://github.com/PaddlePaddle/Paddle/pull/57564),[#57233](https://github.com/PaddlePaddle/Paddle/pull/57233),[#55726](https://github.com/PaddlePaddle/Paddle/pull/55726),[#58073](https://github.com/PaddlePaddle/Paddle/pull/58073)
-- TCPStore is changed to a single instance to support dynamic graphs and auto parallel more flexibly. [#55956](https://github.com/PaddlePaddle/Paddle/pull/55956)
-- Improved maintainability and flexibility of distributed policies such as MP/PP/SP, including addition of printing warning and error messages, structural cleanup of code files, and optimization of PP restrictions on inputs. [#54448](https://github.com/PaddlePaddle/Paddle/pull/54448),[#59762](https://github.com/PaddlePaddle/Paddle/pull/59762),[#55462](https://github.com/PaddlePaddle/Paddle/pull/55462),[#54788](https://github.com/PaddlePaddle/Paddle/pull/54788),[#54664](https://github.com/PaddlePaddle/Paddle/pull/54664),[#56456](https://github.com/PaddlePaddle/Paddle/pull/56456),[#55540](https://github.com/PaddlePaddle/Paddle/pull/55540)
-- In PP strategy, added support for P2P communication in computation flow, making communication mode more flexible. [#54747](https://github.com/PaddlePaddle/Paddle/pull/54747)
-- Sharding strategy supports reduce Operation on gradient. [#58842](https://github.com/PaddlePaddle/Paddle/pull/58842),[#57967](https://github.com/PaddlePaddle/Paddle/pull/57967),[#55495](https://github.com/PaddlePaddle/Paddle/pull/55495)
+### Developer Content
-#### Performance optimization
+- Developer related contents include PIR switching, unit test start, function verification and other PR. [#60621](https://github.com/PaddlePaddle/Paddle/pull/60621), [#59703](https://github.com/PaddlePaddle/Paddle/pull/59703), [#59694](https://github.com/PaddlePaddle/Paddle/pull/59694), [#59717](https://github.com/PaddlePaddle/Paddle/pull/59717), [#59729](https://github.com/PaddlePaddle/Paddle/pull/59729), [#59730](https://github.com/PaddlePaddle/Paddle/pull/59730), [#60216](https://github.com/PaddlePaddle/Paddle/pull/60216), [#60238](https://github.com/PaddlePaddle/Paddle/pull/60238), [#60246](https://github.com/PaddlePaddle/Paddle/pull/60246), [#60343](https://github.com/PaddlePaddle/Paddle/pull/60343), [#60302](https://github.com/PaddlePaddle/Paddle/pull/60302), [#60870](https://github.com/PaddlePaddle/Paddle/pull/60870), [#59956](https://github.com/PaddlePaddle/Paddle/pull/59956), [#60795](https://github.com/PaddlePaddle/Paddle/pull/60795), [#62528](https://github.com/PaddlePaddle/Paddle/pull/62528), [#59932](https://github.com/PaddlePaddle/Paddle/pull/59932), [#59636](https://github.com/PaddlePaddle/Paddle/pull/59636), [#59959](https://github.com/PaddlePaddle/Paddle/pull/59959), [#59734](https://github.com/PaddlePaddle/Paddle/pull/59734), [#60287](https://github.com/PaddlePaddle/Paddle/pull/60287), [#60347](https://github.com/PaddlePaddle/Paddle/pull/60347), [#60335](https://github.com/PaddlePaddle/Paddle/pull/60335), [#60332](https://github.com/PaddlePaddle/Paddle/pull/60332), [#59631](https://github.com/PaddlePaddle/Paddle/pull/59631), [#60255](https://github.com/PaddlePaddle/Paddle/pull/60255), [#60329](https://github.com/PaddlePaddle/Paddle/pull/60329), [#60401](https://github.com/PaddlePaddle/Paddle/pull/60401), [#60522](https://github.com/PaddlePaddle/Paddle/pull/60522), [#60792](https://github.com/PaddlePaddle/Paddle/pull/60792), [#59617](https://github.com/PaddlePaddle/Paddle/pull/59617), [#60277](https://github.com/PaddlePaddle/Paddle/pull/60277), [#60584](https://github.com/PaddlePaddle/Paddle/pull/60584), [#60911](https://github.com/PaddlePaddle/Paddle/pull/60911), [#61322](https://github.com/PaddlePaddle/Paddle/pull/61322), [#60838](https://github.com/PaddlePaddle/Paddle/pull/60838), [#60602](https://github.com/PaddlePaddle/Paddle/pull/60602), [#61458](https://github.com/PaddlePaddle/Paddle/pull/61458), [#61607](https://github.com/PaddlePaddle/Paddle/pull/61607), [#61960](https://github.com/PaddlePaddle/Paddle/pull/61960), [#60484](https://github.com/PaddlePaddle/Paddle/pull/60484), [#61662](https://github.com/PaddlePaddle/Paddle/pull/61662), [#62263](https://github.com/PaddlePaddle/Paddle/pull/62263), [#62270](https://github.com/PaddlePaddle/Paddle/pull/62270), [#62469](https://github.com/PaddlePaddle/Paddle/pull/62469), [#62416](https://github.com/PaddlePaddle/Paddle/pull/62416), [#62443](https://github.com/PaddlePaddle/Paddle/pull/62443), [#62412](https://github.com/PaddlePaddle/Paddle/pull/62412), [#62541](https://github.com/PaddlePaddle/Paddle/pull/62541), [#62634](https://github.com/PaddlePaddle/Paddle/pull/62634), [#62369](https://github.com/PaddlePaddle/Paddle/pull/62369), [#60805](https://github.com/PaddlePaddle/Paddle/pull/60805), [#62644](https://github.com/PaddlePaddle/Paddle/pull/62644), [#62494](https://github.com/PaddlePaddle/Paddle/pull/62494), [#62767](https://github.com/PaddlePaddle/Paddle/pull/62767), [#62735](https://github.com/PaddlePaddle/Paddle/pull/62735), [#62802](https://github.com/PaddlePaddle/Paddle/pull/62802), [#62801](https://github.com/PaddlePaddle/Paddle/pull/62801), [#62783](https://github.com/PaddlePaddle/Paddle/pull/62783), [#62579](https://github.com/PaddlePaddle/Paddle/pull/62579), [#62833](https://github.com/PaddlePaddle/Paddle/pull/62833), [#62668](https://github.com/PaddlePaddle/Paddle/pull/62668), [#62972](https://github.com/PaddlePaddle/Paddle/pull/62972), [#62505](https://github.com/PaddlePaddle/Paddle/pull/62505), [#63005](https://github.com/PaddlePaddle/Paddle/pull/63005), [#62900](https://github.com/PaddlePaddle/Paddle/pull/62900), [#60577](https://github.com/PaddlePaddle/Paddle/pull/60577), [#60877](https://github.com/PaddlePaddle/Paddle/pull/60877), [#61076](https://github.com/PaddlePaddle/Paddle/pull/61076), [#61038](https://github.com/PaddlePaddle/Paddle/pull/61038), [#61112](https://github.com/PaddlePaddle/Paddle/pull/61112), [#61120](https://github.com/PaddlePaddle/Paddle/pull/61120), [#61582](https://github.com/PaddlePaddle/Paddle/pull/61582), [#61119](https://github.com/PaddlePaddle/Paddle/pull/61119), [#61036](https://github.com/PaddlePaddle/Paddle/pull/61036), [#61289](https://github.com/PaddlePaddle/Paddle/pull/61289), [#60695](https://github.com/PaddlePaddle/Paddle/pull/60695), [#61039](https://github.com/PaddlePaddle/Paddle/pull/61039), [#61963](https://github.com/PaddlePaddle/Paddle/pull/61963), [#62118](https://github.com/PaddlePaddle/Paddle/pull/62118), [#62797](https://github.com/PaddlePaddle/Paddle/pull/62797), [#62807](https://github.com/PaddlePaddle/Paddle/pull/62807), [#62887](https://github.com/PaddlePaddle/Paddle/pull/62887), [#62830](https://github.com/PaddlePaddle/Paddle/pull/62830), [#62849](https://github.com/PaddlePaddle/Paddle/pull/62849), [#62750](https://github.com/PaddlePaddle/Paddle/pull/62750), [#62965](https://github.com/PaddlePaddle/Paddle/pull/62965), [#59742](https://github.com/PaddlePaddle/Paddle/pull/59742), [#59867](https://github.com/PaddlePaddle/Paddle/pull/59867), [#60836](https://github.com/PaddlePaddle/Paddle/pull/60836), [#60902](https://github.com/PaddlePaddle/Paddle/pull/60902), [#61228](https://github.com/PaddlePaddle/Paddle/pull/61228), [#60037](https://github.com/PaddlePaddle/Paddle/pull/60037), [#60079](https://github.com/PaddlePaddle/Paddle/pull/60079), [#60173](https://github.com/PaddlePaddle/Paddle/pull/60173), [#60373](https://github.com/PaddlePaddle/Paddle/pull/60373), [#60380](https://github.com/PaddlePaddle/Paddle/pull/60380), [#60381](https://github.com/PaddlePaddle/Paddle/pull/60381), [#60750](https://github.com/PaddlePaddle/Paddle/pull/60750), [#61065](https://github.com/PaddlePaddle/Paddle/pull/61065), [#61122](https://github.com/PaddlePaddle/Paddle/pull/61122), [#61074](https://github.com/PaddlePaddle/Paddle/pull/61074), [#61204](https://github.com/PaddlePaddle/Paddle/pull/61204), [#61191](https://github.com/PaddlePaddle/Paddle/pull/61191), [#61182](https://github.com/PaddlePaddle/Paddle/pull/61182), [#61219](https://github.com/PaddlePaddle/Paddle/pull/61219), [#61296](https://github.com/PaddlePaddle/Paddle/pull/61296), [#61503](https://github.com/PaddlePaddle/Paddle/pull/61503), [#61484](https://github.com/PaddlePaddle/Paddle/pull/61484), [#61513](https://github.com/PaddlePaddle/Paddle/pull/61513), [#61476](https://github.com/PaddlePaddle/Paddle/pull/61476), [#61510](https://github.com/PaddlePaddle/Paddle/pull/61510), [#61511](https://github.com/PaddlePaddle/Paddle/pull/61511), [#61526](https://github.com/PaddlePaddle/Paddle/pull/61526), [#61524](https://github.com/PaddlePaddle/Paddle/pull/61524), [#61525](https://github.com/PaddlePaddle/Paddle/pull/61525), [#61466](https://github.com/PaddlePaddle/Paddle/pull/61466), [#61497](https://github.com/PaddlePaddle/Paddle/pull/61497), [#61538](https://github.com/PaddlePaddle/Paddle/pull/61538), [#61533](https://github.com/PaddlePaddle/Paddle/pull/61533), [#61530](https://github.com/PaddlePaddle/Paddle/pull/61530), [#61468](https://github.com/PaddlePaddle/Paddle/pull/61468), [#61527](https://github.com/PaddlePaddle/Paddle/pull/61527), [#61535](https://github.com/PaddlePaddle/Paddle/pull/61535), [#61512](https://github.com/PaddlePaddle/Paddle/pull/61512), [#61531](https://github.com/PaddlePaddle/Paddle/pull/61531), [#61539](https://github.com/PaddlePaddle/Paddle/pull/61539), [#61532](https://github.com/PaddlePaddle/Paddle/pull/61532), [#61521](https://github.com/PaddlePaddle/Paddle/pull/61521), [#61517](https://github.com/PaddlePaddle/Paddle/pull/61517), [#61518](https://github.com/PaddlePaddle/Paddle/pull/61518), [#61550](https://github.com/PaddlePaddle/Paddle/pull/61550), [#61545](https://github.com/PaddlePaddle/Paddle/pull/61545), [#61548](https://github.com/PaddlePaddle/Paddle/pull/61548), [#61519](https://github.com/PaddlePaddle/Paddle/pull/61519), [#61549](https://github.com/PaddlePaddle/Paddle/pull/61549), [#61574](https://github.com/PaddlePaddle/Paddle/pull/61574), [#61585](https://github.com/PaddlePaddle/Paddle/pull/61585), [#61581](https://github.com/PaddlePaddle/Paddle/pull/61581), [#61553](https://github.com/PaddlePaddle/Paddle/pull/61553), [#61504](https://github.com/PaddlePaddle/Paddle/pull/61504), [#61603](https://github.com/PaddlePaddle/Paddle/pull/61603), [#61534](https://github.com/PaddlePaddle/Paddle/pull/61534), [#61567](https://github.com/PaddlePaddle/Paddle/pull/61567), [#61523](https://github.com/PaddlePaddle/Paddle/pull/61523), [#61565](https://github.com/PaddlePaddle/Paddle/pull/61565), [#61564](https://github.com/PaddlePaddle/Paddle/pull/61564), [#61707](https://github.com/PaddlePaddle/Paddle/pull/61707), [#61560](https://github.com/PaddlePaddle/Paddle/pull/61560), [#61684](https://github.com/PaddlePaddle/Paddle/pull/61684), [#61706](https://github.com/PaddlePaddle/Paddle/pull/61706), [#61724](https://github.com/PaddlePaddle/Paddle/pull/61724), [#61719](https://github.com/PaddlePaddle/Paddle/pull/61719), [#61729](https://github.com/PaddlePaddle/Paddle/pull/61729), [#61763](https://github.com/PaddlePaddle/Paddle/pull/61763), [#61755](https://github.com/PaddlePaddle/Paddle/pull/61755), [#61737](https://github.com/PaddlePaddle/Paddle/pull/61737), [#61750](https://github.com/PaddlePaddle/Paddle/pull/61750), [#61753](https://github.com/PaddlePaddle/Paddle/pull/61753), [#61756](https://github.com/PaddlePaddle/Paddle/pull/61756), [#61777](https://github.com/PaddlePaddle/Paddle/pull/61777), [#61758](https://github.com/PaddlePaddle/Paddle/pull/61758), [#61731](https://github.com/PaddlePaddle/Paddle/pull/61731), [#61771](https://github.com/PaddlePaddle/Paddle/pull/61771), [#61739](https://github.com/PaddlePaddle/Paddle/pull/61739), [#61559](https://github.com/PaddlePaddle/Paddle/pull/61559), [#61717](https://github.com/PaddlePaddle/Paddle/pull/61717), [#61733](https://github.com/PaddlePaddle/Paddle/pull/61733), [#61563](https://github.com/PaddlePaddle/Paddle/pull/61563), [#61546](https://github.com/PaddlePaddle/Paddle/pull/61546), [#61566](https://github.com/PaddlePaddle/Paddle/pull/61566), [#61562](https://github.com/PaddlePaddle/Paddle/pull/61562), [#61793](https://github.com/PaddlePaddle/Paddle/pull/61793), [#61902](https://github.com/PaddlePaddle/Paddle/pull/61902), [#61905](https://github.com/PaddlePaddle/Paddle/pull/61905), [#61904](https://github.com/PaddlePaddle/Paddle/pull/61904), [#62227](https://github.com/PaddlePaddle/Paddle/pull/62227), [#62332](https://github.com/PaddlePaddle/Paddle/pull/62332), [#62653](https://github.com/PaddlePaddle/Paddle/pull/62653), [#62681](https://github.com/PaddlePaddle/Paddle/pull/62681), [#62709](https://github.com/PaddlePaddle/Paddle/pull/62709), [#62794](https://github.com/PaddlePaddle/Paddle/pull/62794), [#62938](https://github.com/PaddlePaddle/Paddle/pull/62938), [#63185](https://github.com/PaddlePaddle/Paddle/pull/63185), [#63754](https://github.com/PaddlePaddle/Paddle/pull/63754), [#63769](https://github.com/PaddlePaddle/Paddle/pull/63769), [#63793](https://github.com/PaddlePaddle/Paddle/pull/63793), [#63830](https://github.com/PaddlePaddle/Paddle/pull/63830), [#63939](https://github.com/PaddlePaddle/Paddle/pull/63939), [#64340](https://github.com/PaddlePaddle/Paddle/pull/64340), [#64657](https://github.com/PaddlePaddle/Paddle/pull/64657), [#62527](https://github.com/PaddlePaddle/Paddle/pull/62527), [#64088](https://github.com/PaddlePaddle/Paddle/pull/64088), [#60203](https://github.com/PaddlePaddle/Paddle/pull/60203), [#60372](https://github.com/PaddlePaddle/Paddle/pull/60372), [#60685](https://github.com/PaddlePaddle/Paddle/pull/60685), [#60815](https://github.com/PaddlePaddle/Paddle/pull/60815), [#60791](https://github.com/PaddlePaddle/Paddle/pull/60791), [#60864](https://github.com/PaddlePaddle/Paddle/pull/60864), [#60851](https://github.com/PaddlePaddle/Paddle/pull/60851), [#60844](https://github.com/PaddlePaddle/Paddle/pull/60844), [#60694](https://github.com/PaddlePaddle/Paddle/pull/60694), [#60855](https://github.com/PaddlePaddle/Paddle/pull/60855), [#60869](https://github.com/PaddlePaddle/Paddle/pull/60869), [#60948](https://github.com/PaddlePaddle/Paddle/pull/60948), [#61042](https://github.com/PaddlePaddle/Paddle/pull/61042), [#61455](https://github.com/PaddlePaddle/Paddle/pull/61455), [#61580](https://github.com/PaddlePaddle/Paddle/pull/61580), [#61589](https://github.com/PaddlePaddle/Paddle/pull/61589), [#61609](https://github.com/PaddlePaddle/Paddle/pull/61609), [#61616](https://github.com/PaddlePaddle/Paddle/pull/61616), [#61715](https://github.com/PaddlePaddle/Paddle/pull/61715), [#61716](https://github.com/PaddlePaddle/Paddle/pull/61716), [#61759](https://github.com/PaddlePaddle/Paddle/pull/61759), [#61555](https://github.com/PaddlePaddle/Paddle/pull/61555), [#61492](https://github.com/PaddlePaddle/Paddle/pull/61492), [#61805](https://github.com/PaddlePaddle/Paddle/pull/61805), [#61712](https://github.com/PaddlePaddle/Paddle/pull/61712), [#61615](https://github.com/PaddlePaddle/Paddle/pull/61615), [#61713](https://github.com/PaddlePaddle/Paddle/pull/61713), [#62129](https://github.com/PaddlePaddle/Paddle/pull/62129), [#59294](https://github.com/PaddlePaddle/Paddle/pull/59294), [#59865](https://github.com/PaddlePaddle/Paddle/pull/59865), [#60270](https://github.com/PaddlePaddle/Paddle/pull/60270), [#60547](https://github.com/PaddlePaddle/Paddle/pull/60547), [#60698](https://github.com/PaddlePaddle/Paddle/pull/60698), [#60762](https://github.com/PaddlePaddle/Paddle/pull/60762), [#60753](https://github.com/PaddlePaddle/Paddle/pull/60753), [#60966](https://github.com/PaddlePaddle/Paddle/pull/60966), [#60976](https://github.com/PaddlePaddle/Paddle/pull/60976), [#61100](https://github.com/PaddlePaddle/Paddle/pull/61100), [#61203](https://github.com/PaddlePaddle/Paddle/pull/61203), [#61210](https://github.com/PaddlePaddle/Paddle/pull/61210), [#61424](https://github.com/PaddlePaddle/Paddle/pull/61424), [#61213](https://github.com/PaddlePaddle/Paddle/pull/61213), [#61275](https://github.com/PaddlePaddle/Paddle/pull/61275), [#61276](https://github.com/PaddlePaddle/Paddle/pull/61276), [#61279](https://github.com/PaddlePaddle/Paddle/pull/61279), [#61292](https://github.com/PaddlePaddle/Paddle/pull/61292), [#61295](https://github.com/PaddlePaddle/Paddle/pull/61295), [#61298](https://github.com/PaddlePaddle/Paddle/pull/61298), [#61299](https://github.com/PaddlePaddle/Paddle/pull/61299), [#61301](https://github.com/PaddlePaddle/Paddle/pull/61301), [#61302](https://github.com/PaddlePaddle/Paddle/pull/61302), [#61329](https://github.com/PaddlePaddle/Paddle/pull/61329), [#61804](https://github.com/PaddlePaddle/Paddle/pull/61804), [#62745](https://github.com/PaddlePaddle/Paddle/pull/62745), [#62909](https://github.com/PaddlePaddle/Paddle/pull/62909), [#64247](https://github.com/PaddlePaddle/Paddle/pull/64247), [#64308](https://github.com/PaddlePaddle/Paddle/pull/64308), [#60690](https://github.com/PaddlePaddle/Paddle/pull/60690), [#61149](https://github.com/PaddlePaddle/Paddle/pull/61149), [#61145](https://github.com/PaddlePaddle/Paddle/pull/61145), [#61193](https://github.com/PaddlePaddle/Paddle/pull/61193), [#61207](https://github.com/PaddlePaddle/Paddle/pull/61207), [#61229](https://github.com/PaddlePaddle/Paddle/pull/61229), [#61236](https://github.com/PaddlePaddle/Paddle/pull/61236), [#61244](https://github.com/PaddlePaddle/Paddle/pull/61244), [#61242](https://github.com/PaddlePaddle/Paddle/pull/61242), [#61263](https://github.com/PaddlePaddle/Paddle/pull/61263), [#61370](https://github.com/PaddlePaddle/Paddle/pull/61370), [#61410](https://github.com/PaddlePaddle/Paddle/pull/61410), [#61480](https://github.com/PaddlePaddle/Paddle/pull/61480), [#61522](https://github.com/PaddlePaddle/Paddle/pull/61522), [#61540](https://github.com/PaddlePaddle/Paddle/pull/61540), [#61520](https://github.com/PaddlePaddle/Paddle/pull/61520), [#61625](https://github.com/PaddlePaddle/Paddle/pull/61625), [#61700](https://github.com/PaddlePaddle/Paddle/pull/61700), [#61708](https://github.com/PaddlePaddle/Paddle/pull/61708), [#61736](https://github.com/PaddlePaddle/Paddle/pull/61736), [#61889](https://github.com/PaddlePaddle/Paddle/pull/61889), [#61952](https://github.com/PaddlePaddle/Paddle/pull/61952), [#62033](https://github.com/PaddlePaddle/Paddle/pull/62033), [#62637](https://github.com/PaddlePaddle/Paddle/pull/62637), [#62777](https://github.com/PaddlePaddle/Paddle/pull/62777), [#62779](https://github.com/PaddlePaddle/Paddle/pull/62779), [#63226](https://github.com/PaddlePaddle/Paddle/pull/63226), [#63287](https://github.com/PaddlePaddle/Paddle/pull/63287), [#63398](https://github.com/PaddlePaddle/Paddle/pull/63398), [#63431](https://github.com/PaddlePaddle/Paddle/pull/63431), [#64000](https://github.com/PaddlePaddle/Paddle/pull/64000), [#64058](https://github.com/PaddlePaddle/Paddle/pull/64058), [#64059](https://github.com/PaddlePaddle/Paddle/pull/64059), [#64063](https://github.com/PaddlePaddle/Paddle/pull/64063), [#64066](https://github.com/PaddlePaddle/Paddle/pull/64066), [#64089](https://github.com/PaddlePaddle/Paddle/pull/64089), [#64170](https://github.com/PaddlePaddle/Paddle/pull/64170), [#64235](https://github.com/PaddlePaddle/Paddle/pull/64235), [#64237](https://github.com/PaddlePaddle/Paddle/pull/64237), [#64243](https://github.com/PaddlePaddle/Paddle/pull/64243), [#64242](https://github.com/PaddlePaddle/Paddle/pull/64242), [#64286](https://github.com/PaddlePaddle/Paddle/pull/64286), [#64322](https://github.com/PaddlePaddle/Paddle/pull/64322), [#64317](https://github.com/PaddlePaddle/Paddle/pull/64317), [#64490](https://github.com/PaddlePaddle/Paddle/pull/64490), [#60138](https://github.com/PaddlePaddle/Paddle/pull/60138), [#62384](https://github.com/PaddlePaddle/Paddle/pull/62384), [#59702](https://github.com/PaddlePaddle/Paddle/pull/59702), [#60341](https://github.com/PaddlePaddle/Paddle/pull/60341), [#60636](https://github.com/PaddlePaddle/Paddle/pull/60636), [#60714](https://github.com/PaddlePaddle/Paddle/pull/60714), [#60716](https://github.com/PaddlePaddle/Paddle/pull/60716), [#60700](https://github.com/PaddlePaddle/Paddle/pull/60700), [#60702](https://github.com/PaddlePaddle/Paddle/pull/60702), [#60704](https://github.com/PaddlePaddle/Paddle/pull/60704), [#60715](https://github.com/PaddlePaddle/Paddle/pull/60715), [#60713](https://github.com/PaddlePaddle/Paddle/pull/60713), [#60711](https://github.com/PaddlePaddle/Paddle/pull/60711), [#60724](https://github.com/PaddlePaddle/Paddle/pull/60724), [#60803](https://github.com/PaddlePaddle/Paddle/pull/60803), [#61331](https://github.com/PaddlePaddle/Paddle/pull/61331), [#63286](https://github.com/PaddlePaddle/Paddle/pull/63286), [#60473](https://github.com/PaddlePaddle/Paddle/pull/60473), [#61046](https://github.com/PaddlePaddle/Paddle/pull/61046), [#61859](https://github.com/PaddlePaddle/Paddle/pull/61859), [#60675](https://github.com/PaddlePaddle/Paddle/pull/60675), [#60719](https://github.com/PaddlePaddle/Paddle/pull/60719), [#62863](https://github.com/PaddlePaddle/Paddle/pull/62863), [#63013](https://github.com/PaddlePaddle/Paddle/pull/63013), [#61293](https://github.com/PaddlePaddle/Paddle/pull/61293), [#62781](https://github.com/PaddlePaddle/Paddle/pull/62781), [#62935](https://github.com/PaddlePaddle/Paddle/pull/62935), [#63014](https://github.com/PaddlePaddle/Paddle/pull/63014), [#64203](https://github.com/PaddlePaddle/Paddle/pull/64203), [#63349](https://github.com/PaddlePaddle/Paddle/pull/63349), [#59572](https://github.com/PaddlePaddle/Paddle/pull/59572), [#59911](https://github.com/PaddlePaddle/Paddle/pull/59911), [#59861](https://github.com/PaddlePaddle/Paddle/pull/59861), [#60014](https://github.com/PaddlePaddle/Paddle/pull/60014), [#59913](https://github.com/PaddlePaddle/Paddle/pull/59913), [#58889](https://github.com/PaddlePaddle/Paddle/pull/58889), [#60114](https://github.com/PaddlePaddle/Paddle/pull/60114), [#59928](https://github.com/PaddlePaddle/Paddle/pull/59928), [#60180](https://github.com/PaddlePaddle/Paddle/pull/60180), [#60168](https://github.com/PaddlePaddle/Paddle/pull/60168), [#60166](https://github.com/PaddlePaddle/Paddle/pull/60166), [#60250](https://github.com/PaddlePaddle/Paddle/pull/60250), [#60247](https://github.com/PaddlePaddle/Paddle/pull/60247), [#60172](https://github.com/PaddlePaddle/Paddle/pull/60172), [#59661](https://github.com/PaddlePaddle/Paddle/pull/59661), [#58880](https://github.com/PaddlePaddle/Paddle/pull/58880), [#60291](https://github.com/PaddlePaddle/Paddle/pull/60291), [#58881](https://github.com/PaddlePaddle/Paddle/pull/58881), [#58955](https://github.com/PaddlePaddle/Paddle/pull/58955), [#58684](https://github.com/PaddlePaddle/Paddle/pull/58684), [#58708](https://github.com/PaddlePaddle/Paddle/pull/58708), [#60323](https://github.com/PaddlePaddle/Paddle/pull/60323), [#58762](https://github.com/PaddlePaddle/Paddle/pull/58762), [#60048](https://github.com/PaddlePaddle/Paddle/pull/60048), [#60345](https://github.com/PaddlePaddle/Paddle/pull/60345), [#60325](https://github.com/PaddlePaddle/Paddle/pull/60325), [#59627](https://github.com/PaddlePaddle/Paddle/pull/59627), [#60416](https://github.com/PaddlePaddle/Paddle/pull/60416), [#60434](https://github.com/PaddlePaddle/Paddle/pull/60434), [#59801](https://github.com/PaddlePaddle/Paddle/pull/59801), [#60619](https://github.com/PaddlePaddle/Paddle/pull/60619), [#60445](https://github.com/PaddlePaddle/Paddle/pull/60445), [#60666](https://github.com/PaddlePaddle/Paddle/pull/60666), [#60353](https://github.com/PaddlePaddle/Paddle/pull/60353), [#60733](https://github.com/PaddlePaddle/Paddle/pull/60733), [#60693](https://github.com/PaddlePaddle/Paddle/pull/60693), [#60350](https://github.com/PaddlePaddle/Paddle/pull/60350), [#61096](https://github.com/PaddlePaddle/Paddle/pull/61096), [#61121](https://github.com/PaddlePaddle/Paddle/pull/61121), [#61164](https://github.com/PaddlePaddle/Paddle/pull/61164), [#62054](https://github.com/PaddlePaddle/Paddle/pull/62054), [#62136](https://github.com/PaddlePaddle/Paddle/pull/62136), [#62508](https://github.com/PaddlePaddle/Paddle/pull/62508), [#62988](https://github.com/PaddlePaddle/Paddle/pull/62988), [#63472](https://github.com/PaddlePaddle/Paddle/pull/63472), [#60193](https://github.com/PaddlePaddle/Paddle/pull/60193), [#60197](https://github.com/PaddlePaddle/Paddle/pull/60197), [#60198](https://github.com/PaddlePaddle/Paddle/pull/60198), [#60346](https://github.com/PaddlePaddle/Paddle/pull/60346), [#60318](https://github.com/PaddlePaddle/Paddle/pull/60318), [#60645](https://github.com/PaddlePaddle/Paddle/pull/60645), [#60650](https://github.com/PaddlePaddle/Paddle/pull/60650), [#60660](https://github.com/PaddlePaddle/Paddle/pull/60660), [#60706](https://github.com/PaddlePaddle/Paddle/pull/60706), [#60799](https://github.com/PaddlePaddle/Paddle/pull/60799), [#60837](https://github.com/PaddlePaddle/Paddle/pull/60837), [#60817](https://github.com/PaddlePaddle/Paddle/pull/60817), [#60820](https://github.com/PaddlePaddle/Paddle/pull/60820), [#60894](https://github.com/PaddlePaddle/Paddle/pull/60894), [#61079](https://github.com/PaddlePaddle/Paddle/pull/61079), [#61087](https://github.com/PaddlePaddle/Paddle/pull/61087), [#61073](https://github.com/PaddlePaddle/Paddle/pull/61073), [#61072](https://github.com/PaddlePaddle/Paddle/pull/61072), [#61127](https://github.com/PaddlePaddle/Paddle/pull/61127), [#61097](https://github.com/PaddlePaddle/Paddle/pull/61097), [#61365](https://github.com/PaddlePaddle/Paddle/pull/61365), [#61456](https://github.com/PaddlePaddle/Paddle/pull/61456), [#61846](https://github.com/PaddlePaddle/Paddle/pull/61846), [#62217](https://github.com/PaddlePaddle/Paddle/pull/62217), [#62519](https://github.com/PaddlePaddle/Paddle/pull/62519), [#62881](https://github.com/PaddlePaddle/Paddle/pull/62881), [#62880](https://github.com/PaddlePaddle/Paddle/pull/62880), [#59723](https://github.com/PaddlePaddle/Paddle/pull/59723), [#59722](https://github.com/PaddlePaddle/Paddle/pull/59722), [#59797](https://github.com/PaddlePaddle/Paddle/pull/59797), [#59960](https://github.com/PaddlePaddle/Paddle/pull/59960), [#59761](https://github.com/PaddlePaddle/Paddle/pull/59761), [#59996](https://github.com/PaddlePaddle/Paddle/pull/59996), [#60009](https://github.com/PaddlePaddle/Paddle/pull/60009), [#58896](https://github.com/PaddlePaddle/Paddle/pull/58896), [#60051](https://github.com/PaddlePaddle/Paddle/pull/60051), [#60410](https://github.com/PaddlePaddle/Paddle/pull/60410), [#60420](https://github.com/PaddlePaddle/Paddle/pull/60420), [#60548](https://github.com/PaddlePaddle/Paddle/pull/60548), [#60575](https://github.com/PaddlePaddle/Paddle/pull/60575), [#60726](https://github.com/PaddlePaddle/Paddle/pull/60726), [#60809](https://github.com/PaddlePaddle/Paddle/pull/60809), [#61346](https://github.com/PaddlePaddle/Paddle/pull/61346), [#61222](https://github.com/PaddlePaddle/Paddle/pull/61222), [#61099](https://github.com/PaddlePaddle/Paddle/pull/61099), [#62254](https://github.com/PaddlePaddle/Paddle/pull/62254), [#62269](https://github.com/PaddlePaddle/Paddle/pull/62269), [#62362](https://github.com/PaddlePaddle/Paddle/pull/62362)
+- Improve the underlying error checking mechanism of PaddlePaddle to facilitate developers' debugging. [#62571](https://github.com/PaddlePaddle/Paddle/pull/62571), [#62602](https://github.com/PaddlePaddle/Paddle/pull/62602), [#60903](https://github.com/PaddlePaddle/Paddle/pull/60903), [#64695](https://github.com/PaddlePaddle/Paddle/pull/64695), [#59907](https://github.com/PaddlePaddle/Paddle/pull/59907), [#62018](https://github.com/PaddlePaddle/Paddle/pull/62018), [#62839](https://github.com/PaddlePaddle/Paddle/pull/62839), [#60651](https://github.com/PaddlePaddle/Paddle/pull/60651), [#61488](https://github.com/PaddlePaddle/Paddle/pull/61488), [#64064](https://github.com/PaddlePaddle/Paddle/pull/64064), [#63192](https://github.com/PaddlePaddle/Paddle/pull/63192), [#63525](https://github.com/PaddlePaddle/Paddle/pull/63525)。
-- Implemented timely release of last layer of PP strategy, to save video memory. [#54505](https://github.com/PaddlePaddle/Paddle/pull/54505)
-- In MP strategy Tensor fusion, supported incoming params group to enhance Tensor fusion function. Improved allreduce asynchronous communication performance, and enhanced training performance through overlap of computation and communication. [#57690](https://github.com/PaddlePaddle/Paddle/pull/57690),[#55662](https://github.com/PaddlePaddle/Paddle/pull/55662)
-- In Sharding strategy, carried out overlap for reverse computation and gradient communication, to improve training performance. For Sharding stage1, added Tensor fusion and fuse grad clip, and optimizer, to improve computational efficiency. Supported overlap between VPP and DP/Sharding Stage1, to improve communication and computation parallelism. Optimized performance of Sharding Stage1 under FP16. Check only gradient responsible for this sharding rank in the check finite stage, to reduce computation overhead; added environment variables to control whether Optimize is performed to save video memory, to achieve use of fewer resources for model training debugging. [#55598](https://github.com/PaddlePaddle/Paddle/pull/55598),[#55427](https://github.com/PaddlePaddle/Paddle/pull/55427),[#56063](https://github.com/PaddlePaddle/Paddle/pull/56063),[#55766](https://github.com/PaddlePaddle/Paddle/pull/55766),[#59848](https://github.com/PaddlePaddle/Paddle/pull/59848)
-- In Hybrid Parallel strategy, arranged Tensor fusion under PP/VPP to pre-run, to solve the problem of extra overhead of runtime fuse on video memory. Improved model training performance by reducing non-essential synchronous memcpy. [#54403](https://github.com/PaddlePaddle/Paddle/pull/54403),[#57215](https://github.com/PaddlePaddle/Paddle/pull/57215)
+### Vulnerability Fixing
-#### Bug Fix
+- Fix potential security vulnerabilities. [#59957](https://github.com/PaddlePaddle/Paddle/pull/59957), [#61032](https://github.com/PaddlePaddle/Paddle/pull/61032), [#61356](https://github.com/PaddlePaddle/Paddle/pull/61356), [#61573](https://github.com/PaddlePaddle/Paddle/pull/61573), [#61671](https://github.com/PaddlePaddle/Paddle/pull/61671), [#62345](https://github.com/PaddlePaddle/Paddle/pull/62345), [#60097](https://github.com/PaddlePaddle/Paddle/pull/60097), [#61161](https://github.com/PaddlePaddle/Paddle/pull/61161), [#61294](https://github.com/PaddlePaddle/Paddle/pull/61294), [#61349](https://github.com/PaddlePaddle/Paddle/pull/61349), [#61344](https://github.com/PaddlePaddle/Paddle/pull/61344), [#61162](https://github.com/PaddlePaddle/Paddle/pull/61162), [#61285](https://github.com/PaddlePaddle/Paddle/pull/61285), [#61826](https://github.com/PaddlePaddle/Paddle/pull/61826), [#59967](https://github.com/PaddlePaddle/Paddle/pull/59967), [#59976](https://github.com/PaddlePaddle/Paddle/pull/59976), [#59979](https://github.com/PaddlePaddle/Paddle/pull/59979)[#60527](https://github.com/PaddlePaddle/Paddle/pull/60527),[#60646](https://github.com/PaddlePaddle/Paddle/pull/60646),[#61827](https://github.com/PaddlePaddle/Paddle/pull/61827)
-- Fixed 13 bugs in PP, Launch function, MP strategy, and fuse_rope, to enhance stability of distributed strategies. At mechanism level, fixed errors of inplace and tensor reference to improve stability. [#55116](https://github.com/PaddlePaddle/Paddle/pull/55116),[#55782](https://github.com/PaddlePaddle/Paddle/pull/55782),[#59609](https://github.com/PaddlePaddle/Paddle/pull/59609),[#57394](https://github.com/PaddlePaddle/Paddle/pull/57394),[#55864](https://github.com/PaddlePaddle/Paddle/pull/55864),[#58482](https://github.com/PaddlePaddle/Paddle/pull/58482),[#54571](https://github.com/PaddlePaddle/Paddle/pull/54571),[#55896](https://github.com/PaddlePaddle/Paddle/pull/55896),[#54648](https://github.com/PaddlePaddle/Paddle/pull/54648),[#58307](https://github.com/PaddlePaddle/Paddle/pull/58307),[#55679](https://github.com/PaddlePaddle/Paddle/pull/55679),[#58133](https://github.com/PaddlePaddle/Paddle/pull/58133),[#58408](https://github.com/PaddlePaddle/Paddle/pull/58408),[#59707](https://github.com/PaddlePaddle/Paddle/pull/59707),[#55342](https://github.com/PaddlePaddle/Paddle/pull/55342),[#54703](https://github.com/PaddlePaddle/Paddle/pull/54703),[#54869](https://github.com/PaddlePaddle/Paddle/pull/54869),[#55568](https://github.com/PaddlePaddle/Paddle/pull/55568),[#55233](https://github.com/PaddlePaddle/Paddle/pull/55233),[#56418](https://github.com/PaddlePaddle/Paddle/pull/56418),[#56428](https://github.com/PaddlePaddle/Paddle/pull/56428),[#56892](https://github.com/PaddlePaddle/Paddle/pull/56892),[#57192](https://github.com/PaddlePaddle/Paddle/pull/57192),[#59161](https://github.com/PaddlePaddle/Paddle/pull/59161),[#59340](https://github.com/PaddlePaddle/Paddle/pull/59340),[#57006](https://github.com/PaddlePaddle/Paddle/pull/57006),[#57353](https://github.com/PaddlePaddle/Paddle/pull/57353),[#57352](https://github.com/PaddlePaddle/Paddle/pull/57352),[#59088](https://github.com/PaddlePaddle/Paddle/pull/59088)
-- Fixed bug that PP strategy can't release single-layer output in time. Fixed the bug that initialization process may Hang. [#54624](https://github.com/PaddlePaddle/Paddle/pull/54624),[#58844](https://github.com/PaddlePaddle/Paddle/pull/58844),[#54673](https://github.com/PaddlePaddle/Paddle/pull/54673),[#58376](https://github.com/PaddlePaddle/Paddle/pull/58376)
-- Fixed the bug calculation is wrong when input data type is not uniform under MP strategy. Fixed the bug of parameter synchronization under MP strategy. Fixed the bug user input config is not used correctly. [#58858](https://github.com/PaddlePaddle/Paddle/pull/58858),[#57918](https://github.com/PaddlePaddle/Paddle/pull/57918),[#58037](https://github.com/PaddlePaddle/Paddle/pull/58037)
-- Unified judgment method of dygraph and dynamic mode. [#54633](https://github.com/PaddlePaddle/Paddle/pull/54633)
-- Fixed the bug shape of sin and cos in fuse_rope is not correct. [#56132](https://github.com/PaddlePaddle/Paddle/pull/56132)
-- Fixed the bug task fails to due to long endpoints in Luanch distributed scenarios. Fixed the bug endpoints may be out of order. [#55011](https://github.com/PaddlePaddle/Paddle/pull/55011),[#55478](https://github.com/PaddlePaddle/Paddle/pull/55478)
-- Fixed the bug MEA function may cause segmentation fault error. [#55408](https://github.com/PaddlePaddle/Paddle/pull/55408)
+### Deprecated Features
-### Auto parallel
+- Clean up deprecated actuators and other logic to reduce redundant codes. [#64822](https://github.com/PaddlePaddle/Paddle/pull/64822), [#60941](https://github.com/PaddlePaddle/Paddle/pull/60941)
-This release fully optimizes Auto Parallel programming paradigm with dynamic-static unification to simplify programming complexity for developers. Developers do not need to understand complex concepts and APIs in manual parallel programming paradigm, such as row-parallel, column-parallel, and so on. A small amount of tensor distribution annotations is required to build a hybrid parallel model. Framework will handle the derivation of distribution states of all tensors and operators, and adding appropriate communication operators. Meanwhile, it supports the dynamic to static distributed training by just one extra code changed, enabling developers to efficiently and easily implement any hybrid parallel strategy. This can significantly reduce development costs of hybrid parallel training codes.
+## Compiler Infrastructure for Neural Networks (CINN)
-#### Improved auto parallel core functions
+In version 3.0, the compiler architecture has been significantly upgraded. Based on Shape Dialect, build a symbolic automatic derivation and simplification system, support symbolic expression and constraint construction, and support end-to-end execution under the dynamic shape of the compiler. Meanwhile, CINN has upgraded the automatic fusion of subgraphs and Pass Pipline mechanism, merged the core modules of dynamic and static shapes, and merged the iteration paths, so that the architecture is clear and unified. In this version, the compiler has been refactored in important back-end modules such as AST Compute, Schedule strategy, and Tiling, improving the general optimization capability of the compiler, and verifies the training, inference correctness and speedup performance of the dynamic shapes on the subgraphs of PaddlePaddle Industry Suite models and typical large models Llama2-7B and Stable Diffusion models.
-- Implemented auto parallel core APIs such as process_mesh, placement, shard_tensor, reshard, dtensor_from_fn, unshard_dtensor, shard_layer, to_static, and so on. [#55494](https://github.com/PaddlePaddle/Paddle/pull/55494),[#59059](https://github.com/PaddlePaddle/Paddle/pull/59059),[#56561](https://github.com/PaddlePaddle/Paddle/pull/56561),[#54425](https://github.com/PaddlePaddle/Paddle/pull/54425),[#59557](https://github.com/PaddlePaddle/Paddle/pull/59557),[#59682](https://github.com/PaddlePaddle/Paddle/pull/59682),[#56565](https://github.com/PaddlePaddle/Paddle/pull/56565),[#59862](https://github.com/PaddlePaddle/Paddle/pull/59862),[#59856](https://github.com/PaddlePaddle/Paddle/pull/59856),[#59342](https://github.com/PaddlePaddle/Paddle/pull/59342),[#59575](https://github.com/PaddlePaddle/Paddle/pull/59575),[#57604](https://github.com/PaddlePaddle/Paddle/pull/57604),[#57293](https://github.com/PaddlePaddle/Paddle/pull/57293),[#57278](https://github.com/PaddlePaddle/Paddle/pull/57278)
-- Implemented Sharding derivation rules based on Enisum expressions, and completed 20+ classes of operator Sharding derivation rules, which covers LLaMA, GPT and other transformer-like large language models. [#55196](https://github.com/PaddlePaddle/Paddle/pull/55196),[#53863](https://github.com/PaddlePaddle/Paddle/pull/53863),[#56257](https://github.com/PaddlePaddle/Paddle/pull/56257),[#55394](https://github.com/PaddlePaddle/Paddle/pull/55394),[#54810](https://github.com/PaddlePaddle/Paddle/pull/54810),[#55508](https://github.com/PaddlePaddle/Paddle/pull/55508),[#56257](https://github.com/PaddlePaddle/Paddle/pull/56257),[#57813](https://github.com/PaddlePaddle/Paddle/pull/57813),[#58149](https://github.com/PaddlePaddle/Paddle/pull/58149),[#58506](https://github.com/PaddlePaddle/Paddle/pull/58506),[#58563](https://github.com/PaddlePaddle/Paddle/pull/58563),[#58360](https://github.com/PaddlePaddle/Paddle/pull/58360),[#58920](https://github.com/PaddlePaddle/Paddle/pull/58920),[#59050](https://github.com/PaddlePaddle/Paddle/pull/59050),[#58760](https://github.com/PaddlePaddle/Paddle/pull/58760),[#59083](https://github.com/PaddlePaddle/Paddle/pull/59083),[#59236](https://github.com/PaddlePaddle/Paddle/pull/59236),[#59350](https://github.com/PaddlePaddle/Paddle/pull/59350),[#59411](https://github.com/PaddlePaddle/Paddle/pull/59411),[#59260](https://github.com/PaddlePaddle/Paddle/pull/59260),[#54373](https://github.com/PaddlePaddle/Paddle/pull/54373),[#54991](https://github.com/PaddlePaddle/Paddle/pull/54991),[#55397](https://github.com/PaddlePaddle/Paddle/pull/55397),[#55350](https://github.com/PaddlePaddle/Paddle/pull/55350),[#55177](https://github.com/PaddlePaddle/Paddle/pull/55177),[#56443](https://github.com/PaddlePaddle/Paddle/pull/56443),[#58097](https://github.com/PaddlePaddle/Paddle/pull/58097),[#56509](https://github.com/PaddlePaddle/Paddle/pull/56509),[#56502](https://github.com/PaddlePaddle/Paddle/pull/56502),[#56504](https://github.com/PaddlePaddle/Paddle/pull/56504),[#56506](https://github.com/PaddlePaddle/Paddle/pull/56506),[#56507](https://github.com/PaddlePaddle/Paddle/pull/56507),[#56505](https://github.com/PaddlePaddle/Paddle/pull/56505),[#57176](https://github.com/PaddlePaddle/Paddle/pull/57176),[#57374](https://github.com/PaddlePaddle/Paddle/pull/57374),[#57573](https://github.com/PaddlePaddle/Paddle/pull/57573),[#57545](https://github.com/PaddlePaddle/Paddle/pull/57545),[#57875](https://github.com/PaddlePaddle/Paddle/pull/57875),[#57866](https://github.com/PaddlePaddle/Paddle/pull/57866),[#58854](https://github.com/PaddlePaddle/Paddle/pull/58854),[#59109](https://github.com/PaddlePaddle/Paddle/pull/59109),[#59185](https://github.com/PaddlePaddle/Paddle/pull/59185),[#58913](https://github.com/PaddlePaddle/Paddle/pull/58913),[#59547](https://github.com/PaddlePaddle/Paddle/pull/59547),[#58296](https://github.com/PaddlePaddle/Paddle/pull/58296),[#59545](https://github.com/PaddlePaddle/Paddle/pull/59545),[#59039](https://github.com/PaddlePaddle/Paddle/pull/59039),[#59002](https://github.com/PaddlePaddle/Paddle/pull/59002),[#58087](https://github.com/PaddlePaddle/Paddle/pull/58087),[#56367](https://github.com/PaddlePaddle/Paddle/pull/56367),[#57877](https://github.com/PaddlePaddle/Paddle/pull/57877),[#56839](https://github.com/PaddlePaddle/Paddle/pull/56839),[#59003](https://github.com/PaddlePaddle/Paddle/pull/59003),[#57269](https://github.com/PaddlePaddle/Paddle/pull/57269),[#55130](https://github.com/PaddlePaddle/Paddle/pull/55130),[#58474](https://github.com/PaddlePaddle/Paddle/pull/58474),[#57197](https://github.com/PaddlePaddle/Paddle/pull/57197),[#57467](https://github.com/PaddlePaddle/Paddle/pull/57467),[#57259](https://github.com/PaddlePaddle/Paddle/pull/57259),[#57280](https://github.com/PaddlePaddle/Paddle/pull/57280),[#56508](https://github.com/PaddlePaddle/Paddle/pull/56508)
-- Implemented distributed checkpoint storage and loading with dynamic-static unification. Supports ReShard upon arbitrary Sharding of storage and loading in a Sharding state. [#59659](https://github.com/PaddlePaddle/Paddle/pull/59659),[#59843](https://github.com/PaddlePaddle/Paddle/pull/59843),[#60033](https://github.com/PaddlePaddle/Paddle/pull/60033),[#60034](https://github.com/PaddlePaddle/Paddle/pull/60034)
+### New Features
-#### Enhanced semi-auto parallel capability of dynamic graph
+1. Upgrade the new automatic subgraph fusion mechanism, and innovatively propose the TrivialOp and ReduceOp fusion theory, supporting a wider range of vertical fusion and horizontal fusion, ensuring the correctness and robustness of subgraph fusion, and giving full play to the fusion potential of the neural network compiler.([#63340](https://github.com/PaddlePaddle/Paddle/pull/63340)、[#63913](https://github.com/PaddlePaddle/Paddle/pull/63913)、[#63579](https://github.com/PaddlePaddle/Paddle/pull/63579)、[#63605](https://github.com/PaddlePaddle/Paddle/pull/63605)、[#60769](https://github.com/PaddlePaddle/Paddle/pull/60769)、[#62088](https://github.com/PaddlePaddle/Paddle/pull/62088)、[#63124](https://github.com/PaddlePaddle/Paddle/pull/63124)、[#63658](https://github.com/PaddlePaddle/Paddle/pull/63658)、[#64557](https://github.com/PaddlePaddle/Paddle/pull/64557)、[#63318](https://github.com/PaddlePaddle/Paddle/pull/63318)、[#62545](https://github.com/PaddlePaddle/Paddle/pull/62545))
+2. Add the symbol derivation function of dynamic shapes. Based on the Shape Dialect, realize the dynamic symbol construction, automatic derivation, constraint expression, symbol simplification and other mechanisms, introduce the DimExpr concept, upgrade the support for the PaddlePaddle framework of the InferSymbolicShape logic of the 150 + typical primitive operators, and provide more information for training and inference with compiler support for dynamic shapes.([#60843](https://github.com/PaddlePaddle/Paddle/pull/60843)、[#62662](https://github.com/PaddlePaddle/Paddle/pull/62662)、[#63790](https://github.com/PaddlePaddle/Paddle/pull/63790)、[#60098](https://github.com/PaddlePaddle/Paddle/pull/60098)、[#60511](https://github.com/PaddlePaddle/Paddle/pull/60511)、[#61232](https://github.com/PaddlePaddle/Paddle/pull/61232)、[#61939](https://github.com/PaddlePaddle/Paddle/pull/61939)、[#62798](https://github.com/PaddlePaddle/Paddle/pull/62798)、[#62955](https://github.com/PaddlePaddle/Paddle/pull/62955)、[#63029](https://github.com/PaddlePaddle/Paddle/pull/63029)、[#60572](https://github.com/PaddlePaddle/Paddle/pull/60572)、[#61035](https://github.com/PaddlePaddle/Paddle/pull/61035)、[#61224](https://github.com/PaddlePaddle/Paddle/pull/61224)、[#61587](https://github.com/PaddlePaddle/Paddle/pull/61587)、[#61937](https://github.com/PaddlePaddle/Paddle/pull/61937)、[#62314](https://github.com/PaddlePaddle/Paddle/pull/62314)、[#62394](https://github.com/PaddlePaddle/Paddle/pull/62394)、[#62569](https://github.com/PaddlePaddle/Paddle/pull/62569)、[#62495](https://github.com/PaddlePaddle/Paddle/pull/62495)、[#62844](https://github.com/PaddlePaddle/Paddle/pull/62844)、[#63000](https://github.com/PaddlePaddle/Paddle/pull/63000)、[#63016](https://github.com/PaddlePaddle/Paddle/pull/63016)、[#64222](https://github.com/PaddlePaddle/Paddle/pull/64222)、[#60129](https://github.com/PaddlePaddle/Paddle/pull/60129)、[#60899](https://github.com/PaddlePaddle/Paddle/pull/60899)、[#61342](https://github.com/PaddlePaddle/Paddle/pull/61342)、[#61439](https://github.com/PaddlePaddle/Paddle/pull/61439)、[#62766](https://github.com/PaddlePaddle/Paddle/pull/62766)、[#61133](https://github.com/PaddlePaddle/Paddle/pull/61133)、[#61430](https://github.com/PaddlePaddle/Paddle/pull/61430)、[#61498](https://github.com/PaddlePaddle/Paddle/pull/61498)、[#61680](https://github.com/PaddlePaddle/Paddle/pull/61680)、[#63367](https://github.com/PaddlePaddle/Paddle/pull/63367)、[#62151](https://github.com/PaddlePaddle/Paddle/pull/62151)、[#62665](https://github.com/PaddlePaddle/Paddle/pull/62665)、[#61407](https://github.com/PaddlePaddle/Paddle/pull/61407)、[#61502](https://github.com/PaddlePaddle/Paddle/pull/61502)、[#61655](https://github.com/PaddlePaddle/Paddle/pull/61655)、[#64115](https://github.com/PaddlePaddle/Paddle/pull/64115)、[#61791](https://github.com/PaddlePaddle/Paddle/pull/61791)、[#62141](https://github.com/PaddlePaddle/Paddle/pull/62141)、[#63422](https://github.com/PaddlePaddle/Paddle/pull/63422)、[#63577](https://github.com/PaddlePaddle/Paddle/pull/63577)、[#63978](https://github.com/PaddlePaddle/Paddle/pull/63978)、[#63576](https://github.com/PaddlePaddle/Paddle/pull/63576)、[#63947](https://github.com/PaddlePaddle/Paddle/pull/63947)、[#64332](https://github.com/PaddlePaddle/Paddle/pull/64332)、[#63990](https://github.com/PaddlePaddle/Paddle/pull/63990))
+3. Add the Pass Pipline function, including PdToCinn, CinnPreprocess, BuildGroupOp, GroupClusterOp, CinnLowering, Accuracy Check and other Pass strategies, to support the Lowering and execution of subgraphs in dynamic and static shapes, with a clear architecture.([#61611](https://github.com/PaddlePaddle/Paddle/pull/61611)、[#62612](https://github.com/PaddlePaddle/Paddle/pull/62612)、[#64354](https://github.com/PaddlePaddle/Paddle/pull/64354)、[#61848](https://github.com/PaddlePaddle/Paddle/pull/61848)、[#62316](https://github.com/PaddlePaddle/Paddle/pull/62316)、[#64152](https://github.com/PaddlePaddle/Paddle/pull/64152)、[#61619](https://github.com/PaddlePaddle/Paddle/pull/61619)、[#62318](https://github.com/PaddlePaddle/Paddle/pull/62318)、[#61977](https://github.com/PaddlePaddle/Paddle/pull/61977)、[#62211](https://github.com/PaddlePaddle/Paddle/pull/62211)、[#63972](https://github.com/PaddlePaddle/Paddle/pull/63972)、[#63686](https://github.com/PaddlePaddle/Paddle/pull/63686)、[#64505](https://github.com/PaddlePaddle/Paddle/pull/64505))
+4. Add the support for BuketLower and DyShapeSchdule functions, to realize automatic bucket compilation and optimization according to the range of dynamic shapes; and adapt and upgrade the logic of CodeGen module to support the generation of InferShape function and the distribution of conditional branching function of Host function, so as to support the acceleration of training inference under the dynamic Shape of large models.([#62730](https://github.com/PaddlePaddle/Paddle/pull/62730)、[#61115](https://github.com/PaddlePaddle/Paddle/pull/61115)、[#59941](https://github.com/PaddlePaddle/Paddle/pull/59941)、[#62207](https://github.com/PaddlePaddle/Paddle/pull/62207)、[#64318](https://github.com/PaddlePaddle/Paddle/pull/64318)、[#64345](https://github.com/PaddlePaddle/Paddle/pull/64345)、[#60519](https://github.com/PaddlePaddle/Paddle/pull/60519)、[#62584](https://github.com/PaddlePaddle/Paddle/pull/62584)、[#60828](https://github.com/PaddlePaddle/Paddle/pull/60828)、[#60533](https://github.com/PaddlePaddle/Paddle/pull/60533)、[#61436](https://github.com/PaddlePaddle/Paddle/pull/61436)、[#62071](https://github.com/PaddlePaddle/Paddle/pull/62071)、[#63971](https://github.com/PaddlePaddle/Paddle/pull/63971)、[#61656](https://github.com/PaddlePaddle/Paddle/pull/61656)、[#63083](https://github.com/PaddlePaddle/Paddle/pull/63083)、[#64405](https://github.com/PaddlePaddle/Paddle/pull/64405)、[#63047](https://github.com/PaddlePaddle/Paddle/pull/63047)、[#64655](https://github.com/PaddlePaddle/Paddle/pull/64655)、[#63095](https://github.com/PaddlePaddle/Paddle/pull/63095)、[#63829](https://github.com/PaddlePaddle/Paddle/pull/63829)、[#63572](https://github.com/PaddlePaddle/Paddle/pull/63572))
+5. Add support for compilation caching strategy, to automatically recognize, merge and reuse compilation results of the same subgraph structure, improve compilation efficiency by using multi-threading, so as to enhance the user experience.([#62952](https://github.com/PaddlePaddle/Paddle/pull/62952)、[#63269](https://github.com/PaddlePaddle/Paddle/pull/63269)、[#64718](https://github.com/PaddlePaddle/Paddle/pull/64718)、[#61367](https://github.com/PaddlePaddle/Paddle/pull/61367)、[#63305](https://github.com/PaddlePaddle/Paddle/pull/63305)、[#63750](https://github.com/PaddlePaddle/Paddle/pull/63750)、[#63871](https://github.com/PaddlePaddle/Paddle/pull/63871)、[#64893](https://github.com/PaddlePaddle/Paddle/pull/64893))
+6. Add support for GenerateShape mechanism, add corresponding AST Compute operator definitions, support automatic resolution of dynamic symbols, and automatic generation of ShapeOp in the Lowering stage.([#64167](https://github.com/PaddlePaddle/Paddle/pull/64167)、[#64636](https://github.com/PaddlePaddle/Paddle/pull/64636)、[#61993](https://github.com/PaddlePaddle/Paddle/pull/61993)、[#64843](https://github.com/PaddlePaddle/Paddle/pull/64843)、[#62587](https://github.com/PaddlePaddle/Paddle/pull/62587))
-- Basic data structure supplementation: Added DistTensor, Placements and other distributed specific basic data structures on C++ end, and exposed to Python end. Supports debugging and printing of related attributes and values. [#58930](https://github.com/PaddlePaddle/Paddle/pull/58930),[#59068](https://github.com/PaddlePaddle/Paddle/pull/59068),[#55436](https://github.com/PaddlePaddle/Paddle/pull/55436),[#56449](https://github.com/PaddlePaddle/Paddle/pull/56449),[#59683](https://github.com/PaddlePaddle/Paddle/pull/59683),[#55593](https://github.com/PaddlePaddle/Paddle/pull/55593),[#58032](https://github.com/PaddlePaddle/Paddle/pull/58032),[#56368](https://github.com/PaddlePaddle/Paddle/pull/56368),[#59086](https://github.com/PaddlePaddle/Paddle/pull/59086)
-- Added SPMD derivation and Reshard generation logic in execution flow for all operators, and adapted to multiple types of inputs and outputs such as vector and optional, as well as special mechanisms such as cpu fallback and multi-kernel selection. [#56602](https://github.com/PaddlePaddle/Paddle/pull/56602),[#57321](https://github.com/PaddlePaddle/Paddle/pull/57321),[#57092](https://github.com/PaddlePaddle/Paddle/pull/57092),[#56831](https://github.com/PaddlePaddle/Paddle/pull/56831),[#57119](https://github.com/PaddlePaddle/Paddle/pull/57119),[#58819](https://github.com/PaddlePaddle/Paddle/pull/58819),[#58254](https://github.com/PaddlePaddle/Paddle/pull/58254),[#55698](https://github.com/PaddlePaddle/Paddle/pull/55698),[#59241](https://github.com/PaddlePaddle/Paddle/pull/59241),[#59328](https://github.com/PaddlePaddle/Paddle/pull/59328),[#58644](https://github.com/PaddlePaddle/Paddle/pull/58644),[#56202](https://github.com/PaddlePaddle/Paddle/pull/56202),[#59159](https://github.com/PaddlePaddle/Paddle/pull/59159),[#58573](https://github.com/PaddlePaddle/Paddle/pull/58573),[#59246](https://github.com/PaddlePaddle/Paddle/pull/59246),[#59133](https://github.com/PaddlePaddle/Paddle/pull/59133),[#59186](https://github.com/PaddlePaddle/Paddle/pull/59186),[#57505](https://github.com/PaddlePaddle/Paddle/pull/57505),[#57241](https://github.com/PaddlePaddle/Paddle/pull/57241),[#58928](https://github.com/PaddlePaddle/Paddle/pull/58928)
+### Function Optimization
-- Adapted auto parallel execution logic for special types of operators, such as custom operators. Supports automatic conversion of DistTensor and DenseTensor as mixed inputs. [#57774](https://github.com/PaddlePaddle/Paddle/pull/57774),[#59108](https://github.com/PaddlePaddle/Paddle/pull/59108),[#58436](https://github.com/PaddlePaddle/Paddle/pull/58436),[#59523](https://github.com/PaddlePaddle/Paddle/pull/59523),[#59136](https://github.com/PaddlePaddle/Paddle/pull/59136),[#59352](https://github.com/PaddlePaddle/Paddle/pull/59352),[#59062](https://github.com/PaddlePaddle/Paddle/pull/59062),[#58434](https://github.com/PaddlePaddle/Paddle/pull/58434),[#59148](https://github.com/PaddlePaddle/Paddle/pull/59148),[#58553](https://github.com/PaddlePaddle/Paddle/pull/58553),[#58716](https://github.com/PaddlePaddle/Paddle/pull/58716),[#58369](https://github.com/PaddlePaddle/Paddle/pull/58369),[#59061](https://github.com/PaddlePaddle/Paddle/pull/59061),[#58841](https://github.com/PaddlePaddle/Paddle/pull/58841),[#59139](https://github.com/PaddlePaddle/Paddle/pull/59139),[#59141](https://github.com/PaddlePaddle/Paddle/pull/59141),[#58837](https://github.com/PaddlePaddle/Paddle/pull/58837),[#59137](https://github.com/PaddlePaddle/Paddle/pull/59137),[#59143](https://github.com/PaddlePaddle/Paddle/pull/59143)
+1. Optimize BuildCinnPass logic, upgrade the compiler's perception strategy for black and white list operators, and improve the robustness of Pass logic.([#62372](https://github.com/PaddlePaddle/Paddle/pull/62372)、[#61081](https://github.com/PaddlePaddle/Paddle/pull/61081)、[#61225](https://github.com/PaddlePaddle/Paddle/pull/61225)、[#58863](https://github.com/PaddlePaddle/Paddle/pull/58863))
+2. Optimize the OpLoweringGroup data structure, remove unnecessary interfaces and members, and reduce the coupling between upstream and downstream modules.([#62339](https://github.com/PaddlePaddle/Paddle/pull/62339))
+3. Optimize the component design of the compiler on the architecture Arch, to abstract the concept of hardware, and reduce the cost of adapting to domestic hardware.([#63530](https://github.com/PaddlePaddle/Paddle/pull/63530)、[#64347](https://github.com/PaddlePaddle/Paddle/pull/64347)、[#64506](https://github.com/PaddlePaddle/Paddle/pull/64506)、[#64587](https://github.com/PaddlePaddle/Paddle/pull/64587))
+4. Upgrade the AST Compute module of the compiler's back-end operator, to adapt to support the computing logic of dynamic Shape.([#62488](https://github.com/PaddlePaddle/Paddle/pull/62488)、[#63581](https://github.com/PaddlePaddle/Paddle/pull/63581)、[#63687](https://github.com/PaddlePaddle/Paddle/pull/63687)、[#63654](https://github.com/PaddlePaddle/Paddle/pull/63654)、[#64217](https://github.com/PaddlePaddle/Paddle/pull/64217))
-- Optimized dynamic graph execution system: Adapted Autograd execution process. Supports dynamic graph's inverse gradient aggregation, AMP, Hook, PyLayer, View, custom operators, and other surrounding mechanisms. [#58437](https://github.com/PaddlePaddle/Paddle/pull/58437),[#58769](https://github.com/PaddlePaddle/Paddle/pull/58769),[#58796](https://github.com/PaddlePaddle/Paddle/pull/58796),[#58339](https://github.com/PaddlePaddle/Paddle/pull/58339),[#58409](https://github.com/PaddlePaddle/Paddle/pull/58409),[#58772](https://github.com/PaddlePaddle/Paddle/pull/58772),[#58380](https://github.com/PaddlePaddle/Paddle/pull/58380),[#58447](https://github.com/PaddlePaddle/Paddle/pull/58447),[#58706](https://github.com/PaddlePaddle/Paddle/pull/58706),[#58656](https://github.com/PaddlePaddle/Paddle/pull/58656),[#58172](https://github.com/PaddlePaddle/Paddle/pull/58172),[#59401](https://github.com/PaddlePaddle/Paddle/pull/59401),[#58727](https://github.com/PaddlePaddle/Paddle/pull/58727),[#58238](https://github.com/PaddlePaddle/Paddle/pull/58238),[#59243](https://github.com/PaddlePaddle/Paddle/pull/59243),[#58469](https://github.com/PaddlePaddle/Paddle/pull/58469),[#58442](https://github.com/PaddlePaddle/Paddle/pull/58442),[#58487](https://github.com/PaddlePaddle/Paddle/pull/58487),[#58476](https://github.com/PaddlePaddle/Paddle/pull/58476),[#59706](https://github.com/PaddlePaddle/Paddle/pull/59706)
+### Performance Optimization
-- Added support for Pipeline Parallelism, Sequence Parallelism and other distributed parallelism. [#58126](https://github.com/PaddlePaddle/Paddle/pull/58126),[#59766](https://github.com/PaddlePaddle/Paddle/pull/59766),[#59060](https://github.com/PaddlePaddle/Paddle/pull/59060),[#59841](https://github.com/PaddlePaddle/Paddle/pull/59841),[#58609](https://github.com/PaddlePaddle/Paddle/pull/58609),[#59688](https://github.com/PaddlePaddle/Paddle/pull/59688),[#58449](https://github.com/PaddlePaddle/Paddle/pull/58449)、[#59598](https://github.com/PaddlePaddle/Paddle/pull/59598)
-- Added various Reshard strategies and support tensor conversions between different distributed states. [#58592](https://github.com/PaddlePaddle/Paddle/pull/58592),[#59138](https://github.com/PaddlePaddle/Paddle/pull/59138),[#59367](https://github.com/PaddlePaddle/Paddle/pull/59367),[#59621](https://github.com/PaddlePaddle/Paddle/pull/59621),[#59758](https://github.com/PaddlePaddle/Paddle/pull/59758),[#59777](https://github.com/PaddlePaddle/Paddle/pull/59777),[#56975](https://github.com/PaddlePaddle/Paddle/pull/56975),[#58550](https://github.com/PaddlePaddle/Paddle/pull/58550),[#58703](https://github.com/PaddlePaddle/Paddle/pull/58703),[#57210](https://github.com/PaddlePaddle/Paddle/pull/57210),[#58734](https://github.com/PaddlePaddle/Paddle/pull/58734),[#56833](https://github.com/PaddlePaddle/Paddle/pull/56833),[#59292](https://github.com/PaddlePaddle/Paddle/pull/59292),[#57432](https://github.com/PaddlePaddle/Paddle/pull/57432),[#57568](https://github.com/PaddlePaddle/Paddle/pull/57568),[#56553](https://github.com/PaddlePaddle/Paddle/pull/56553),[#58284](https://github.com/PaddlePaddle/Paddle/pull/58284),[#56039](https://github.com/PaddlePaddle/Paddle/pull/56039),[#55552](https://github.com/PaddlePaddle/Paddle/pull/55552),[#56149](https://github.com/PaddlePaddle/Paddle/pull/56149)
+1. Optimize the Schedule logic of AST IR, restructure the core modules such as Vectorize, Unroll, AxisBind, and ComputeAt, and merged the iterative paths of dynamic and static shapes, so as to reduce the development and maintenance costs.([#60449](https://github.com/PaddlePaddle/Paddle/pull/60449)、[#60155](https://github.com/PaddlePaddle/Paddle/pull/60155)、[#60342](https://github.com/PaddlePaddle/Paddle/pull/60342)、[#60498](https://github.com/PaddlePaddle/Paddle/pull/60498)、[#60538](https://github.com/PaddlePaddle/Paddle/pull/60538)、[#60190](https://github.com/PaddlePaddle/Paddle/pull/60190)、[#61197](https://github.com/PaddlePaddle/Paddle/pull/61197)、[#63140](https://github.com/PaddlePaddle/Paddle/pull/63140)、[#61156](https://github.com/PaddlePaddle/Paddle/pull/61156))
+2. Optimize the Tiling strategy and temp Buffer function, support warp-level memory continuous Read and cache_read cache_write function, and improve the subgraph execution performance.([#64240](https://github.com/PaddlePaddle/Paddle/pull/64240)、[#60562](https://github.com/PaddlePaddle/Paddle/pull/60562)、[#64711](https://github.com/PaddlePaddle/Paddle/pull/64711)、[#62856](https://github.com/PaddlePaddle/Paddle/pull/62856)、[#61576](https://github.com/PaddlePaddle/Paddle/pull/61576)、[#61901](https://github.com/PaddlePaddle/Paddle/pull/61901)、[#62581](https://github.com/PaddlePaddle/Paddle/pull/62581)、[#61987](https://github.com/PaddlePaddle/Paddle/pull/61987)、[#60190](https://github.com/PaddlePaddle/Paddle/pull/60190)、[#63138](https://github.com/PaddlePaddle/Paddle/pull/63138)、[#62517](https://github.com/PaddlePaddle/Paddle/pull/62517))
+3. Support automatic search function of Schedule configuration and AOT offline saving mechanism to accelerate the performance of subgraph Kernel.([#64271](https://github.com/PaddlePaddle/Paddle/pull/64271)、[#64588](https://github.com/PaddlePaddle/Paddle/pull/64588)、[#64694](https://github.com/PaddlePaddle/Paddle/pull/64694)、[#64620](https://github.com/PaddlePaddle/Paddle/pull/64620)、[#64702](https://github.com/PaddlePaddle/Paddle/pull/64702)、[#63086](https://github.com/PaddlePaddle/Paddle/pull/63086))
+4. Support OptimizeReductionTactic optimization strategy to improve kernel performance in Reduce scenarios.([#6066](https://github.com/PaddlePaddle/Paddle/pull/60661)、[#61363](https://github.com/PaddlePaddle/Paddle/pull/61363)、[#60881](https://github.com/PaddlePaddle/Paddle/pull/60881)、[#63859](https://github.com/PaddlePaddle/Paddle/pull/63859))
+5. Enhance DCE Pass function, remove redundant If/For branch codes and improve execution efficiency.([#61682](https://github.com/PaddlePaddle/Paddle/pull/61682))
+6. Add support for FuseParallelMatmulPass Pass, integrate multiple Matmul operators to achieve acceleration.([#63623](https://github.com/PaddlePaddle/Paddle/pull/63623))
-#### Enhanced semi-auto parallel for static graphs
+### Bug Fixing
-- Added Sequence Parallel Parallelism; added FThenB, Interleaved 1F1B, Eager 1F1B, VPP and other scheduling modes for Pipeline Parallel, and supported the hybrid parallel between the above new parallelism and original parallelism. Supported visualization of pipeline scheduling. Upgraded gradient synchronization mechanism which supports gradient synchronization when data is sharded on any broadcast dimension. [#57605](https://github.com/PaddlePaddle/Paddle/pull/57605),[#54727](https://github.com/PaddlePaddle/Paddle/pull/54727),[#54409](https://github.com/PaddlePaddle/Paddle/pull/54409),[#54787](https://github.com/PaddlePaddle/Paddle/pull/54787),[#58313](https://github.com/PaddlePaddle/Paddle/pull/58313),[#59179](https://github.com/PaddlePaddle/Paddle/pull/59179),[#59416](https://github.com/PaddlePaddle/Paddle/pull/59416),[#59719](https://github.com/PaddlePaddle/Paddle/pull/59719),[#59822](https://github.com/PaddlePaddle/Paddle/pull/59822),[#59057](https://github.com/PaddlePaddle/Paddle/pull/59057),[#59522](https://github.com/PaddlePaddle/Paddle/pull/59522),[#57061](https://github.com/PaddlePaddle/Paddle/pull/57061)
-- Adapted the executor to PIR, and supported PIR optimization Pass. In distributed scenarios, supports fuse_linear fuse, and etc., to improve performance. [#58459](https://github.com/PaddlePaddle/Paddle/pull/58459),[#58528](https://github.com/PaddlePaddle/Paddle/pull/58528),[#55555](https://github.com/PaddlePaddle/Paddle/pull/55555),[#59757](https://github.com/PaddlePaddle/Paddle/pull/59757),[#59102](https://github.com/PaddlePaddle/Paddle/pull/59102),[#57917](https://github.com/PaddlePaddle/Paddle/pull/57917)
-- Upgraded underlying architecture: upgraded the executor to reuse the results of data-flow dependency analysis and static kernel selection; upgraded entire graph based sharding completion mechanism, to switch to new sharding derivation rules and support some long-tailed cases; optimized the support of control flow under distributed static graph to adapt to more scenarios; reduced the graph compilation time and refined error message format to improve user experience. [#55389](https://github.com/PaddlePaddle/Paddle/pull/55389),[#55650](https://github.com/PaddlePaddle/Paddle/pull/55650),[#54938](https://github.com/PaddlePaddle/Paddle/pull/54938),[#57447](https://github.com/PaddlePaddle/Paddle/pull/57447),[#57751](https://github.com/PaddlePaddle/Paddle/pull/57751),[#57742](https://github.com/PaddlePaddle/Paddle/pull/57742),[#59524](https://github.com/PaddlePaddle/Paddle/pull/59524),[#59526](https://github.com/PaddlePaddle/Paddle/pull/59526),[#58669](https://github.com/PaddlePaddle/Paddle/pull/58669),[#57616](https://github.com/PaddlePaddle/Paddle/pull/57616),[#56511](https://github.com/PaddlePaddle/Paddle/pull/56511),[#55727](https://github.com/PaddlePaddle/Paddle/pull/55727),[#58906](https://github.com/PaddlePaddle/Paddle/pull/58906),[#56016](https://github.com/PaddlePaddle/Paddle/pull/56016),[#54897](https://github.com/PaddlePaddle/Paddle/pull/54897)
-- Optimized the gpu memory usage in static graph mode, and added refined recomputing strategy; optimized auto mixed precision pass, and allows users to manually specify auto-cast region and fixed some bugs; supports parallel computation of cross-entropy; supports fusion operators such as scaled_dot_product_attention, fuse_rope, etc.; performs scheduling optimization to support better overlap between communication and computation in tensor parallelism and pipeline parallelsim. [#58421](https://github.com/PaddlePaddle/Paddle/pull/58421),[#58533](https://github.com/PaddlePaddle/Paddle/pull/58533),[#59498](https://github.com/PaddlePaddle/Paddle/pull/59498),[#59498](https://github.com/PaddlePaddle/Paddle/pull/59498),[#59187](https://github.com/PaddlePaddle/Paddle/pull/59187),[#59188](https://github.com/PaddlePaddle/Paddle/pull/59188),[#58172](https://github.com/PaddlePaddle/Paddle/pull/58172),[#58628](https://github.com/PaddlePaddle/Paddle/pull/58628),[#56185](https://github.com/PaddlePaddle/Paddle/pull/56185),[#56696](https://github.com/PaddlePaddle/Paddle/pull/56696),[#59497](https://github.com/PaddlePaddle/Paddle/pull/59497),[#58304](https://github.com/PaddlePaddle/Paddle/pull/58304),[#58977](https://github.com/PaddlePaddle/Paddle/pull/58977)
+1. Fix the bug when Lowering some special operators to the compiler, to improve the end-to-end user experience.([#60800](https://github.com/PaddlePaddle/Paddle/pull/60800)、[#64720](https://github.com/PaddlePaddle/Paddle/pull/64720)、[#62593](https://github.com/PaddlePaddle/Paddle/pull/62593)、[#62661](https://github.com/PaddlePaddle/Paddle/pull/62661)、[#64626](https://github.com/PaddlePaddle/Paddle/pull/64626)、[#63320](https://github.com/PaddlePaddle/Paddle/pull/63320)、[#64581](https://github.com/PaddlePaddle/Paddle/pull/64581)、[#61608](https://github.com/PaddlePaddle/Paddle/pull/61608)、[#64135](https://github.com/PaddlePaddle/Paddle/pull/64135)、[#64659](https://github.com/PaddlePaddle/Paddle/pull/64659)、[#62391](https://github.com/PaddlePaddle/Paddle/pull/62391)、[#62490](https://github.com/PaddlePaddle/Paddle/pull/62490)、[#63891](https://github.com/PaddlePaddle/Paddle/pull/63891)、[#64529](https://github.com/PaddlePaddle/Paddle/pull/64529))
+2. Fix a bug in the symbolic derivation logic of some operators.([#62141](https://github.com/PaddlePaddle/Paddle/pull/62141)、[#62376](https://github.com/PaddlePaddle/Paddle/pull/62376)、[#62941](https://github.com/PaddlePaddle/Paddle/pull/62941)、[#63322](https://github.com/PaddlePaddle/Paddle/pull/63322)、[#64672](https://github.com/PaddlePaddle/Paddle/pull/64672)、[#64407](https://github.com/PaddlePaddle/Paddle/pull/64407)、[#60241](https://github.com/PaddlePaddle/Paddle/pull/60241)、[#60440](https://github.com/PaddlePaddle/Paddle/pull/60440)、[#62503](https://github.com/PaddlePaddle/Paddle/pull/62503)、[#62997](https://github.com/PaddlePaddle/Paddle/pull/62997)、[#63169](https://github.com/PaddlePaddle/Paddle/pull/63169)、[#61098](https://github.com/PaddlePaddle/Paddle/pull/61098)、[#63973](https://github.com/PaddlePaddle/Paddle/pull/63973)、[#62248](https://github.com/PaddlePaddle/Paddle/pull/62248)、[#62321](https://github.com/PaddlePaddle/Paddle/pull/62321)、[#63755](https://github.com/PaddlePaddle/Paddle/pull/63755)、[#63917](https://github.com/PaddlePaddle/Paddle/pull/63917)、[#63903](https://github.com/PaddlePaddle/Paddle/pull/63903)、[#64173](https://github.com/PaddlePaddle/Paddle/pull/64173)、[#64525](https://github.com/PaddlePaddle/Paddle/pull/64525)、[#64615](https://github.com/PaddlePaddle/Paddle/pull/64615)、[#62247](https://github.com/PaddlePaddle/Paddle/pull/62247)、[#62455](https://github.com/PaddlePaddle/Paddle/pull/62455)、[#62898](https://github.com/PaddlePaddle/Paddle/pull/62898)、[#62867](https://github.com/PaddlePaddle/Paddle/pull/62867)、[#63608](https://github.com/PaddlePaddle/Paddle/pull/63608)、[#63789](https://github.com/PaddlePaddle/Paddle/pull/63789)、[#64085](https://github.com/PaddlePaddle/Paddle/pull/64085)、[#64136](https://github.com/PaddlePaddle/Paddle/pull/64136)、[#64181](https://github.com/PaddlePaddle/Paddle/pull/64181))
+3. Fix the problems of compiler execution errors under dynamic and static shapes, to improve the robustness of the framework mechanism.([#60813](https://github.com/PaddlePaddle/Paddle/pull/60813)、[#61877](https://github.com/PaddlePaddle/Paddle/pull/61877)、[#61909](https://github.com/PaddlePaddle/Paddle/pull/61909)、[#62954](https://github.com/PaddlePaddle/Paddle/pull/62954)、[#63614](https://github.com/PaddlePaddle/Paddle/pull/63614)、[#60339](https://github.com/PaddlePaddle/Paddle/pull/60339)、[#60623](https://github.com/PaddlePaddle/Paddle/pull/60623)、[#60658](https://github.com/PaddlePaddle/Paddle/pull/60658)、[#60669](https://github.com/PaddlePaddle/Paddle/pull/60669)、[#58823](https://github.com/PaddlePaddle/Paddle/pull/58823)、[#62483](https://github.com/PaddlePaddle/Paddle/pull/62483)、[#62742](https://github.com/PaddlePaddle/Paddle/pull/62742)、[#61797](https://github.com/PaddlePaddle/Paddle/pull/61797)、[#63411](https://github.com/PaddlePaddle/Paddle/pull/63411)、[#64077](https://github.com/PaddlePaddle/Paddle/pull/64077)、[#62736](https://github.com/PaddlePaddle/Paddle/pull/62736)、[#62390](https://github.com/PaddlePaddle/Paddle/pull/62390)、[#63689](https://github.com/PaddlePaddle/Paddle/pull/63689))
-#### AutoTuner
+### Deprecated Features
-This release implements a profiling based automatic search and tuning tool named AutoTuner for parallel strategies, to automatically combine parallel and optimization strategies. Users can select effective combination configurations for experiments, and AutoTuner will search for the optimal configuration for large model training and inference given the model and hardware specification. In addition, AutoTuner implements a variety of pruning methods, including gpu memory modelling based pruning, so the search space and search time can be significantly reduced. [#54460](https://github.com/PaddlePaddle/Paddle/pull/54460),[#54668](https://github.com/PaddlePaddle/Paddle/pull/54668),[#59794](https://github.com/PaddlePaddle/Paddle/pull/59794),[#59727](https://github.com/PaddlePaddle/Paddle/pull/59727),[#59782](https://github.com/PaddlePaddle/Paddle/pull/59782),[#54834](https://github.com/PaddlePaddle/Paddle/pull/54834),[#58127](https://github.com/PaddlePaddle/Paddle/pull/58127),[#56968](https://github.com/PaddlePaddle/Paddle/pull/56968),[#55466](https://github.com/PaddlePaddle/Paddle/pull/55466),[#56939](https://github.com/PaddlePaddle/Paddle/pull/56939),[#58183](https://github.com/PaddlePaddle/Paddle/pull/58183),[#58314](https://github.com/PaddlePaddle/Paddle/pull/58314),[#55499](https://github.com/PaddlePaddle/Paddle/pull/55499),[#59748](https://github.com/PaddlePaddle/Paddle/pull/59748)
+1. Remove useless symbol-related components such as adt DimExpr, SymbolicDimExpr and ShapedTypeInterface.([#60901](https://github.com/PaddlePaddle/Paddle/pull/60901)、[#60933](https://github.com/PaddlePaddle/Paddle/pull/60933)、[#60744](https://github.com/PaddlePaddle/Paddle/pull/60744)、[#64176](https://github.com/PaddlePaddle/Paddle/pull/64176)、[#64140](https://github.com/PaddlePaddle/Paddle/pull/64140))
+2. Remove the old Group Cluster, and the front-end representation under the old IR, to improve the simplicity of the architecture.([#63683](https://github.com/PaddlePaddle/Paddle/pull/63683)、[#64630](https://github.com/PaddlePaddle/Paddle/pull/64630)、[#61380](https://github.com/PaddlePaddle/Paddle/pull/61380))
-### Operator library
+## Auto-Parallel Architecture
-#### Incompatible upgrade
+In order to further enhance the usability of the Auto Parallel architecture in large model training scenarios, PaddlePaddle has improved the Auto Parallel functionality in dynamic-static graphs, including the newly added parallel strategies such as sharding parallelism and interleaved pipeline parallelism, including support of lazy initialization parameters. Add and enhance the SPMD derivation rules for some of the operators. The auto-parallel architecture has been comprehensively verified in a number of mainstream large language models. Meanwhile, in order to build the new 3.0 architecture of PaddlePaddle, the static graph auto parallel architecture has been comprehensively upgraded based on PIR, the new generation intermediate representation of Paddlepaddle. It introduces DistDialect for distributed related components, and natively support DistAttr and DistTensor in the computation graph representation, and smooth the transfom from static to dynmaic graph, further enhance the unity of auto parallel usage in dynamic and static graph mode. Finally, a number of performance optimization technologies have been added and improved, including zero bubble pipeline scheduling strategy, achieving the same or even better end-to-end training performance compared to the manual parallelism on typical large models such as Llama-2 13B/70B.
-In order to improve maintainability of PaddlePaddle framework, some deprecated operators in the framework (e.g. diag_v1, isfinite_v1, pad2d_v1, etc.) have been removed, and models using these operators saved through the PaddlePaddle 1.x training will not be able to infer on new version of PaddlePaddle. [#57895](https://github.com/PaddlePaddle/Paddle/pull/57895),[#57892](https://github.com/PaddlePaddle/Paddle/pull/57892),[#57898](https://github.com/PaddlePaddle/Paddle/pull/57898),[#57730](https://github.com/PaddlePaddle/Paddle/pull/57730),[#57732](https://github.com/PaddlePaddle/Paddle/pull/57732),[#57810](https://github.com/PaddlePaddle/Paddle/pull/57810),[#57884](https://github.com/PaddlePaddle/Paddle/pull/57884),[#57794](https://github.com/PaddlePaddle/Paddle/pull/57794),[#57926](https://github.com/PaddlePaddle/Paddle/pull/57926),[#57925](https://github.com/PaddlePaddle/Paddle/pull/57925),[#57807](https://github.com/PaddlePaddle/Paddle/pull/57807),[#57808](https://github.com/PaddlePaddle/Paddle/pull/57808)
+### Function Improvements
-#### Operator library enhancements
+- Add the dtensor_from_local interface for creating DistTensor from local tensor after sharding (correspondingly, shard_tensor is the created DistTensor from global tensor before sharding). [#60206](https://github.com/PaddlePaddle/Paddle/pull/60206)
+- Add the unshard_tensor interface to convert DistTensor to global tensor, which is reciprocal operation to shard_tensor. [#60272](https://github.com/PaddlePaddle/Paddle/pull/60272)
+- To reduce the GPU memory usage during training, add Sharding parallelism, and support stage1, stage2 and stage3 modes. [#61926](https://github.com/PaddlePaddle/Paddle/pull/61926), [#62711](https://github.com/PaddlePaddle/Paddle/pull/62711), [#62486](https://github.com/PaddlePaddle/Paddle/pull/62486), [#62230](https://github.com/PaddlePaddle/Paddle/pull/62230)
+- To solve the problem of insufficient GPU memory when initializing parameters first and then sharding them, add the LazyInit function, to support slicing parameters first and then initializing them. [#60316](https://github.com/PaddlePaddle/Paddle/pull/60316), [#60441](https://github.com/PaddlePaddle/Paddle/pull/60441), [#60563](https://github.com/PaddlePaddle/Paddle/pull/60563), [#61792](https://github.com/PaddlePaddle/Paddle/pull/61792)
+- In order to reduce the bubble of pipeline parallel, add the interleaved pipeline parallel parallelism has been added, and support automatically converting the pipeline parallel of the user's networking to interleaved pipeline parallel through configuration, so that the user doesn't need to perform complicated marking in the networking. [#59751](https://github.com/PaddlePaddle/Paddle/pull/59751), [#60050](https://github.com/PaddlePaddle/Paddle/pull/60050), [#60467](https://github.com/PaddlePaddle/Paddle/pull/60467), [#60868](https://github.com/PaddlePaddle/Paddle/pull/60868), [#60187](https://github.com/PaddlePaddle/Paddle/pull/60187), [#62884](https://github.com/PaddlePaddle/Paddle/pull/62884), [#60560](https://github.com/PaddlePaddle/Paddle/pull/60560), [#61541](https://github.com/PaddlePaddle/Paddle/pull/61541)
+- Add the SPMD derivation rules for stack, gather, scatter_grad, cumsum, unbind, swiglu, and fused_linear_param_grad. Improve and optimize the implementation of fused_rope, reshape, flatten, fused_rms_norm, slice, tile, flash_attn, cross_entropy and other operator slice derivation rules, to solve the problem of incompatibility in some of the model networking scenarios. [#62720](https://github.com/PaddlePaddle/Paddle/pull/62720), [#64202](https://github.com/PaddlePaddle/Paddle/pull/64202), [#63361](https://github.com/PaddlePaddle/Paddle/pull/63361), [#63290](https://github.com/PaddlePaddle/Paddle/pull/63290), [#61460](https://github.com/PaddlePaddle/Paddle/pull/61460), [#59986](https://github.com/PaddlePaddle/Paddle/pull/59986), [#61184](https://github.com/PaddlePaddle/Paddle/pull/61184), [#60144](https://github.com/PaddlePaddle/Paddle/pull/60144), [#62525](https://github.com/PaddlePaddle/Paddle/pull/62525), [#62053](https://github.com/PaddlePaddle/Paddle/pull/62053), [#60709](https://github.com/PaddlePaddle/Paddle/pull/60709), [#60111](https://github.com/PaddlePaddle/Paddle/pull/60111), [#63681](https://github.com/PaddlePaddle/Paddle/pull/63681), [#62180](https://github.com/PaddlePaddle/Paddle/pull/62180), [#60794](https://github.com/PaddlePaddle/Paddle/pull/60794), [#60632](https://github.com/PaddlePaddle/Paddle/pull/60632), [#62439](https://github.com/PaddlePaddle/Paddle/pull/62439)
+- Improve the distributed checkpoint storage and loading function, support master_weights strategy, and fix the random hanging problem. [#60027](https://github.com/PaddlePaddle/Paddle/pull/60027), [#59872](https://github.com/PaddlePaddle/Paddle/pull/59872)
+- In order to support the auto parallel of arbitrary shape tensor, add the non-uniform tensor sharding feature. [#62611](https://github.com/PaddlePaddle/Paddle/pull/62611), [#61432](https://github.com/PaddlePaddle/Paddle/pull/61432)
+- In order to support users to use customized operators in the auto parallel networking, support user registration outside the framework to customize the SPMD derivation rules for this class of operators. [#60509](https://github.com/PaddlePaddle/Paddle/pull/60509)
+- Improve the slice SPMD rule, and support the transition from any state to replicate and from replicate state to any state. [#60281](https://github.com/PaddlePaddle/Paddle/pull/60281), [#59869](https://github.com/PaddlePaddle/Paddle/pull/59869)
+- Add MoE expert parallelism (experimental). Currently, only dynamic graph auto parallel is supported. [#63904](https://github.com/PaddlePaddle/Paddle/pull/63904)
+- Fix some process adaptation problems of auto parallel and dynamic diagram execution, and dynamic to static. [#60214](https://github.com/PaddlePaddle/Paddle/pull/60214), [#60546](https://github.com/PaddlePaddle/Paddle/pull/60546), [#62082](https://github.com/PaddlePaddle/Paddle/pull/62082), [#61313](https://github.com/PaddlePaddle/Paddle/pull/61313), [#61840](https://github.com/PaddlePaddle/Paddle/pull/61840), [#60614](https://github.com/PaddlePaddle/Paddle/pull/60614), [#60234](https://github.com/PaddlePaddle/Paddle/pull/60234), [#64813](https://github.com/PaddlePaddle/Paddle/pull/64813), [#61606](https://github.com/PaddlePaddle/Paddle/pull/61606), [#63405](https://github.com/PaddlePaddle/Paddle/pull/63405), [#64334](https://github.com/PaddlePaddle/Paddle/pull/64334), [#60504](https://github.com/PaddlePaddle/Paddle/pull/60504)
-- The complex kernels of PaddlePaddle PHI operator library have been further enhanced, and a total of 40+ complex kernels have been added. [#55380](https://github.com/PaddlePaddle/Paddle/pull/55380), [#56349](https://github.com/PaddlePaddle/Paddle/pull/56349), [#56412](https://github.com/PaddlePaddle/Paddle/pull/56412), [#56323](https://github.com/PaddlePaddle/Paddle/pull/56323), [#56723](https://github.com/PaddlePaddle/Paddle/pull/56723), [#56457](https://github.com/PaddlePaddle/Paddle/pull/56457), [#56903](https://github.com/PaddlePaddle/Paddle/pull/56903)[#56914](https://github.com/PaddlePaddle/Paddle/pull/56914), [#57116](https://github.com/PaddlePaddle/Paddle/pull/57116), [#56048](https://github.com/PaddlePaddle/Paddle/pull/56048), [#57244](https://github.com/PaddlePaddle/Paddle/pull/57244), [#57639](https://github.com/PaddlePaddle/Paddle/pull/57639), [#57638](https://github.com/PaddlePaddle/Paddle/pull/57638), [#57540](https://github.com/PaddlePaddle/Paddle/pull/57540), [#58545](https://github.com/PaddlePaddle/Paddle/pull/58545), [#58336](https://github.com/PaddlePaddle/Paddle/pull/58336), [#58532](https://github.com/PaddlePaddle/Paddle/pull/58532), [#58839](https://github.com/PaddlePaddle/Paddle/pull/58839), [#59079](https://github.com/PaddlePaddle/Paddle/pull/59079), [#59277](https://github.com/PaddlePaddle/Paddle/pull/59277), [#59122](https://github.com/PaddlePaddle/Paddle/pull/59122), [#57058](https://github.com/PaddlePaddle/Paddle/pull/57058)
+### Performance Optimization
-- Optimized and added XPU kernels for some operators, and enhanced the support for data types such as bfloat16 on XPU kernel. [#54478](https://github.com/PaddlePaddle/Paddle/pull/54478), [#57740](https://github.com/PaddlePaddle/Paddle/pull/57740), [#58346](https://github.com/PaddlePaddle/Paddle/pull/58346), [#58456](https://github.com/PaddlePaddle/Paddle/pull/58456), [#58662](https://github.com/PaddlePaddle/Paddle/pull/58662), [#59066](https://github.com/PaddlePaddle/Paddle/pull/59066), [#59263](https://github.com/PaddlePaddle/Paddle/pull/59263)), [#59375](https://github.com/PaddlePaddle/Paddle/pull/59375), [#59505](https://github.com/PaddlePaddle/Paddle/pull/59505), [#59653](https://github.com/PaddlePaddle/Paddle/pull/59653), [#55001](https://github.com/PaddlePaddle/Paddle/pull/55001), [#57272](https://github.com/PaddlePaddle/Paddle/pull/57272), [#56169](https://github.com/PaddlePaddle/Paddle/pull/56169), [#59454](https://github.com/PaddlePaddle/Paddle/pull/59454), [#59480](https://github.com/PaddlePaddle/Paddle/pull/59480), [#55914](https://github.com/PaddlePaddle/Paddle/pull/55914), [#54758](https://github.com/PaddlePaddle/Paddle/pull/54758), [#54827](https://github.com/PaddlePaddle/Paddle/pull/54827), [#58364](https://github.com/PaddlePaddle/Paddle/pull/58364), [#58419](https://github.com/PaddlePaddle/Paddle/pull/58419), [#58982](https://github.com/PaddlePaddle/Paddle/pull/58982), [#57216](https://github.com/PaddlePaddle/Paddle/pull/57216), [#59166](https://github.com/PaddlePaddle/Paddle/pull/59166), [#55033](https://github.com/PaddlePaddle/Paddle/pull/55033), [#55375](https://github.com/PaddlePaddle/Paddle/pull/55375), [#58805](https://github.com/PaddlePaddle/Paddle/pull/58805), [#59389](https://github.com/PaddlePaddle/Paddle/pull/59389), [#57077](https://github.com/PaddlePaddle/Paddle/pull/57077), [#55166](https://github.com/PaddlePaddle/Paddle/pull/55166), [#56773](https://github.com/PaddlePaddle/Paddle/pull/56773)
+- In order to reduce the bubble in pipeline parallel, support the reverse computation of parameter and activation splitting in backward, and add zero bubble pipeline scheduling strategy to improve the training performance. [#62865](https://github.com/PaddlePaddle/Paddle/pull/62865), [#62737](https://github.com/PaddlePaddle/Paddle/pull/62737), [#64534](https://github.com/PaddlePaddle/Paddle/pull/64534),
+- To improve the performance of sequence parallel, perform fusion on related communication operations and computation operations, and optimize redundant transopse operations. [#64807](https://github.com/PaddlePaddle/Paddle/pull/64807), [#63948](https://github.com/PaddlePaddle/Paddle/pull/63948), [#64316](https://github.com/PaddlePaddle/Paddle/pull/64316), [#64119](https://github.com/PaddlePaddle/Paddle/pull/64119)
+- Optimize the time consumption of auto parallel graph optimization for static graphs, to reduce the delay from the start of training to the completion of the first step. [#59912](https://github.com/PaddlePaddle/Paddle/pull/59912), [#61817](https://github.com/PaddlePaddle/Paddle/pull/61817), [#60022](https://github.com/PaddlePaddle/Paddle/pull/60022), [#60125](https://github.com/PaddlePaddle/Paddle/pull/60125)
+- Optimize the time consumption of related communication operations in hybrid parallel scenarios. [#62157](https://github.com/PaddlePaddle/Paddle/pull/62157), [#61622](https://github.com/PaddlePaddle/Paddle/pull/61622)
+- Optimize the redundant video memory consumption of parameters under the auto parallel dynamic-to-static. [#62746](https://github.com/PaddlePaddle/Paddle/pull/62746)
+- Improve the hybrid precision training function of auto parallel, support the setting of local auto_cast and black/white list, support master grad function, and adapt to different parallel strategies. [60158](https://github.com/PaddlePaddle/Paddle/pull/60158), [#59987](https://github.com/PaddlePaddle/Paddle/pull/59987), [#62629](https://github.com/PaddlePaddle/Paddle/pull/62629), [#60385](https://github.com/PaddlePaddle/Paddle/pull/60385), [#62015](https://github.com/PaddlePaddle/Paddle/pull/62015), [#60514](https://github.com/PaddlePaddle/Paddle/pull/60514), [#61221](https://github.com/PaddlePaddle/Paddle/pull/61221), [#60779](https://github.com/PaddlePaddle/Paddle/pull/60779), [#63228](https://github.com/PaddlePaddle/Paddle/pull/63228)
+- Optimize non-essential casts caused by type promotion and amp to improve performance. [#63293](https://github.com/PaddlePaddle/Paddle/pull/63293), [#63228](https://github.com/PaddlePaddle/Paddle/pull/63228)
-- Added some operators for optimizing large model training and inference performance. [#55758](https://github.com/PaddlePaddle/Paddle/pull/55758), [#54998](https://github.com/PaddlePaddle/Paddle/pull/54998), [#55400](https://github.com/PaddlePaddle/Paddle/pull/55400), [#54630](https://github.com/PaddlePaddle/Paddle/pull/54630), [#55969](https://github.com/PaddlePaddle/Paddle/pull/55969), [#55026](https://github.com/PaddlePaddle/Paddle/pull/55026), [#58986](https://github.com/PaddlePaddle/Paddle/pull/58986)
+### Upgrade Static Graph Auto Parallel Architecture
-- Improved mechanism of Tensor Strided in the operator library. [#59422](https://github.com/PaddlePaddle/Paddle/pull/59422), [#59325](https://github.com/PaddlePaddle/Paddle/pull/59325), [#56863](https://github.com/PaddlePaddle/Paddle/pull/56863), [#56882](https://github.com/PaddlePaddle/Paddle/pull/56882), [#56947](https://github.com/PaddlePaddle/Paddle/pull/56947)
+- Based on the new generation of Intermediate Representation(PIR), add the new DistDialect, natively supporting DistAttr and DistTensor in computation graph representation, and realizing the direct binding of distributed attributes between tensor or operator, which making the auto-parallel architecture more simple and unified. [#63828](https://github.com/PaddlePaddle/Paddle/pull/63828), [#64299](https://github.com/PaddlePaddle/Paddle/pull/64299), [#63870](https://github.com/PaddlePaddle/Paddle/pull/63870), [#64144](https://github.com/PaddlePaddle/Paddle/pull/64144), [#62524](https://github.com/PaddlePaddle/Paddle/pull/62524), [#62630](https://github.com/PaddlePaddle/Paddle/pull/62630), [#62897](https://github.com/PaddlePaddle/Paddle/pull/62897), [#60478](https://github.com/PaddlePaddle/Paddle/pull/60478), [#60574](https://github.com/PaddlePaddle/Paddle/pull/60574), [#63876](https://github.com/PaddlePaddle/Paddle/pull/63876), [#63798](https://github.com/PaddlePaddle/Paddle/pull/63798), [#62560](https://github.com/PaddlePaddle/Paddle/pull/62560), [#63676](https://github.com/PaddlePaddle/Paddle/pull/63676)
+- Improve APIs such as shard_tensor, reshard, and to_static, to support users to convert the dynamic graph model networking directly into PIR static computation graph for better performance. [#62945](https://github.com/PaddlePaddle/Paddle/pull/62945), [#62356](https://github.com/PaddlePaddle/Paddle/pull/62356), [#60175](https://github.com/PaddlePaddle/Paddle/pull/60175), [#62654](https://github.com/PaddlePaddle/Paddle/pull/62654), [#63347](https://github.com/PaddlePaddle/Paddle/pull/63347)
+- Optimize the auto-parallel graph optimization compilation process, and reduce the compilation and optimization time of static graphs by refactoring and optimizing the procedure of computation graph parallelization and communication resolution. [#64137](https://github.com/PaddlePaddle/Paddle/pull/64137), [#62201](https://github.com/PaddlePaddle/Paddle/pull/62201), [#64143](https://github.com/PaddlePaddle/Paddle/pull/64143), [#62560](https://github.com/PaddlePaddle/Paddle/pull/62560)
+- Optimize the procedure of the SPMD derivation in static graphs to achieve the consistency results under dynamic-static graphs, which improves the unity and stability of the architecture. [#62659](https://github.com/PaddlePaddle/Paddle/pull/62659), [#62547](https://github.com/PaddlePaddle/Paddle/pull/62547), [#63117](https://github.com/PaddlePaddle/Paddle/pull/63117), [#63434](https://github.com/PaddlePaddle/Paddle/pull/63434), [#63770](https://github.com/PaddlePaddle/Paddle/pull/63770), [#64361](https://github.com/PaddlePaddle/Paddle/pull/64361), [#63073](https://github.com/PaddlePaddle/Paddle/pull/63073)
+- Upgrade the implementation of Reshard conversion in static graphs, and use consistent conversion rules under dynamic-static graphs to ensure the consistency of the execution logic and results of tensor reshard conversion in dynamic-static graphs, so as to improve user experience. [#62718](https://github.com/PaddlePaddle/Paddle/pull/62718), [#62694](https://github.com/PaddlePaddle/Paddle/pull/62694), [#60215](https://github.com/PaddlePaddle/Paddle/pull/60215), [#63362](https://github.com/PaddlePaddle/Paddle/pull/63362), [#63072](https://github.com/PaddlePaddle/Paddle/pull/63072), [#63962](https://github.com/PaddlePaddle/Paddle/pull/63962), [#64223](https://github.com/PaddlePaddle/Paddle/pull/64223), [#61796](https://github.com/PaddlePaddle/Paddle/pull/61796), [#64465](https://github.com/PaddlePaddle/Paddle/pull/64465), [#64623](https://github.com/PaddlePaddle/Paddle/pull/64623), [#64418](https://github.com/PaddlePaddle/Paddle/pull/64418)
-- Optimized function implementation and template function in some kernels to reduce size of complied library package. [#57083](https://github.com/PaddlePaddle/Paddle/pull/57083), [#57299](https://github.com/PaddlePaddle/Paddle/pull/57299), [#57261](https://github.com/PaddlePaddle/Paddle/pull/57261), [#57290](https://github.com/PaddlePaddle/Paddle/pull/57290), [#57118](https://github.com/PaddlePaddle/Paddle/pull/57118), [#57551](https://github.com/PaddlePaddle/Paddle/pull/57551), [#57509](https://github.com/PaddlePaddle/Paddle/pull/57509), [#57558](https://github.com/PaddlePaddle/Paddle/pull/57558), [#57064](https://github.com/PaddlePaddle/Paddle/pull/57064), [#57365](https://github.com/PaddlePaddle/Paddle/pull/57365), [#57327](https://github.com/PaddlePaddle/Paddle/pull/57327), [#57603](https://github.com/PaddlePaddle/Paddle/pull/57603), [#57671](https://github.com/PaddlePaddle/Paddle/pull/57671), [#57672](https://github.com/PaddlePaddle/Paddle/pull/57672), [#57631](https://github.com/PaddlePaddle/Paddle/pull/57631), [#57082](https://github.com/PaddlePaddle/Paddle/pull/57082), [#57721](https://github.com/PaddlePaddle/Paddle/pull/57721), [#57823](https://github.com/PaddlePaddle/Paddle/pull/57823), [#57821](https://github.com/PaddlePaddle/Paddle/pull/57821), [#57815](https://github.com/PaddlePaddle/Paddle/pull/57815), [#57822](https://github.com/PaddlePaddle/Paddle/pull/57822), [#57541](https://github.com/PaddlePaddle/Paddle/pull/57541), [#57817](https://github.com/PaddlePaddle/Paddle/pull/57817), [#57838](https://github.com/PaddlePaddle/Paddle/pull/57838)
+### Automatic Search and Tuning of Training Strategies
-#### Fixed bug
+In order to improve the ease of use of the training strategy automatic search and tuning tool (AutoTuner), support user-defined search items, support for setting the priority of search items, and support for user-configured illegal strategy combinations, to comprehensively enhance the error reporting information in the runtime and post-run logs, and support for AutoTuner on NPU devices. [#60101](https://github.com/PaddlePaddle/Paddle/pull/60101), [#60294](https://github.com/PaddlePaddle/Paddle/pull/60294), [#61898](https://github.com/PaddlePaddle/Paddle/pull/61898), [#60248](https://github.com/PaddlePaddle/Paddle/pull/60248), [#60417](https://github.com/PaddlePaddle/Paddle/pull/60417), [#60954](https://github.com/PaddlePaddle/Paddle/pull/60954), [#61499](https://github.com/PaddlePaddle/Paddle/pull/61499), [#62724](https://github.com/PaddlePaddle/Paddle/pull/62724), [#60954](https://github.com/PaddlePaddle/Paddle/pull/60954), [#63693](https://github.com/PaddlePaddle/Paddle/pull/63693), [#62853](https://github.com/PaddlePaddle/Paddle/pull/62853), [#62984](https://github.com/PaddlePaddle/Paddle/pull/62984)
-- Fixed some bugs with CUDA 12 adaptation of the PaddlePaddle framework. [#54640](https://github.com/PaddlePaddle/Paddle/pull/54640), [#57820](https://github.com/PaddlePaddle/Paddle/pull/57820), [#58958](https://github.com/PaddlePaddle/Paddle/pull/58958), [#58179](https://github.com/PaddlePaddle/Paddle/pull/58179), [#55594](https://github.com/PaddlePaddle/Paddle/pull/55594)
+## Cuda Training Performance Optimization
-### CUDA
+This upgrade achieves the improvement of large model training efficiency from multiple perspectives, such as operator computation efficiency, distributed communication optimization, and video memory optimization.
-#### New features
+### Function Improvements
-- Added debugging class API paddle.amp.debugging.check_check_numerics. Calculated and returned number of outliers (NaN, Inf) and zero elements in this Tensor value. [#54301](https://github.com/PaddlePaddle/Paddle/pull/54301)
-- Added fused_rope fusion operator to accelerate LLaMA class large model training.[#54351](https://github.com/PaddlePaddle/Paddle/pull/54351)
-- Updated CUDNN Frontend API version to v0.9.1 and added fused_scale_bias_add_relu fusion operator to accelerate ResNet networks. Note this feature is in experimental period and is disabled by default. [#58367](https://github.com/PaddlePaddle/Paddle/pull/58367), [#54949](https://github.com/PaddlePaddle/Paddle/pull/54949), [#58504](https://github.com/PaddlePaddle/Paddle/pull/58504)
-- Based on Flash-Attention v2, added Tensor-like Mask function support. Inverse operator supports deterministic computation for debugging. [#57276](https://github.com/PaddlePaddle/Paddle/pull/57276), [#56363](https://github.com/PaddlePaddle/Paddle/pull/56363)
-- Modified sparse conv3d backend implementation to support 2d shapes, avoiding front-end reshape overhead. [#54707](https://github.com/PaddlePaddle/Paddle/pull/54707)
-- Added matmul_int8 operator. ([#55228](https://github.com/PaddlePaddle/Paddle/pull/55228))
+- Enhance the FlashAttention operator function, including support for NVIDIA SM90 GPU compilation, support for Group Query Attention, support for cuDNN access, support for QKV-packed form inputs, and so on. [#59820](https://github.com/PaddlePaddle/Paddle/pull/59820),[#60776](https://github.com/PaddlePaddle/Paddle/pull/60776),[#58680](https://github.com/PaddlePaddle/Paddle/pull/58680),[#63289](https://github.com/PaddlePaddle/Paddle/pull/63289)
+- In the Repeat_interleave operator, add support for BFloat16 data type. [#61854](https://github.com/PaddlePaddle/Paddle/pull/61854)
+- For the issues of many interface parameters of ResNet-like models such as fused_scale_bias_add_relu, fused_scale_bias_relu_conv_bn, and fused_dconv_drelu_dbn, and the ease of use of operators, add the fuse_resunit pass, to support automatic fusion of the abovementioned operators, to achieve generic performance optimization. ([#59771](https://github.com/PaddlePaddle/Paddle/pull/59771))
-#### Function optimization
+### Performance Improvement
-- Optimized CUDA Graph’s support for random number operators.[#58310](https://github.com/PaddlePaddle/Paddle/pull/58310)
-- Enhanced automatic mixed-precision training default functionality, including:
- - Optimizing the experience of using automatic mixed precision training interface.[#58152](https://github.com/PaddlePaddle/Paddle/pull/58152),[#55364](https://github.com/PaddlePaddle/Paddle/pull/55364),[#57903](https://github.com/PaddlePaddle/Paddle/pull/57903)
- - Added matrix computation class operators such as fused_attention, fused_feedforward, and fused_gemm_epilogue to framework's default whitelist, and unified default black and white list settings for dynamic and static graphs. [#55373](https://github.com/PaddlePaddle/Paddle/pull/55373), [#55713](https://github.com/PaddlePaddle/Paddle/pull/55713)
- - The argsort, dist, erfinv, nanmedian, poisson operators and lamb optimizer operators support FP16 and BF16 low precision computing. [#51662](https://github.com/PaddlePaddle/Paddle/pull/51662), [#55105](https://github.com/PaddlePaddle/Paddle/pull/55105), [#55287](https://github.com/PaddlePaddle/Paddle/pull/55287), [#55824](https://github.com/PaddlePaddle/Paddle/pull/55824), [#56056](https://github.com/PaddlePaddle/Paddle/pull/56056), [#56184](https://github.com/PaddlePaddle/Paddle/pull/56184), [#55641](https://github.com/PaddlePaddle/Paddle/pull/55641)
- - Fixed elementwise_max operator low-precision implementation. Changed to use FP32 type for numerical computing, and reduce precision loss. [#54799](https://github.com/PaddlePaddle/Paddle/pull/54799)
- - Changed temporary result Tensor needed for Reduce class operator to FP32 type, to avoid precision loss caused by converting intermediate result to low precision. [#55709](https://github.com/PaddlePaddle/Paddle/pull/55709))
-- Optimized GPU codes for flip, roll & roll_grad, index_put & index_put_grad, etc. Removed unnecessary C++ templates to optimize compilation time and reduce compiled binary size without performance degradation. [#57309](https://github.com/PaddlePaddle/Paddle/pull/57309), [#57525](https://github.com/PaddlePaddle/Paddle/pull/57525)
-- For the bernoulli operator, added a check on legitimacy of input probabilities. [#59174](https://github.com/PaddlePaddle/Paddle/pull/59174)
+- To address the problem of large GPU memory consumption during the computation of SwiGLU activation module of the Llama models, add the SwiGLU fusion operator to save the memory consumption of intermediate variables, thus reducing the memory overhead during the training process of the large model, and reducing the recomputation to improve the performance. The performance of the Llama-70B model is improved by 9%. [#61508](https://github.com/PaddlePaddle/Paddle/pull/61508)
+- To address the problem of higher percentage of communications in Sequence Parallel, realize the overlap between Sequence Parallel reverse process communication and Matmul computation, saving the end-to-end time consumption and improving the end-to-end performance of large model training scenarios by 1%~2%. [#62284](https://github.com/PaddlePaddle/Paddle/pull/62284),[#63531](https://github.com/PaddlePaddle/Paddle/pull/63531)
+- For the problem of slow training speed due to the need to divide by nranks after sharding reverse communications, support the fusion of reverse communication and division by nranks operation, and support the mode of ReduceScatter Average, to improve the performance of large model training. [#62623](https://github.com/PaddlePaddle/Paddle/pull/62623)
+- For the problem of jitter training speed caused by the input data broadcasting process of the tensor model parallel process, fix the unnecessary synchronization between CPU and GPU in the data broadcasting, to ensure the stability of the training speed. [#60816](https://github.com/PaddlePaddle/Paddle/pull/60816)
+- For the problem of low training speed due to the long parallel P2P communication time of pipelined models, realize the overlap of P2P communication and forward-backward computation. The end-to-end training performance of large models is improved by 2%~3%. [#61935](https://github.com/PaddlePaddle/Paddle/pull/61935),[#62051](https://github.com/PaddlePaddle/Paddle/pull/62051,[#62051](https://github.com/PaddlePaddle/Paddle/pull/62051))
+- For the problem of low inefficiency of bias gradient computation of fused_linear_param_grad_add operator, optimize the computation efficiency of bias gradient computation, and improve the end-to-end training performance of large model by 0.2%. [#63114](https://github.com/PaddlePaddle/Paddle/pull/63114)
+- For the problem of long time-consuming parameter broadcasting process after the end of sharding reverse computation, implement the overlap between parameter broadcasting and next step computation. As a result, the end-to-end training performance of large model is improved by more than 2%. [#63945](https://github.com/PaddlePaddle/Paddle/pull/63945)
+- To address the problem that the gradient occupies too much video memory during the pipelined parallel training, as a result of slow training speed due to the introduction of multiple computations, we have implemented the gradient dynamic release technique, to improve the end-to-end training performance of large models by 3.4%. [#59739](https://github.com/PaddlePaddle/Paddle/pull/59739)
-#### Performance optimization
+### Bug Fixing
-- Optimized BroadcastKernel's support for large Tensor. Change to call INT32 version implementation for multiple times for large Tensor Sharding, improving operator performance by 7.27x. [#57313](https://github.com/PaddlePaddle/Paddle/pull/57313), [#57996](https://github.com/PaddlePaddle/Paddle/pull/57996)
-- Optimized performance of Tensor save interface by copying the Tensor to CPU and then converting to numpy, to avoid overhead of automatically converting the Tensor to a continuous Tensor when Tensor is not continuous. [#57040](https://github.com/PaddlePaddle/Paddle/pull/57040)
+- Fix the problem of StreamSafeCUDAAllocator CUDA Event resource leakage, as a result of slowdown of large model training. [#64621](https://github.com/PaddlePaddle/Paddle/pull/64621)
+- Fix the bug of reverse calculation error of fused_rotary_position_embedding operator. [#60217](https://github.com/PaddlePaddle/Paddle/pull/60217)
+- Fix the bug that customized operators cannot control the calculation accuracy by black and white lists in AMP scenarios. [#60052](https://github.com/PaddlePaddle/Paddle/pull/60052)
+- Fix the bug that operators such as add_, and divide_ natively supporting operations with different data types have unanticipated type boosting when type boosting occurs. [#64302](https://github.com/PaddlePaddle/Paddle/pull/64302)
-#### Bug Fix
+## Distributed Strategy Enhancements
-- Fixed bug of memmory_efficient_attention operator supporting the sm_90. [#58070](https://github.com/PaddlePaddle/Paddle/pull/58070)
-- Fixed the NaN problem of softmax operator when axis=-1 and length is greater than 100000. [#57851](https://github.com/PaddlePaddle/Paddle/pull/57851)
-- Fixed bug of GPU access error in some cases for set_constant operator. [#59905](https://github.com/PaddlePaddle/Paddle/pull/59905)
-- Fixed GPU storage read/write contention issue in fast implementation version of layer_norm operator. [#56435](https://github.com/PaddlePaddle/Paddle/pull/56435)
+Focus on strengthening the functional experience of PaddlePaddle dynamic graph distributed computing, and make various functional improvements to parallel strategies such as AutoTuner, pipeline parallel, and sharding, and enhance the flexibility of large model training. Add the features such as Flash Attention Mask, which significantly reduce the video memory usage of large model training, especially long-sequence training, improve training performance, and provide stronger capability support for large model training. In addition, several bugs and potential security risks have been fixed, which has significantly improved the overall stability of the system.
-### Expanded Compiler Infrastructure for Neural Networks (CINN)
+### Function Optimization
-In this update, PaddlePaddle CINN focuses on optimization of architecture and comprehensive expansion of its capabilities. In view of increasing demand for dynamic shapes for large models, effective operation and optimization strategies of compiler under dynamic shapes are initially explored and implemented.
-At the architectural level, Python DSL is introduced, significantly improving CINN's development convenience and Debug capability and enabling developers to write and debug codes more efficiently. Meanwhile, logic of Schedule has been refactored to be dominated by GroupSchedule, enabling more general and stable optimization strategies at operator Group level. In order to enhance stability of CINN, a strong constraint component is explored and introduced. This can effectively reduce uncertainties and potential errors in the system. In addition, historical tool classes and software structure of CINN are systematically organized, optimized and improved, to further enhance readability and maintainability of codes. In terms of integration with other PaddlePaddle components, tight integration of CINN with PIR and Paddle has been further strengthened, making compiler more coherent with overall PaddlePaddle framework. This improvement not only enhances performance of the compiler, but also provides developers with a smoother and more unified development experience.
+- Optimize the search space of Autotuner, which significantly improves the performance of search. [#62608](https://github.com/PaddlePaddle/Paddle/pull/62608)
+- For the problem of pipeline parallel that the training may be wrong due to the checking of sending type in the eval process, add the training configuration, to skip the redundant receiving check of pipelined sending, featuring higher flexibility and better performance. [#63001](https://github.com/PaddlePaddle/Paddle/pull/63001)
+- In the dynamic graph pipeline parallel, add the checking of the size and type of the sent and received data, and add the error message, making the robustness and debuggability better. [#59405](https://github.com/PaddlePaddle/Paddle/pull/59405)
+- Support the settings of multiple loss functions with returning multiple losses in dynamic graph pipeline, which improves the flexibility of dynamic graph pipeline. [#63167](https://github.com/PaddlePaddle/Paddle/pull/63167)
+- In the dynamic graph pipeline, add the pipeline cache clearing configuration option, to clear the cache sent and received in the pipeline in time to better support dynamic batchsize training. [#62277](https://github.com/PaddlePaddle/Paddle/pull/62277)
+- For the problem that the sharding stage3 strategy cannot be aligned bit by bit, replace the unordered set with OrderedSet to avoid the error caused by the accumulation order, as a result of alignment bit by bit after fixing. [#60085](https://github.com/PaddlePaddle/Paddle/pull/60085)
+- In order to further reduce the video memory usage in sequence parallel, add a new method of recalculating allgather, to reduce the video memory size of the activation of allgather. [#64244](https://github.com/PaddlePaddle/Paddle/pull/64244)
-#### Compatibility upgrade
+### New Features for Dynamic Graphs
-- Updated storage read interface to be compatible with Paddle 2.0. [#55836](https://github.com/PaddlePaddle/Paddle/pull/55836)
-- Updated relu6 Op Mapper compatibility. [#55611](https://github.com/PaddlePaddle/Paddle/pull/55611)
+- For the search space of autotuner, add a new search dimension of refined recompute, which makes the search result more accurate and the threshold of model tuning lower. [#62430](https://github.com/PaddlePaddle/Paddle/pull/62430)
+- For the problem of limiting the training batch size in virtual pipeline parallel, modify the pipeline scheduling method, to flexibly set the batch size, so as to support more flexible batch size. [#61561](https://github.com/PaddlePaddle/Paddle/pull/61561),[#60314](https://github.com/PaddlePaddle/Paddle/pull/60134)
+- In order to solve the problem that the video memory occupation of the mask is a quadratic complexity with low performance in sequence length when using flash attention with a mask, the memory complexity of the mask is reduced from the quadrature of the sequence length to the first square by using the sparse mask, to optimize the memory of the mask. This reduces the number of storage accesses. Meanwhile, use share memory to accelerate memory access, greatly improving the performance. [#62029](https://github.com/PaddlePaddle/Paddle/pull/62029)
+- Add the dynamic graph sharding parallel strategy, to improve the communications and computation overlap function, to improve the performance of the training process. [#60455](https://github.com/PaddlePaddle/Paddle/pull/60455)
-#### Modification deprecation
+### Communication Library Function Optimization
-- Removed old Schedule form. [#55566](https://github.com/PaddlePaddle/Paddle/pull/55566),[#55391](https://github.com/PaddlePaddle/Paddle/pull/55391)
-- Removed some obsolete tests. [#56245](https://github.com/PaddlePaddle/Paddle/pull/56245),[#57987](https://github.com/PaddlePaddle/Paddle/pull/57987)
-- Removed the remove_nested_block Visitor tool that no longer works. [#56972](https://github.com/PaddlePaddle/Paddle/pull/56972)
-- Removed other useless codes. [#55413](https://github.com/PaddlePaddle/Paddle/pull/55413)
+- Enhance the functionality of the NCCL communication library to support the initialization of customized NCCL libraries by passing additional initialization parameters during initialization. [#62193](https://github.com/PaddlePaddle/Paddle/pull/62193)
+- Add the NCCL library path search function to support more flexible NCCL library search methods. [#62492](https://github.com/PaddlePaddle/Paddle/pull/62492)
-#### New features
+### Bug Fixing
-- Added CINN paddle.framework.core.is_run_with_cinn() API on the PaddlePaddle side. [#54355](https://github.com/PaddlePaddle/Paddle/pull/54355)
-- Added CINN related operator logics, including various combinatorial operator’s disassembly logic. [#56072](https://github.com/PaddlePaddle/Paddle/pull/56072),[#58210](https://github.com/PaddlePaddle/Paddle/pull/58210),[#58502](https://github.com/PaddlePaddle/Paddle/pull/58502), [#58591](https://github.com/PaddlePaddle/Paddle/pull/58591), [#58981](https://github.com/PaddlePaddle/Paddle/pull/58981), [#59135](https://github.com/PaddlePaddle/Paddle/pull/59135), [#59274](https://github.com/PaddlePaddle/Paddle/pull/59274), [#59306](https://github.com/PaddlePaddle/Paddle/pull/59306), [#59202](https://github.com/PaddlePaddle/Paddle/pull/59202), [#59176](https://github.com/PaddlePaddle/Paddle/pull/59176), [#59534](https://github.com/PaddlePaddle/Paddle/pull/59534), [#59713](https://github.com/PaddlePaddle/Paddle/pull/59713), [#59798](https://github.com/PaddlePaddle/Paddle/pull/59798); Supports bf16, amp and other forms [#54399](https://github.com/PaddlePaddle/Paddle/pull/54399), [#54368](https://github.com/PaddlePaddle/Paddle/pull/54368), [#54608](https://github.com/PaddlePaddle/Paddle/pull/54608); Supports operator zero-dimensional capability [#54892](https://github.com/PaddlePaddle/Paddle/pull/54892), [#54919](https://github.com/PaddlePaddle/Paddle/pull/54919), [#54907](https://github.com/PaddlePaddle/Paddle/pull/54907), [#54966](https://github.com/PaddlePaddle/Paddle/pull/54966)
-- Supports CINN and PaddlePaddle PIR, and combinator operator junction operation mode, so new PIR and CINN operation is integrated. [#54732](https://github.com/PaddlePaddle/Paddle/pull/54732), [#56074](https://github.com/PaddlePaddle/Paddle/pull/56074), [#58216](https://github.com/PaddlePaddle/Paddle/pull/58216), [#55680](https://github.com/PaddlePaddle/Paddle/pull/55680), [#56302](https://github.com/PaddlePaddle/Paddle/pull/56302), [#59037](https://github.com/PaddlePaddle/Paddle/pull/59037), [#55186](https://github.com/PaddlePaddle/Paddle/pull/55186), [#58641](https://github.com/PaddlePaddle/Paddle/pull/58641)
-- There are strongly constrained components to stabilize CINN changes. [#58719](https://github.com/PaddlePaddle/Paddle/pull/58719), [#59309](https://github.com/PaddlePaddle/Paddle/pull/59309), [#58993](https://github.com/PaddlePaddle/Paddle/pull/58993)
-- Added Group Schedule related CINN architecture process. [#58399](https://github.com/PaddlePaddle/Paddle/pull/58399), [#56444](https://github.com/PaddlePaddle/Paddle/pull/56444)
-- Added CUTLASS, error handling, and NVRTC Cubin Fmad options to CINN architecture functions preliminarily. [#58079](https://github.com/PaddlePaddle/Paddle/pull/58079), [#57198](https://github.com/PaddlePaddle/Paddle/pull/57198), [#58794](https://github.com/PaddlePaddle/Paddle/pull/58794)
-- Added Python interface language for CINN. [#57731](https://github.com/PaddlePaddle/Paddle/pull/57731), [#57515](https://github.com/PaddlePaddle/Paddle/pull/57515), [#57644](https://github.com/PaddlePaddle/Paddle/pull/57644), [#57981](https://github.com/PaddlePaddle/Paddle/pull/57981), [#58009](https://github.com/PaddlePaddle/Paddle/pull/58009)
-- Added dynamic shape functionality for CINN to cover ASTGen to generate dynamic shape symbols, to replace the ISL to generate dynamic shape signals [#56360](https://github.com/PaddlePaddle/Paddle/pull/56360), [#57207](https://github.com/PaddlePaddle/Paddle/pull/57207), [#57454](https://github.com/PaddlePaddle/Paddle/pull/57454); Added Bucket Conditional Compilation functionality [#59165](https://github.com/PaddlePaddle/Paddle/pull/59165); Added Schedule, Device, and IR level support for dynamic shape [#58988](https://github.com/PaddlePaddle/Paddle/pull/58988), [#59493](https://github.com/PaddlePaddle/Paddle/pull/59493), [#58717](https://github.com/PaddlePaddle/Paddle/pull/58717), [#58602](https://github.com/PaddlePaddle/Paddle/pull/58602), [#59196](https://github.com/PaddlePaddle/Paddle/pull/59196)
-- Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. [#56122](https://github.com/PaddlePaddle/Paddle/pull/56122), [#57777](https://github.com/PaddlePaddle/Paddle/pull/57777), [#57569](https://github.com/PaddlePaddle/Paddle/pull/57569)
+- Fix the problem of dbias_out space application of fused_linear_param_grad_add_kernel operator, and add the gradient address checking logic to make the error message easier to debug. [#363433](https://github.com/PaddlePaddle/Paddle/pull/63433),[#64460](https://github.com/PaddlePaddle/Paddle/pull/64460)
+- Fix the problem that the sharding policy does not scale the gradient when comm_overlap is turned off in the support of reduce_avg operation. [#62702](https://github.com/PaddlePaddle/Paddle/pull/62702)
+- Fix the bug related to fusion in the calculation order of main grad in Stage2. [#59142](https://github.com/PaddlePaddle/Paddle/pull/59142)
+- Fix the bug that the switch attribute cannot be found when reduce_avg communication operation is turned on under the sharding strategy. [#62502](https://github.com/PaddlePaddle/Paddle/pull/62502)
+- Fix the problem of setting stop_gradient=True for some parameters when Sharding stage1 training supports non-training parameter training. [#62616](https://github.com/PaddlePaddle/Paddle/pull/62616)
+- Fix the bug of message printing when TCP is turned off, to prevent misleading users. [#62631](https://github.com/PaddlePaddle/Paddle/pull/62631)
+- Fix the DataParallel training problem and solve multi-card training error when some gradients are not initialized and segmentation fault error occurs in data parallel training. [#62299](https://github.com/PaddlePaddle/Paddle/pull/62299)
+- For the scenario of turning on sequence parallel, fix the bug caused by weight freezing in some models. [#63596](https://github.com/PaddlePaddle/Paddle/pull/63596)
+- Fix some bugs for autotuner scenarios with single dp. [#60757](https://github.com/PaddlePaddle/Paddle/pull/60757)
+- Fix aadiff bug of streaming parallel strategy. ([#64716](https://github.com/PaddlePaddle/Paddle/pull/64716))
+- Remove some distributed unit tests. ([#62762](https://github.com/PaddlePaddle/Paddle/pull/62762))
-#### Function optimization
+### Security Risk Fixing
-- Enriched or improved operator functionality, including improvements to various operator processes such as Repair Reverse, FP16, Infershape, Operator Single Test, etc. [#56320](https://github.com/PaddlePaddle/Paddle/pull/56320), [#56845](https://github.com/PaddlePaddle/Paddle/pull/56845), [#54939](https://github.com/PaddlePaddle/Paddle/pull/54939),[#54378](https://github.com/PaddlePaddle/Paddle/pull/54378),[#55321](https://github.com/PaddlePaddle/Paddle/pull/55321),[#55336](https://github.com/PaddlePaddle/Paddle/pull/55336),[#55337](https://github.com/PaddlePaddle/Paddle/pull/55337),[#55442](https://github.com/PaddlePaddle/Paddle/pull/55442),[#55470](https://github.com/PaddlePaddle/Paddle/pull/55470),[#55489](https://github.com/PaddlePaddle/Paddle/pull/55489),[#55510](https://github.com/PaddlePaddle/Paddle/pull/55510),[#55547](https://github.com/PaddlePaddle/Paddle/pull/55547),[#55505](https://github.com/PaddlePaddle/Paddle/pull/55505),[#55563](https://github.com/PaddlePaddle/Paddle/pull/55563),[#54280](https://github.com/PaddlePaddle/Paddle/pull/54280),[#59650](https://github.com/PaddlePaddle/Paddle/pull/59650),[#54862](https://github.com/PaddlePaddle/Paddle/pull/54862),[#55135](https://github.com/PaddlePaddle/Paddle/pull/55135),[#55292](https://github.com/PaddlePaddle/Paddle/pull/55292),[#55333](https://github.com/PaddlePaddle/Paddle/pull/55333),[#55316](https://github.com/PaddlePaddle/Paddle/pull/55316),[#55379](https://github.com/PaddlePaddle/Paddle/pull/55379),[#55326](https://github.com/PaddlePaddle/Paddle/pull/55326)
-- Improved CINN, PaddlePaddle, PIR, combinator operator junction operation, including various and PIR and its actuator interface and CINN mutual support. [#59170](https://github.com/PaddlePaddle/Paddle/pull/59170),[#58766](https://github.com/PaddlePaddle/Paddle/pull/58766),[#59255](https://github.com/PaddlePaddle/Paddle/pull/59255),[#59203](https://github.com/PaddlePaddle/Paddle/pull/59203),[#59024](https://github.com/PaddlePaddle/Paddle/pull/59024),[#57829](https://github.com/PaddlePaddle/Paddle/pull/57829),[#58135](https://github.com/PaddlePaddle/Paddle/pull/58135),[#58193](https://github.com/PaddlePaddle/Paddle/pull/58193),[#58207](https://github.com/PaddlePaddle/Paddle/pull/58207),[#58606](https://github.com/PaddlePaddle/Paddle/pull/58606),[#59437](https://github.com/PaddlePaddle/Paddle/pull/59437),[#59759](https://github.com/PaddlePaddle/Paddle/pull/59759),[#55075](https://github.com/PaddlePaddle/Paddle/pull/55075),[#56805](https://github.com/PaddlePaddle/Paddle/pull/56805),[#57764](https://github.com/PaddlePaddle/Paddle/pull/57764),[#58620](https://github.com/PaddlePaddle/Paddle/pull/58620),[#59769](https://github.com/PaddlePaddle/Paddle/pull/59769),[#58702](https://github.com/PaddlePaddle/Paddle/pull/58702),[#58749](https://github.com/PaddlePaddle/Paddle/pull/58749),[#59025](https://github.com/PaddlePaddle/Paddle/pull/59025),[#58820](https://github.com/PaddlePaddle/Paddle/pull/58820),[#58908](https://github.com/PaddlePaddle/Paddle/pull/58908),[#58169](https://github.com/PaddlePaddle/Paddle/pull/58169)
-- There are strongly constrained components to stabilize CINN changes. [#55090](https://github.com/PaddlePaddle/Paddle/pull/55090),[#55705](https://github.com/PaddlePaddle/Paddle/pull/55705),[#57587](https://github.com/PaddlePaddle/Paddle/pull/57587),[#59501](https://github.com/PaddlePaddle/Paddle/pull/59501)
-- Improved CINN IR and related tool codes. [#55145](https://github.com/PaddlePaddle/Paddle/pull/55145),[#55955](https://github.com/PaddlePaddle/Paddle/pull/55955),[#56307](https://github.com/PaddlePaddle/Paddle/pull/56307),[#55519](https://github.com/PaddlePaddle/Paddle/pull/55519),[#56958](https://github.com/PaddlePaddle/Paddle/pull/56958),[#57019](https://github.com/PaddlePaddle/Paddle/pull/57019),[#57230](https://github.com/PaddlePaddle/Paddle/pull/57230),[#57531](https://github.com/PaddlePaddle/Paddle/pull/57531),[#57532](https://github.com/PaddlePaddle/Paddle/pull/57532),[#57524](https://github.com/PaddlePaddle/Paddle/pull/57524),[#58770](https://github.com/PaddlePaddle/Paddle/pull/58770),[#59337](https://github.com/PaddlePaddle/Paddle/pull/59337),[#59096](https://github.com/PaddlePaddle/Paddle/pull/59096),[#56274](https://github.com/PaddlePaddle/Paddle/pull/56274),[#56350](https://github.com/PaddlePaddle/Paddle/pull/56350),[#57312](https://github.com/PaddlePaddle/Paddle/pull/57312),[#55171](https://github.com/PaddlePaddle/Paddle/pull/55171)
-- Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. [#54982](https://github.com/PaddlePaddle/Paddle/pull/54982),[#57963](https://github.com/PaddlePaddle/Paddle/pull/57963),[#58220](https://github.com/PaddlePaddle/Paddle/pull/58220),[#55484](https://github.com/PaddlePaddle/Paddle/pull/55484),[#55935](https://github.com/PaddlePaddle/Paddle/pull/55935),[#55590](https://github.com/PaddlePaddle/Paddle/pull/55590),[#56530](https://github.com/PaddlePaddle/Paddle/pull/56530),[#58344](https://github.com/PaddlePaddle/Paddle/pull/58344),[#59810](https://github.com/PaddlePaddle/Paddle/pull/59810)
-- CINN architectural improvements, including parallel compilation, low-level storage allocation method, print information, Group structure, Pass structure, etc. [#56282](https://github.com/PaddlePaddle/Paddle/pull/56282), [#59014](https://github.com/PaddlePaddle/Paddle/pull/59014),[#59209](https://github.com/PaddlePaddle/Paddle/pull/59209),[#52660](https://github.com/PaddlePaddle/Paddle/pull/52660),[#54749](https://github.com/PaddlePaddle/Paddle/pull/54749),[#58694](https://github.com/PaddlePaddle/Paddle/pull/58694),[#58940](https://github.com/PaddlePaddle/Paddle/pull/58940),[#59504](https://github.com/PaddlePaddle/Paddle/pull/59504),[#56123](https://github.com/PaddlePaddle/Paddle/pull/56123)
-- Improved CINN codegen, jit instruction, dim args, and host kernel to support dynamic shape. [#58825](https://github.com/PaddlePaddle/Paddle/pull/58825),[#59395](https://github.com/PaddlePaddle/Paddle/pull/59395),[#59398](https://github.com/PaddlePaddle/Paddle/pull/59398),[#59540](https://github.com/PaddlePaddle/Paddle/pull/59540),[#59470](https://github.com/PaddlePaddle/Paddle/pull/59470),[#59640](https://github.com/PaddlePaddle/Paddle/pull/59640)
-- CINN error reporting optimization. [#54983](https://github.com/PaddlePaddle/Paddle/pull/54983),[#55544](https://github.com/PaddlePaddle/Paddle/pull/55544)
-- Improved cleanup of CINN codes, including CI, file paths, C++17, Flags, third-party libraries, Docker, etc. [#55018](https://github.com/PaddlePaddle/Paddle/pull/55018),[#55121](https://github.com/PaddlePaddle/Paddle/pull/55121),[#55009](https://github.com/PaddlePaddle/Paddle/pull/55009),[#55888](https://github.com/PaddlePaddle/Paddle/pull/55888),[#56168](https://github.com/PaddlePaddle/Paddle/pull/56168),[#56192](https://github.com/PaddlePaddle/Paddle/pull/56192),[#56896](https://github.com/PaddlePaddle/Paddle/pull/56896),[#53861](https://github.com/PaddlePaddle/Paddle/pull/53861),[#55208](https://github.com/PaddlePaddle/Paddle/pull/55208)
+- Fix security vulnerability against security leakage risk in prune_by_memory_estimation operator. [#61320](https://github.com/PaddlePaddle/Paddle/pull/61320)
-#### Performance optimization
+## Parameter Server
-- Fusion of vit attention. [#54139](https://github.com/PaddlePaddle/Paddle/pull/54139)
-- Optimized block reduce. [#58196](https://github.com/PaddlePaddle/Paddle/pull/58196)
+This update mainly fixes several bugs in the process of using the parameter server as well as compilation and installation issues.
-#### Fixed bug
+### Bug Fixing
-- Fixed operator-related bugs. [#56280](https://github.com/PaddlePaddle/Paddle/pull/56280),[#57767](https://github.com/PaddlePaddle/Paddle/pull/57767),[#58406](https://github.com/PaddlePaddle/Paddle/pull/58406),[#54406](https://github.com/PaddlePaddle/Paddle/pull/54406),[#54494](https://github.com/PaddlePaddle/Paddle/pull/54494),[#54751](https://github.com/PaddlePaddle/Paddle/pull/54751),[#55674](https://github.com/PaddlePaddle/Paddle/pull/55674),[#55684](https://github.com/PaddlePaddle/Paddle/pull/55684),[#55683](https://github.com/PaddlePaddle/Paddle/pull/55683),[#57798](https://github.com/PaddlePaddle/Paddle/pull/57798),[#57816](https://github.com/PaddlePaddle/Paddle/pull/57816),[#57687](https://github.com/PaddlePaddle/Paddle/pull/57687),[#56719](https://github.com/PaddlePaddle/Paddle/pull/56719),[#59756](https://github.com/PaddlePaddle/Paddle/pull/59756),[#59770](https://github.com/PaddlePaddle/Paddle/pull/59770),[#58811](https://github.com/PaddlePaddle/Paddle/pull/58811)
-- Fixed process architecture-related bugs. [#54899](https://github.com/PaddlePaddle/Paddle/pull/54899),[#59737](https://github.com/PaddlePaddle/Paddle/pull/59737),[#59356](https://github.com/PaddlePaddle/Paddle/pull/59356),[#56105](https://github.com/PaddlePaddle/Paddle/pull/56105),[#56662](https://github.com/PaddlePaddle/Paddle/pull/56662),[#58146](https://github.com/PaddlePaddle/Paddle/pull/58146),[#58910](https://github.com/PaddlePaddle/Paddle/pull/58910),[#58121](https://github.com/PaddlePaddle/Paddle/pull/58121),[#58943](https://github.com/PaddlePaddle/Paddle/pull/58943),[#58886](https://github.com/PaddlePaddle/Paddle/pull/58886),[#59642](https://github.com/PaddlePaddle/Paddle/pull/59642),[#56164](https://github.com/PaddlePaddle/Paddle/pull/56164),[#56338](https://github.com/PaddlePaddle/Paddle/pull/56338),[#56966](https://github.com/PaddlePaddle/Paddle/pull/56966),[#59112](https://github.com/PaddlePaddle/Paddle/pull/59112),[#55820](https://github.com/PaddlePaddle/Paddle/pull/55820),[#56660](https://github.com/PaddlePaddle/Paddle/pull/56660),[#57307](https://github.com/PaddlePaddle/Paddle/pull/57307),[#57530](https://github.com/PaddlePaddle/Paddle/pull/57530),[#58236](https://github.com/PaddlePaddle/Paddle/pull/58236),[#55190](https://github.com/PaddlePaddle/Paddle/pull/55190),[#55043](https://github.com/PaddlePaddle/Paddle/pull/55043),[#55667](https://github.com/PaddlePaddle/Paddle/pull/55667)
-- Other bugs. [#57239](https://github.com/PaddlePaddle/Paddle/pull/57239),[#55530](https://github.com/PaddlePaddle/Paddle/pull/55530),[#56605](https://github.com/PaddlePaddle/Paddle/pull/56605),[#58243](https://github.com/PaddlePaddle/Paddle/pull/58243),[#58197](https://github.com/PaddlePaddle/Paddle/pull/58197),[#58197](https://github.com/PaddlePaddle/Paddle/pull/58197),[#56086](https://github.com/PaddlePaddle/Paddle/pull/56086),[#56065](https://github.com/PaddlePaddle/Paddle/pull/56065),[#58775](https://github.com/PaddlePaddle/Paddle/pull/58775),[#54750](https://github.com/PaddlePaddle/Paddle/pull/54750),[#58595](https://github.com/PaddlePaddle/Paddle/pull/58595),[#58873](https://github.com/PaddlePaddle/Paddle/pull/58873)
+- For the problem of reading and writing out of bounds of the unique operator, fix the problem of setting the wrong length in the calculation process of the unique operator to ensure the correctness of the operation of the unique operator. [#60840](https://github.com/PaddlePaddle/Paddle/pull/60840)
+- Fixed some bugs in PGLBox save/load and compilation process to ensure the correctness of PGLBox function in response to the lack of save/load function and compilation error in PGLBox training process. [#63905](https://github.com/PaddlePaddle/Paddle/pull/63905)
+- Fix the setting value of use_ps_gpu in CPUPS to ensure the correctness of the CPUPS training process, in response to the problem that the CPUPS training process triggers the GPUPS logic and causes the training to crash. [#61406](https://github.com/PaddlePaddle/Paddle/pull/61406)
+- For the problem that the cudaErrorInvalidResourceHandle error occurs in GPUPS training in CUDA 12.3, add the device id switching mechanism, to ensure that the corresponding resource operation is carried out on the correct device. [#63391](https://github.com/PaddlePaddle/Paddle/pull/63391)
+- For the problem of garbled codes in PGLBox Embedding Dump process, fix the bug of improper use of C++ std::string, to ensure the correctness of Embedding Dump results. [#65179](https://github.com/PaddlePaddle/Paddle/pull/65179)
-#### Documentation
+### Documentation Improvement
-- Added README file. [#58349](https://github.com/PaddlePaddle/Paddle/pull/58349)
+- Access security warnings in the RPC interface documentation, to remind users that they need to use this interface under secure network conditions. [#64100](https://github.com/PaddlePaddle/Paddle/pull/64100)
-## 4. Deployment Direction (Paddle Inference)
+### Security Enhancement
-### General inference optimization
+- Fix several code security issues to prevent malicious code injection. [#60023](https://github.com/PaddlePaddle/Paddle/pull/60023),[#60544](https://github.com/PaddlePaddle/Paddle/pull/60544),[#60615](https://github.com/PaddlePaddle/Paddle/pull/60615)
+
+## Inference Deployment
+
+The inference framework is based on PIR upgraded PASS under GPU, XPU, CPU hardware, to significantly reduce the number of lines of codes compared with the previous version, and improve development efficiency. The underlying executor is upgraded to a new version of asynchronous executor, improving inference performance on most models. Complete the adaptive interconnection for inference acceleration based on CINN compiler. Add the switches for these features. Users can turn on the features through settings. In addition, Paddle Inference supports direct loading of optimized serialized models under mixed inference with TensorRT subgraphs natively, to reduce startup time consumption. For Paddle-TensorRT, add the interfaces to flexibly control node computation precision and whether the subgraph enters TensorRT computation. It is convenient for debugging. For performance optimization, GPU, XPU, CPU are added with more Transformer and LLM computing acceleration fusion operator, such as group attention mechanism fusion operator, GQA structure, and WINT4, and support for automatic matching by PASS.
-This version of the upgrade improves performance and ease-of-use of the inference engine on GPU and CPU, reducing user cost and application cost of online inference. On GPU: A high-performance multi-threaded asynchronous executor is supported, and inference performance of each model is improved by 5%~10%. The new version of TensorRT and BF16 inference capabilities are also supported, and TensorRT inference performance and ease of use are further improved. On CPU: The latest version of OneDNN high-performance inference is supported. SwinTransformer, FastRCNN and other series of models have greatly improved performance.
+### New Features
-- matmul supports transpose and broadcast operations. [#56827](https://github.com/PaddlePaddle/Paddle/pull/56827)
-- TruncatedNormal and Assign supports FP64 data types. [#57507](https://github.com/PaddlePaddle/Paddle/pull/57507)
-- Supports conv2d explicit quantized inference. [#57160](https://github.com/PaddlePaddle/Paddle/pull/57160),[#58015](https://github.com/PaddlePaddle/Paddle/pull/58015)
-- Added conv_fuse_pass. Support conv + bn fusion. The conv2d_fusion is renamed fused_conv2d_add_act. [#58724](https://github.com/PaddlePaddle/Paddle/pull/58724),[#55374](https://github.com/PaddlePaddle/Paddle/pull/55374),[#54477](https://github.com/PaddlePaddle/Paddle/pull/54477),[#59431](https://github.com/PaddlePaddle/Paddle/pull/59431)
-- Mixed precision inference supports OP whitelisting. [#56535](https://github.com/PaddlePaddle/Paddle/pull/56535)
-- OneDNN optimization is enabled by default. Supports SwinTransformer, FastRCNNd and other inference optimizations. [#58560](https://github.com/PaddlePaddle/Paddle/pull/58560),[#59394](https://github.com/PaddlePaddle/Paddle/pull/59394),[#59421](https://github.com/PaddlePaddle/Paddle/pull/59421),[#58435](https://github.com/PaddlePaddle/Paddle/pull/58435),[#58488](https://github.com/PaddlePaddle/Paddle/pull/58488),[#59259](https://github.com/PaddlePaddle/Paddle/pull/59259),[#56303](https://github.com/PaddlePaddle/Paddle/pull/56303),[#56782](https://github.com/PaddlePaddle/Paddle/pull/56782),[#57598](https://github.com/PaddlePaddle/Paddle/pull/57598),[#58361](https://github.com/PaddlePaddle/Paddle/pull/58361),[#59641](https://github.com/PaddlePaddle/Paddle/pull/59641),[#59527](https://github.com/PaddlePaddle/Paddle/pull/59527),[#59663](https://github.com/PaddlePaddle/Paddle/pull/59663),[#59744](https://github.com/PaddlePaddle/Paddle/pull/59744)
-- Added share_data and support for pass in specified data. [#57933](https://github.com/PaddlePaddle/Paddle/pull/57933)
-
-### Large model inference optimized
-
-The fine-grained fusion inference optimization of generative large models is realized. Optimization solution ensures high-performance inference capability and excellent expandability. Users can flexibly utilize various fine-grained fusion operators and PaddlePaddle native operators to build a network structure of generative large models in free combinations as required, thus achieving efficient and low-cost inference. In addition, our solution also supports mainstream generative large model structure, significantly reducing deployment cost of inference for such models and strongly supports efficient and low-cost implementation of generative large models.
+- Paddle-TensorRT
+ - The API called at the underlying of Paddle-TensorRT is upgraded. When the version of TensorRT is later than 8.5, the EnqueueV2 API called (which will be deprecated in the future) is upgraded to the EnqueueV3 API. [#60807](https://github.com/PaddlePaddle/Paddle/pull/60807)
+ - Add the config.exp_disable_tensorrt_subgraph() to set some subgraphs not to enter TensorRT. [#61967](https://github.com/PaddlePaddle/Paddle/pull/61967)
+ - Add the config.exp_disable_tensorrt_dynamic_shape_ops() to set dynamic shape input operators not to enter TensorRT. The default value is False. [#62352](https://github.com/PaddlePaddle/Paddle/pull/62352)
+ - Add the config.exp_specify_tensorrt_subgraph_precision() to set nodes to run different precision types. [#62402](https://github.com/PaddlePaddle/Paddle/pull/62402)
+- In the Inference, add switch to turn on CINN compiler. When configuring inference config, turn on CINN through config.enable_cinn(). [#61949](https://github.com/PaddlePaddle/Paddle/pull/61949)
+- PIR use mechanism in the Inference upgrade
+ - In the config, add enable_new_ir() interface to enable PIR. [#61968](https://github.com/PaddlePaddle/Paddle/pull/61968)
+ - In the config, add set_optimization_level() interface to set different optimization levels. [#61968](https://github.com/PaddlePaddle/Paddle/pull/61968)
+ - In the PIR mechanism, the PASS function supports custom C++PASS. [#62468](https://github.com/PaddlePaddle/Paddle/pull/62468)
+ - The inference library exposes PIR-related implementation header files to the outside world. Support users' secondary development based on PIR, such as custom Pass development. [#61863](https://github.com/PaddlePaddle/Paddle/pull/61863),[#62293](https://github.com/PaddlePaddle/Paddle/pull/62293)
+ - The PIR mechanism supports input and output of the Hook operator by registering the Predictor. [#63101](https://github.com/PaddlePaddle/Paddle/pull/63101)
+- The multi-layer Transformer fusion operator fused_multi_transformer_op supports GQA calculation. [#64125](https://github.com/PaddlePaddle/Paddle/pull/64125)
+
+### Function Improvements
+
+- The inference supports loading optimized models directly, making it possible to skip IR optimization altogether. The deployment in this way can minimize framework overhead. [#61598](https://github.com/PaddlePaddle/Paddle/pull/61598)
+- Re-specify the shape range information file when loading the saved IR PASS optimized model inference. [#60457](https://github.com/PaddlePaddle/Paddle/pull/60457)
+- Collect the Shape information within the subgraph of the control flow operator, supporting the use of Paddle-TensorRT inference acceleration. [#60451](https://github.com/PaddlePaddle/Paddle/pull/60451) ,[#59588](https://github.com/PaddlePaddle/Paddle/pull/59588)
+- The mixed-precision PASS (auto_mixed_precision_pass) for GPU-native inference supports the handling of sparse Tensor. [#62656](https://github.com/PaddlePaddle/Paddle/pull/62656)
+- XPU hardware related function
+ - XPU's fused PASS for Conv and FC supports conversion from Float to INT31 type. [#59981](https://github.com/PaddlePaddle/Paddle/pull/59981)
+ - XPU's strided slice operator supports the setting of strides non-negative. [#62268](https://github.com/PaddlePaddle/Paddle/pull/62268)
+ - XPU's multi-layer Encoder fusion PASS is adaptive to sequence length and supports variable length. [#63825](https://github.com/PaddlePaddle/Paddle/pull/63825)
+- Paddle TensorRT INT8 computation mode supports tile operator into TensorRT computation, to improve INT8 performance of some models. [#60189](https://github.com/PaddlePaddle/Paddle/pull/60189)
+
+### Model Compression
+
+Fix bugs and optimize functions mainly for Post Training Quantization (PTQ) and Quantization Aware Training (QAT).
+
+- Support the simulation quantization grouped by channel. [#61828](https://github.com/PaddlePaddle/Paddle/pull/61828)
+- Support automatic saving of quantization scale to model parameter file under dynamic graphs. [#59441](https://github.com/PaddlePaddle/Paddle/pull/59441)
+- Remove the restriction that the dataloader must be a DataLoader instance. [#61798](https://github.com/PaddlePaddle/Paddle/pull/61798)
+
+### Performance Optimization
+
+- Upgrade the inference executor to reduce the video memory usage at runtime while keeping the performance unchanged. This can be used through config.enable_use_executor(True). [#57920](https://github.com/PaddlePaddle/Paddle/pull/57920),[#58452](https://github.com/PaddlePaddle/Paddle/pull/58452),[#63350](https://github.com/PaddlePaddle/Paddle/pull/63350),[#64466](https://github.com/PaddlePaddle/Paddle/pull/64466)
+- Upgrade oneDNN version of paddle inference to v3.4. Its overall performance has been improved compared with v3.3. [#64661](https://github.com/PaddlePaddle/Paddle/pull/64661)
+- Upgrade the CUTLASS-based support for matrix multiplication and activation fusion calculation. ([#61925](https://github.com/PaddlePaddle/Paddle/pull/61925))
+
+#### Add generic PASS in PIR mechanism
+
+- Add identity_op_clean_pass and matmul_scale_fuse_pass. [#59840](https://github.com/PaddlePaddle/Paddle/pull/59840)
+- Add fused_flash_attn_pass. The pass can call flash_attention to replace the original attentions computation. [#64213](https://github.com/PaddlePaddle/Paddle/pull/64213),[#64707](https://github.com/PaddlePaddle/Paddle/pull/64707),[#63304](https://github.com/PaddlePaddle/Paddle/pull/63304)
+- In the inference PIR new architecture, upgrade layout adjustment algorithm, support the NHWC inference of conv class and norm class. The performance tested on SD models is significantly improved. [#63628](https://github.com/PaddlePaddle/Paddle/pull/63628),[#64634](https://github.com/PaddlePaddle/Paddle/pull/64634),[#64658](https://github.com/PaddlePaddle/Paddle/pull/64658),[#64708](https://github.com/PaddlePaddle/Paddle/pull/64708),[#64830](https://github.com/PaddlePaddle/Paddle/pull/64830),[#64896](https://github.com/PaddlePaddle/Paddle/pull/64896)
+- Add remove_redundant_transpose PASS. [#63357](https://github.com/PaddlePaddle/Paddle/pull/63357)
+- Enable CSE PASS in inference to improve inference performance. [#64523](https://github.com/PaddlePaddle/Paddle/pull/64523)
+
+#### GPU Performance Optimizations
+
+Include new fusion operators and new PASS under PIR mechanism.
+
+- Optimize the performance of sparse convolution operator (sparse conv) to improve the inference performance of BEV and other models. [#63067](https://github.com/PaddlePaddle/Paddle/pull/63067)
+- Add the fusion PASS based on flash attention. [#63220](https://github.com/PaddlePaddle/Paddle/pull/63220)
+- The inference supports elementwise_add+group_norm+silu activated operator fusion pattern and its corresponding fusion kernel. [#64199](https://github.com/PaddlePaddle/Paddle/pull/64199)
+- The Matrix multiplication calculation supports groupwise's Weight only INT4 calculation. [#60422](https://github.com/PaddlePaddle/Paddle/pull/60422) 、[#63212](https://github.com/PaddlePaddle/Paddle/pull/63212) 、[#60204](https://github.com/PaddlePaddle/Paddle/pull/60204))
+- The implementation of the group attention mechanism fusion operator block_multi_head_attention supports KV Cache quantization. [#59951](https://github.com/PaddlePaddle/Paddle/pull/59951))
+- The Inference uses CUTLASS upgraded conv fusion operator to implement and support PASS automatic fusion. Support bias and activation. Compared to the original cuDNN, the new operator has significant performance acceleration. It is used through config.exp_enable_use_cutlass(True). [#64201](https://github.com/PaddlePaddle/Paddle/pull/64201)、[#64641](https://github.com/PaddlePaddle/Paddle/pull/64641)
+- Add the blha_get_max_len operator and remove every call to get_max_len in block_multihead_attention. The function application is used for large model dynamic inference acceleration. [#64246](https://github.com/PaddlePaddle/Paddle/pull/64246)
+- Data layout optimization: PASS prohibits using NHWC mode calculation in the conv fusion operator FP32 precision type, because cuDNN will cause performance degradation under this condition. [#63400](https://github.com/PaddlePaddle/Paddle/pull/63400)
+- GPU peak video memory optimization: upgrade the underlying interface TryShrinkMemory, and upgrade to support GPU place under the support for the release of the idle video memory in the pool. In certain scenarios, peak video memory can be significantly cut. [#61319](https://github.com/PaddlePaddle/Paddle/pull/61319)
+
+#### CPU performance optimization
+
+Include new fusion operator. Add PASS under PIR mechanism and optimize part of Kernel.
+
+- Add scale_matmul_fuse_pass. [#63313](https://github.com/PaddlePaddle/Paddle/pull/63313)
+- Add CPU implementation in fused_bias_residual_layernorm and fused_rms_norm to improve inference speed. [#63196](https://github.com/PaddlePaddle/Paddle/pull/63196)、[#63165](https://github.com/PaddlePaddle/Paddle/pull/63165)
+- Add the cache optimization for Deconvolution kernel, to greatly improve the execution speed of this operator. [#60922](https://github.com/PaddlePaddle/Paddle/pull/60922)
+- In PIR, add depthwise_conv fusion PASS, to convert the depthwise_conv operator to conv2d, thus using the onednn conv2d kernel optimization to improve the inference speed of this operator. [#63051](https://github.com/PaddlePaddle/Paddle/pull/63051)
+- In PIR, add Conv and Activation Fusion PASS (conv_activation_mkldnn_fuse_pass), to support the fusion of conv and 13 kinds of activation functions, thus greatly improving the inference speed of conv-related operators. [#63145](https://github.com/PaddlePaddle/Paddle/pull/63145)
+- In PIR, add the fusion PASS (operator_unsqueeze_onednn_fuse_pass) between multiple operators and unsqueeze, to improve inference speed. [#63592](https://github.com/PaddlePaddle/Paddle/pull/63592)
+- In PIR, add PASS (operator_reshape_onednn_fuse_pass) to fuse reshape into multiple operators. [#63812](https://github.com/PaddlePaddle/Paddle/pull/63812)
+- In PIR, add scale fusion PASS (operator_scale_onednn_fuse_pass). [#63811](https://github.com/PaddlePaddle/Paddle/pull/63811)
+- In PIR, add PASS (conv2d_transpose_bias operator) that fuses conv and bias. [#62241](https://github.com/PaddlePaddle/Paddle/pull/62241)
+- In PIR, add onednn_placement_pass, which supports 151 operators to convert from Phi operators to oneDNN operators, so that the oneDNN high-performance library can be used for optimization, to improve the inference speed. [#63982](https://github.com/PaddlePaddle/Paddle/pull/63982)
+- In PIR, add the fusion between Elementwise type operators and 13 activation functions, to greatly improve the inference speed of enabling Onednn on the CPU. [#63516](https://github.com/PaddlePaddle/Paddle/pull/63516)
+- In PIR, add the fusion of multiple conv + concat + activation functions and fused_conv + concat + activation functions, to greatly improve the inference speed when there are concat and activation functions in conv. [#62993](https://github.com/PaddlePaddle/Paddle/pull/62993)、 [#62713](https://github.com/PaddlePaddle/Paddle/pull/62713)
+- In PIR, add matmul+add operator fusion PASS (matmul_elementwise_add_fuse_pass). [#62715](https://github.com/PaddlePaddle/Paddle/pull/62715)
+- In PIR, add the scale parameter to fold PASS (scale_matmul_fuse_pass). [#63313](https://github.com/PaddlePaddle/Paddle/pull/63313)
+- In PIR, add the fusion PASS (softplus_activation_fuse_pass) between softplus and 12 activation functions. [#63617](https://github.com/PaddlePaddle/Paddle/pull/63617)
+- In PIR, add fc operator conversion PASS (fc_onednn_enable_pass). [#63518](https://github.com/PaddlePaddle/Paddle/pull/63518)
+- In PIR, add self-attention operator fusion PASS (self_attention_fuse_pass). [#63726](https://github.com/PaddlePaddle/Paddle/pull/63726)
+- In PIR, add fusion PASS (fc_activation_fuse_pass) between fc and 12 activation functions. [#63853](https://github.com/PaddlePaddle/Paddle/pull/63853)
+- In PIR, add BatchNorm folded PASS (conv2d_bn_onednn_fuse_pass) to amplify the fusion probability of subsequent PASS. [#64524](https://github.com/PaddlePaddle/Paddle/pull/64524)
+- In PIR, add the fusion PASS (matmul_activation_fuse_pass) between matmul and 12 activation functions. [#62901](https://github.com/PaddlePaddle/Paddle/pull/62901)
+- In PIR, add reshape + transpose + reshape fusion PASS (shuffle_channel_detect_pass), which is fused into a shuffle_channel operator under specific conditions. [#64053](https://github.com/PaddlePaddle/Paddle/pull/64053)
+- In PIR, add reshape + transpose + matmul fusion PASS (reshape_transpose_matmul_fuse_pass). [#62998](https://github.com/PaddlePaddle/Paddle/pull/62998)
+- In PIR, add matmul + transpose + reshape fusion PASS (matmul_transpose_reshape_fuse_pass) to PIR to significantly improve performance in some scenarios. [#63151](https://github.com/PaddlePaddle/Paddle/pull/63151)(https://github.com/PaddlePaddle/Paddle/pull/63151)
+- XPU hardware new fusion PASS optimization:
+ - Add qk_qkv_attention_xpu_fuse_pass and qkv_attention_xpu_kernel in XPU hardware. [#60089](https://github.com/PaddlePaddle/Paddle/pull/60089)
+ - Add rotary position encoded fusion operator, to support elementwise_mul + strided_slice + sin/cos+ stack fusion to 1 operator in XPU hardware. [#60025](https://github.com/PaddlePaddle/Paddle/pull/60025)
+ - Add group_norm_silu_xpu_fuse_pass. [#62689](https://github.com/PaddlePaddle/Paddle/pull/62689)
+ - Add weight_only_linear_xpu_pass. [#64185](https://github.com/PaddlePaddle/Paddle/pull/64185)
+ - Add block_multihead_attention operator and PASS, to support large model inference for LLaMA2 models in XPU devices. [#65036](https://github.com/PaddlePaddle/Paddle/pull/65036)
+ - Support float16 type for squeeze_excitation_block_xpu_kernel. [#61023](https://github.com/PaddlePaddle/Paddle/pull/61023)
+
+### Bug Fixing
+
+- Fix mixed-precision conversions in models such as faster_rcnn_swin_tiny_fpn_1x_coco, and solve the mixed_precision_pass error. [#64673](https://github.com/PaddlePaddle/Paddle/pull/64673)
+- Block fused_conv2d_add_act pass from being validated in activation functions that are sigmoid (fused conv2d and sigmoid cause performance degradation between cudnn versions 8.0 and 8.7). [#64717](https://github.com/PaddlePaddle/Paddle/pull/64717)
+- Fix compilation issues with self_dp_attention and fused_layer_norm_avx_kernel in Clang12. [#63414](https://github.com/PaddlePaddle/Paddle/pull/63414)
+- Fix the issue that scale and zeroPoints in the qdq operator of some models are deleted prematurely in the IR/Pass stage. [#62225](https://github.com/PaddlePaddle/Paddle/pull/62225)
+- Fix the issue that causes an error to be reported when both Config.UseOptimizedModel() and config.EnableMemoryOptim() are turned on. [#62501](https://github.com/PaddlePaddle/Paddle/pull/62501)
+- Add constraint on matmul_scale_fuse_pass, where input w must be a weight or the pass will not be matched. [#62850](https://github.com/PaddlePaddle/Paddle/pull/62850)
+- Keep inference model output key ordering guaranteed to be the same as when dynamic graph models are exported. [#63791](https://github.com/PaddlePaddle/Paddle/pull/63791)
+- Fix the error in subgraph when the constant fold PASS is in "the folded op and its input and output are not in the same subgraph." [#62148](https://github.com/PaddlePaddle/Paddle/pull/62148)
+- Fix several runtime problems in PaddleTRT mode. Include the failure of quantization calibration table generation caused by yolo_box operator in int8 mode, and the error caused by incorrect handling of dim attribute data type in reduce operator. [#61596](https://github.com/PaddlePaddle/Paddle/pull/61596)
+- Fix some runtime error problems in mixed-precision inference mode.Include the errors caused by sharing weights among fused conv2d operators without correctly converting weight layout, fused conv2d operator backend not properly selected as cuDNN, fused conv2d operator incorrectly handling bias dimension under NHWC, incorrectly handling input data type of norm class operator. [#60955](https://github.com/PaddlePaddle/Paddle/pull/60955)、[#60076](https://github.com/PaddlePaddle/Paddle/pull/60076)、[#63007](https://github.com/PaddlePaddle/Paddle/pull/63007)、[#63988](https://github.com/PaddlePaddle/Paddle/pull/63988)
+- Fix the problem that config.delete_pass function does not take effect. [#61056](https://github.com/PaddlePaddle/Paddle/pull/61056)
+- Fix the GC mechanism of While control flow in PIR to recycle unwanted inputs in advance and reduce the peak memory, for example, 2GB memory reduction in LLaMA 7B model. [#63062](https://github.com/PaddlePaddle/Paddle/pull/63062)
+- Fix the OneDNN mean kernel rollback error. [#64676](https://github.com/PaddlePaddle/Paddle/pull/64676)
+- Fix the conv_bias_fuse_pass strong constraints newly added, e.g., the shape of the bias cannot be 1, so as to ensure the stability of the pass inference result. [#64412](https://github.com/PaddlePaddle/Paddle/pull/64412)
+- Fix the conv_elementwise_add_onednn_fuse_pass strong constraints newly added, e.g., conv2d_out and residual_param must have the same size, so that the pass inference is stable. [#64448](https://github.com/PaddlePaddle/Paddle/pull/64448)
+- Fix the problem of repeatedly inserting quantized inverse-quantization operators under certain circumstances [#63082](https://github.com/PaddlePaddle/Paddle/pull/63082)
+
+## Hardware Adaptation
+
+### Adaptation Scheme (Custom Device)
+
+For PaddlePaddle hardware access, add the daily release supports for 4 hardware Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU this time. Meanwhile, the problems in distributed communications have been fixed through large model training and inference deployment, and performance is optimized through functions such as video memory optimization, and overlap of computation and communication. Furthermore, each hardware is also added to support a large number of BFloat16 data type operators this time, as well as many operator fusion Pass and fusion operators on each hardware. Through the hardware and software together, hardware large Transformer operator library is accessed to fully improve the performance of large models.
+
+#### New Features
+
+- Add the support for distributed policy sharding stage1 v2. [#61500](https://github.com/PaddlePaddle/Paddle/pull/61500)
+- Support the distributed communication module in BF16 data type.Add some operators to support for BF16 data types such as empty, shape, etc. [#60768](https://github.com/PaddlePaddle/Paddle/pull/60768),[#62140](https://github.com/PaddlePaddle/Paddle/pull/62140),[#62604](https://github.com/PaddlePaddle/Paddle/pull/62604)
+- Add the support for get_comm_name interface, support for memory stat function, and support for Profiler to record memory time. [#62556](https://github.com/PaddlePaddle/Paddle/pull/62556),[#61030](https://github.com/PaddlePaddle/Paddle/pull/61030),[#62292](https://github.com/PaddlePaddle/Paddle/pull/62292)
+- Add support for some fusion strategies and operators, including silu_fuse_pass, conv_elementwise_add_act_fuse_pass, and generator offset. [#60595](https://github.com/PaddlePaddle/Paddle/pull/60595),[#60708](https://github.com/PaddlePaddle/Paddle/pull/60708),[#60616](https://github.com/PaddlePaddle/Paddle/pull/60616)
+
+#### Performance Optimization
+
+- The distributed communication strategy Sharing uses asynchronous strategy in Broadcast parameter, to improve the overlap between computation and communication. [#59745](https://github.com/PaddlePaddle/Paddle/pull/59745)
+- Add the support for STRIDED Layout operator to improve the performance of the operator. [#62532](https://github.com/PaddlePaddle/Paddle/pull/62532),[#62697](https://github.com/PaddlePaddle/Paddle/pull/62697),[#62649](https://github.com/PaddlePaddle/Paddle/pull/62649)
+- Optimize the memory usage of elementwise_mul operator.[#62377](https://github.com/PaddlePaddle/Paddle/pull/62377)
+
+#### Bug Fixing
+
+- Fix the bug under the distributed strategy Sharing. [#61942](https://github.com/PaddlePaddle/Paddle/pull/61942),[#62236](https://github.com/PaddlePaddle/Paddle/pull/62236),[#62305](https://github.com/PaddlePaddle/Paddle/pull/62305),[#62535](https://github.com/PaddlePaddle/Paddle/pull/62535),[#62572](https://github.com/PaddlePaddle/Paddle/pull/62572),[#61601](https://github.com/PaddlePaddle/Paddle/pull/61601)
+- Fix the problem that the operator cannot be registered due to c_embedding operator is not under PHI namespace. [#60774](https://github.com/PaddlePaddle/Paddle/pull/60774)
+- Fix the xccl_comm release issue. [#60465](https://github.com/PaddlePaddle/Paddle/pull/60465)
+- Fix data address error caused by index_put operator fallbacking cpu. [#61842](https://github.com/PaddlePaddle/Paddle/pull/61842)
+- Fix stream_safe_custom_device_allocator issue. [#63369](https://github.com/PaddlePaddle/Paddle/pull/63369)
+- Fix the distributed worker port conflict issue. [#61409](https://github.com/PaddlePaddle/Paddle/pull/61409)
+- Fix comm data type to improve device compatibility. [#62306](https://github.com/PaddlePaddle/Paddle/pull/62306)
+- Unify the use of comm data type to phi::DataType. [#62464](https://github.com/PaddlePaddle/Paddle/pull/62464),[#62562](https://github.com/PaddlePaddle/Paddle/pull/62562)
+- Fix the problem of missing precision parameter in PD_ConfigEnableCustomDevice. [#63702](https://github.com/PaddlePaddle/Paddle/pull/63702)
+
+### Kunlun XPU
+
+#### New Features
+
+- Add the support for BF16 data types for some operators, including compare_kernel and add reduce_all_kernel ([#63602](https://github.com/PaddlePaddle/Paddle/pull/63602)), empty([#60212](https://github.com/PaddlePaddle/Paddle/pull/60212)), hybrid_parallel_optimizer([#60213](https://github.com/PaddlePaddle/Paddle/pull/60213)), reduce_max/reduce_min([#60453](https://github.com/PaddlePaddle/Paddle/pull/60453)), all_reduce/concat/split([#62364](https://github.com/PaddlePaddle/Paddle/pull/62364)), tile/tile_grad([#63075](https://github.com/PaddlePaddle/Paddle/pull/63075)), accuracy([#63863](https://github.com/PaddlePaddle/Paddle/pull/63863)), swiglu/set_value([#64070](https://github.com/PaddlePaddle/Paddle/pull/64070)), amp_master_grad([#63865](https://github.com/PaddlePaddle/Paddle/pull/63865)), c_concat ([#63403](https://github.com/PaddlePaddle/Paddle/pull/63403)), flatten ([#63997](https://github.com/PaddlePaddle/Paddle/pull/63997)), compare_op ([#64473](https://github.com/PaddlePaddle/Paddle/pull/64473)), moment1/moment2 ([#62688](https://github.com/PaddlePaddle/Paddle/pull/62688)), fused_rope ([#60064](https://github.com/PaddlePaddle/Paddle/pull/60064)), c_softmax_with_cross_entropy ([#60472](https://github.com/PaddlePaddle/Paddle/pull/60472)), elementwise_pow/square/sin/cos ([#60402](https://github.com/PaddlePaddle/Paddle/pull/60402)), strided_slice ([#60382](https://github.com/PaddlePaddle/Paddle/pull/60382)), tile/sigmoid_grad ([#60119](https://github.com/PaddlePaddle/Paddle/pull/60119)), elementwise_sub/elementwise_div ([#60386](https://github.com/PaddlePaddle/Paddle/pull/60386)), softmax_with_cross_entropy ([#63759](https://github.com/PaddlePaddle/Paddle/pull/63759))
+- Add the support for INT8 data types for some operators, including multi_encoder_xpu ([#61212](https://github.com/PaddlePaddle/Paddle/pull/61212)), qkv_attention ([#63105](https://github.com/PaddlePaddle/Paddle/pull/63105))
+- Update Kunlun SDK versions including BKCL, XHPC, XCCL, etc. [#59895](https://github.com/PaddlePaddle/Paddle/pull/59895)、[#59888](https://github.com/PaddlePaddle/Paddle/pull/59888)、[#63624](https://github.com/PaddlePaddle/Paddle/pull/63624), [#60305](https://github.com/PaddlePaddle/Paddle/pull/60305), [#62076](https://github.com/PaddlePaddle/Paddle/pull/62076), [#62646](https://github.com/PaddlePaddle/Paddle/pull/62646), [#63520](https://github.com/PaddlePaddle/Paddle/pull/63520), [#64163](https://github.com/PaddlePaddle/Paddle/pull/64163), [#64326](https://github.com/PaddlePaddle/Paddle/pull/64326), [#60617](https://github.com/PaddlePaddle/Paddle/pull/60617), [#60377](https://github.com/PaddlePaddle/Paddle/pull/60377), [#60421](https://github.com/PaddlePaddle/Paddle/pull/60421), [#60598](https://github.com/PaddlePaddle/Paddle/pull/60598), [#61199](https://github.com/PaddlePaddle/Paddle/pull/61199)
+- Add the support for memory stat function. [#61116](https://github.com/PaddlePaddle/Paddle/pull/61116)
+- Add multi-stream support, to assign default l3/gm buffer size to each stream. [#62729](https://github.com/PaddlePaddle/Paddle/pull/62729)
+- Add nonzero operator, to support simulator XPUSIM_SKIP_RUN mode. [#60224](https://github.com/PaddlePaddle/Paddle/pull/60224)。[#60388](https://github.com/PaddlePaddle/Paddle/pull/60388)
+- Add stride_slice and stride_slice_grad operators, to support strides < 0. [#62749](https://github.com/PaddlePaddle/Paddle/pull/62749)
+- Add rotary_embedding, to support use_neox_rotary_style == True. [#64090](https://github.com/PaddlePaddle/Paddle/pull/64090)
+- Add fusion Pass and fusion operators including cross_attention ([#63203](https://github.com/PaddlePaddle/Paddle/pull/63203)), fused_bias_act ([#62232](https://github.com/PaddlePaddle/Paddle/pull/62232)), fused_layernorm ([#62228](https://github.com/PaddlePaddle/Paddle/pull/62228)), group_norm_silu_xpu_fuse_pass ([#63342](https://github.com/PaddlePaddle/Paddle/pull/63342))
+- Add the support for distributed policy sharding stage3. [#57457](https://github.com/PaddlePaddle/Paddle/pull/57457)
+- Add the support for tf32 fc quantization mode. [#62273](https://github.com/PaddlePaddle/Paddle/pull/62273)
+- Add the flash attention operator. [#60065](https://github.com/PaddlePaddle/Paddle/pull/60065)
+- Add the roformer relative embedding pass & kernel and support multi_encoder_xpu. [#62089](https://github.com/PaddlePaddle/Paddle/pull/62089)
+- Add the support for pp + sharding strategy. [#63640](https://github.com/PaddlePaddle/Paddle/pull/63640)
+- Upgrade the XPU communication library architecture to support dynamic-static unified communication library function. [#63817](https://github.com/PaddlePaddle/Paddle/pull/63817)
+
+#### Performance Optimization
+
+- Add XHPC buffer manager to improve the performance of Paddle and XHPC memory collaboration. [#63924](https://github.com/PaddlePaddle/Paddle/pull/63924)
+- Enhance TensorSetConstantXPU performance and support BF16 data type. [#63920](https://github.com/PaddlePaddle/Paddle/pull/63920),[#61818](https://github.com/PaddlePaddle/Paddle/pull/61818)
+- Fusion multiple group norm + silu + conv modules and compress the video memory. [#62892](https://github.com/PaddlePaddle/Paddle/pull/62892)
+- Optimize XPU memory allocation in comm manager. [#64139](https://github.com/PaddlePaddle/Paddle/pull/64139)
+- Optimize operator performance, including mean_all_grad ([#61148](https://github.com/PaddlePaddle/Paddle/pull/61148)), dropout_v2 ([#61029](https://github.com/PaddlePaddle/Paddle/pull/61029)), fused_rotary_position_embedding ([#62846](https://github.com/PaddlePaddle/Paddle/pull/62846)), cross_entropy ([#63159](https://github.com/PaddlePaddle/Paddle/pull/63159)), elementwise_add ([#64289](https://github.com/PaddlePaddle/Paddle/pull/64289)), fused_gemm_epilogue ([#61350](https://github.com/PaddlePaddle/Paddle/pull/61350), check_nan_or_inf ([#60853](https://github.com/PaddlePaddle/Paddle/pull/60853))
+
+#### Bug Fixing
+
+- Fix the tile operator support for 0-dimensional Tensor. [#64279](https://github.com/PaddlePaddle/Paddle/pull/64279)
+- Fix the group_norm_silu_fuse_pass. [#63449](https://github.com/PaddlePaddle/Paddle/pull/63449)
+- Fix the XPU API GM memory issue. [#60260](https://github.com/PaddlePaddle/Paddle/pull/60260),[#60387](https://github.com/PaddlePaddle/Paddle/pull/60387),[#62940](https://github.com/PaddlePaddle/Paddle/pull/62940)
+- Fix the distributed strategy Sharing stage1 v2 bug. [#64209](https://github.com/PaddlePaddle/Paddle/pull/64209)
+- Fix the XPU constant issue. [#60763](https://github.com/PaddlePaddle/Paddle/pull/60763)
+- Fix some operator issues, including AdamW ([#62251](https://github.com/PaddlePaddle/Paddle/pull/62251)), dropout_v3 ([#62726](https://github.com/PaddlePaddle/Paddle/pull/62726)), softmax([#63780](https://github.com/PaddlePaddle/Paddle/pull/63780)) , fused rope embedding ([#62143](https://github.com/PaddlePaddle/Paddle/pull/62143)), elementwise_add ([#60252](https://github.com/PaddlePaddle/Paddle/pull/60252)), resnet_basic_block ([#62914](https://github.com/PaddlePaddle/Paddle/pull/62914))
+- Fix XPU runtime and installation related issues. [#60028](https://github.com/PaddlePaddle/Paddle/pull/60028),[#61970](https://github.com/PaddlePaddle/Paddle/pull/61970)
+- Fix XPU compilation bugs. [#63307](https://github.com/PaddlePaddle/Paddle/pull/63307)
+- Fix end-side memory related bugs when initializing XPU communication library. [#64396](https://github.com/PaddlePaddle/Paddle/pull/64396)
-- Supports the FMHA/MMHA for CacheKV division block scheduling. [#59462](https://github.com/PaddlePaddle/Paddle/pull/59462)
-- RoPE encoding fusion operator supports input sin/cos values. [#55415](https://github.com/PaddlePaddle/Paddle/pull/55415)
-- Added fine-grained fusion operators. Supports high-performance inference optimization of generative large models. Added operators such as quant_linear, weight_quantize, and linear_compress for support of large model quantitative inference. [#57852](https://github.com/PaddlePaddle/Paddle/pull/57852),[#55128](https://github.com/PaddlePaddle/Paddle/pull/55128),[#59090](https://github.com/PaddlePaddle/Paddle/pull/59090),[#56706](https://github.com/PaddlePaddle/Paddle/pull/56706),[#59951](https://github.com/PaddlePaddle/Paddle/pull/59951),[#55490](https://github.com/PaddlePaddle/Paddle/pull/55490),[#59291](https://github.com/PaddlePaddle/Paddle/pull/59291),[#59441](https://github.com/PaddlePaddle/Paddle/pull/59441),[#59778](https://github.com/PaddlePaddle/Paddle/pull/59778),[#59651](https://github.com/PaddlePaddle/Paddle/pull/59651)[#55301](https://github.com/PaddlePaddle/Paddle/pull/55301),[#58637](https://github.com/PaddlePaddle/Paddle/pull/58637),[#56673](https://github.com/PaddlePaddle/Paddle/pull/56673),[#56401](https://github.com/PaddlePaddle/Paddle/pull/56401)
-- Supports variable length inference series API. [#57948](https://github.com/PaddlePaddle/Paddle/pull/57948)
-- Supports the GQA inference. [#58472](https://github.com/PaddlePaddle/Paddle/pull/58472),[#58836](https://github.com/PaddlePaddle/Paddle/pull/58836)
-- Added masked multihead attention. Supports high performance MMHA inference. [#55344](https://github.com/PaddlePaddle/Paddle/pull/55344),[#56411](https://github.com/PaddlePaddle/Paddle/pull/56411),[#58134](https://github.com/PaddlePaddle/Paddle/pull/58134),[#57936](https://github.com/PaddlePaddle/Paddle/pull/57936)
-- weight_quantize/weight_only_linear supports the Volta architecture. [#58082](https://github.com/PaddlePaddle/Paddle/pull/58082)
-- Added weight_only_linear_grad for support of large model weight only quantization gradient transfer-back. [#57685](https://github.com/PaddlePaddle/Paddle/pull/57685)
-- Fixed large model dynamic to static bug. Optimized communication initialization logic between static graph cards. [#56390](https://github.com/PaddlePaddle/Paddle/pull/56390),[#57169](https://github.com/PaddlePaddle/Paddle/pull/57169),[#56688](https://github.com/PaddlePaddle/Paddle/pull/56688),[#56592](https://github.com/PaddlePaddle/Paddle/pull/56592),[#58868](https://github.com/PaddlePaddle/Paddle/pull/58868)
-- Optimized top_p_sampling random number generation logic. [#59494](https://github.com/PaddlePaddle/Paddle/pull/59494)
-
-### Paddle-TensorRT Inference Optimization
-
-- elementwise_add fusion supports NHWC format. [#56795](https://github.com/PaddlePaddle/Paddle/pull/56795)
-- conv2d supports filter as input. [#55246](https://github.com/PaddlePaddle/Paddle/pull/55246)。
-- Supports BF16 and FP64 inference. [#59765](https://github.com/PaddlePaddle/Paddle/pull/59765),[#55520](https://github.com/PaddlePaddle/Paddle/pull/55520)
-- Added MarkTrtEngineOutputs API. Users can specify TensorRT Engine outputs. [#56858](https://github.com/PaddlePaddle/Paddle/pull/56858),[#56188](https://github.com/PaddlePaddle/Paddle/pull/56188),[#57407](https://github.com/PaddlePaddle/Paddle/pull/57407)
-- Customized OP can generate TensorRT Plugin automatically. [#58976](https://github.com/PaddlePaddle/Paddle/pull/58976),[#56037](https://github.com/PaddlePaddle/Paddle/pull/56037)
-- TensorRT inference allows users to specify input hook to optimize shape collection process. [#59466](https://github.com/PaddlePaddle/Paddle/pull/59466),[#54841](https://github.com/PaddlePaddle/Paddle/pull/54841),[#57498](https://github.com/PaddlePaddle/Paddle/pull/57498),[#54861](https://github.com/PaddlePaddle/Paddle/pull/54861),[#54432](https://github.com/PaddlePaddle/Paddle/pull/54432),[#55503](https://github.com/PaddlePaddle/Paddle/pull/55503)
-- TensorRT Inference supports inference model after saving Tuning. [#55893](https://github.com/PaddlePaddle/Paddle/pull/55893),[#56952](https://github.com/PaddlePaddle/Paddle/pull/56952),[#57031](https://github.com/PaddlePaddle/Paddle/pull/57031)
-- Supports variable length Transformer model PromptTuning. [#57034](https://github.com/PaddlePaddle/Paddle/pull/57034)
-- Added operators such as bitwise_and, bitwise_or, bitwise_not, cumsum, einsum, lookup_table, assign, flip, size, scatter, solve, unbind, reduce, and argsort. Optimized support of existing operators. [#59214](https://github.com/PaddlePaddle/Paddle/pull/59214),[#59293](https://github.com/PaddlePaddle/Paddle/pull/59293),[#54882](https://github.com/PaddlePaddle/Paddle/pull/54882),[#54097](https://github.com/PaddlePaddle/Paddle/pull/54097),[#54860](https://github.com/PaddlePaddle/Paddle/pull/54860),[#55426](https://github.com/PaddlePaddle/Paddle/pull/55426),[#54372](https://github.com/PaddlePaddle/Paddle/pull/54372),[#55688](https://github.com/PaddlePaddle/Paddle/pull/55688),[#56069](https://github.com/PaddlePaddle/Paddle/pull/56069),[#59563](https://github.com/PaddlePaddle/Paddle/pull/59563),[#59317](https://github.com/PaddlePaddle/Paddle/pull/59317),[#59424](https://github.com/PaddlePaddle/Paddle/pull/59424),[#55476](https://github.com/PaddlePaddle/Paddle/pull/55476),[#56043](https://github.com/PaddlePaddle/Paddle/pull/56043),[#58549](https://github.com/PaddlePaddle/Paddle/pull/58549),[#57326](https://github.com/PaddlePaddle/Paddle/pull/57326),[#59409](https://github.com/PaddlePaddle/Paddle/pull/59409))
-- TensorRT enables video memory sharing by default. [#59495](https://github.com/PaddlePaddle/Paddle/pull/59495),[#58251](https://github.com/PaddlePaddle/Paddle/pull/58251)
-- PrelnResidualBiasPluginDynamic supports 4D input. [#56304](https://github.com/PaddlePaddle/Paddle/pull/56304)
-- Added support for FlashAttention for Paddle-TRT inference for architectures below SM80.[#56492](https://github.com/PaddlePaddle/Paddle/pull/56492)
-
-### Modification deprecation
-
-- Removed fc_elementwise_add fusion from OneDNN. [#55504](https://github.com/PaddlePaddle/Paddle/pull/55504)
-- Removed redunant op. [#54442](https://github.com/PaddlePaddle/Paddle/pull/54442)
-
-### Bug Fix
-
-- Fixed “Inference so” link flags conflict issue. [#59755](https://github.com/PaddlePaddle/Paddle/pull/59755)
-- Fixed constant_folding pass execution error. [#55556](https://github.com/PaddlePaddle/Paddle/pull/55556)
-- Fixed softmax forward speed bug and reverse accuracy bug. [#56036](https://github.com/PaddlePaddle/Paddle/pull/56036),[#57858](https://github.com/PaddlePaddle/Paddle/pull/57858)[#57538](https://github.com/PaddlePaddle/Paddle/pull/57538)
-- Fixed customized OP while error and export bug. [#58898](https://github.com/PaddlePaddle/Paddle/pull/58898),[#59318](https://github.com/PaddlePaddle/Paddle/pull/59318)
-- Fixed CUDA 12.0 compilation problem on Windows platform. [#59852](https://github.com/PaddlePaddle/Paddle/pull/59852)
-- Fixed bug of inference partial operator error when TensorRT version is later than 8.6. [#54379](https://github.com/PaddlePaddle/Paddle/pull/54379),[#54679](https://github.com/PaddlePaddle/Paddle/pull/54679),[#54251](https://github.com/PaddlePaddle/Paddle/pull/54251)
-- Fixed and removed inference fusion Pass. [#54846](https://github.com/PaddlePaddle/Paddle/pull/54846),[#54887](https://github.com/PaddlePaddle/Paddle/pull/54887),[#55573](https://github.com/PaddlePaddle/Paddle/pull/55573),[#56434](https://github.com/PaddlePaddle/Paddle/pull/56434),[#56326](https://github.com/PaddlePaddle/Paddle/pull/56326),[#56753](https://github.com/PaddlePaddle/Paddle/pull/56753),[#57491](https://github.com/PaddlePaddle/Paddle/pull/57491),[#56909](https://github.com/PaddlePaddle/Paddle/pull/56909),[#54536](https://github.com/PaddlePaddle/Paddle/pull/54536),[#55073](https://github.com/PaddlePaddle/Paddle/pull/55073),[#55081](https://github.com/PaddlePaddle/Paddle/pull/55081),[#55240](https://github.com/PaddlePaddle/Paddle/pull/55240),[#56439](https://github.com/PaddlePaddle/Paddle/pull/56439),[#59009](https://github.com/PaddlePaddle/Paddle/pull/59009)
-- Fixed error of multi-stream inference context switching. [#57629](https://github.com/PaddlePaddle/Paddle/pull/57629),[#58048](https://github.com/PaddlePaddle/Paddle/pull/58048),[#54994](https://github.com/PaddlePaddle/Paddle/pull/54994)
-
-## 5. Hardware Support
-
-### Hardware Integration Solution (Custom Device)
-
-In this update, added support for distributed advanced strategy, custom operator and custom fusion strategy. By upgrading distributed communication library, supports MP, GroupShared, PP, SP, MOE and other advanced distributed strategies. Meanwhile, enables vendors to flexibly access Transformer operator libraries of different granularities, and modify computation graph through Fusion Pass for performance acceleration.
-
-#### New features
-
-- Upgraded CustomDevice to support for Paddle's latest distributed communication library CommContext. Added a variety of advanced distributed strategies such as GroupShared and MOE. [#56301](https://github.com/PaddlePaddle/Paddle/pull/56301),[#54671](https://github.com/PaddlePaddle/Paddle/pull/54671),[#57957](https://github.com/PaddlePaddle/Paddle/pull/57957),[#56669](https://github.com/PaddlePaddle/Paddle/pull/56669),[#54384](https://github.com/PaddlePaddle/Paddle/pull/54384),[#54572](https://github.com/PaddlePaddle/Paddle/pull/54572),[#54573](https://github.com/PaddlePaddle/Paddle/pull/54573),[#54676](https://github.com/PaddlePaddle/Paddle/pull/54676)
-- Upgraded CustomDevice to support CustomOP. Users can register undefined operators in Paddle PHI operator library. CustomDevice can support CustomOP via CAPI. [#57038](https://github.com/PaddlePaddle/Paddle/pull/57038),[#55532](https://github.com/PaddlePaddle/Paddle/pull/55532),[#56755](https://github.com/PaddlePaddle/Paddle/pull/56755),[#55532](https://github.com/PaddlePaddle/Paddle/pull/55532),[#55533](https://github.com/PaddlePaddle/Paddle/pull/55533),[#55659](https://github.com/PaddlePaddle/Paddle/pull/55659)
-- Added CustomDevice's support for CustomPass function. Modified the computation graph IR through Python API. [#55511](https://github.com/PaddlePaddle/Paddle/pull/55511),[#55728](https://github.com/PaddlePaddle/Paddle/pull/55728)
-- Added CustomDevice’s support for Paddle run_check. [#56318](https://github.com/PaddlePaddle/Paddle/pull/56318)
-- Added CustomDevice’s support for StreamSafeAllocator. [#55393](https://github.com/PaddlePaddle/Paddle/pull/55393),[#56380](https://github.com/PaddlePaddle/Paddle/pull/56380),[#56536](https://github.com/PaddlePaddle/Paddle/pull/56536),[#58035](https://github.com/PaddlePaddle/Paddle/pull/58035)
-- Added CustomDevice’s support for DataTransform. [#56627](https://github.com/PaddlePaddle/Paddle/pull/56627)
-
-#### Function optimization
-
-- Added CustomDevice’s support for more PaddlePaddle APIs such as Variable.set_value, adamw, share_external_data, mp_allreduce_sum, tensor.numpy, get_paddle_place, and GeneratorState. [#55272](https://github.com/PaddlePaddle/Paddle/pull/55272), [#56386](https://github.com/PaddlePaddle/Paddle/pull/56386), [#57253](https://github.com/PaddlePaddle/Paddle/pull/57253), [#56927](https://github.com/PaddlePaddle/Paddle/pull/56927),[#56189](https://github.com/PaddlePaddle/Paddle/pull/56189),[#55225](https://github.com/PaddlePaddle/Paddle/pull/55225),[#55247](https://github.com/PaddlePaddle/Paddle/pull/55247)
-- Modified CustomDevice dynamic library loading method from RTLD_NOW to RTLD_LAZY, to facilitate subsequent checking of compatibility of CustomDevice related software stack version. [#57544](https://github.com/PaddlePaddle/Paddle/pull/57544)
-- Added CustomDevice's detection function for FP16 operator under mixed precision training. [#56053](https://github.com/PaddlePaddle/Paddle/pull/56053),[#56176](https://github.com/PaddlePaddle/Paddle/pull/56176)
-
-#### Bug Fix
-
-- Fixed some problems in CustomDevice's support for distributed communication libraries. [#55293](https://github.com/PaddlePaddle/Paddle/pull/55293),[#58038](https://github.com/PaddlePaddle/Paddle/pull/58038),[#59800](https://github.com/PaddlePaddle/Paddle/pull/59800)
-- Fixed some problems in CustomDevice on some operators, including c_softmax_with_cross_entropy,data loader,SplitDenseTensor,grad accumulation,atan2 grad.[#56486](https://github.com/PaddlePaddle/Paddle/pull/56486),[#55541](https://github.com/PaddlePaddle/Paddle/pull/55541),[#55615](https://github.com/PaddlePaddle/Paddle/pull/55615),[#56052](https://github.com/PaddlePaddle/Paddle/pull/56052),[#56067](https://github.com/PaddlePaddle/Paddle/pull/56067)
-- Fixed some problems of device management in CustomDevice, including device exceptions ([#56556](https://github.com/PaddlePaddle/Paddle/pull/56556),[#58639](https://github.com/PaddlePaddle/Paddle/pull/58639),[#55173](https://github.com/PaddlePaddle/Paddle/pull/55173)), exception events ([#56745](https://github.com/PaddlePaddle/Paddle/pull/56745),[#58059](https://github.com/PaddlePaddle/Paddle/pull/58059)), video memory exception ([#56977](https://github.com/PaddlePaddle/Paddle/pull/56977),[#59247](https://github.com/PaddlePaddle/Paddle/pull/59247),[#54606](https://github.com/PaddlePaddle/Paddle/pull/54606)), device initialization ([#57099](https://github.com/PaddlePaddle/Paddle/pull/57099),[#57994](https://github.com/PaddlePaddle/Paddle/pull/57994)), device release ([#54932](https://github.com/PaddlePaddle/Paddle/pull/54932),[#55351](https://github.com/PaddlePaddle/Paddle/pull/55351),[#55783](https://github.com/PaddlePaddle/Paddle/pull/55783)), and device resource pooling, etc.([#55229](https://github.com/PaddlePaddle/Paddle/pull/55229),[#56580](https://github.com/PaddlePaddle/Paddle/pull/56580))
-- Fixed CustomDevice compilation-related issues. [#56760](https://github.com/PaddlePaddle/Paddle/pull/56760),[#56766](https://github.com/PaddlePaddle/Paddle/pull/56766)
-
-### Kunlunxin XPU
-
-#### New features
-
-- Added XPTI (XPU Profiling Tool Interface) to support collection and analysis function of runtime performance data. [#54685](https://github.com/PaddlePaddle/Paddle/pull/54685),[#54690](https://github.com/PaddlePaddle/Paddle/pull/54690),[#54800](https://github.com/PaddlePaddle/Paddle/pull/54800)
-- Supports Paddle's latest distributed communication library CommContext. [#59418](https://github.com/PaddlePaddle/Paddle/pull/59418)
-- Added XPU fusion operators, for example, fast_where. [#55628](https://github.com/PaddlePaddle/Paddle/pull/55628)
-- Added support for XPU Pluign function, facilitating users to develop XPU customized operators through XTDK programming. [#55101](https://github.com/PaddlePaddle/Paddle/pull/55101),[#59326](https://github.com/PaddlePaddle/Paddle/pull/59326)
-- Added XPU’s support for AutoGrowthAllocator. [#54121](https://github.com/PaddlePaddle/Paddle/pull/54121)
-- Added operator support list of Kunlun3. [#57683](https://github.com/PaddlePaddle/Paddle/pull/57683)
-
-#### Function optimization
-
-- Upgraded XPU Inference API. [#54342](https://github.com/PaddlePaddle/Paddle/pull/54342)
-- Optimized performance of some XPU operators. Added support for bf16 in some XPU operators, including unique/index_put,squeeze/unsqueeze kernels,swish/swish_grad,scatter_nd_add_grad/slice,rsqrt/bitwise_or/arange_tensor,where,collective. [#56582](https://github.com/PaddlePaddle/Paddle/pull/56582),[#58161](https://github.com/PaddlePaddle/Paddle/pull/58161),[#58440](https://github.com/PaddlePaddle/Paddle/pull/58440),[#58580](https://github.com/PaddlePaddle/Paddle/pull/58580),[#58950](https://github.com/PaddlePaddle/Paddle/pull/58950),[#58616](https://github.com/PaddlePaddle/Paddle/pull/58616),[#59273](https://github.com/PaddlePaddle/Paddle/pull/59273)
-- Optimized XPU memory management to avoid memory leakage. [#59334](https://github.com/PaddlePaddle/Paddle/pull/59334),[#54847](https://github.com/PaddlePaddle/Paddle/pull/54847)
-- Supports INT8 inference. [#57258](https://github.com/PaddlePaddle/Paddle/pull/57258)
-- Added support for FP16 series inference operators. [#55642](https://github.com/PaddlePaddle/Paddle/pull/55642),[#54410](https://github.com/PaddlePaddle/Paddle/pull/54410)
-- Supports share_external_memory interface to pass input and output. [#55170](https://github.com/PaddlePaddle/Paddle/pull/55170)
-- Supports open source quantization model XPU inference. [#58568](https://github.com/PaddlePaddle/Paddle/pull/58568)
-- Added context_gm_size configuration, instead of allocating global memory in Pass. [#54674](https://github.com/PaddlePaddle/Paddle/pull/54674)
-- Added embedding and fast_gather_nd plugin. [#56488](https://github.com/PaddlePaddle/Paddle/pull/56488),[#56103](https://github.com/PaddlePaddle/Paddle/pull/56103)
-- Supports fusion of fast_layternorm + leaky_relu. [#57113](https://github.com/PaddlePaddle/Paddle/pull/57113)
-- Supports elementwise_min/max/floordiv/where inference in KL1 and KL2 precision. [#58422](https://github.com/PaddlePaddle/Paddle/pull/58422)
-- Supports autotune configuration of fc and conv2d operator. [#58801](https://github.com/PaddlePaddle/Paddle/pull/58801)
-- Supports conv and fc dynamic quantization. [#59307](https://github.com/PaddlePaddle/Paddle/pull/59307)
-- fc + act fusion support for sigmoid, swish and relu6. [#54486](https://github.com/PaddlePaddle/Paddle/pull/54486)
-- elementwise_sub/elementwise_div supports int data type. [#55920](https://github.com/PaddlePaddle/Paddle/pull/55920)
-
-#### Bug Fix
-
-- Fixed XPU communication library issues and some operator issues including rnn, layer_norm_grad, yolo_box. ([#55475](https://github.com/PaddlePaddle/Paddle/pull/55475),[#55515](https://github.com/PaddlePaddle/Paddle/pull/55515)) ([#55656](https://github.com/PaddlePaddle/Paddle/pull/55656),[#54669](https://github.com/PaddlePaddle/Paddle/pull/54669),[#55310](https://github.com/PaddlePaddle/Paddle/pull/55310)
-
-### Hygon DCU
-
-#### Bug Fix
-
-- Fixed some operator bugs of Hygon DCU, including rnn, concat/split, fft, and so on.[#59402](https://github.com/PaddlePaddle/Paddle/pull/59402),[#55821](https://github.com/PaddlePaddle/Paddle/pull/55821),[#56340](https://github.com/PaddlePaddle/Paddle/pull/56340))
-- Fixed issues related to communication library of Hygon DCU. [#57110](https://github.com/PaddlePaddle/Paddle/pull/57110)
-- Fixed compilation-related problems of Hygon DCU. [#59775](https://github.com/PaddlePaddle/Paddle/pull/59775),[#55507](https://github.com/PaddlePaddle/Paddle/pull/55507),[#55612](https://github.com/PaddlePaddle/Paddle/pull/55612),[#54952](https://github.com/PaddlePaddle/Paddle/pull/54952),[#55076](https://github.com/PaddlePaddle/Paddle/pull/55076),[#56079](https://github.com/PaddlePaddle/Paddle/pull/56079),[#54874](https://github.com/PaddlePaddle/Paddle/pull/54874))
-- Fixed support issue of Hygon DCU for BF16 data type. [#56517](https://github.com/PaddlePaddle/Paddle/pull/56517)
-
-## 6. Environment Adaptation
-
-Adopted modular compilation to optimize CMake codes, improving efficiency of compilation of PaddlePaddle. This can increase efficiency of RD local development. Meanwhile, supports compilation in Python3.12, CUDA12, and Hopper architecture, and using Clang tool to comprehensively optimize code formats. In addition, C++ unitest is changed from linking static libraries to linking dynamic libraries to reduce compilation size. These improvements provide users with a smoother and more efficient installation and development experience.
-
-- CMake code optimization: stratify directories into independent static libraries, to improve incremental compilation efficiency. [#59095](https://github.com/PaddlePaddle/Paddle/pull/59095), [#58960](https://github.com/PaddlePaddle/Paddle/pull/58960),[#56591](https://github.com/PaddlePaddle/Paddle/pull/56591),[#58484](https://github.com/PaddlePaddle/Paddle/pull/58484)
-- CMake compilation stratification: to realize compilation layering of PaddlePaddle architecture from bottom-up and improve compilation efficiency. [#56442](https://github.com/PaddlePaddle/Paddle/pull/56442),[#54729](https://github.com/PaddlePaddle/Paddle/pull/54729),[#55733](https://github.com/PaddlePaddle/Paddle/pull/55733),[#56352](https://github.com/PaddlePaddle/Paddle/pull/56352),[#55109](https://github.com/PaddlePaddle/Paddle/pull/55109),[#54992](https://github.com/PaddlePaddle/Paddle/pull/54992),[#57698](https://github.com/PaddlePaddle/Paddle/pull/57698),[#55147](https://github.com/PaddlePaddle/Paddle/pull/55147),[#55113](https://github.com/PaddlePaddle/Paddle/pull/55113),[#56691](https://github.com/PaddlePaddle/Paddle/pull/56691),[#58618](https://github.com/PaddlePaddle/Paddle/pull/58618),[#58899](https://github.com/PaddlePaddle/Paddle/pull/58899),[#59140](https://github.com/PaddlePaddle/Paddle/pull/59140),[#59129](https://github.com/PaddlePaddle/Paddle/pull/59129),[#59222](https://github.com/PaddlePaddle/Paddle/pull/59222),[#59105](https://github.com/PaddlePaddle/Paddle/pull/59105),[#59711](https://github.com/PaddlePaddle/Paddle/pull/59711)
-- Offline compilation of third-party libraries: Third-party dependent libraries are compiled offline, so CI/CE system does not need to download third-party libraries repeatedly in every compilation, improving operation efficiency of the CI/CE system. [#54344](https://github.com/PaddlePaddle/Paddle/pull/54344),[#54370](https://github.com/PaddlePaddle/Paddle/pull/54370),[#54466](https://github.com/PaddlePaddle/Paddle/pull/54466),[#54438](https://github.com/PaddlePaddle/Paddle/pull/54438),[#54388](https://github.com/PaddlePaddle/Paddle/pull/54388),[#54436](https://github.com/PaddlePaddle/Paddle/pull/54436),[#54392](https://github.com/PaddlePaddle/Paddle/pull/54392),[#54646](https://github.com/PaddlePaddle/Paddle/pull/54646),[#54380](https://github.com/PaddlePaddle/Paddle/pull/54380),[#55501](https://github.com/PaddlePaddle/Paddle/pull/55501),[#55136](https://github.com/PaddlePaddle/Paddle/pull/55136),[#54451](https://github.com/PaddlePaddle/Paddle/pull/54451),[#55631](https://github.com/PaddlePaddle/Paddle/pull/55631),[#55549](https://github.com/PaddlePaddle/Paddle/pull/55549),[#56165](https://github.com/PaddlePaddle/Paddle/pull/56165),[#54391](https://github.com/PaddlePaddle/Paddle/pull/54391),[#54614](https://github.com/PaddlePaddle/Paddle/pull/54614),[#54522](https://github.com/PaddlePaddle/Paddle/pull/54522),[#54764](https://github.com/PaddlePaddle/Paddle/pull/54764),[#54400](https://github.com/PaddlePaddle/Paddle/pull/54400),[#54322](https://github.com/PaddlePaddle/Paddle/pull/54322)
-- PaddlePaddle supports Python 3.12. [#59396](https://github.com/PaddlePaddle/Paddle/pull/59396),[#58069](https://github.com/PaddlePaddle/Paddle/pull/58069)
-- Using Clang tool to optimize source codes and improve code quality. [#59626](https://github.com/PaddlePaddle/Paddle/pull/59626),[#55895](https://github.com/PaddlePaddle/Paddle/pull/55895),[#56632](https://github.com/PaddlePaddle/Paddle/pull/56632),[#54449](https://github.com/PaddlePaddle/Paddle/pull/54449),[#54523](https://github.com/PaddlePaddle/Paddle/pull/54523),[#54796](https://github.com/PaddlePaddle/Paddle/pull/54796),[#55847](https://github.com/PaddlePaddle/Paddle/pull/55847),[#55807](https://github.com/PaddlePaddle/Paddle/pull/55807),[#56261](https://github.com/PaddlePaddle/Paddle/pull/56261),[#57522](https://github.com/PaddlePaddle/Paddle/pull/57522),[#57868](https://github.com/PaddlePaddle/Paddle/pull/57868),[#57809](https://github.com/PaddlePaddle/Paddle/pull/57809),[#55658](https://github.com/PaddlePaddle/Paddle/pull/55658),[#58285](https://github.com/PaddlePaddle/Paddle/pull/58285),[#55491](https://github.com/PaddlePaddle/Paddle/pull/55491),[#55506](https://github.com/PaddlePaddle/Paddle/pull/55506),[#55279](https://github.com/PaddlePaddle/Paddle/pull/55279),[#55741](https://github.com/PaddlePaddle/Paddle/pull/55741),[#55894](https://github.com/PaddlePaddle/Paddle/pull/55894),[#55704](https://github.com/PaddlePaddle/Paddle/pull/55704),[#55800](https://github.com/PaddlePaddle/Paddle/pull/55800),[#55799](https://github.com/PaddlePaddle/Paddle/pull/55799),[#55983](https://github.com/PaddlePaddle/Paddle/pull/55983),[#55954](https://github.com/PaddlePaddle/Paddle/pull/55954),[#55764](https://github.com/PaddlePaddle/Paddle/pull/55764),[#56246](https://github.com/PaddlePaddle/Paddle/pull/56246),[#56219](https://github.com/PaddlePaddle/Paddle/pull/56219),[#56217](https://github.com/PaddlePaddle/Paddle/pull/56217),[#56216](https://github.com/PaddlePaddle/Paddle/pull/56216),[#56208](https://github.com/PaddlePaddle/Paddle/pull/56208),[#56134](https://github.com/PaddlePaddle/Paddle/pull/56134),[#56253](https://github.com/PaddlePaddle/Paddle/pull/56253),[#56255](https://github.com/PaddlePaddle/Paddle/pull/56255),[#56693](https://github.com/PaddlePaddle/Paddle/pull/56693),[#56692](https://github.com/PaddlePaddle/Paddle/pull/56692),[#56637](https://github.com/PaddlePaddle/Paddle/pull/56637),[#56636](https://github.com/PaddlePaddle/Paddle/pull/56636),[#56647](https://github.com/PaddlePaddle/Paddle/pull/56647),[#56218](https://github.com/PaddlePaddle/Paddle/pull/56218),[#56640](https://github.com/PaddlePaddle/Paddle/pull/56640),[#56635](https://github.com/PaddlePaddle/Paddle/pull/56635),[#55675](https://github.com/PaddlePaddle/Paddle/pull/55675),[#56601](https://github.com/PaddlePaddle/Paddle/pull/56601),[#56485](https://github.com/PaddlePaddle/Paddle/pull/56485),[#56648](https://github.com/PaddlePaddle/Paddle/pull/56648),[#56747](https://github.com/PaddlePaddle/Paddle/pull/56747),[#56676](https://github.com/PaddlePaddle/Paddle/pull/56676),[#56649](https://github.com/PaddlePaddle/Paddle/pull/56649),[#56895](https://github.com/PaddlePaddle/Paddle/pull/56895),[#56994](https://github.com/PaddlePaddle/Paddle/pull/56994),[#56904](https://github.com/PaddlePaddle/Paddle/pull/56904),[#56744](https://github.com/PaddlePaddle/Paddle/pull/56744),[#56954](https://github.com/PaddlePaddle/Paddle/pull/56954),[#57114](https://github.com/PaddlePaddle/Paddle/pull/57114),[#57343](https://github.com/PaddlePaddle/Paddle/pull/57343),[#57483](https://github.com/PaddlePaddle/Paddle/pull/57483),[#57871](https://github.com/PaddlePaddle/Paddle/pull/57871),[#57861](https://github.com/PaddlePaddle/Paddle/pull/57861),[#58028](https://github.com/PaddlePaddle/Paddle/pull/58028),[#57627](https://github.com/PaddlePaddle/Paddle/pull/57627),[#59072](https://github.com/PaddlePaddle/Paddle/pull/59072)
-- C++ unitest has changed from linking static libraries to linking dynamic libraries, reducing compilation size and improving compilation efficiency. [#59477](https://github.com/PaddlePaddle/Paddle/pull/59477),[#56630](https://github.com/PaddlePaddle/Paddle/pull/56630),[#57789](https://github.com/PaddlePaddle/Paddle/pull/57789),[#54257](https://github.com/PaddlePaddle/Paddle/pull/54257),[#59620](https://github.com/PaddlePaddle/Paddle/pull/59620),[#59384](https://github.com/PaddlePaddle/Paddle/pull/59384),[#59619](https://github.com/PaddlePaddle/Paddle/pull/59619),[#58583](https://github.com/PaddlePaddle/Paddle/pull/58583),[#58821](https://github.com/PaddlePaddle/Paddle/pull/58821),[#58710](https://github.com/PaddlePaddle/Paddle/pull/58710),[#58619](https://github.com/PaddlePaddle/Paddle/pull/58619)
-- Fixed bug related to source code compilation, improving compilation efficiency. [#56617](https://github.com/PaddlePaddle/Paddle/pull/56617),[#58195](https://github.com/PaddlePaddle/Paddle/pull/58195),[#56136](https://github.com/PaddlePaddle/Paddle/pull/56136),[#54540](https://github.com/PaddlePaddle/Paddle/pull/54540),[#57172](https://github.com/PaddlePaddle/Paddle/pull/57172),[#54429](https://github.com/PaddlePaddle/Paddle/pull/54429),[#55603](https://github.com/PaddlePaddle/Paddle/pull/55603),[#54807](https://github.com/PaddlePaddle/Paddle/pull/54807),[#56102](https://github.com/PaddlePaddle/Paddle/pull/56102),[#56829](https://github.com/PaddlePaddle/Paddle/pull/56829),[#56951](https://github.com/PaddlePaddle/Paddle/pull/56951),[#56555](https://github.com/PaddlePaddle/Paddle/pull/56555),[#57781](https://github.com/PaddlePaddle/Paddle/pull/57781),[#57836](https://github.com/PaddlePaddle/Paddle/pull/57836),[#58807](https://github.com/PaddlePaddle/Paddle/pull/58807),[#54535](https://github.com/PaddlePaddle/Paddle/pull/54535),[#54946](https://github.com/PaddlePaddle/Paddle/pull/54946),[#54437](https://github.com/PaddlePaddle/Paddle/pull/54437),[#54411](https://github.com/PaddlePaddle/Paddle/pull/54411),[#54411](https://github.com/PaddlePaddle/Paddle/pull/54411),[#54391](https://github.com/PaddlePaddle/Paddle/pull/54391),[#54466](https://github.com/PaddlePaddle/Paddle/pull/54466),[#54480](https://github.com/PaddlePaddle/Paddle/pull/54480),[#54480](https://github.com/PaddlePaddle/Paddle/pull/54480),[#54724](https://github.com/PaddlePaddle/Paddle/pull/54724),[#59193](https://github.com/PaddlePaddle/Paddle/pull/59193),[#54735](https://github.com/PaddlePaddle/Paddle/pull/54735),[#54812](https://github.com/PaddlePaddle/Paddle/pull/54812),[#56430](https://github.com/PaddlePaddle/Paddle/pull/56430),[#56655](https://github.com/PaddlePaddle/Paddle/pull/56655),[#56684](https://github.com/PaddlePaddle/Paddle/pull/56684),[#56774](https://github.com/PaddlePaddle/Paddle/pull/56774),[#56936](https://github.com/PaddlePaddle/Paddle/pull/56936),[#56949](https://github.com/PaddlePaddle/Paddle/pull/56949),[#56974](https://github.com/PaddlePaddle/Paddle/pull/56974),[#57171](https://github.com/PaddlePaddle/Paddle/pull/57171),[#57712](https://github.com/PaddlePaddle/Paddle/pull/57712),[#56617](https://github.com/PaddlePaddle/Paddle/pull/56617),[#58181](https://github.com/PaddlePaddle/Paddle/pull/58181),[#58253](https://github.com/PaddlePaddle/Paddle/pull/58253),[#58268](https://github.com/PaddlePaddle/Paddle/pull/58268),[#59051](https://github.com/PaddlePaddle/Paddle/pull/59051),[#59048](https://github.com/PaddlePaddle/Paddle/pull/59048),[#59081](https://github.com/PaddlePaddle/Paddle/pull/59081),[#59076](https://github.com/PaddlePaddle/Paddle/pull/59076),[#59155](https://github.com/PaddlePaddle/Paddle/pull/59155),[#59253](https://github.com/PaddlePaddle/Paddle/pull/59253),[#59347](https://github.com/PaddlePaddle/Paddle/pull/59347),[#58957](https://github.com/PaddlePaddle/Paddle/pull/58957),[#59443](https://github.com/PaddlePaddle/Paddle/pull/59443),[#58998](https://github.com/PaddlePaddle/Paddle/pull/58998),[#57574](https://github.com/PaddlePaddle/Paddle/pull/57574),[#55889](https://github.com/PaddlePaddle/Paddle/pull/55889),[#59078](https://github.com/PaddlePaddle/Paddle/pull/59078),[#55762](https://github.com/PaddlePaddle/Paddle/pull/55762),[#56252](https://github.com/PaddlePaddle/Paddle/pull/56252),[#56715](https://github.com/PaddlePaddle/Paddle/pull/56715),[#54905](https://github.com/PaddlePaddle/Paddle/pull/54905),[#56978](https://github.com/PaddlePaddle/Paddle/pull/56978),[#57032](https://github.com/PaddlePaddle/Paddle/pull/57032),[#57179](https://github.com/PaddlePaddle/Paddle/pull/57179),[#57179](https://github.com/PaddlePaddle/Paddle/pull/57179),[#58996](https://github.com/PaddlePaddle/Paddle/pull/58996),[#59915](https://github.com/PaddlePaddle/Paddle/pull/59915),[#54883](https://github.com/PaddlePaddle/Paddle/pull/54883),[#56746](https://github.com/PaddlePaddle/Paddle/pull/56746),[#57674](https://github.com/PaddlePaddle/Paddle/pull/57674),[#60117](https://github.com/PaddlePaddle/Paddle/pull/60117),[#55627](https://github.com/PaddlePaddle/Paddle/pull/55627),[#54568](https://github.com/PaddlePaddle/Paddle/pull/54568),[#54450](https://github.com/PaddlePaddle/Paddle/pull/54450),[#54513](https://github.com/PaddlePaddle/Paddle/pull/54513),[#54615](https://github.com/PaddlePaddle/Paddle/pull/54615),[#54913](https://github.com/PaddlePaddle/Paddle/pull/54913),[#54916](https://github.com/PaddlePaddle/Paddle/pull/54916),[#55148](https://github.com/PaddlePaddle/Paddle/pull/55148),[#55125](https://github.com/PaddlePaddle/Paddle/pull/55125),[#55479](https://github.com/PaddlePaddle/Paddle/pull/55479),[#55723](https://github.com/PaddlePaddle/Paddle/pull/55723),[#55831](https://github.com/PaddlePaddle/Paddle/pull/55831),[#55904](https://github.com/PaddlePaddle/Paddle/pull/55904),[#56085](https://github.com/PaddlePaddle/Paddle/pull/56085),[#56259](https://github.com/PaddlePaddle/Paddle/pull/56259),[#56366](https://github.com/PaddlePaddle/Paddle/pull/56366),[#56366](https://github.com/PaddlePaddle/Paddle/pull/56366),[#56546](https://github.com/PaddlePaddle/Paddle/pull/56546),[#56679](https://github.com/PaddlePaddle/Paddle/pull/56679),[#57222](https://github.com/PaddlePaddle/Paddle/pull/57222),[#57387](https://github.com/PaddlePaddle/Paddle/pull/57387),[#57993](https://github.com/PaddlePaddle/Paddle/pull/57993),[#59556](https://github.com/PaddlePaddle/Paddle/pull/59556),[#57931](https://github.com/PaddlePaddle/Paddle/pull/57931),[#58112](https://github.com/PaddlePaddle/Paddle/pull/58112),[#54228](https://github.com/PaddlePaddle/Paddle/pull/54228),[#56913](https://github.com/PaddlePaddle/Paddle/pull/56913),[#56993](https://github.com/PaddlePaddle/Paddle/pull/56993),[#55042](https://github.com/PaddlePaddle/Paddle/pull/55042),[#55305](https://github.com/PaddlePaddle/Paddle/pull/55305),[#55286](https://github.com/PaddlePaddle/Paddle/pull/55286),[#56634](https://github.com/PaddlePaddle/Paddle/pull/56634),[#57778](https://github.com/PaddlePaddle/Paddle/pull/57778),[#58374](https://github.com/PaddlePaddle/Paddle/pull/58374),[#58640](https://github.com/PaddlePaddle/Paddle/pull/58640),[#58822](https://github.com/PaddlePaddle/Paddle/pull/58822),[#59055](https://github.com/PaddlePaddle/Paddle/pull/59055),[#59303](https://github.com/PaddlePaddle/Paddle/pull/59303),[#59487](https://github.com/PaddlePaddle/Paddle/pull/59487),[#58400](https://github.com/PaddlePaddle/Paddle/pull/58400),[#59283](https://github.com/PaddlePaddle/Paddle/pull/59283),[#54791](https://github.com/PaddlePaddle/Paddle/pull/54791),[#59134](https://github.com/PaddlePaddle/Paddle/pull/59134),[#56206](https://github.com/PaddlePaddle/Paddle/pull/56206),[#56199](https://github.com/PaddlePaddle/Paddle/pull/56199),[#56670](https://github.com/PaddlePaddle/Paddle/pull/56670),[#58923](https://github.com/PaddlePaddle/Paddle/pull/58923)
-- Fixed bug related to Paddle ARM compilation. [#55416](https://github.com/PaddlePaddle/Paddle/pull/55416),[#55548](https://github.com/PaddlePaddle/Paddle/pull/55548)
-
-## Thanks to Our Contributors
-
-Azure-Tang, zhaoyinglia, From00, JZ-LIANG, xysheng-baidu, SylarTiaNII, kuizhiqing, zhiqiu, FeixLiu, liuzhenhai93, GhostScreaming, pangengzheng, xiaoyewww, wanghuancoder, ForFishes, hitywt, danleifeng, tianshuo78520a, ykkk2333, houj04, lj970926, XiaociZhang, HarperCy, cqulilujia, runzhech, RuohengMa, Caozhou1995, kangguangli, heavyrain-lzy, zyfncg, SigureMo, YuanRisheng, lchdl, LiYuRio, AndSonder, Wennie396, zhangbo9674, liudongxue01, risemeup1, phlrain, winter-wang, yuanlehome, NALLEIN, Liujie0926, yuguo-Jack, gitliuyf, zh794390558, Aurelius84, 6clc, GGBond8488, xiaoguoguo626807, Wong4j, iosmers, xiaoxiaohehe001, LielinJiang, carryyu, Difers, yangxiaoyu14, xuxinyi389, cxxly, gongshaotian, jjyaoao, lijialin03, lxd-cumt, cyber-pioneer, HydrogenSulfate, MayYouBeProsperous, Charles-hit, Patrick-Star125, ScottWong98, huangjiyi, DrRyanHuang, jinyouzhi, BeingGod, Wanglongzhi2001, yangguohao, zyt1024, longranger2, 2742195759, megemini, thisjiang, kevincheng2, zhoutianzi666, Wangzheee, ming1753, tianhaodongbd, freeliuzc, zhenyun-li, MARD1NO, RichardWooSJTU, eee4017, leo0519, csy0225, wwbitejotunn, bukejiyu, jiweibo, iamsonderr, ckl117, ronny1996, zhanglirong1999, LLee233, ZHUI, wangxn12138, zhwesky2010, Courtesy-Xs, zoooo0820, llyyxx0413, Asthestarsfalll, zxcd, pkuzyc, idontkonwher, sneaxiy, hong19860320, ZibinGuo, leolishaohao, MuShangCC, zhupengyang, shentanyue, Travis-Lee, wz1qqx, frank-oops, newway, QingshuChen, zhangyk0314, HandSomeLEEw, Shixiaowei02, zhangyuqin1998, Xing-lil, zhhsplendid, jiahy0825, xinyu-intel, MarioLulab, 0x45f, Tom-Zheng, xingmingyyj, zhangbopd, gouzil, zeroRains, BiynXu, WintersMontagne10335, wuhuachaocoding, GreatV, chenwhql, deepllz, parap1uie-s, ozogxyz, FisherWY, changeyoung98, zhiboniu, YangQun1 dynamicheart, Xreki, liugddx, Lylinnnnn, YSF-A, zzjjay, YanhuiDua, lishicheng1996, USTCKAY, abenmao, cocoshe, HermitSun, ccsuzzh, sanbuphy, enkilee, RedContritio, Liyulingyue, zrr1999, chen2016013, Galaxy1458, chalsliu, mrcangye, XieYunshen, zhiheng-liu, haohongxiang, ZzSean, JamesLim-sy, yuehuayingxueluo, niuliling123, umiswing, sijunhe, littsk, SecretXV, zhurou603, zhangjun, caizejun, yangjianfengo1, vivienfanghuagood, Xinyu302, lizexu123, yghstill, Li-fAngyU, VigiZhang, co63oc, dhanush-2501, ooooo-create, PommesPeter, zeus2x7, akshatvishu, jzhang533, Sekiro-x, gumblex, BernieHuang2008, YibinLiu666, qiuwenbogdut, XavierZXY, MqLeet, zhangting2020, mingxu1067, Ainavo, SSKlearns, yuchen202, silverling, zade23, wenxiaohahaha, NKNaN, Tsaiyue, fsczz, Tomoko-hjf, rhmaaa, zbt78, Hhankyangg, wangzhen38, zhengqiwen1997, engineer1109, onepick, qili93, Rane2021, nemonameless, DesmonDay, RachelXu7, ceci3, lyuwenyu, liuruyan, LokeZhou, shiyutang, lanxianghit, feifei-111, Sahala08, sunzhongkai588, Kaedeharai, Candy2Tang, liyongchao911, whisky-12, InsaneOnion, yoyoIcy, KongAKun, linzeyang, MuhammadNizamani, eltociear, Ligoml, LUZY0726, Windfarer, FlyingQianMM, jeng1220, junelotus, zlsh80826, Vvsmile, Frida-a, TonibMw, guoshengCS, zhink, ZhangYulongg, AlbertVan, fengxin-hello, mjp9527, entired, DanGuge.
-
-# 2.5.0 Release Note
-
-## 1. Highlights
-- **New dynamic-static unification architecture**: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%.
-- **Operator architecture unification of PHI operator library**: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware.
-- **Full go-live of new actuator for static graph**: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced.
-- **Python API supporting 0-dimensional tensor**: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as `paddle.sum` etc.
-- **New environment adaptation**: Adapt to CUDA 12. Compilation with gcc12 is supported.
-
-## **2. Incompatibility Upgrade**
-- PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like `loss.numpy()[0]`.After this modification, output loss in model training is a 0-dimensional tensor. When using `loss.numpy()`, users can take out or print the loss. The codes are short, easy to understand, and in line with the industry's habit.
-- `paddle.fluid ` API is fully decommissioned. According to the plan that has been previewed in the last version, 1116 `paddle.fluid ` APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand.
-- Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic.
-- In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including `paddle.static.ParallelExecutor ` and `paddle.static. CompiledProgram(). with_data_parallel( )` APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e., `paddle.distributed.launch ` API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on [data parallel](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/cluster_quick_start_collective_cn.html) to modify model code. [#50351](https://github.com/PaddlePaddle/Paddle/pull/50351),[#50501](https://github.com/PaddlePaddle/Paddle/pull/50501),[#51240](https://github.com/PaddlePaddle/Paddle/pull/51240),[#51701](https://github.com/PaddlePaddle/Paddle/pull/51701),[#51616](https://github.com/PaddlePaddle/Paddle/pull/51616),[#51369](https://github.com/PaddlePaddle/Paddle/pull/51369),[#52671](https://github.com/PaddlePaddle/Paddle/pull/52671)
-- Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse.
-
-## 3. Training Framework (Including Distributed)
-### Python API
-#### API supporting 0-dimensional tensor
-- API input supports 0-dimensional tensor, involving `paddle.reshape `, `paddle.trace `, `paddle.linalg.norm ` and other 286 APIs. [#53208](https://github.com/PaddlePaddle/Paddle/pull/53208), [#53592](https://github.com/PaddlePaddle/Paddle/pull/53592), [#47074](https://github.com/PaddlePaddle/Paddle/pull/47074), [#53186](https://github.com/PaddlePaddle/Paddle/pull/53186), [#47677](https://github.com/PaddlePaddle/Paddle/pull/47677), [#49357](https://github.com/PaddlePaddle/Paddle/pull/49357), [#50237](https://github.com/PaddlePaddle/Paddle/pull/50237), [#46555](https://github.com/PaddlePaddle/Paddle/pull/46555), [#47219](https://github.com/PaddlePaddle/Paddle/pull/47219), [#47501](https://github.com/PaddlePaddle/Paddle/pull/47501), [#47858](https://github.com/PaddlePaddle/Paddle/pull/47858), [#47961](https://github.com/PaddlePaddle/Paddle/pull/47961), [#48058](https://github.com/PaddlePaddle/Paddle/pull/48058), [#48007](https://github.com/PaddlePaddle/Paddle/pull/48007), [#49755](https://github.com/PaddlePaddle/Paddle/pull/49755), [#51024](https://github.com/PaddlePaddle/Paddle/pull/51024), [#51566](https://github.com/PaddlePaddle/Paddle/pull/51566), [#51899](https://github.com/PaddlePaddle/Paddle/pull/51899), [#49813](https://github.com/PaddlePaddle/Paddle/pull/49813), [#47812](https://github.com/PaddlePaddle/Paddle/pull/47812), [#47849](https://github.com/PaddlePaddle/Paddle/pull/47849), [#47251](https://github.com/PaddlePaddle/Paddle/pull/47251), [#53125](https://github.com/PaddlePaddle/Paddle/pull/53125), [#53828](https://github.com/PaddlePaddle/Paddle/pull/53828), [#51265](https://github.com/PaddlePaddle/Paddle/pull/51265), [#47689](https://github.com/PaddlePaddle/Paddle/pull/47689), [#48452](https://github.com/PaddlePaddle/Paddle/pull/48452), [#49072](https://github.com/PaddlePaddle/Paddle/pull/49072), [#48638](https://github.com/PaddlePaddle/Paddle/pull/48638), [#49175](https://github.com/PaddlePaddle/Paddle/pull/49175), [#49279](https://github.com/PaddlePaddle/Paddle/pull/49279), [#50857](https://github.com/PaddlePaddle/Paddle/pull/50857), [#49805](https://github.com/PaddlePaddle/Paddle/pull/49805), [#47734](https://github.com/PaddlePaddle/Paddle/pull/47734), [#45992](https://github.com/PaddlePaddle/Paddle/pull/45992), [#49616](https://github.com/PaddlePaddle/Paddle/pull/49616), [#49959](https://github.com/PaddlePaddle/Paddle/pull/49959), [#50536](https://github.com/PaddlePaddle/Paddle/pull/50536), [#49544](https://github.com/PaddlePaddle/Paddle/pull/49544), [#49842](https://github.com/PaddlePaddle/Paddle/pull/49842), [#46909](https://github.com/PaddlePaddle/Paddle/pull/46909), [#49361](https://github.com/PaddlePaddle/Paddle/pull/49361), [#50169](https://github.com/PaddlePaddle/Paddle/pull/50169), [#48314](https://github.com/PaddlePaddle/Paddle/pull/48314), [#48735](https://github.com/PaddlePaddle/Paddle/pull/48735), [#49122](https://github.com/PaddlePaddle/Paddle/pull/49122), [#49122](https://github.com/PaddlePaddle/Paddle/pull/49122), [#49177](https://github.com/PaddlePaddle/Paddle/pull/49177), [#49501](https://github.com/PaddlePaddle/Paddle/pull/49501), [#49562](https://github.com/PaddlePaddle/Paddle/pull/49562), [#49340](https://github.com/PaddlePaddle/Paddle/pull/49340), [#49550](https://github.com/PaddlePaddle/Paddle/pull/49550), [#49596](https://github.com/PaddlePaddle/Paddle/pull/49596), [#49730](https://github.com/PaddlePaddle/Paddle/pull/49730), [#49667](https://github.com/PaddlePaddle/Paddle/pull/49667), [#49692](https://github.com/PaddlePaddle/Paddle/pull/49692), [#49854](https://github.com/PaddlePaddle/Paddle/pull/49854), [#49845](https://github.com/PaddlePaddle/Paddle/pull/49845), [#49803](https://github.com/PaddlePaddle/Paddle/pull/49803), [#49889](https://github.com/PaddlePaddle/Paddle/pull/49889), [#49904](https://github.com/PaddlePaddle/Paddle/pull/49904), [#49518](https://github.com/PaddlePaddle/Paddle/pull/49518), [#49884](https://github.com/PaddlePaddle/Paddle/pull/49884), [#49880](https://github.com/PaddlePaddle/Paddle/pull/49880), [#49862](https://github.com/PaddlePaddle/Paddle/pull/49862), [#49921](https://github.com/PaddlePaddle/Paddle/pull/49921), [#49260](https://github.com/PaddlePaddle/Paddle/pull/49260), [#49929](https://github.com/PaddlePaddle/Paddle/pull/49929), [#49570](https://github.com/PaddlePaddle/Paddle/pull/49570), [#49882](https://github.com/PaddlePaddle/Paddle/pull/49882), [#50213](https://github.com/PaddlePaddle/Paddle/pull/50213), [#49780](https://github.com/PaddlePaddle/Paddle/pull/49780), [#50271](https://github.com/PaddlePaddle/Paddle/pull/50271), [#50289](https://github.com/PaddlePaddle/Paddle/pull/50289), [#50293](https://github.com/PaddlePaddle/Paddle/pull/50293), [#49735](https://github.com/PaddlePaddle/Paddle/pull/49735), [#50433](https://github.com/PaddlePaddle/Paddle/pull/50433), [#49847](https://github.com/PaddlePaddle/Paddle/pull/49847), [#50635](https://github.com/PaddlePaddle/Paddle/pull/50635), [#50950](https://github.com/PaddlePaddle/Paddle/pull/50950), [#50947](https://github.com/PaddlePaddle/Paddle/pull/50947), [#49460](https://github.com/PaddlePaddle/Paddle/pull/49460), [#53087](https://github.com/PaddlePaddle/Paddle/pull/53087), [#51687](https://github.com/PaddlePaddle/Paddle/pull/51687), [#52185](https://github.com/PaddlePaddle/Paddle/pull/52185), [#54649](https://github.com/PaddlePaddle/Paddle/pull/54649)
-- API output supports 0-dimensional tensor, involving `paddle.sum `, `paddle.min/max `, `paddle.any/all ` and other 90 APIs. [#52891](https://github.com/PaddlePaddle/Paddle/pull/52891), [#52861](https://github.com/PaddlePaddle/Paddle/pull/52861), [#52775](https://github.com/PaddlePaddle/Paddle/pull/52775), [#52850](https://github.com/PaddlePaddle/Paddle/pull/52850), [#52843](https://github.com/PaddlePaddle/Paddle/pull/52843), [#52857](https://github.com/PaddlePaddle/Paddle/pull/52857), [#51721](https://github.com/PaddlePaddle/Paddle/pull/51721), [#53051](https://github.com/PaddlePaddle/Paddle/pull/53051), [#53192](https://github.com/PaddlePaddle/Paddle/pull/53192), [#52739](https://github.com/PaddlePaddle/Paddle/pull/52739), [#52741](https://github.com/PaddlePaddle/Paddle/pull/52741), [#53175](https://github.com/PaddlePaddle/Paddle/pull/53175), [#51889](https://github.com/PaddlePaddle/Paddle/pull/51889), [#53199](https://github.com/PaddlePaddle/Paddle/pull/53199), [#53242](https://github.com/PaddlePaddle/Paddle/pull/53242), [#53421](https://github.com/PaddlePaddle/Paddle/pull/53421)
-- In addition to the support of 0-dimensional tensor, fix the original non-standard codes, and provide hints and compatibility for non-standard usage in the model codes. [#51562](https://github.com/PaddlePaddle/Paddle/pull/51562), [#51586](https://github.com/PaddlePaddle/Paddle/pull/51586), [#51757](https://github.com/PaddlePaddle/Paddle/pull/51757), [#52197](https://github.com/PaddlePaddle/Paddle/pull/52197), [#54117](https://github.com/PaddlePaddle/Paddle/pull/54117)。
-
-#### new API
-- Add `paddle.autograd.jacobian` and `paddle.autograd.hessian` APIs for scientific computing. [#53331](https://github.com/PaddlePaddle/Paddle/pull/53331)
-- Add sparse computing API. For example, `paddle.sparse.reshape `, `paddle.sparse.sum ` and `paddle.sparse.slice `. [#46694](https://github.com/PaddlePaddle/Paddle/pull/46694), [#51513](https://github.com/PaddlePaddle/Paddle/pull/51513), [#53794](https://github.com/PaddlePaddle/Paddle/pull/53794), [#51406](https://github.com/PaddlePaddle/Paddle/pull/51406)
-- Add APIsFor example, `paddle.optimizer.LBFGS `, `paddle.index_put ` and `paddle.logaddexp `. [#53314](https://github.com/PaddlePaddle/Paddle/pull/53314), [#51912](https://github.com/PaddlePaddle/Paddle/pull/51912), [#52886](https://github.com/PaddlePaddle/Paddle/pull/52886), [#50843](https://github.com/PaddlePaddle/Paddle/pull/50843), [#47282](https://github.com/PaddlePaddle/Paddle/pull/47282), [#52284](https://github.com/PaddlePaddle/Paddle/pull/52284)
-
-### Dynamic graphs
-#### New features
-- Add paddle.nn.utils.clip_grad_norm_ for gradient clipping support and paddle.Tensor.data_ptr for getting the address of the Tensor data's memory/GPU memory. [PR49935](https://github.com/PaddlePaddle/Paddle/pull/49935)[, PR48235](https://github.com/PaddlePaddle/Paddle/pull/48235), [PR49173](https://github.com/PaddlePaddle/Paddle/pull/49173)
-- Add the saved_tensors_hooks mechanism, for temporary storage and retrieval of forward Tensor used in backward computation. [PR45763](https://github.com/PaddlePaddle/Paddle/pull/45763), [PR46215](https://github.com/PaddlePaddle/Paddle/pull/46215), [PR48124](https://github.com/PaddlePaddle/Paddle/pull/48124)
-- Tensor supports pickler, for serialization of Tensor. [PR47025](https://github.com/PaddlePaddle/Paddle/pull/47025), [PR48179](https://github.com/PaddlePaddle/Paddle/pull/48179)
-- Add debug logs, to print forward Python stacks when nan/inf appears in reverse. [PR53217](https://github.com/PaddlePaddle/Paddle/pull/53217) [PR52639](https://github.com/PaddlePaddle/Paddle/pull/52639) [PR52729](https://github.com/PaddlePaddle/Paddle/pull/52729)
-- Add the support for expand_v2, tile, concat, assign, slice higher-order differentiation. [PR45941](https://github.com/PaddlePaddle/Paddle/pull/45941), [PR45942](https://github.com/PaddlePaddle/Paddle/pull/45942), [PR45940](https://github.com/PaddlePaddle/Paddle/pull/45940), [PR45879](https://github.com/PaddlePaddle/Paddle/pull/45879), [PR45960](https://github.com/PaddlePaddle/Paddle/pull/45960)
-
-#### Improvements
-- Optimize log printing for dynamic graphs, including log content, VLog level, and error reporting content. [PR45783](https://github.com/PaddlePaddle/Paddle/pull/45783), [PR46349](https://github.com/PaddlePaddle/Paddle/pull/46349), [PR46934](https://github.com/PaddlePaddle/Paddle/pull/46934), [PR47724](https://github.com/PaddlePaddle/Paddle/pull/47724)
-- Add FLAGS_auto_growth_chunk_size_in_mb for minimum chunk size settings of auto_growth_allocator. [PR52204](https://github.com/PaddlePaddle/Paddle/pull/52204)
-
-#### bug fix
-- Fix bugs in some operators, including batch_norm, slice, set_value, scale, multinomial, adam, conv, transpose2_grad, conv2d_transpose_double_grad. [PR47802](https://github.com/PaddlePaddle/Paddle/pull/47802), [PR47634](https://github.com/PaddlePaddle/Paddle/pull/47634), [PR47349](https://github.com/PaddlePaddle/Paddle/pull/47349), [PR46124](https://github.com/PaddlePaddle/Paddle/pull/46124), [PR46147](https://github.com/PaddlePaddle/Paddle/pull/46147), [PR50388](https://github.com/PaddlePaddle/Paddle/pull/50388), [PR48626](https://github.com/PaddlePaddle/Paddle/pull/48626), [PR48519](https://github.com/PaddlePaddle/Paddle/pull/48519), [PR50386](https://github.com/PaddlePaddle/Paddle/pull/50386), [PR48432](https://github.com/PaddlePaddle/Paddle/pull/48432), [PR51851](https://github.com/PaddlePaddle/Paddle/pull/51851)
-- Fix some PyLayer bugs. [PR51740](https://github.com/PaddlePaddle/Paddle/pull/51740), [PR47154](https://github.com/PaddlePaddle/Paddle/pull/47154), [PR47323](https://github.com/PaddlePaddle/Paddle/pull/47323), [PR54041](https://github.com/PaddlePaddle/Paddle/pull/54041), [PR48533](https://github.com/PaddlePaddle/Paddle/pull/48533)
-- Makes sure sync_batch_norm is sequential in reverse to avoid hang or precision errors due to misordering. [PR52268](https://github.com/PaddlePaddle/Paddle/pull/52268), [PR52860](https://github.com/PaddlePaddle/Paddle/pull/52860), [PR52779](https://github.com/PaddlePaddle/Paddle/pull/52779)
-- Fix a bug of linspace under AMP. [PR46088](https://github.com/PaddlePaddle/Paddle/pull/46088)
-- Fix Python C API’s incorrect call that causes Windows to crash. [PR46833](https://github.com/PaddlePaddle/Paddle/pull/46833)
-- Fix the bug that DataLoader may miss deleting/dev/shm. [PR48511](https://github.com/PaddlePaddle/Paddle/pull/48511)
-- Fix some bugs of paddle.grad. [PR47151](https://github.com/PaddlePaddle/Paddle/pull/47151)
-- Add error message for operators that do not support higher order differentiation. [PR47231](https://github.com/PaddlePaddle/Paddle/pull/47231)
-- Add numpyarray support for python operators. [PR48229](https://github.com/PaddlePaddle/Paddle/pull/48229)
-- Delete either of element_size APIs. [PR49631](https://github.com/PaddlePaddle/Paddle/pull/49631)
-- Fix the bug of crash when opening old dynamic graph VLOG. [PR47115](https://github.com/PaddlePaddle/Paddle/pull/47115)
-- For XPU, change to d2h+h2d in case of d2d, to solve the multi-threading problem. [PR48373](https://github.com/PaddlePaddle/Paddle/pull/48373)
-
-#### Performance optimization
-- Python operators sink to C++ implementation, to improve API performance. There is a 3x to 6x performance improvement in this class of APIs after sinking. [PR45811](https://github.com/PaddlePaddle/Paddle/pull/45811), [PR46326](https://github.com/PaddlePaddle/Paddle/pull/46326), [PR46329](https://github.com/PaddlePaddle/Paddle/pull/46329), [PR46520](https://github.com/PaddlePaddle/Paddle/pull/46520), [PR46542](https://github.com/PaddlePaddle/Paddle/pull/46542), [PR46565](https://github.com/PaddlePaddle/Paddle/pull/46565), [PR47060](https://github.com/PaddlePaddle/Paddle/pull/47060), [PR47077](https://github.com/PaddlePaddle/Paddle/pull/47077), [PR47174](https://github.com/PaddlePaddle/Paddle/pull/47174), [PR47315](https://github.com/PaddlePaddle/Paddle/pull/47315)
-- Optimize the Optimizer CPU scheduling performance to reduce GPU Gap caused by Optimizer phase. [PR49787](https://github.com/PaddlePaddle/Paddle/pull/49787), [PR50188](https://github.com/PaddlePaddle/Paddle/pull/50188)[, PR51340](https://github.com/PaddlePaddle/Paddle/pull/51340), [PR49864](https://github.com/PaddlePaddle/Paddle/pull/49864), [PR50158](https://github.com/PaddlePaddle/Paddle/pull/50158), [PR50335](https://github.com/PaddlePaddle/Paddle/pull/50335)
-- According to the logic that API can be sunk to C++, API is sunk to C++ to improve API performance. [PR46412](https://github.com/PaddlePaddle/Paddle/pull/46412), [PR46190](https://github.com/PaddlePaddle/Paddle/pull/46190)
-- Optimize unnecessary call logic on Python side under dynamic graph, to improve API performance. [PR46221](https://github.com/PaddlePaddle/Paddle/pull/46221), [PR49473](https://github.com/PaddlePaddle/Paddle/pull/49473), [PR49574](https://github.com/PaddlePaddle/Paddle/pull/49574), [PR49589](https://github.com/PaddlePaddle/Paddle/pull/49589), [PR49612](https://github.com/PaddlePaddle/Paddle/pull/49612), [PR49717](https://github.com/PaddlePaddle/Paddle/pull/49717)[, PR49733](https://github.com/PaddlePaddle/Paddle/pull/49733), [PR49823](https://github.com/PaddlePaddle/Paddle/pull/49823)[, PR49508](https://github.com/PaddlePaddle/Paddle/pull/49508), [PR46840](https://github.com/PaddlePaddle/Paddle/pull/46840)
-- Optimize use of Allocator to improve dynamic graph API scheduling performance. [PR47125](https://github.com/PaddlePaddle/Paddle/pull/47125), [PR48548](https://github.com/PaddlePaddle/Paddle/pull/48548), [PR50995](https://github.com/PaddlePaddle/Paddle/pull/50995), [PR47731](https://github.com/PaddlePaddle/Paddle/pull/47731)
-- Optimize fused_attention operator performance. [PR48902](https://github.com/PaddlePaddle/Paddle/pull/48902)
-- For optimizer's _add_accumulator, if device is CPU and under dynamic graphs, use full to initialize var directly. [PR48189](https://github.com/PaddlePaddle/Paddle/pull/48189)
-- Prune unnecessarily executed subgraphs for inverse graphs to improve performance. [PR47827](https://github.com/PaddlePaddle/Paddle/pull/47827)
-- Optimize performance of initalizers. [PR46033](https://github.com/PaddlePaddle/Paddle/pull/46033)
-- Add fused dropout add operator to improve computation performance when dropout and add are used together. [#52903](https://github.com/PaddlePaddle/Paddle/pull/52903)
-
-### Static graphs
-#### The new static graph executor is now fully go-live.
-The new actuator for static graph implements a number of functions and performance optimizations, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced. [#45913](https://github.com/PaddlePaddle/Paddle/pull/45913),[#46025](https://github.com/PaddlePaddle/Paddle/pull/46025),[#48911](https://github.com/PaddlePaddle/Paddle/pull/48911),[#50239](https://github.com/PaddlePaddle/Paddle/pull/50239),[#45696](https://github.com/PaddlePaddle/Paddle/pull/45696),[#46092](https://github.com/PaddlePaddle/Paddle/pull/46092),[#48158](https://github.com/PaddlePaddle/Paddle/pull/48158),[#51389](https://github.com/PaddlePaddle/Paddle/pull/51389),[#49708](https://github.com/PaddlePaddle/Paddle/pull/49708),[#49275](https://github.com/PaddlePaddle/Paddle/pull/49275),[#48789](https://github.com/PaddlePaddle/Paddle/pull/48789),[#49939](https://github.com/PaddlePaddle/Paddle/pull/49939),[#51149](https://github.com/PaddlePaddle/Paddle/pull/51149),[#52652](https://github.com/PaddlePaddle/Paddle/pull/52652)
-
-### Operator library
-#### Enhance functions of customized operators
-New function support for custom extension mechanism to achieve the C++ extension of the arithmetic function binding to the Python side, to further enhance the framework's secondary development capabilities. The extension supports custom hardware to use a custom operator mechanism to meet the needs of hardware manufacturers to implement non-Paddle existing operations. The extension supports custom operators in the implementation of the `inplace `, `vector < Tensor> ` output, `optional < Tnesor> ` input and other high-level mechanisms in custom operators. Optimized scheduling performance of custom operators in dynamic graph mode, with a 25.4% performance improvement for operators with multiple input parameters. Add new commonly used operators and APIs for custom operator Tensor extensions. Support chaining calls and simplify code writing. Optimize the operator kernel selection mechanism. Improve the logic of some operator kernels, enhance supported data types and optimize performance. Add and improve XPU kernels 100+. Fix 170+ bugs.
-[#49222](https://github.com/PaddlePaddle/Paddle/pull/49222), [#51773](https://github.com/PaddlePaddle/Paddle/pull/51773), [#51923](https://github.com/PaddlePaddle/Paddle/pull/51923), [#53080](https://github.com/PaddlePaddle/Paddle/pull/53080), [#50731](https://github.com/PaddlePaddle/Paddle/pull/50731), [#50563](https://github.com/PaddlePaddle/Paddle/pull/50563), [#50840](https://github.com/PaddlePaddle/Paddle/pull/50840), [#50983](https://github.com/PaddlePaddle/Paddle/pull/50983), [#51713](https://github.com/PaddlePaddle/Paddle/pull/51713), [#48733](https://github.com/PaddlePaddle/Paddle/pull/48733), [#50558](https://github.com/PaddlePaddle/Paddle/pull/50558), [#50764](https://github.com/PaddlePaddle/Paddle/pull/50764), [#51973](https://github.com/PaddlePaddle/Paddle/pull/51973), [#52216](https://github.com/PaddlePaddle/Paddle/pull/52216), [#51027](https://github.com/PaddlePaddle/Paddle/pull/51027), [#50745](https://github.com/PaddlePaddle/Paddle/pull/50745), [#50756](https://github.com/PaddlePaddle/Paddle/pull/50756), [#50886](https://github.com/PaddlePaddle/Paddle/pull/50886), [#50813](https://github.com/PaddlePaddle/Paddle/pull/50813), [#50869](https://github.com/PaddlePaddle/Paddle/pull/50869), [#51085](https://github.com/PaddlePaddle/Paddle/pull/51085), [#51646](https://github.com/PaddlePaddle/Paddle/pull/51646), [#51620](https://github.com/PaddlePaddle/Paddle/pull/51620), [#51844](https://github.com/PaddlePaddle/Paddle/pull/51844), [#52421](https://github.com/PaddlePaddle/Paddle/pull/52421), [#52872](https://github.com/PaddlePaddle/Paddle/pull/52872), [#52597](https://github.com/PaddlePaddle/Paddle/pull/52597), [#50582](https://github.com/PaddlePaddle/Paddle/pull/50582), [#52114](https://github.com/PaddlePaddle/Paddle/pull/52114), [#52915](https://github.com/PaddlePaddle/Paddle/pull/52915), [#50928](https://github.com/PaddlePaddle/Paddle/pull/50928), [#48272](https://github.com/PaddlePaddle/Paddle/pull/48272), [#48702](https://github.com/PaddlePaddle/Paddle/pull/48702), [#52191](https://github.com/PaddlePaddle/Paddle/pull/52191), [#52191](https://github.com/PaddlePaddle/Paddle/pull/52191), [#47374](https://github.com/PaddlePaddle/Paddle/pull/47374), [#47375](https://github.com/PaddlePaddle/Paddle/pull/47375), [#47378](https://github.com/PaddlePaddle/Paddle/pull/47378), [#54126](https://github.com/PaddlePaddle/Paddle/pull/54126), [#47638](https://github.com/PaddlePaddle/Paddle/pull/47638), [#47661](https://github.com/PaddlePaddle/Paddle/pull/47661), [#50606](https://github.com/PaddlePaddle/Paddle/pull/50606), [#53528](https://github.com/PaddlePaddle/Paddle/pull/53528), [#50599](https://github.com/PaddlePaddle/Paddle/pull/50599), [#51727](https://github.com/PaddlePaddle/Paddle/pull/51727), [#50825](https://github.com/PaddlePaddle/Paddle/pull/50825), [#50773](https://github.com/PaddlePaddle/Paddle/pull/50773), [#50979](https://github.com/PaddlePaddle/Paddle/pull/50979), [#53336](https://github.com/PaddlePaddle/Paddle/pull/53336), [#53555](https://github.com/PaddlePaddle/Paddle/pull/53555), [#53716](https://github.com/PaddlePaddle/Paddle/pull/53716), [#53753](https://github.com/PaddlePaddle/Paddle/pull/53753), [#53981](https://github.com/PaddlePaddle/Paddle/pull/53981), [#53977](https://github.com/PaddlePaddle/Paddle/pull/53977), [#53980](https://github.com/PaddlePaddle/Paddle/pull/53980), [#54043](https://github.com/PaddlePaddle/Paddle/pull/54043), [#54066](https://github.com/PaddlePaddle/Paddle/pull/54066), [#52866](https://github.com/PaddlePaddle/Paddle/pull/52866), [#53043](https://github.com/PaddlePaddle/Paddle/pull/53043), [#53325](https://github.com/PaddlePaddle/Paddle/pull/53325), [#54323](https://github.com/PaddlePaddle/Paddle/pull/54323), [#54367](https://github.com/PaddlePaddle/Paddle/pull/54367), [#51353](https://github.com/PaddlePaddle/Paddle/pull/51353), [#53749](https://github.com/PaddlePaddle/Paddle/pull/53749), [#50013](https://github.com/PaddlePaddle/Paddle/pull/50013), [#47570](https://github.com/PaddlePaddle/Paddle/pull/47570), [#50997](https://github.com/PaddlePaddle/Paddle/pull/50997), [#51241](https://github.com/PaddlePaddle/Paddle/pull/51241), [#49537](https://github.com/PaddlePaddle/Paddle/pull/49537)
-
-#### Unification of operator architecture
-Unify all remaining 350+ operator kernels under the original operator system into PHI operator library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all Fluid header files the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce cost of accessing hardware.
-[#47856](https://github.com/PaddlePaddle/Paddle/pull/47856), [#49328](https://github.com/PaddlePaddle/Paddle/pull/49328), [#49138](https://github.com/PaddlePaddle/Paddle/pull/49138), [#52014](https://github.com/PaddlePaddle/Paddle/pull/52014), [#52044](https://github.com/PaddlePaddle/Paddle/pull/52044), [#52116](https://github.com/PaddlePaddle/Paddle/pull/52116), [#52486](https://github.com/PaddlePaddle/Paddle/pull/52486), [#52101](https://github.com/PaddlePaddle/Paddle/pull/52101), [#52882](https://github.com/PaddlePaddle/Paddle/pull/52882), [#53003](https://github.com/PaddlePaddle/Paddle/pull/53003), [#53034](https://github.com/PaddlePaddle/Paddle/pull/53034), [#51914](https://github.com/PaddlePaddle/Paddle/pull/51914), [#49116](https://github.com/PaddlePaddle/Paddle/pull/49116), [#52626](https://github.com/PaddlePaddle/Paddle/pull/52626), [#52878](https://github.com/PaddlePaddle/Paddle/pull/52878), [#52879](https://github.com/PaddlePaddle/Paddle/pull/52879), [#52880](https://github.com/PaddlePaddle/Paddle/pull/52880), [#52875](https://github.com/PaddlePaddle/Paddle/pull/52875), [#51600](https://github.com/PaddlePaddle/Paddle/pull/51600), [#51601](https://github.com/PaddlePaddle/Paddle/pull/51601), [#51590](https://github.com/PaddlePaddle/Paddle/pull/51590), [#51887](https://github.com/PaddlePaddle/Paddle/pull/51887), [#51891](https://github.com/PaddlePaddle/Paddle/pull/51891), [#52036](https://github.com/PaddlePaddle/Paddle/pull/52036), [#52130](https://github.com/PaddlePaddle/Paddle/pull/52130), [#52134](https://github.com/PaddlePaddle/Paddle/pull/52134), [#51951](https://github.com/PaddlePaddle/Paddle/pull/51951), [#51886](https://github.com/PaddlePaddle/Paddle/pull/51886), [#52274](https://github.com/PaddlePaddle/Paddle/pull/52274), [#52263](https://github.com/PaddlePaddle/Paddle/pull/52263), [#51913](https://github.com/PaddlePaddle/Paddle/pull/51913), [#52145](https://github.com/PaddlePaddle/Paddle/pull/52145), [#52347](https://github.com/PaddlePaddle/Paddle/pull/52347), [#52370](https://github.com/PaddlePaddle/Paddle/pull/52370), [#52437](https://github.com/PaddlePaddle/Paddle/pull/52437), [#52424](https://github.com/PaddlePaddle/Paddle/pull/52424), [#52231](https://github.com/PaddlePaddle/Paddle/pull/52231), [#52522](https://github.com/PaddlePaddle/Paddle/pull/52522), [#52529](https://github.com/PaddlePaddle/Paddle/pull/52529), [#52802](https://github.com/PaddlePaddle/Paddle/pull/52802), [#52799](https://github.com/PaddlePaddle/Paddle/pull/52799), [#52855](https://github.com/PaddlePaddle/Paddle/pull/52855), [#52711](https://github.com/PaddlePaddle/Paddle/pull/52711), [#52940](https://github.com/PaddlePaddle/Paddle/pull/52940), [#53309](https://github.com/PaddlePaddle/Paddle/pull/53309), [#47817](https://github.com/PaddlePaddle/Paddle/pull/47817), [#48001](https://github.com/PaddlePaddle/Paddle/pull/48001), [#48063](https://github.com/PaddlePaddle/Paddle/pull/48063), [#48049](https://github.com/PaddlePaddle/Paddle/pull/48049), [#48168](https://github.com/PaddlePaddle/Paddle/pull/48168), [#48415](https://github.com/PaddlePaddle/Paddle/pull/48415), [#48696](https://github.com/PaddlePaddle/Paddle/pull/48696), [#48970](https://github.com/PaddlePaddle/Paddle/pull/48970), [#50183](https://github.com/PaddlePaddle/Paddle/pull/50183), [#50407](https://github.com/PaddlePaddle/Paddle/pull/50407), [#50498](https://github.com/PaddlePaddle/Paddle/pull/50498), [#50419](https://github.com/PaddlePaddle/Paddle/pull/50419), [#50282](https://github.com/PaddlePaddle/Paddle/pull/50282), [#50870](https://github.com/PaddlePaddle/Paddle/pull/50870), [#50911](https://github.com/PaddlePaddle/Paddle/pull/50911), [#50865](https://github.com/PaddlePaddle/Paddle/pull/50865), [#51288](https://github.com/PaddlePaddle/Paddle/pull/51288), [#53735](https://github.com/PaddlePaddle/Paddle/pull/53735), [#47248](https://github.com/PaddlePaddle/Paddle/pull/47248), [#47787](https://github.com/PaddlePaddle/Paddle/pull/47787), [#52202](https://github.com/PaddlePaddle/Paddle/pull/52202),
-[#47579](https://github.com/PaddlePaddle/Paddle/pull/47579), [#49444](https://github.com/PaddlePaddle/Paddle/pull/49444), [#45772](https://github.com/PaddlePaddle/Paddle/pull/45772), [#51264](https://github.com/PaddlePaddle/Paddle/pull/51264), [#51634](https://github.com/PaddlePaddle/Paddle/pull/51634), [#51631](https://github.com/PaddlePaddle/Paddle/pull/51631), [#47385](https://github.com/PaddlePaddle/Paddle/pull/47385), [#46342](https://github.com/PaddlePaddle/Paddle/pull/46342), [#47510](https://github.com/PaddlePaddle/Paddle/pull/47510), [#47532](https://github.com/PaddlePaddle/Paddle/pull/47532), [#47702](https://github.com/PaddlePaddle/Paddle/pull/47702), [#47860](https://github.com/PaddlePaddle/Paddle/pull/47860), [#49470](https://github.com/PaddlePaddle/Paddle/pull/49470), [#50358](https://github.com/PaddlePaddle/Paddle/pull/50358), [#49121](https://github.com/PaddlePaddle/Paddle/pull/49121), [#50190](https://github.com/PaddlePaddle/Paddle/pull/50190), [#52374](https://github.com/PaddlePaddle/Paddle/pull/52374), [#52372](https://github.com/PaddlePaddle/Paddle/pull/52372), [#52375](https://github.com/PaddlePaddle/Paddle/pull/52375), [#52371](https://github.com/PaddlePaddle/Paddle/pull/52371)
-
-### Dynamic-to-static plus combinator
-#### New features
-- Add the combination rules for combinators such as dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish [#50497](https://github.com/PaddlePaddle/Paddle/pull/50497), [#50838](https://github.com/PaddlePaddle/Paddle/pull/50838), [#50861](https://github.com/PaddlePaddle/Paddle/pull/50861), [#50819](https://github.com/PaddlePaddle/Paddle/pull/50819), [#50810](https://github.com/PaddlePaddle/Paddle/pull/50810), [#51527](https://github.com/PaddlePaddle/Paddle/pull/51527), [#51070](https://github.com/PaddlePaddle/Paddle/pull/51070), [#51539](https://github.com/PaddlePaddle/Paddle/pull/51539), [#51061](https://github.com/PaddlePaddle/Paddle/pull/51061), [#49894](https://github.com/PaddlePaddle/Paddle/pull/49894), [#50422](https://github.com/PaddlePaddle/Paddle/pull/50422), [#51874](https://github.com/PaddlePaddle/Paddle/pull/51874), [#51341](https://github.com/PaddlePaddle/Paddle/pull/51341), [#50295](https://github.com/PaddlePaddle/Paddle/pull/50295), [#50298](https://github.com/PaddlePaddle/Paddle/pull/50298), [#50672](https://github.com/PaddlePaddle/Paddle/pull/50672), [#51432](https://github.com/PaddlePaddle/Paddle/pull/51432), [#51003](https://github.com/PaddlePaddle/Paddle/pull/51003)
-- Add the vjp rule for combinators such as gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad [#50966](https://github.com/PaddlePaddle/Paddle/pull/50966), [#51653](https://github.com/PaddlePaddle/Paddle/pull/51653), [#52663](https://github.com/PaddlePaddle/Paddle/pull/52663), [#51742](https://github.com/PaddlePaddle/Paddle/pull/51742), [#52203](https://github.com/PaddlePaddle/Paddle/pull/52203), [#50794](https://github.com/PaddlePaddle/Paddle/pull/50794), [#50305](https://github.com/PaddlePaddle/Paddle/pull/50305), [#50786](https://github.com/PaddlePaddle/Paddle/pull/50786), [#50679](https://github.com/PaddlePaddle/Paddle/pull/50679), [#51045](https://github.com/PaddlePaddle/Paddle/pull/51045), [#51230](https://github.com/PaddlePaddle/Paddle/pull/51230), [#51474](https://github.com/PaddlePaddle/Paddle/pull/51474), [#51283](https://github.com/PaddlePaddle/Paddle/pull/51283), [#51238](https://github.com/PaddlePaddle/Paddle/pull/51238), [#49831](https://github.com/PaddlePaddle/Paddle/pull/49831), [#51838](https://github.com/PaddlePaddle/Paddle/pull/51838), [#50771](https://github.com/PaddlePaddle/Paddle/pull/50771), [#50565](https://github.com/PaddlePaddle/Paddle/pull/50565), [#51768](https://github.com/PaddlePaddle/Paddle/pull/51768), [#51750](https://github.com/PaddlePaddle/Paddle/pull/51750), [#51748](https://github.com/PaddlePaddle/Paddle/pull/51748), [#52532](https://github.com/PaddlePaddle/Paddle/pull/52532), [#52935](https://github.com/PaddlePaddle/Paddle/pull/52935), [#50963](https://github.com/PaddlePaddle/Paddle/pull/50963), [#51430](https://github.com/PaddlePaddle/Paddle/pull/51430), [#53141](https://github.com/PaddlePaddle/Paddle/pull/53141), [#52469](https://github.com/PaddlePaddle/Paddle/pull/52469), [#50436](https://github.com/PaddlePaddle/Paddle/pull/50436), [#51059](https://github.com/PaddlePaddle/Paddle/pull/51059), [#51296](https://github.com/PaddlePaddle/Paddle/pull/51296), [#52533](https://github.com/PaddlePaddle/Paddle/pull/52533), [#53374](https://github.com/PaddlePaddle/Paddle/pull/53374)
-- Add the second-order differentiation rule for combinators such as matmul, tanh, and elementwise [#50452](https://github.com/PaddlePaddle/Paddle/pull/50452), [#52192](https://github.com/PaddlePaddle/Paddle/pull/52192), [#53014](https://github.com/PaddlePaddle/Paddle/pull/53014)
-- Add the bf16 datatype support for combinators such as exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max [#54263](https://github.com/PaddlePaddle/Paddle/pull/54263), [#54236](https://github.com/PaddlePaddle/Paddle/pull/54236), [#53865](https://github.com/PaddlePaddle/Paddle/pull/53865), [#54175](https://github.com/PaddlePaddle/Paddle/pull/54175), [#54399](https://github.com/PaddlePaddle/Paddle/pull/54399)
-- Add support for assigning semantics to containers in control flow in dynamic-to-static. [#51248](https://github.com/PaddlePaddle/Paddle/pull/51248)
-- For to_static, add full graph fallback function. When dynamic-to-static conversion fails, the whole graph can fall back to the dynamic graph mode of execution. For the fallback mechanism, add the set_eval_frame API. [#50111](https://github.com/PaddlePaddle/Paddle/pull/50111), [#52006](https://github.com/PaddlePaddle/Paddle/pull/52006)
-- For to_static, support the combinator mechanism. Support the scenario of using register_hook under to_static decoration; [#49836](https://github.com/PaddlePaddle/Paddle/pull/49836), [#52948](https://github.com/PaddlePaddle/Paddle/pull/52948), [#53572](https://github.com/PaddlePaddle/Paddle/pull/53572)
-- Add a backend parameter to the to_static API. It can be specified as `CINN` or None. When the parameter is specified as CINN, the CINN compiler will be used to accelerate training and inference. [#52596](https://github.com/PaddlePaddle/Paddle/pull/52596)
-- Add the code automatic generation function for the primitive API. Based on operator definitions in ops.yaml and legacy_ops.yaml, automatically generate code for the primitive API. Automatically generate the Tensor computation API. [#50315](https://github.com/PaddlePaddle/Paddle/pull/50315), [#49654](https://github.com/PaddlePaddle/Paddle/pull/49654), [#50642](https://github.com/PaddlePaddle/Paddle/pull/50642)
-- Add the function of forward combination of operators. By registering the combination rules of forward operators, it can split forward operators into base operators. [#49605](https://github.com/PaddlePaddle/Paddle/pull/49605)
-- Add the combinator switch. You can set environmental variables in shell to split operators in different ways. [#50309](https://github.com/PaddlePaddle/Paddle/pull/50309)
-- Add `OpTest ` combination test function to guarantee accuracy of operators. Add elementwise class base operator unit test. Add batch_norm CINN unit test. [#50509](https://github.com/PaddlePaddle/Paddle/pull/50509), [#50807](https://github.com/PaddlePaddle/Paddle/pull/50807), [#52815](https://github.com/PaddlePaddle/Paddle/pull/52815)
-
-#### Improvements
-- Add combinator to support FP16 operation and AMP O1 operation. Add AMP logic for softmax and layer_norm operators. [#52397](https://github.com/PaddlePaddle/Paddle/pull/52397), [#52598](https://github.com/PaddlePaddle/Paddle/pull/52598), [#51473](https://github.com/PaddlePaddle/Paddle/pull/51473)
-- Simplify combination rules and vjp rules of the combinator batch_norm. [#54012](https://github.com/PaddlePaddle/Paddle/pull/54012), [#51827](https://github.com/PaddlePaddle/Paddle/pull/51827), [#51933](https://github.com/PaddlePaddle/Paddle/pull/51933),
-- Optimize combination rules for combinators, and improve performance of combination rules with containing scalar. Optimize log printing for combinators. [#51960](https://github.com/PaddlePaddle/Paddle/pull/51960), [#50160](https://github.com/PaddlePaddle/Paddle/pull/50160)
-- Combinator supports the jit.save API. Add custom VJP rule API. [#52344](https://github.com/PaddlePaddle/Paddle/pull/52344), [#50885](https://github.com/PaddlePaddle/Paddle/pull/50885)
-- Remove the overwrite parameter from combinator gather_grad. [#52707](https://github.com/PaddlePaddle/Paddle/pull/52707)
-- Clean up dynamic-to-static code style, optimize error message, and standardize logs. [#48637](https://github.com/PaddlePaddle/Paddle/pull/48637), [#46128](https://github.com/PaddlePaddle/Paddle/pull/46128), [#52527](https://github.com/PaddlePaddle/Paddle/pull/52527), [#46800](https://github.com/PaddlePaddle/Paddle/pull/46800),[#46415](https://github.com/PaddlePaddle/Paddle/pull/46415)
-- For dynamic-to-static, call the append backward to get `grad var name ` to fix the error in the high order gradient computation. [#53250](https://github.com/PaddlePaddle/Paddle/pull/53250)
-- Upgrade the dynamic-to-static function, and clean up the temporary directory of to_static to speed up code conversion. Enhance to_static to automatically skip internal API. Support use of to_static decorator in the program. [#47102](https://github.com/PaddlePaddle/Paddle/pull/47102), [#50596](https://github.com/PaddlePaddle/Paddle/pull/50596), [#45768](https://github.com/PaddlePaddle/Paddle/pull/45768)
-- For dynamic-to-static, optimize `print ` function conversion to support printing Tensor parameters at the networking stage. Upgrade the parameter collection mechanism. [#48672](https://github.com/PaddlePaddle/Paddle/pull/48672), [#50336](https://github.com/PaddlePaddle/Paddle/pull/50336)
-
-#### bug fix
-- For the combinator, fix cmake compilation errors. Fix cuda 12 test errors. Fix bugs of operators such as meshgird, expand_as, concat, conv, and arrange. [#49643](https://github.com/PaddlePaddle/Paddle/pull/49643), [#54622](https://github.com/PaddlePaddle/Paddle/pull/54622), [#53951](https://github.com/PaddlePaddle/Paddle/pull/53951), [#53951](https://github.com/PaddlePaddle/Paddle/pull/53951), [#53350](https://github.com/PaddlePaddle/Paddle/pull/53350), [#51486](https://github.com/PaddlePaddle/Paddle/pull/51486), [#52764](https://github.com/PaddlePaddle/Paddle/pull/52764)
-- For the combinator, fix the bug in a number of scenarios such as rank=1, shape=-1, amp, and multi-process. [#51413](https://github.com/PaddlePaddle/Paddle/pull/51413), [#51435](https://github.com/PaddlePaddle/Paddle/pull/51435), [#50518](https://github.com/PaddlePaddle/Paddle/pull/50518), [#47301](https://github.com/PaddlePaddle/Paddle/pull/47301),
-- For the combinator, fix bugs in automatic code generation of composite grad maker and static prim api. Fix bugs that op creation attributes are missing, and some combination rules do not take effect. [#50854](https://github.com/PaddlePaddle/Paddle/pull/50854), [#51445](https://github.com/PaddlePaddle/Paddle/pull/51445), [#50780](https://github.com/PaddlePaddle/Paddle/pull/50780), [#52120](https://github.com/PaddlePaddle/Paddle/pull/52120)
-- Fix some other bugs for combinators [#50086](https://github.com/PaddlePaddle/Paddle/pull/50086), [#51208](https://github.com/PaddlePaddle/Paddle/pull/51208), [#51577](https://github.com/PaddlePaddle/Paddle/pull/51577), [#53598](https://github.com/PaddlePaddle/Paddle/pull/53598), [#47500](https://github.com/PaddlePaddle/Paddle/pull/47500), [#52119](https://github.com/PaddlePaddle/Paddle/pull/52119), [#50397](https://github.com/PaddlePaddle/Paddle/pull/50397), [#50527](https://github.com/PaddlePaddle/Paddle/pull/50527), [#50788](https://github.com/PaddlePaddle/Paddle/pull/50788), [#51014](https://github.com/PaddlePaddle/Paddle/pull/51014), [#52154](https://github.com/PaddlePaddle/Paddle/pull/52154), [#52752](https://github.com/PaddlePaddle/Paddle/pull/52752)
-- For dynamic-to-static, fix the bugs of dataloader, cond input dict, transformer import, T5 model memory leak, and grad var name parsing error. [#49821](https://github.com/PaddlePaddle/Paddle/pull/49821), [#47299](https://github.com/PaddlePaddle/Paddle/pull/47299), [#50776](https://github.com/PaddlePaddle/Paddle/pull/50776), [#50883](https://github.com/PaddlePaddle/Paddle/pull/50883), [#51100](https://github.com/PaddlePaddle/Paddle/pull/51100), [#51464](https://github.com/PaddlePaddle/Paddle/pull/51464), [#51966](https://github.com/PaddlePaddle/Paddle/pull/51966), [#52110](https://github.com/PaddlePaddle/Paddle/pull/52110), [#52821](https://github.com/PaddlePaddle/Paddle/pull/52821)
-- For dynamic-to-static, fix the bugs of Lazy initialization, Windows training, is_paddle_func failure, and recurrent op failure to delete pass. [#50785](https://github.com/PaddlePaddle/Paddle/pull/50785), [#52580](https://github.com/PaddlePaddle/Paddle/pull/52580), [#51585](https://github.com/PaddlePaddle/Paddle/pull/51585), [#51763](https://github.com/PaddlePaddle/Paddle/pull/51763), [#51763](https://github.com/PaddlePaddle/Paddle/pull/51763)
-
-#### Performance optimization
-- Add scope caching and reuse mechanism during execution of run_program_op in dynamic-to-static, to avoid passing new scope for each step. [#45813](https://github.com/PaddlePaddle/Paddle/pull/45813)
-
-### Distributed training
-#### Dynamic graph distributed training
-- Remove the distributed sharding API in the old dynamic graphs. [#49334](https://github.com/PaddlePaddle/Paddle/pull/49334)
-- Upgrade fleet to distributed directory. [#50834](https://github.com/PaddlePaddle/Paddle/pull/50834)
-- Optimize log printing for distributed strategies. [#47761](https://github.com/PaddlePaddle/Paddle/pull/47761)
-- For re-computation, support hook mode, inplace function, and stop_gradient mode. Support more flexible use. [#48471](https://github.com/PaddlePaddle/Paddle/pull/48471), [#47985](https://github.com/PaddlePaddle/Paddle/pull/47985)
-- Data parallel
- - For data parallel, support no_sync API for blocking parameter gradient communications. Support the parameter synchronization function. Add scale API to scale parameters. [#47536](https://github.com/PaddlePaddle/Paddle/pull/47536),[#51895](https://github.com/PaddlePaddle/Paddle/pull/51895),[#47519](https://github.com/PaddlePaddle/Paddle/pull/47519)
- - Fix the problem of video memory leakage under data parallel. [#47369](https://github.com/PaddlePaddle/Paddle/pull/47369),[#47444](https://github.com/PaddlePaddle/Paddle/pull/47444),[#48668](https://github.com/PaddlePaddle/Paddle/pull/48668)
- - Support sparse parameter gradient synchronization. [#52785](https://github.com/PaddlePaddle/Paddle/pull/52785)
-- Pipeline parallel
- - Optimize pipeline performance, and remove communication wait. Optimize scheduling and communication overlap. [#46209](https://github.com/PaddlePaddle/Paddle/pull/46209),[#54003](https://github.com/PaddlePaddle/Paddle/pull/54003),[#54312](https://github.com/PaddlePaddle/Paddle/pull/54312),[#53384](https://github.com/PaddlePaddle/Paddle/pull/53384),[#54310](https://github.com/PaddlePaddle/Paddle/pull/54310),[#46399](https://github.com/PaddlePaddle/Paddle/pull/46399),[#46483](https://github.com/PaddlePaddle/Paddle/pull/46483),[#46780](https://github.com/PaddlePaddle/Paddle/pull/46780),[#46116](https://github.com/PaddlePaddle/Paddle/pull/46116)
- - Support custom sharding, log printing, random seed setting, and timer elapsed time printing. [#53344](https://github.com/PaddlePaddle/Paddle/pull/53344), [#47670](https://github.com/PaddlePaddle/Paddle/pull/47670),[#47336](https://github.com/PaddlePaddle/Paddle/pull/47336),[#52656](https://github.com/PaddlePaddle/Paddle/pull/52656),[#53831](https://github.com/PaddlePaddle/Paddle/pull/53831)
- - Optimize video memory release logic in pipeline scheduling, and release intermediate variables and data in advance. [#54557](https://github.com/PaddlePaddle/Paddle/pull/54557), [#47199](https://github.com/PaddlePaddle/Paddle/pull/47199),[#47497](https://github.com/PaddlePaddle/Paddle/pull/47497),[#48045](https://github.com/PaddlePaddle/Paddle/pull/48045),[#54672](https://github.com/PaddlePaddle/Paddle/pull/54672)
- - Support VPP mode and model saving for pipeline parallel. [#54196](https://github.com/PaddlePaddle/Paddle/pull/54196), [#52927](https://github.com/PaddlePaddle/Paddle/pull/52927),[#47801](https://github.com/PaddlePaddle/Paddle/pull/47801),[#45922](https://github.com/PaddlePaddle/Paddle/pull/45922),[#47242](https://github.com/PaddlePaddle/Paddle/pull/47242)
-- Grouping sharding parallel
- - sharding stage2 parallel supports the quantization function, hybrid parallel training, gradient accumulation, XPU hardware, BF16 low precision computation, optimizer learning rate setting, offload function, and data parallel. [#47169](https://github.com/PaddlePaddle/Paddle/pull/47169),[#47535](https://github.com/PaddlePaddle/Paddle/pull/47535), [#46795](https://github.com/PaddlePaddle/Paddle/pull/46795),[#47711](https://github.com/PaddlePaddle/Paddle/pull/47711),[#48310](https://github.com/PaddlePaddle/Paddle/pull/48310),[#46846](https://github.com/PaddlePaddle/Paddle/pull/46846),[#48857](https://github.com/PaddlePaddle/Paddle/pull/48857),[#49196](https://github.com/PaddlePaddle/Paddle/pull/49196),[#49931](https://github.com/PaddlePaddle/Paddle/pull/49931),[#47114](https://github.com/PaddlePaddle/Paddle/pull/47114),[#49767](https://github.com/PaddlePaddle/Paddle/pull/49767)
- - Optimize sharing stage2 performance. Support the communication computation overlap. [#46495](https://github.com/PaddlePaddle/Paddle/pull/46495),[#46894](https://github.com/PaddlePaddle/Paddle/pull/46894)
- - sharding stage3 support shared parameters, and untrainable parameters. [#48695](https://github.com/PaddlePaddle/Paddle/pull/48695),[#48577](https://github.com/PaddlePaddle/Paddle/pull/48577)
-- Tensor model parallel
- - Optimize tensor model parallel performance to reduce performance impact of stream sharding. [#47715](https://github.com/PaddlePaddle/Paddle/pull/47715),[#51617](https://github.com/PaddlePaddle/Paddle/pull/51617)
- - Support parameter, optimizer shapes, gradient synchronization. [#51428](https://github.com/PaddlePaddle/Paddle/pull/51428),[#53254](https://github.com/PaddlePaddle/Paddle/pull/53254), [#53335](https://github.com/PaddlePaddle/Paddle/pull/53335),[#45803](https://github.com/PaddlePaddle/Paddle/pull/45803),[#46303](https://github.com/PaddlePaddle/Paddle/pull/46303),[#52293](https://github.com/PaddlePaddle/Paddle/pull/52293)
- - Optimize tensor model parallel operators such as c_embedding, softmax_with_corss_entropy. [#53197](https://github.com/PaddlePaddle/Paddle/pull/53197),[#53547](https://github.com/PaddlePaddle/Paddle/pull/53547),[#53541](https://github.com/PaddlePaddle/Paddle/pull/53541),[#52789](https://github.com/PaddlePaddle/Paddle/pull/52789),[#46491](https://github.com/PaddlePaddle/Paddle/pull/46491),[#52742](https://github.com/PaddlePaddle/Paddle/pull/52742),[#53419](https://github.com/PaddlePaddle/Paddle/pull/53419)
-- Launch
- - Support distributed Launch function, with keeping independent logs. [#53207](https://github.com/PaddlePaddle/Paddle/pull/53207),[#50405](https://github.com/PaddlePaddle/Paddle/pull/50405)
- - Add framework print environment variable function, log overwrite function, log return, and environment check. It is easy to change the debug environment variable. [#53243](https://github.com/PaddlePaddle/Paddle/pull/53243),[#53243](https://github.com/PaddlePaddle/Paddle/pull/53243), [#51803](https://github.com/PaddlePaddle/Paddle/pull/51803), [#53990](https://github.com/PaddlePaddle/Paddle/pull/53990)
-- Communication library
- - Add custom mixed parallel communication groups, topology information printing, and custom communication topology order. [#47021](https://github.com/PaddlePaddle/Paddle/pull/47021),[#54000](https://github.com/PaddlePaddle/Paddle/pull/54000),[#51781](https://github.com/PaddlePaddle/Paddle/pull/51781)
- - Remove communication library dependency on Place information [#47857](https://github.com/PaddlePaddle/Paddle/pull/47857)
- - Add communications library to support GLOO operator. Support send/recv/gather. [#52221](https://github.com/PaddlePaddle/Paddle/pull/52221), [#52334](https://github.com/PaddlePaddle/Paddle/pull/52334),[#49084](https://github.com/PaddlePaddle/Paddle/pull/49084)
- - Disable reverse computation of communication operator. [#47636](https://github.com/PaddlePaddle/Paddle/pull/47636)
- - Add communication library static shape check, to help determine whether communication volume is matched. [#48256](https://github.com/PaddlePaddle/Paddle/pull/48256),[#48915](https://github.com/PaddlePaddle/Paddle/pull/48915),[#48646](https://github.com/PaddlePaddle/Paddle/pull/48646)
- - Support communication python object type, BF16 type, alltoall, reduce, allgather, group call, global gather, broadcast, and scatter communication methods. Support XPU device communications. [#51765](https://github.com/PaddlePaddle/Paddle/pull/51765),[#45844](https://github.com/PaddlePaddle/Paddle/pull/45844),[#48059](https://github.com/PaddlePaddle/Paddle/pull/48059),[#48115](https://github.com/PaddlePaddle/Paddle/pull/48115), [#48339](https://github.com/PaddlePaddle/Paddle/pull/48339),[#49252](https://github.com/PaddlePaddle/Paddle/pull/49252),[#49451](https://github.com/PaddlePaddle/Paddle/pull/49451),[#50085](https://github.com/PaddlePaddle/Paddle/pull/50085),[#50701](https://github.com/PaddlePaddle/Paddle/pull/50701),[#48208](https://github.com/PaddlePaddle/Paddle/pull/48208),[#48736](https://github.com/PaddlePaddle/Paddle/pull/48736),[#51762](https://github.com/PaddlePaddle/Paddle/pull/51762),[#52495](https://github.com/PaddlePaddle/Paddle/pull/52495),[#53514](https://github.com/PaddlePaddle/Paddle/pull/53514),[#48232](https://github.com/PaddlePaddle/Paddle/pull/48232),[#49896](https://github.com/PaddlePaddle/Paddle/pull/49896),[#49941](https://github.com/PaddlePaddle/Paddle/pull/49941),[#45584](https://github.com/PaddlePaddle/Paddle/pull/45584)
- - Add support for communications between computational streams. [#46182](https://github.com/PaddlePaddle/Paddle/pull/46182),[#46023](https://github.com/PaddlePaddle/Paddle/pull/46023),[#46295](https://github.com/PaddlePaddle/Paddle/pull/46295),[#46761](https://github.com/PaddlePaddle/Paddle/pull/46761),[#47481](https://github.com/PaddlePaddle/Paddle/pull/47481),[#47740](https://github.com/PaddlePaddle/Paddle/pull/47740),[#47976](https://github.com/PaddlePaddle/Paddle/pull/47976),[#48163](https://github.com/PaddlePaddle/Paddle/pull/48163),[#48396](https://github.com/PaddlePaddle/Paddle/pull/48396),[#48308](https://github.com/PaddlePaddle/Paddle/pull/48308),[#47110](https://github.com/PaddlePaddle/Paddle/pull/47110),[#53089](https://github.com/PaddlePaddle/Paddle/pull/53089)
- - Optimize communication library TCP linking time. [#49810](https://github.com/PaddlePaddle/Paddle/pull/49810),[#47184](https://github.com/PaddlePaddle/Paddle/pull/47184)
-
-#### Automatic parallel
-- Improve semi-automatic parallel for static graphs:
- - Add FLOPs computation function for multiple operators, and add computation Cost modelling based on FLOPs. [#48083](https://github.com/PaddlePaddle/Paddle/pull/48083),[#47978](https://github.com/PaddlePaddle/Paddle/pull/47978),[#47595](https://github.com/PaddlePaddle/Paddle/pull/47595),[#48083](https://github.com/PaddlePaddle/Paddle/pull/48083),[#48084](https://github.com/PaddlePaddle/Paddle/pull/48084),[#47816](https://github.com/PaddlePaddle/Paddle/pull/47816)
- - Improve API ease-of-use. Perfect the DistAttr, Process Mesh, Engine API, information printing, input and output modules. Implement the Engine new cost API. It can be used to theoretically analyze model running time and video memory overhead. [#47503](https://github.com/PaddlePaddle/Paddle/pull/47503),[#46416](https://github.com/PaddlePaddle/Paddle/pull/46416),[#46554](https://github.com/PaddlePaddle/Paddle/pull/46554), [#46633](https://github.com/PaddlePaddle/Paddle/pull/46633),[#49214](https://github.com/PaddlePaddle/Paddle/pull/49214),[#53848](https://github.com/PaddlePaddle/Paddle/pull/53848),[#46552](https://github.com/PaddlePaddle/Paddle/pull/46552), [#47043](https://github.com/PaddlePaddle/Paddle/pull/47043), [#49665](https://github.com/PaddlePaddle/Paddle/pull/49665), [#52912](https://github.com/PaddlePaddle/Paddle/pull/52912), [#45776](https://github.com/PaddlePaddle/Paddle/pull/45776), [#47263](https://github.com/PaddlePaddle/Paddle/pull/47263)
- - Optimize the generality and ease of use of Pass. Support more scenarios, and reduce time spent on Pass pre-analysis. [#46519](https://github.com/PaddlePaddle/Paddle/pull/46519),[#47358](https://github.com/PaddlePaddle/Paddle/pull/47358),[#46391](https://github.com/PaddlePaddle/Paddle/pull/46391), [#51035](https://github.com/PaddlePaddle/Paddle/pull/51035)
- - Enhance debugging capabilities with distributed randomness control mechanisms and hybrid parallel precision alignment tools. [#52903](https://github.com/PaddlePaddle/Paddle/pull/52903),[#49865](https://github.com/PaddlePaddle/Paddle/pull/49865)
- - Support automatic sharding of inference generation task networking. Adapt special usage of control flow and conditional block in the generation model. [#46771](https://github.com/PaddlePaddle/Paddle/pull/46771), [#54067](https://github.com/PaddlePaddle/Paddle/pull/54067)
- - Improve grad_clip to support load balancing in data parallel scenarios. [#49510](https://github.com/PaddlePaddle/Paddle/pull/49510), [#49249](https://github.com/PaddlePaddle/Paddle/pull/49249)
-- Semi-automatic parallel performance improvement for static graphs:
- - Add the Sharding Pass automated communication Fuse and multi-streams communication functions, with throughput performance improved by 26% on two machines for GPT 6.7B model. [#48604](https://github.com/PaddlePaddle/Paddle/pull/48604), [#47180](https://github.com/PaddlePaddle/Paddle/pull/47180),[#46180](https://github.com/PaddlePaddle/Paddle/pull/46180)
- - Add Recompute optimization strategy tuning function. Select optimal recompute checkpoint settings based on video memory and model size. [#48608](https://github.com/PaddlePaddle/Paddle/pull/48608),[#47846](https://github.com/PaddlePaddle/Paddle/pull/47846),[#49010](https://github.com/PaddlePaddle/Paddle/pull/49010)
- - For the pipeline parallel, add 1F1B scheduling optimization Pass [#54260](https://github.com/PaddlePaddle/Paddle/pull/54260), [#45915](https://github.com/PaddlePaddle/Paddle/pull/45915)
- - Optimize data parallel. Support optimizations such as converged communication and communication computation Overlap, with performance improved by 5% in GPT 1.3B model. [#48092](https://github.com/PaddlePaddle/Paddle/pull/48092),[#45643](https://github.com/PaddlePaddle/Paddle/pull/45643),[#49744](https://github.com/PaddlePaddle/Paddle/pull/49744), [#47578](https://github.com/PaddlePaddle/Paddle/pull/47578)
- - Optimize Reshard module concate performance. Reduce number of concates in some scenarios. [#47809](https://github.com/PaddlePaddle/Paddle/pull/47809)
- - Optimize mixing accuracy, upgrade Pass performance, support BF16 low accuracy, and adapt the auto mixing parallel of the while loop control flow. [#51285](https://github.com/PaddlePaddle/Paddle/pull/51285),[#51147](https://github.com/PaddlePaddle/Paddle/pull/51147), [#49219](https://github.com/PaddlePaddle/Paddle/pull/49219), [#49079](https://github.com/PaddlePaddle/Paddle/pull/49079)
-- Improve function of fully automatic parallel for static graphs:
- - Add new rule-based fully automated search strategy. [#51859](https://github.com/PaddlePaddle/Paddle/pull/51859),[#51908](https://github.com/PaddlePaddle/Paddle/pull/51908),[#52053](https://github.com/PaddlePaddle/Paddle/pull/52053),[#48316](https://github.com/PaddlePaddle/Paddle/pull/48316),[#48464](https://github.com/PaddlePaddle/Paddle/pull/48464), [#52041](https://github.com/PaddlePaddle/Paddle/pull/52041)
- - Improve automatic parallel modelling capability, enriching single-node topology modelling and communication volume modelling. [#52723](https://github.com/PaddlePaddle/Paddle/pull/52723),[#46387](https://github.com/PaddlePaddle/Paddle/pull/46387),[#47043](https://github.com/PaddlePaddle/Paddle/pull/47043)
-
-#### Parameter server
-- Clean up the all list in ps directory, in which API is not exposed [#51289](https://github.com/PaddlePaddle/Paddle/pull/51289)
-- Clean up cvm operator [#48989](https://github.com/PaddlePaddle/Paddle/pull/48989)
-- For GPUPS, add support for AFS. [#46611](https://github.com/PaddlePaddle/Paddle/pull/46611)
-- Degrade PGLBOX2.0 log, fix stuck issue of dense parameter, fix the bug that barrier does not take effect, and add get_epoch_finish python side interface [#49946](https://github.com/PaddlePaddle/Paddle/pull/49946),[#50166](https://github.com/PaddlePaddle/Paddle/pull/50166),[#50349](https://github.com/PaddlePaddle/Paddle/pull/50349)
-- GPUPs run to switch to specified mode. [#51115](https://github.com/PaddlePaddle/Paddle/pull/51115)
-- GPUPS is added to benchmark. [#49587](https://github.com/PaddlePaddle/Paddle/pull/49587),[#49649](https://github.com/PaddlePaddle/Paddle/pull/49649)
-- Fix the GPUPS optimizer selection bug, fix reader reading problem, and fix RPC compilation problem. [#47026](https://github.com/PaddlePaddle/Paddle/pull/47026),[#47192](https://github.com/PaddlePaddle/Paddle/pull/47192),[#49878](https://github.com/PaddlePaddle/Paddle/pull/49878), [#46356](https://github.com/PaddlePaddle/Paddle/pull/46356),[#46575](https://github.com/PaddlePaddle/Paddle/pull/46575),[#49389](https://github.com/PaddlePaddle/Paddle/pull/49389),[#46258](https://github.com/PaddlePaddle/Paddle/pull/46258),[#50136](https://github.com/PaddlePaddle/Paddle/pull/50136)
-- Add rocksdb compilation method. [#46074](https://github.com/PaddlePaddle/Paddle/pull/46074)
-
-### CUDA
-#### New features
-- Add compilation support for CUDA 12.0. Fix related unit test. ([#49539](https://github.com/PaddlePaddle/Paddle/pull/49539), [#54542](https://github.com/PaddlePaddle/Paddle/pull/54542))
-- Add CUDNN Frontend API compilation support and related unit test. You can use `WITH_CUDNN_FRONTEND=ON ` compilation option for start. ([#47524](https://github.com/PaddlePaddle/Paddle/pull/47524), [#47612](https://github.com/PaddlePaddle/Paddle/pull/47612))
-
-#### Improvements
-- Add mixed precision strategy and optimize precision:
- - Add and optimize FP16 and BF16 data type support for more than 200 operators in the framework, including logsumexp, reduce_max, cumprod, sync_batch_norm, compare class OP, etc. Carry out precision optimization and unit test for all FP16 and BF16 operators. Improve the unit test framework function for low-precision operators, to ensure there is no loss of accuracy in the process of large-model training. ([#51193](https://github.com/PaddlePaddle/Paddle/pull/51193), [#51114](https://github.com/PaddlePaddle/Paddle/pull/51114), [#45817](https://github.com/PaddlePaddle/Paddle/pull/45817), [#52862](https://github.com/PaddlePaddle/Paddle/pull/52862), [#52919](https://github.com/PaddlePaddle/Paddle/pull/52919), [#52921](https://github.com/PaddlePaddle/Paddle/pull/52921), [#46413](https://github.com/PaddlePaddle/Paddle/pull/46413), [#48205](https://github.com/PaddlePaddle/Paddle/pull/48205), [#54193](https://github.com/PaddlePaddle/Paddle/pull/54193), [#48041](https://github.com/PaddlePaddle/Paddle/pull/48041), [#48121](https://github.com/PaddlePaddle/Paddle/pull/48121), [#46364](https://github.com/PaddlePaddle/Paddle/pull/46364), [#51153](https://github.com/PaddlePaddle/Paddle/pull/51153), [#53023](https://github.com/PaddlePaddle/Paddle/pull/53023), [#53079](https://github.com/PaddlePaddle/Paddle/pull/53079), [#53137](https://github.com/PaddlePaddle/Paddle/pull/53137), [#46212](https://github.com/PaddlePaddle/Paddle/pull/46212), [#50908](https://github.com/PaddlePaddle/Paddle/pull/50908), [#52555](https://github.com/PaddlePaddle/Paddle/pull/52555), [#51582](https://github.com/PaddlePaddle/Paddle/pull/51582), [#47897](https://github.com/PaddlePaddle/Paddle/pull/47897), [#45601](https://github.com/PaddlePaddle/Paddle/pull/45601), [#53522](https://github.com/PaddlePaddle/Paddle/pull/53522), [#52666](https://github.com/PaddlePaddle/Paddle/pull/52666), [#50101](https://github.com/PaddlePaddle/Paddle/pull/50101), [#48315](https://github.com/PaddlePaddle/Paddle/pull/48315), [#50847](https://github.com/PaddlePaddle/Paddle/pull/50847), [#50905](https://github.com/PaddlePaddle/Paddle/pull/50905), [#50906](https://github.com/PaddlePaddle/Paddle/pull/50906), [#50909](https://github.com/PaddlePaddle/Paddle/pull/50909), [#50916](https://github.com/PaddlePaddle/Paddle/pull/50916), [#50917](https://github.com/PaddlePaddle/Paddle/pull/50917), [#50920](https://github.com/PaddlePaddle/Paddle/pull/50920), [#50919](https://github.com/PaddlePaddle/Paddle/pull/50919), [#50904](https://github.com/PaddlePaddle/Paddle/pull/50904), [#50918](https://github.com/PaddlePaddle/Paddle/pull/50918), [#50938](https://github.com/PaddlePaddle/Paddle/pull/50938), [#50858](https://github.com/PaddlePaddle/Paddle/pull/50858), [#50933](https://github.com/PaddlePaddle/Paddle/pull/50933), [#50945](https://github.com/PaddlePaddle/Paddle/pull/50945), [#50936](https://github.com/PaddlePaddle/Paddle/pull/50936), [#51168](https://github.com/PaddlePaddle/Paddle/pull/51168), [#51493](https://github.com/PaddlePaddle/Paddle/pull/51493), [#50924](https://github.com/PaddlePaddle/Paddle/pull/50924), [#50923](https://github.com/PaddlePaddle/Paddle/pull/50923), [#50926](https://github.com/PaddlePaddle/Paddle/pull/50926), [#50925](https://github.com/PaddlePaddle/Paddle/pull/50925), [#50930](https://github.com/PaddlePaddle/Paddle/pull/50930), [#53284](https://github.com/PaddlePaddle/Paddle/pull/53284), [#53286](https://github.com/PaddlePaddle/Paddle/pull/53286), [#53285](https://github.com/PaddlePaddle/Paddle/pull/53285), [#50976](https://github.com/PaddlePaddle/Paddle/pull/50976), [#50915](https://github.com/PaddlePaddle/Paddle/pull/50915), [#50915](https://github.com/PaddlePaddle/Paddle/pull/50915), [#48192](https://github.com/PaddlePaddle/Paddle/pull/48192), [#50993](https://github.com/PaddlePaddle/Paddle/pull/50993), [#50998](https://github.com/PaddlePaddle/Paddle/pull/50998), [#51380](https://github.com/PaddlePaddle/Paddle/pull/51380), [#51137](https://github.com/PaddlePaddle/Paddle/pull/51137), [#51106](https://github.com/PaddlePaddle/Paddle/pull/51106), [#51197](https://github.com/PaddlePaddle/Paddle/pull/51197), [#51159](https://github.com/PaddlePaddle/Paddle/pull/51159), [#51552](https://github.com/PaddlePaddle/Paddle/pull/51552), [#51151](https://github.com/PaddlePaddle/Paddle/pull/51151), [#51005](https://github.com/PaddlePaddle/Paddle/pull/51005), [#51565](https://github.com/PaddlePaddle/Paddle/pull/51565), [#51036](https://github.com/PaddlePaddle/Paddle/pull/51036), [#51185](https://github.com/PaddlePaddle/Paddle/pull/51185), [#51791](https://github.com/PaddlePaddle/Paddle/pull/51791), [#51083](https://github.com/PaddlePaddle/Paddle/pull/51083), [#51694](https://github.com/PaddlePaddle/Paddle/pull/51694), [#51689](https://github.com/PaddlePaddle/Paddle/pull/51689), [#51009](https://github.com/PaddlePaddle/Paddle/pull/51009), [#51051](https://github.com/PaddlePaddle/Paddle/pull/51051), [#51532](https://github.com/PaddlePaddle/Paddle/pull/51532), [#51978](https://github.com/PaddlePaddle/Paddle/pull/51978), [#51903](https://github.com/PaddlePaddle/Paddle/pull/51903), [#51888](https://github.com/PaddlePaddle/Paddle/pull/51888), [#52016](https://github.com/PaddlePaddle/Paddle/pull/52016), [#52035](https://github.com/PaddlePaddle/Paddle/pull/52035), [#52184](https://github.com/PaddlePaddle/Paddle/pull/52184), [#52018](https://github.com/PaddlePaddle/Paddle/pull/52018), [#51787](https://github.com/PaddlePaddle/Paddle/pull/51787), [#51640](https://github.com/PaddlePaddle/Paddle/pull/51640), [#52172](https://github.com/PaddlePaddle/Paddle/pull/52172), [#52193](https://github.com/PaddlePaddle/Paddle/pull/52193), [#51160](https://github.com/PaddlePaddle/Paddle/pull/51160), [#51809](https://github.com/PaddlePaddle/Paddle/pull/51809), [#51678](https://github.com/PaddlePaddle/Paddle/pull/51678), [#52158](https://github.com/PaddlePaddle/Paddle/pull/52158), [#51015](https://github.com/PaddlePaddle/Paddle/pull/51015), [#52240](https://github.com/PaddlePaddle/Paddle/pull/52240), [#52276](https://github.com/PaddlePaddle/Paddle/pull/52276), [#52233](https://github.com/PaddlePaddle/Paddle/pull/52233), [#52220](https://github.com/PaddlePaddle/Paddle/pull/52220), [#52107](https://github.com/PaddlePaddle/Paddle/pull/52107), [#52282](https://github.com/PaddlePaddle/Paddle/pull/52282), [#52311](https://github.com/PaddlePaddle/Paddle/pull/52311), [#52315](https://github.com/PaddlePaddle/Paddle/pull/52315), [#52357](https://github.com/PaddlePaddle/Paddle/pull/52357), [#52256](https://github.com/PaddlePaddle/Paddle/pull/52256), [#51649](https://github.com/PaddlePaddle/Paddle/pull/51649), [#52413](https://github.com/PaddlePaddle/Paddle/pull/52413), [#52369](https://github.com/PaddlePaddle/Paddle/pull/52369), [#51837](https://github.com/PaddlePaddle/Paddle/pull/51837), [#52112](https://github.com/PaddlePaddle/Paddle/pull/52112), [#51819](https://github.com/PaddlePaddle/Paddle/pull/51819), [#52388](https://github.com/PaddlePaddle/Paddle/pull/52388), [#52411](https://github.com/PaddlePaddle/Paddle/pull/52411), [#52521](https://github.com/PaddlePaddle/Paddle/pull/52521), [#51300](https://github.com/PaddlePaddle/Paddle/pull/51300), [#51117](https://github.com/PaddlePaddle/Paddle/pull/51117), [#52380](https://github.com/PaddlePaddle/Paddle/pull/52380), [#52317](https://github.com/PaddlePaddle/Paddle/pull/52317), [#51263](https://github.com/PaddlePaddle/Paddle/pull/51263), [#52668](https://github.com/PaddlePaddle/Paddle/pull/52668), [#52259](https://github.com/PaddlePaddle/Paddle/pull/52259), [#50999](https://github.com/PaddlePaddle/Paddle/pull/50999), [#52407](https://github.com/PaddlePaddle/Paddle/pull/52407), [#52288](https://github.com/PaddlePaddle/Paddle/pull/52288), [#52845](https://github.com/PaddlePaddle/Paddle/pull/52845), [#50953](https://github.com/PaddlePaddle/Paddle/pull/50953), [#52667](https://github.com/PaddlePaddle/Paddle/pull/52667), [#52582](https://github.com/PaddlePaddle/Paddle/pull/52582), [#52426](https://github.com/PaddlePaddle/Paddle/pull/52426), [#51884](https://github.com/PaddlePaddle/Paddle/pull/51884), [#52630](https://github.com/PaddlePaddle/Paddle/pull/52630), [#52136](https://github.com/PaddlePaddle/Paddle/pull/52136), [#52604](https://github.com/PaddlePaddle/Paddle/pull/52604), [#51615](https://github.com/PaddlePaddle/Paddle/pull/51615), [#51275](https://github.com/PaddlePaddle/Paddle/pull/51275), [#52898](https://github.com/PaddlePaddle/Paddle/pull/52898), [#52918](https://github.com/PaddlePaddle/Paddle/pull/52918), [#52572](https://github.com/PaddlePaddle/Paddle/pull/52572), [#52683](https://github.com/PaddlePaddle/Paddle/pull/52683), [#52956](https://github.com/PaddlePaddle/Paddle/pull/52956), [#52963](https://github.com/PaddlePaddle/Paddle/pull/52963), [#52954](https://github.com/PaddlePaddle/Paddle/pull/52954), [#52444](https://github.com/PaddlePaddle/Paddle/pull/52444), [#52314](https://github.com/PaddlePaddle/Paddle/pull/52314), [#52887](https://github.com/PaddlePaddle/Paddle/pull/52887), [#52195](https://github.com/PaddlePaddle/Paddle/pull/52195), [#53100](https://github.com/PaddlePaddle/Paddle/pull/53100), [#52961](https://github.com/PaddlePaddle/Paddle/pull/52961), [#52953](https://github.com/PaddlePaddle/Paddle/pull/52953), [#53111](https://github.com/PaddlePaddle/Paddle/pull/53111), [#53549](https://github.com/PaddlePaddle/Paddle/pull/53549), [#53736](https://github.com/PaddlePaddle/Paddle/pull/53736), [#52920](https://github.com/PaddlePaddle/Paddle/pull/52920), [#53195](https://github.com/PaddlePaddle/Paddle/pull/53195), [#53535](https://github.com/PaddlePaddle/Paddle/pull/53535), [#53876](https://github.com/PaddlePaddle/Paddle/pull/53876), [#53785](https://github.com/PaddlePaddle/Paddle/pull/53785), [#53722](https://github.com/PaddlePaddle/Paddle/pull/53722), [#54285](https://github.com/PaddlePaddle/Paddle/pull/54285), [#54232](https://github.com/PaddlePaddle/Paddle/pull/54232), [#53922](https://github.com/PaddlePaddle/Paddle/pull/53922), [#47277](https://github.com/PaddlePaddle/Paddle/pull/47277), [#50811](https://github.com/PaddlePaddle/Paddle/pull/50811), [#54571](https://github.com/PaddlePaddle/Paddle/pull/54571), [#50129](https://github.com/PaddlePaddle/Paddle/pull/50129), [#50340](https://github.com/PaddlePaddle/Paddle/pull/50340), [#50848](https://github.com/PaddlePaddle/Paddle/pull/50848), [#50849](https://github.com/PaddlePaddle/Paddle/pull/50849), [#50868](https://github.com/PaddlePaddle/Paddle/pull/50868), [#50878](https://github.com/PaddlePaddle/Paddle/pull/50878), [#50929](https://github.com/PaddlePaddle/Paddle/pull/50929), [#50939](https://github.com/PaddlePaddle/Paddle/pull/50939), [#50973](https://github.com/PaddlePaddle/Paddle/pull/50973), [#50913](https://github.com/PaddlePaddle/Paddle/pull/50913), [#51145](https://github.com/PaddlePaddle/Paddle/pull/51145), [#51090](https://github.com/PaddlePaddle/Paddle/pull/51090), [#51098](https://github.com/PaddlePaddle/Paddle/pull/51098), [#51094](https://github.com/PaddlePaddle/Paddle/pull/51094), [#51216](https://github.com/PaddlePaddle/Paddle/pull/51216), [#51736](https://github.com/PaddlePaddle/Paddle/pull/51736), [#51684](https://github.com/PaddlePaddle/Paddle/pull/51684), [#51925](https://github.com/PaddlePaddle/Paddle/pull/51925), [#54030](https://github.com/PaddlePaddle/Paddle/pull/54030), [#50700](https://github.com/PaddlePaddle/Paddle/pull/50700), [#52264](https://github.com/PaddlePaddle/Paddle/pull/52264), [#51069](https://github.com/PaddlePaddle/Paddle/pull/51069), [#51101](https://github.com/PaddlePaddle/Paddle/pull/51101), [#51286](https://github.com/PaddlePaddle/Paddle/pull/51286), [#53582](https://github.com/PaddlePaddle/Paddle/pull/53582),[#49869](https://github.com/PaddlePaddle/Paddle/pull/49869)))
-- AMP optimization: Comprehensively upgrade and optimize ease of use, accuracy stability and debuggability of AMP training, to better support acceleration of large model training. In terms of ease of use, unify the API for dynamic and static graphs. Add new conversion interfaces such as model.float(), model.float16() and model.bfloat16(). In terms of accuracy stability, enhance automatic adjustment of the strategy for BF16 type. Optimize blacklist settings. Enhance support of the multi_precision function by optimizer operators Adagrad, Adamax, Adadelta, and RMSProp. In the O2 mode, improve master grad mechanism, add type promotion mechanism and a new parameter for the specific module to use float32 computation to guarantee accuracy. In terms of debuggability, add the paddle.amp.debugging module to provide operator statistics, outlier detection, and accuracy comparison. ( [#50132](https://github.com/PaddlePaddle/Paddle/pull/50132), [#50078](https://github.com/PaddlePaddle/Paddle/pull/50078), [#50131](https://github.com/PaddlePaddle/Paddle/pull/50131), [#49705](https://github.com/PaddlePaddle/Paddle/pull/49705), [#52936](https://github.com/PaddlePaddle/Paddle/pull/52936), [#52871](https://github.com/PaddlePaddle/Paddle/pull/52871), [#53289](https://github.com/PaddlePaddle/Paddle/pull/53289), [#53362](https://github.com/PaddlePaddle/Paddle/pull/53362), [#54240](https://github.com/PaddlePaddle/Paddle/pull/54240), [#53768](https://github.com/PaddlePaddle/Paddle/pull/53768), [#48041](https://github.com/PaddlePaddle/Paddle/pull/48041), [#47672](https://github.com/PaddlePaddle/Paddle/pull/47672), [#48843](https://github.com/PaddlePaddle/Paddle/pull/48843), [#49391](https://github.com/PaddlePaddle/Paddle/pull/49391), [#51635](https://github.com/PaddlePaddle/Paddle/pull/51635), [#45541](https://github.com/PaddlePaddle/Paddle/pull/45541), [#53742](https://github.com/PaddlePaddle/Paddle/pull/53742), [#51020](https://github.com/PaddlePaddle/Paddle/pull/51020), [#51063](https://github.com/PaddlePaddle/Paddle/pull/51063), [#52514](https://github.com/PaddlePaddle/Paddle/pull/52514), [#50940](https://github.com/PaddlePaddle/Paddle/pull/50940), [#52936](https://github.com/PaddlePaddle/Paddle/pull/52936), [#53439](https://github.com/PaddlePaddle/Paddle/pull/53439), [#53712](https://github.com/PaddlePaddle/Paddle/pull/53712), [#48238](https://github.com/PaddlePaddle/Paddle/pull/48238), [#52215](https://github.com/PaddlePaddle/Paddle/pull/52215), [#53012](https://github.com/PaddlePaddle/Paddle/pull/53012), [#52918](https://github.com/PaddlePaddle/Paddle/pull/52918), [#54571](https://github.com/PaddlePaddle/Paddle/pull/54571))
-- For GroupNorm operator, add support for NHWC data format. ([#47533](https://github.com/PaddlePaddle/Paddle/pull/47533))
-- For index_put operator, add support for mixed data types of bool and int. ([#54195](https://github.com/PaddlePaddle/Paddle/pull/54195))
-- Add sparse.is_nan API for determining whether a sparse tensor contains a NaN element. ([#51513](https://github.com/PaddlePaddle/Paddle/pull/51513))
-
-#### bug fix
-- Fix bugs of computation errors of several operators such as trace, roll, dropout_nd, and log_softmax, stack overflow, and some unit test error. ([#50243](https://github.com/PaddlePaddle/Paddle/pull/50243), [#52012](https://github.com/PaddlePaddle/Paddle/pull/52012), [#53795](https://github.com/PaddlePaddle/Paddle/pull/53795), [#53149](https://github.com/PaddlePaddle/Paddle/pull/53149), [#53654](https://github.com/PaddlePaddle/Paddle/pull/53654), [#51054](https://github.com/PaddlePaddle/Paddle/pull/51054), [#49373](https://github.com/PaddlePaddle/Paddle/pull/49373), [#53038](https://github.com/PaddlePaddle/Paddle/pull/53038))
-- Fix the problem that conv operator exhaustive search does not work in some scenarios. ([#47065](https://github.com/PaddlePaddle/Paddle/pull/47065))
-- Fix timeout problem of collective_reduce_scatter and other operators on A100. ([#54513](https://github.com/PaddlePaddle/Paddle/pull/54513))
-- Fix the problem of attribute error in FusedLinear unit test. ([#50359](https://github.com/PaddlePaddle/Paddle/pull/50359))
-- Fix the OOM problem that may occur when using Profiler. ([#46089](https://github.com/PaddlePaddle/Paddle/pull/46089))
-
-#### Performance optimization
-- Further optimize GPU Kernel and eigen implementations of the framework's large number of operators, including max_pool3d, dropout, adaptive_pooling, depthwise_conv2d, transpose, eigh, broadcast class computations, reduce class computations, prelu, logsumexp, and sparse, to achieve better performance in more configuration scenarios. ([#45820](https://github.com/PaddlePaddle/Paddle/pull/45820), [#45959](https://github.com/PaddlePaddle/Paddle/pull/45959), [#45934](https://github.com/PaddlePaddle/Paddle/pull/45934), [#46332](https://github.com/PaddlePaddle/Paddle/pull/46332), [#46287](https://github.com/PaddlePaddle/Paddle/pull/46287), [#47233](https://github.com/PaddlePaddle/Paddle/pull/47233), [#48855](https://github.com/PaddlePaddle/Paddle/pull/48855), [#48560](https://github.com/PaddlePaddle/Paddle/pull/48560), [#49419](https://github.com/PaddlePaddle/Paddle/pull/49419), [#49748](https://github.com/PaddlePaddle/Paddle/pull/49748), [#50348](https://github.com/PaddlePaddle/Paddle/pull/50348), [#52401](https://github.com/PaddlePaddle/Paddle/pull/52401), [#51131](https://github.com/PaddlePaddle/Paddle/pull/51131), [#51141](https://github.com/PaddlePaddle/Paddle/pull/51141), [#51479](https://github.com/PaddlePaddle/Paddle/pull/51479), [#51835](https://github.com/PaddlePaddle/Paddle/pull/51835), [#52509](https://github.com/PaddlePaddle/Paddle/pull/52509), [#52482](https://github.com/PaddlePaddle/Paddle/pull/52482), [#52700](https://github.com/PaddlePaddle/Paddle/pull/52700), [#53112](https://github.com/PaddlePaddle/Paddle/pull/53112), [#53659](https://github.com/PaddlePaddle/Paddle/pull/53659), [#53658](https://github.com/PaddlePaddle/Paddle/pull/53658), [#53154](https://github.com/PaddlePaddle/Paddle/pull/53154), [#54071](https://github.com/PaddlePaddle/Paddle/pull/54071), [#53622](https://github.com/PaddlePaddle/Paddle/pull/53622), [#52952](https://github.com/PaddlePaddle/Paddle/pull/52952), [#46046](https://github.com/PaddlePaddle/Paddle/pull/46046), [#46119](https://github.com/PaddlePaddle/Paddle/pull/46119), [#45946](https://github.com/PaddlePaddle/Paddle/pull/45946), [#47212](https://github.com/PaddlePaddle/Paddle/pull/47212), [#47791](https://github.com/PaddlePaddle/Paddle/pull/47791), [#47454](https://github.com/PaddlePaddle/Paddle/pull/47454), [#45230](https://github.com/PaddlePaddle/Paddle/pull/45230), [#48899](https://github.com/PaddlePaddle/Paddle/pull/48899), [#33051](https://github.com/PaddlePaddle/Paddle/pull/33051), [#49040](https://github.com/PaddlePaddle/Paddle/pull/49040), [#48992](https://github.com/PaddlePaddle/Paddle/pull/48992), [#49086](https://github.com/PaddlePaddle/Paddle/pull/49086), [#50808](https://github.com/PaddlePaddle/Paddle/pull/50808), [#46431](https://github.com/PaddlePaddle/Paddle/pull/46431), [#50931](https://github.com/PaddlePaddle/Paddle/pull/50931), [#48056](https://github.com/PaddlePaddle/Paddle/pull/48056), [#46071](https://github.com/PaddlePaddle/Paddle/pull/46071), [#49231](https://github.com/PaddlePaddle/Paddle/pull/49231), [#38660](https://github.com/PaddlePaddle/Paddle/pull/38660), [#50287](https://github.com/PaddlePaddle/Paddle/pull/50287), [#46111](https://github.com/PaddlePaddle/Paddle/pull/46111), [#46997](https://github.com/PaddlePaddle/Paddle/pull/46997), [#45854](https://github.com/PaddlePaddle/Paddle/pull/45854), [#47738](https://github.com/PaddlePaddle/Paddle/pull/47738), [#48635](https://github.com/PaddlePaddle/Paddle/pull/48635), [#50353](https://github.com/PaddlePaddle/Paddle/pull/50353), [#50362](https://github.com/PaddlePaddle/Paddle/pull/50362), [#51934](https://github.com/PaddlePaddle/Paddle/pull/51934), [#54045](https://github.com/PaddlePaddle/Paddle/pull/54045), [#46679](https://github.com/PaddlePaddle/Paddle/pull/46679), [#52093](https://github.com/PaddlePaddle/Paddle/pull/52093), [#52969](https://github.com/PaddlePaddle/Paddle/pull/52969))
-- Provide more fusion implementations and related fusion pass, such as fused_feed_forward, gather-gemm-scatter, matmul + bias, layernorm_shift_partition + element_add, and elementwise class fusion, to further improve performance of models that use the mode. ( [#50423](https://github.com/PaddlePaddle/Paddle/pull/50423), [#50091](https://github.com/PaddlePaddle/Paddle/pull/50091), [#50364](https://github.com/PaddlePaddle/Paddle/pull/50364), [#53017](https://github.com/PaddlePaddle/Paddle/pull/53017), [#50755](https://github.com/PaddlePaddle/Paddle/pull/50755), [#50050](https://github.com/PaddlePaddle/Paddle/pull/50050), [#47099](https://github.com/PaddlePaddle/Paddle/pull/47099), [#48848](https://github.com/PaddlePaddle/Paddle/pull/48848), [#49383](https://github.com/PaddlePaddle/Paddle/pull/49383), [#50809](https://github.com/PaddlePaddle/Paddle/pull/50809), [#52361](https://github.com/PaddlePaddle/Paddle/pull/52361), [#52028](https://github.com/PaddlePaddle/Paddle/pull/52028), [#48439](https://github.com/PaddlePaddle/Paddle/pull/48439), [#49009](https://github.com/PaddlePaddle/Paddle/pull/49009), [#51427](https://github.com/PaddlePaddle/Paddle/pull/51427), [#52731](https://github.com/PaddlePaddle/Paddle/pull/52731), [#51805](https://github.com/PaddlePaddle/Paddle/pull/51805))
-
-### Intermediate Representation
-In order to guarantee stability and reduce R&D cost of the IR system, we have developed a new IR system for PaddlePaddle. Complete basic data structure definition, operator definition generation, and execution system adaptation. In order to better support higher-order requirements of scientific computing scenarios, complete higher-order adaptation of operators such as silu and cast.
-- Complete the definition of IR data structure, including type system and operator definition. Implement execution adaptation with phi kernel. [#51112](https://github.com/PaddlePaddle/Paddle/pull/51112), [#51992](https://github.com/PaddlePaddle/Paddle/pull/51992), [#50412](https://github.com/PaddlePaddle/Paddle/pull/50412), [#53557](https://github.com/PaddlePaddle/Paddle/pull/53557), [#53953](https://github.com/PaddlePaddle/Paddle/pull/53953), [#50959](https://github.com/PaddlePaddle/Paddle/pull/50959), [#54250](https://github.com/PaddlePaddle/Paddle/pull/54250), [#54197](https://github.com/PaddlePaddle/Paddle/pull/54197), [#54289](https://github.com/PaddlePaddle/Paddle/pull/54289), [#51636](https://github.com/PaddlePaddle/Paddle/pull/51636), [#52846](https://github.com/PaddlePaddle/Paddle/pull/52846), [#53988](https://github.com/PaddlePaddle/Paddle/pull/53988), [#54143](https://github.com/PaddlePaddle/Paddle/pull/54143), [#54035](https://github.com/PaddlePaddle/Paddle/pull/54035), [#54052](https://github.com/PaddlePaddle/Paddle/pull/54052), [#54340](https://github.com/PaddlePaddle/Paddle/pull/54340), [#54356](https://github.com/PaddlePaddle/Paddle/pull/54356), [#54068](https://github.com/PaddlePaddle/Paddle/pull/54068), [#53894](https://github.com/PaddlePaddle/Paddle/pull/53894), [#53707](https://github.com/PaddlePaddle/Paddle/pull/53707), [#54185](https://github.com/PaddlePaddle/Paddle/pull/54185), [#54031](https://github.com/PaddlePaddle/Paddle/pull/54031), [#54220](https://github.com/PaddlePaddle/Paddle/pull/54220), [#54275](https://github.com/PaddlePaddle/Paddle/pull/54275), [#54281](https://github.com/PaddlePaddle/Paddle/pull/54281), [#54186](https://github.com/PaddlePaddle/Paddle/pull/54186), [#54259](https://github.com/PaddlePaddle/Paddle/pull/54259), [#54124](https://github.com/PaddlePaddle/Paddle/pull/54124), [#54292](https://github.com/PaddlePaddle/Paddle/pull/54292), [#48068](https://github.com/PaddlePaddle/Paddle/pull/48068), [#53978](https://github.com/PaddlePaddle/Paddle/pull/53978)
-- Improve the basic pass setup, including basic pass definition, pass registration management. [#54023](https://github.com/PaddlePaddle/Paddle/pull/54023),[#54170](https://github.com/PaddlePaddle/Paddle/pull/54170), [#54170](https://github.com/PaddlePaddle/Paddle/pull/54170), [#54308](https://github.com/PaddlePaddle/Paddle/pull/54308), [#54348](https://github.com/PaddlePaddle/Paddle/pull/54348), [#54385](https://github.com/PaddlePaddle/Paddle/pull/54385)
-- Improve adaptation of high-level arithmetic, including modification of the basic module and adaptation of silu and cast arithmetic. [#52005](https://github.com/PaddlePaddle/Paddle/pull/52005), [#53425](https://github.com/PaddlePaddle/Paddle/pull/53425), [#53417](https://github.com/PaddlePaddle/Paddle/pull/53417), [#53417](https://github.com/PaddlePaddle/Paddle/pull/53417), [#53498](https://github.com/PaddlePaddle/Paddle/pull/53498), [#53171](https://github.com/PaddlePaddle/Paddle/pull/53171), [#53632](https://github.com/PaddlePaddle/Paddle/pull/53632), [#53605](https://github.com/PaddlePaddle/Paddle/pull/53605), [#53746](https://github.com/PaddlePaddle/Paddle/pull/53746), [#53874](https://github.com/PaddlePaddle/Paddle/pull/53874), [#54164](https://github.com/PaddlePaddle/Paddle/pull/54164), [#45888](https://github.com/PaddlePaddle/Paddle/pull/45888), [#46024](https://github.com/PaddlePaddle/Paddle/pull/46024), [#46446](https://github.com/PaddlePaddle/Paddle/pull/46446), [#46960](https://github.com/PaddlePaddle/Paddle/pull/46960)
-
-### CINN compiler
-#### New features
-- Add CINN support for 0D-Tensor. At present, in order to cooperate with the upgrade of the main framework, it is supported by adding pass temporarily. We will replace and upgrade the solution later. ([#53382](https://github.com/PaddlePaddle/Paddle/pull/53382), [#53955](https://github.com/PaddlePaddle/Paddle/pull/53955), [#54064](https://github.com/PaddlePaddle/Paddle/pull/54064), [#54118](https://github.com/PaddlePaddle/Paddle/pull/54118), [#54216](https://github.com/PaddlePaddle/Paddle/pull/54216), [#53454](https://github.com/PaddlePaddle/Paddle/pull/53454))
-- Add CINN support for int8/uint8/int16/uint16/bf16 data types. ([#50566](https://github.com/PaddlePaddle/Paddle/pull/50566), [#53637](https://github.com/PaddlePaddle/Paddle/pull/53637))
-- Add support for the CINN expand operator. ([#46776](https://github.com/PaddlePaddle/Paddle/pull/46776))
-- Add CINN support for PaddleInference. ([#45009](https://github.com/PaddlePaddle/Paddle/pull/45009))
-
-#### Improvements
-- For CINN compiler, pass skip_gc_vars attribute to CINN subgraph. CINN adds fetch operator for skip_gc_vars. [#49471](https://github.com/PaddlePaddle/Paddle/pull/49471), [#49553](https://github.com/PaddlePaddle/Paddle/pull/49553)
-- For CINN compiler, conv2d and conv2d_grad do not use cinn operator by default. [#51645](https://github.com/PaddlePaddle/Paddle/pull/51645)
-- Add build_cinn_pass to BuildStrategy for use in dynamic-to-static ([#49496](https://github.com/PaddlePaddle/Paddle/pull/49496))
-- Add reshape operator to perform unit test under combinator mechanism. ([#51276](https://github.com/PaddlePaddle/Paddle/pull/51276))
-- Change version of the main framework binding CINN from fixed commit to develop. ([#49775](https://github.com/PaddlePaddle/Paddle/pull/49775))
-- Set default Target parameter for CINN. ([#50182](https://github.com/PaddlePaddle/Paddle/pull/50182))
-
-#### bug fix
-- Fix the problem of inconsistent operator order after topology sorting during CINN symbolization. ([#52556](https://github.com/PaddlePaddle/Paddle/pull/52556))
-- Fix some operator computation errors, accuracy degradation, and unit test related problems. ([#53859](https://github.com/PaddlePaddle/Paddle/pull/53859), [#54261](https://github.com/PaddlePaddle/Paddle/pull/54261), [#46801](https://github.com/PaddlePaddle/Paddle/pull/46801), [#53676](https://github.com/PaddlePaddle/Paddle/pull/53676), [#53772](https://github.com/PaddlePaddle/Paddle/pull/53772))
-- Fix the problem of CINN support for float16 type. ([#48249](https://github.com/PaddlePaddle/Paddle/pull/48249))
-- Fix the problem in build_cinn_pass. ([#46843](https://github.com/PaddlePaddle/Paddle/pull/46843))
-- Fix the problem of no data area due to incorrect GC when CINN is turned on during combinator + dynamic-to-static. ([#50116](https://github.com/PaddlePaddle/Paddle/pull/50116))
-- Fix the problems of compiler dropout amp error, combinator resnet error, and inplace variable not found [#51688](https://github.com/PaddlePaddle/Paddle/pull/51688), [#52813](https://github.com/PaddlePaddle/Paddle/pull/52813), [#51769](https://github.com/PaddlePaddle/Paddle/pull/51769)
-
-#### Performance optimization
-- Optimize reshape related fusion strategy ([#53066](https://github.com/PaddlePaddle/Paddle/pull/53066))
-- Optimize performance of BuildCINNPass. ([#49696](https://github.com/PaddlePaddle/Paddle/pull/49696))
-- Optimize performance of subgraph detection module. ([#45040](https://github.com/PaddlePaddle/Paddle/pull/45040), [#46937](https://github.com/PaddlePaddle/Paddle/pull/46937))
-
-### Hardware support
-#### CustomDevice
-- Add support for the distributed strategy MP/Sharding/PP/MoE and recompute on the training side. Add support for the distributed strategy MP on the inference side. Support for hardware Ascend NPU and Cambricon MLU accessed through CustomDevice, without changing any codes, to automatically inherit all new distributed strategies added by CustomDevice. [#52872](https://github.com/PaddlePaddle/Paddle/pull/52872), [#54384](https://github.com/PaddlePaddle/Paddle/pull/54384), [#53220](https://github.com/PaddlePaddle/Paddle/pull/53220), [#54572](https://github.com/PaddlePaddle/Paddle/pull/54572), [#54573](https://github.com/PaddlePaddle/Paddle/pull/54573), [#54676](https://github.com/PaddlePaddle/Paddle/pull/54676), [#53044](https://github.com/PaddlePaddle/Paddle/pull/53044), [#53719](https://github.com/PaddlePaddle/Paddle/pull/53719), [#53701](https://github.com/PaddlePaddle/Paddle/pull/53701), [#53702](https://github.com/PaddlePaddle/Paddle/pull/53702), [#53703](https://github.com/PaddlePaddle/Paddle/pull/53703)
-- Add API paddle.device.is_compiled_with_custom_device. It is convenient for users to judge whether the current environment supports the plug-in device backend of a certain hardware. [#49271](https://github.com/PaddlePaddle/Paddle/pull/49721)
-- Add environment variable CUSTOM_DEVICE_BLACK_LIST setting, to support automatic heterogeneous operation on CPU of blacklisted operators. [#50409](https://github.com/PaddlePaddle/Paddle/pull/50409), [#50666](https://github.com/PaddlePaddle/Paddle/pull/50666)
-- Optimize CustomDevice performance by reducing number of calls to get_device_count interface in runtime. [#46963](https://github.com/PaddlePaddle/Paddle/pull/46963)
-
-#### KUNLUNXIN XPU
-- For the training side, use a new version of dynamic graph, with adding support for distributed strategy MP/Sharding/PP and recompute function, and communication library. For the inference side, add support for distributed strategy MP and support for XPU FasterTransformer operator acceleration library. [#49531](https://github.com/PaddlePaddle/Paddle/pull/49531), [#49815](https://github.com/PaddlePaddle/Paddle/pull/49815), [#48897](https://github.com/PaddlePaddle/Paddle/pull/48897), [#50717](https://github.com/PaddlePaddle/Paddle/pull/50717), [#51082](https://github.com/PaddlePaddle/Paddle/pull/51082), [#49757](https://github.com/PaddlePaddle/Paddle/pull/49757), [#51399](https://github.com/PaddlePaddle/Paddle/pull/51399), [#50329](https://github.com/PaddlePaddle/Paddle/pull/50329), [#48369](https://github.com/PaddlePaddle/Paddle/pull/48369), [#47838](https://github.com/PaddlePaddle/Paddle/pull/47838),[#48076](https://github.com/PaddlePaddle/Paddle/pull/48076),[#47882](https://github.com/PaddlePaddle/Paddle/pull/47882),[#48961](https://github.com/PaddlePaddle/Paddle/pull/48961),[#49043](https://github.com/PaddlePaddle/Paddle/pull/49043),[#49749](https://github.com/PaddlePaddle/Paddle/pull/49749),[#49806](https://github.com/PaddlePaddle/Paddle/pull/49806),[#53427](https://github.com/PaddlePaddle/Paddle/pull/53427),[#48470](https://github.com/PaddlePaddle/Paddle/pull/48470),[#49207](https://github.com/PaddlePaddle/Paddle/pull/49207),[#52296](https://github.com/PaddlePaddle/Paddle/pull/52296),[#51785](https://github.com/PaddlePaddle/Paddle/pull/51785),[#47168](https://github.com/PaddlePaddle/Paddle/pull/47168),[#47445](https://github.com/PaddlePaddle/Paddle/pull/47445),[#50200](https://github.com/PaddlePaddle/Paddle/pull/50200),[#49934](https://github.com/PaddlePaddle/Paddle/pull/49934),[#50792](https://github.com/PaddlePaddle/Paddle/pull/50792),[#52228](https://github.com/PaddlePaddle/Paddle/pull/52228),[#53337](https://github.com/PaddlePaddle/Paddle/pull/53337),[#53389](https://github.com/PaddlePaddle/Paddle/pull/53389),[#53496](https://github.com/PaddlePaddle/Paddle/pull/53496),[#53609](https://github.com/PaddlePaddle/Paddle/pull/53609),[#53697](https://github.com/PaddlePaddle/Paddle/pull/53697),[#53496](https://github.com/PaddlePaddle/Paddle/pull/53496),[#53720](https://github.com/PaddlePaddle/Paddle/pull/53720),[#53734](https://github.com/PaddlePaddle/Paddle/pull/53734),[#54172](https://github.com/PaddlePaddle/Paddle/pull/54172),[PR46227](https://github.com/PaddlePaddle/Paddle/pull/46227)
-
-## 4. Deployment Direction(Paddle Inference)
-### New features
-- Support Paddle TensorRT multiple subgraph TensorRT engine or TensorRT engine between different Predictors to share video memory in order to save video memory. [#45842](https://github.com/PaddlePaddle/Paddle/pull/45842) [#47631](https://github.com/PaddlePaddle/Paddle/pull/47631)
-- For the C++ API, add Shape and data type API to obtain the input Tensor, and add Shape and data type API to obtain the output Tensor. For the C API, add SetExecStream, EnableMkldnnInt8 and other C++ existing APIs for serviced deployment. [#49758](https://github.com/PaddlePaddle/Paddle/pull/49758)
-- Add paddle.inference.Predictor.register_output_hook() API. Support printing of the output of each layer under GPU inference in case of debugging. Support use in control flow models such as While. It should be noted the API does not support Paddle-TensorRT. [#54433](https://github.com/PaddlePaddle/Paddle/pull/54433) ,[#47050](https://github.com/PaddlePaddle/Paddle/pull/47050) , [#54254](https://github.com/PaddlePaddle/Paddle/pull/54254) 。
-- Paddle Inference Predictor API supports paddle::Tensor as input and output, so users can directly reuse the PaddlePaddle dynamics graph for pre-inference and post-inference processing. ([#50445](https://github.com/PaddlePaddle/Paddle/pull/50445))
-- Enhance Paddle TensorRT dynamic shape running ability, config.enable_tuned_tensorrt_dynamic_shape() API to build TensorRT Engine at runtime without passing any parameters. It is unnecessary to collect shape information before running. To avoid rebuilding at runtime, it is necessary to overwrite minimum and maximum Shape in first operations for several times. [#52162](https://github.com/PaddlePaddle/Paddle/pull/52162) 。
-- Paddle-TensorRT supports model input in NHWC format. [#49633](https://github.com/PaddlePaddle/Paddle/pull/49633) 。
-- Extend config.Exp_DisableTensorRtOPs API to disable access to TensorRT by specifying the name of the Tensor variable. [#49497](https://github.com/PaddlePaddle/Paddle/pull/49497) 。
-
-### Improvements
-- Enhance GPU mixed-precision inference (non-Paddle TensorRT scenarios). For the Config.enable_use_gpu enhancement, you can set precision type. [#47993](https://github.com/PaddlePaddle/Paddle/pull/47993)
-- Support double type input for inference. [#51786](https://github.com/PaddlePaddle/Paddle/pull/51786) 。
-- Since the TensorRT operator does not support the INT64 type, leading to running failure of INT64 data type in the model. Paddle-TensorRT has been enhanced to automatically convert, with reducing the model to run in the INT32 type when model contains INT64 data type. [#45547](https://github.com/PaddlePaddle/Paddle/pull/45547)
-- Paddle-TensorRT supports more operators into TensorRT inference, including:
- - expand_v2,gather_nd,rsqrt,sign,not,onehot,arg_min,temporal_shift,expend_as_v2,setvalue,index_select,round,acosh,square,reduce_max,not_equal,reduce_min,reduce_prod,grid_sampler,elementwise_mod,pad3d ,greater_equal,bitwise,cumsum,matmul_v2,reciprocal,where,bmm,take_along_axis,less_than,greater_than, logical_or, logical_xor, logical_and, less_equal,range,reduce_all,reduce_any ,fill_any_like ,pow
- - [#47002](https://github.com/PaddlePaddle/Paddle/pull/47002) , [#47589](https://github.com/PaddlePaddle/Paddle/pull/47589) ,[#48223](https://github.com/PaddlePaddle/Paddle/pull/48223) ,[#48557](https://github.com/PaddlePaddle/Paddle/pull/48557) , [#48655](https://github.com/PaddlePaddle/Paddle/pull/48655) , [#49113](https://github.com/PaddlePaddle/Paddle/pull/49113) , [#51207](https://github.com/PaddlePaddle/Paddle/pull/51207) ,[#51028](https://github.com/PaddlePaddle/Paddle/pull/51028) ,[#50341](https://github.com/PaddlePaddle/Paddle/pull/50341) ,[#51498](https://github.com/PaddlePaddle/Paddle/pull/51498) ,[#48534](https://github.com/PaddlePaddle/Paddle/pull/48534) ,[#48684](https://github.com/PaddlePaddle/Paddle/pull/48684) , [#49393](https://github.com/PaddlePaddle/Paddle/pull/49393) , [#49615](https://github.com/PaddlePaddle/Paddle/pull/49615) ,[#50934](https://github.com/PaddlePaddle/Paddle/pull/50934) ,[#50974](https://github.com/PaddlePaddle/Paddle/pull/50974),[#50986](https://github.com/PaddlePaddle/Paddle/pull/50986) , [#52000](https://github.com/PaddlePaddle/Paddle/pull/52000) ,[#51971](https://github.com/PaddlePaddle/Paddle/pull/51971) , [#52518](https://github.com/PaddlePaddle/Paddle/pull/52518) ,[#44918](https://github.com/PaddlePaddle/Paddle/pull/44918) ,[#48230](https://github.com/PaddlePaddle/Paddle/pull/48230) ,[#47820](https://github.com/PaddlePaddle/Paddle/pull/47820) , [#46877](https://github.com/PaddlePaddle/Paddle/pull/46877) , [#48358](https://github.com/PaddlePaddle/Paddle/pull/48358) , [#48592](https://github.com/PaddlePaddle/Paddle/pull/48592) ,[#48697](https://github.com/PaddlePaddle/Paddle/pull/48697) , [#53088](https://github.com/PaddlePaddle/Paddle/pull/53088) , [#47974](https://github.com/PaddlePaddle/Paddle/pull/47974) , [#53462](https://github.com/PaddlePaddle/Paddle/pull/53462)
-- Enhance Paddle-TensorRT mapping operators strided_slice, instance_norm, prelu, argmax, cast, nearest_interp_v2, elementwise, bilinear. [#46819](https://github.com/PaddlePaddle/Paddle/pull/46819) ,[#47998](https://github.com/PaddlePaddle/Paddle/pull/47998) ,[#48043](https://github.com/PaddlePaddle/Paddle/pull/48043) ,[#48998](https://github.com/PaddlePaddle/Paddle/pull/48998) , [#49675](https://github.com/PaddlePaddle/Paddle/pull/49675) , [#47495](https://github.com/PaddlePaddle/Paddle/pull/47495)
-- Paddle-TensorRT partial operators (scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range, unary, equal, elementwise_op). Support 0-dimensional Tensor. [#53660](https://github.com/PaddlePaddle/Paddle/pull/53660) ,[#53627](https://github.com/PaddlePaddle/Paddle/pull/53627) , [#53634](https://github.com/PaddlePaddle/Paddle/pull/53634) , [#53714](https://github.com/PaddlePaddle/Paddle/pull/53714) , [#53729](https://github.com/PaddlePaddle/Paddle/pull/53729) ,[#53769](https://github.com/PaddlePaddle/Paddle/pull/53769) ,[#53506](https://github.com/PaddlePaddle/Paddle/pull/53506) ,[#53704](https://github.com/PaddlePaddle/Paddle/pull/53704)
-- Support compilation for versions earlier than GCC12 + CUDA 12.0. [#50106](https://github.com/PaddlePaddle/Paddle/pull/50106)
-- Paddle-TensorRT's DeformableConv plugin supports dynamic Shape input. [#50698](https://github.com/PaddlePaddle/Paddle/pull/50698)
-- For Paddle-TensorRT, add plugin support for lookup_table operator. [#46613](https://github.com/PaddlePaddle/Paddle/pull/46613)
-- Add config.enable_low_precision_io() API to support low-precision type input in Paddle-TensorRT scenario. [#52485](https://github.com/PaddlePaddle/Paddle/pull/52485)
-- Paddle-TensorRT's LayerNorm plugin supports FP16 computation. [#45043](https://github.com/PaddlePaddle/Paddle/pull/45043)
-- Predictor's input data paddle_infer::Tensor supports bool type. [#49388](https://github.com/PaddlePaddle/Paddle/pull/49388)
-- Paddle-TensorRT enhanced Convolution implementation uses ConvolutionNd. [#47653](https://github.com/PaddlePaddle/Paddle/pull/47653)
-- conv2d_fusion operator supports NHWC format. [#49047](https://github.com/PaddlePaddle/Paddle/pull/49047)
-- Adjust the directory structure related to Phi operators under C++ inference library. [#53091](https://github.com/PaddlePaddle/Paddle/pull/53091)
-- Support rebuilding TensorRT Engine instead of reporting errors when TensorRT serialization and loading versions do not match. [#50775](https://github.com/PaddlePaddle/Paddle/pull/50775) 。
-- Optimize Paddle-TensorRT runtime to print log messages. [#50181](https://github.com/PaddlePaddle/Paddle/pull/50181)
-- Support elementwise 0-dimensional Tensor inputs for oneDNN-based CPU inference. [#51656](https://github.com/PaddlePaddle/Paddle/pull/51656)
-- Clean up and normalize support for Paddle-TensorRT's FC, matmul, matmul_v2 operators, and unify and upgrade to use TensorRT's IMatrixMultiplyLayer for support. [#52222](https://github.com/PaddlePaddle/Paddle/pull/52222)
-
-### Performance optimization
-- Support multiple lookup_tables into Paddle-TensorRT's Embedding+Eltwise+LayerNorm fusion. [#46243](https://github.com/PaddlePaddle/Paddle/pull/46243) ,[#46230](https://github.com/PaddlePaddle/Paddle/pull/46230)
-- Add MoE fusion Phi operator to improve inference performance of MoE model. [#48703](https://github.com/PaddlePaddle/Paddle/pull/48703)
-- In the scenario of INT8 quantized inference, Paddle-TensorRT plugin can fall back to FP16 computation, instead of FP32 computation. [#50554](https://github.com/PaddlePaddle/Paddle/pull/50554)
-- Optimize memory and video memory in case of inference. [#49051](https://github.com/PaddlePaddle/Paddle/pull/49051) , [#49046](https://github.com/PaddlePaddle/Paddle/pull/49046) ,[#53930](https://github.com/PaddlePaddle/Paddle/pull/53930)
-- Optimize Layout and enhance Pass. [#52997](https://github.com/PaddlePaddle/Paddle/pull/52997)
-- Support caching of operator Shape inferences to improve model inference performance. [#48312](https://github.com/PaddlePaddle/Paddle/pull/48312)
-- Optimize bias+add+relu fusion using half2 instructions. [#49048](https://github.com/PaddlePaddle/Paddle/pull/49048)
-- Optimize Concat Kernel for multiple inputs using vectorization operations. [#49540](https://github.com/PaddlePaddle/Paddle/pull/49540)
-- Implement Convolution, Depthwise Convolution and related fusion operators based on CUTLASS to improve inference speed. [#47989](https://github.com/PaddlePaddle/Paddle/pull/47989) ,[#50603](https://github.com/PaddlePaddle/Paddle/pull/50603) ,[#51792](https://github.com/PaddlePaddle/Paddle/pull/51792) ,[#50603](https://github.com/PaddlePaddle/Paddle/pull/50603)
-- Paddle-TensorRT supports FlashAttention’s plugin, to improve inference speed of models such as StableDiffusion. [#49438](https://github.com/PaddlePaddle/Paddle/pull/49438) 。
-- Add Transpose+LayerNorm fusion PASS, to improve inference speed of models such as StableDiffusion. [#50082](https://github.com/PaddlePaddle/Paddle/pull/50082) 。
-- Add Elementwise+Transpose fusion. [#50081](https://github.com/PaddlePaddle/Paddle/pull/50081)
-- Optimize Paddle-TensorRT Group Norm plugin implementation. [#49160](https://github.com/PaddlePaddle/Paddle/pull/49160)
-- For Config.EnableTensorRtEngine() API, add use_cuda_graph parameter. You can enable CUDA Graph. It should be noted you need to ensure the model input shape remains unchanged during usage, to reduce runtime consumption. [#53406](https://github.com/PaddlePaddle/Paddle/pull/53406)
-- Support inplace operation of Reshape, to reduce copying time of the model at runtime. [#49146](https://github.com/PaddlePaddle/Paddle/pull/49146)
-- Optimize LayerNorm kernel implementation based on oneDNN. [#47782](https://github.com/PaddlePaddle/Paddle/pull/47782)
-- Support fusion of quantize+transpose and transpose+dequantize based on oneDNN. [#49509](https://github.com/PaddlePaddle/Paddle/pull/49509)
-- When MKLDNN is turned on in CPU inference, FC-related fusion pass is enabled by default, to improve performance. [#45704](https://github.com/PaddlePaddle/Paddle/pull/45704)
-- CPU OneDNN inference supports suqeeze2 + transpose2 fusion. [#47592](https://github.com/PaddlePaddle/Paddle/pull/47592)
-
-### XPU inference enhancement and performance optimization
-- Add ExpRunWithRuntimeConfig API and XpuRuntimeConfig, to allow settings of parameters such as external streams, and L3 cache during inference. GetExecStream API supports obtaining Kunlun external stream objects. Input and output support Kunlun device memory, to reduce D2H and H2D overheads. [#53334](https://github.com/PaddlePaddle/Paddle/pull/53334)、 [#52466](https://github.com/PaddlePaddle/Paddle/pull/52466)、 [#53240](https://github.com/PaddlePaddle/Paddle/pull/53240)
-- Add multi-encoder, fused_multi_transformer and fusion pass, to improve performance of ERNIE and Transformer class models. [#50570](https://github.com/PaddlePaddle/Paddle/pull/50570)、[#51346](https://github.com/PaddlePaddle/Paddle/pull/51346)、 [#50499](https://github.com/PaddlePaddle/Paddle/pull/50499)、[#53982](https://github.com/PaddlePaddle/Paddle/pull/53982)、[#50759](https://github.com/PaddlePaddle/Paddle/pull/50759)、[#51571](https://github.com/PaddlePaddle/Paddle/pull/51571)、 [#53144](https://github.com/PaddlePaddle/Paddle/pull/53144)、[#53306](https://github.com/PaddlePaddle/Paddle/pull/53306)
-- Optimize BeamSearch performance. Transform, remove and fuse fine-grained operators such as write_read_array and gather, to improve model performance when beam_size=1. [#53130](https://github.com/PaddlePaddle/Paddle/pull/53130)
-- Transform multiple stack operators with the same input into unsqueeze operators that support broadcast. Unsquee/squeeze supports inplace computation. [#52099](https://github.com/PaddlePaddle/Paddle/pull/52099)
-- Add support for exporting multi-card inference models for Kunlunxin. [#50490](https://github.com/PaddlePaddle/Paddle/pull/50490)
-- Add embedding_with_eltwise_add fusion pass and operator phi kernel, to reduce video memory usage and improve inference performance. [#50590](https://github.com/PaddlePaddle/Paddle/pull/50590)
-- interpolate class operator phi kernel supports FP16. [#52358](https://github.com/PaddlePaddle/Paddle/pull/52358)
-- argmax operator supports INT32 type output. [#51303](https://github.com/PaddlePaddle/Paddle/pull/51303)
-- Fix the error of only model file when saving serialized model after turning on mixed-precision inference mode. [#52994](https://github.com/PaddlePaddle/Paddle/pull/52994)
-- Fix segment error of instance_norm when scale and bias are empty. [#52627](https://github.com/PaddlePaddle/Paddle/pull/52627)
-- conv_transpose operator supports FP16. [#53626](https://github.com/PaddlePaddle/Paddle/pull/53626)
-- Add yolo_box_xpu fusion pass and operator phi kernel, to optimize YOLO model generic substructure. [#54163](https://github.com/PaddlePaddle/Paddle/pull/54163)
-- Add conv2d_xpu fusion pass and operator phi kernel, and support FP16 inference, to optimize convolution operation inference consumption time. [#52247](https://github.com/PaddlePaddle/Paddle/pull/52247) ,[#53626](https://github.com/PaddlePaddle/Paddle/pull/53626)
-- Add sigmoid_elementmul generic fusion pass, to fuse to swish operator to match conv2d_fusion pass to improve YOLO model inference performance. [#53580](https://github.com/PaddlePaddle/Paddle/pull/53580)
-- Add act_add fusion pass and operator phi kernel to improve inference performance. [#53965](https://github.com/PaddlePaddle/Paddle/pull/53965)
-- Add fold_interp_outsize fusion pass, to improve inference performance. [#54245](https://github.com/PaddlePaddle/Paddle/pull/54245)
-- Solve the problem of incorrect results due to duplicate fusion when there is shared weight in FC. [#51108](https://github.com/PaddlePaddle/Paddle/pull/51108)、[#51039](https://github.com/PaddlePaddle/Paddle/pull/51039)
-- Remove op_device attribute where operator is only used for training, to prevent wrong choice of place for training during inference. [#51029](https://github.com/PaddlePaddle/Paddle/pull/51029)
-- Support saving of optimized models, allowing PASS optimization to be skipped in case of re-inference, to reduce first time inference time. [#53696](https://github.com/PaddlePaddle/Paddle/pull/53696)
-- Solve the problem of computation error caused by the CPUPlace input of operator Kernel being forced to copy to XPU. [#51306](https://github.com/PaddlePaddle/Paddle/pull/51306)
-- subblock supports early copying of H2D parameters to improve inference performance. [#51876](https://github.com/PaddlePaddle/Paddle/pull/51876)
-- Fix scale memory size of the output activation of Kunlunxin 2nd generation chip. [#53505](https://github.com/PaddlePaddle/Paddle/pull/53505)
-- In new executor Kunlunxin D2D copy, support asynchronous execution. [#51876](https://github.com/PaddlePaddle/Paddle/pull/51876)
-- Remove concat operator with only one input. [#52304](https://github.com/PaddlePaddle/Paddle/pull/52304)
-- lookup_table_v2 supports FP16 to remove redundant cast operator. [#52888](https://github.com/PaddlePaddle/Paddle/pull/52888)
-- Control flow While operator supports caching scope, to reduce overhead of creating new scope every time. [#52628](https://github.com/PaddlePaddle/Paddle/pull/52628)
-- Scatter newly supports FP16, to remove redundant cast operators and elementwise_mul operators with an input of 1. [#52831](https://github.com/PaddlePaddle/Paddle/pull/52831)
-
-### Model quantization
-- Upgrade of dynamic graph quantization function.
- - Add a new API for quantization training of dynamic graph models: ```paddle.quantization.QAT ```. Support passing quantization-related parameters through configuration, simplifying quantization training process and difficulty of secondary development. ([#49398](https://github.com/PaddlePaddle/Paddle/pull/49398))
- - Add a new offline quantization API: ```paddle.quantization.PTQ ```. Support exporting quantization model to model format supported by inference. ([#50107](https://github.com/PaddlePaddle/Paddle/pull/50107))
- - Add STUB operator to simulate actual quantization operation during training process. ([#50510](https://github.com/PaddlePaddle/Paddle/pull/50510))
-- Support quantization training model to load parameters of offline quantization model. Support more operators for quantization, including matmul, scale, and conv1d. [#47892](https://github.com/PaddlePaddle/Paddle/pull/47892), [#45911](https://github.com/PaddlePaddle/Paddle/pull/45911),[#48912](https://github.com/PaddlePaddle/Paddle/pull/48912)
-- Support hybrid parallel training of static graph quantization training. [#52219](https://github.com/PaddlePaddle/Paddle/pull/52219)
-- Fix the problem in the process of dynamic graph quantization:
- - Repeat insertion of quantization nodes when exporting quantization training models. [#48751](https://github.com/PaddlePaddle/Paddle/pull/48751)
- - Fix the problem of inserting quantization nodes into model input. [#49926](https://github.com/PaddlePaddle/Paddle/pull/49926)
-
-## 5. Environment Adaptation
-Improve efficiency of source code compilation, and promote setuptools + ninja compilation method to increase development efficiency: In CPU scenarios, full amount of compilation time is reduced by 20 min, and compilation speed is increased by 24.52%. In GPU scenario, full amount of compilation time is reduced by 22 min, and compilation speed is increased by 29.31%. In order to adapt to mainstream development environments, PaddlePaddle supports gcc12 compilation and C++17 in the source code, and adapts to the latest CUDA12. In terms of code quality, complete cleanup of compilation warnings, to improve compilation experience. At the third-party dependency level, we have upgraded the version of underlying protobuf to reduce dependency, cleaned up deprecated attributes of some earlier versions of dependency libraries and old code formats, and removed support for Python 2.x.
-- ninja compilation adaptation to improve compilation speed. [#52433](https://github.com/PaddlePaddle/Paddle/pull/52433),[#48932](https://github.com/PaddlePaddle/Paddle/pull/48932),[#49420](https://github.com/PaddlePaddle/Paddle/pull/49420),[#48435](https://github.com/PaddlePaddle/Paddle/pull/48435),[#49303](https://github.com/PaddlePaddle/Paddle/pull/49303),[#49448](https://github.com/PaddlePaddle/Paddle/pull/49448),[#49838](https://github.com/PaddlePaddle/Paddle/pull/49838),[#50067](https://github.com/PaddlePaddle/Paddle/pull/50067),[#52796](https://github.com/PaddlePaddle/Paddle/pull/52796),[#50431](https://github.com/PaddlePaddle/Paddle/pull/50431),[#49181](https://github.com/PaddlePaddle/Paddle/pull/49181),[#48867](https://github.com/PaddlePaddle/Paddle/pull/48867),[#48490](https://github.com/PaddlePaddle/Paddle/pull/48490),[#48211](https://github.com/PaddlePaddle/Paddle/pull/48211),[#49499](https://github.com/PaddlePaddle/Paddle/pull/49499),[#53076](https://github.com/PaddlePaddle/Paddle/pull/53076)
-- setuptools compilation and package all-in-one adaptation. [#48770](https://github.com/PaddlePaddle/Paddle/pull/48770),[#46957](https://github.com/PaddlePaddle/Paddle/pull/46957),[#49583](https://github.com/PaddlePaddle/Paddle/pull/49583),[#47602](https://github.com/PaddlePaddle/Paddle/pull/47602),[#48301](https://github.com/PaddlePaddle/Paddle/pull/48301),[#50800](https://github.com/PaddlePaddle/Paddle/pull/50800),[#42575](https://github.com/PaddlePaddle/Paddle/pull/42575)),[#49826](https://github.com/PaddlePaddle/Paddle/pull/49826),[#49002](https://github.com/PaddlePaddle/Paddle/pull/49002),[#51443](https://github.com/PaddlePaddle/Paddle/pull/51443),[#51528](https://github.com/PaddlePaddle/Paddle/pull/51528),[#52621](https://github.com/PaddlePaddle/Paddle/pull/52621),[#52465](https://github.com/PaddlePaddle/Paddle/pull/52465)
-- gcc12 support. [#52960](https://github.com/PaddlePaddle/Paddle/pull/52960),[#52265](https://github.com/PaddlePaddle/Paddle/pull/52265),[#46546](https://github.com/PaddlePaddle/Paddle/pull/46546),[#52318](https://github.com/PaddlePaddle/Paddle/pull/52318),[#46808](https://github.com/PaddlePaddle/Paddle/pull/46808),[#47466](https://github.com/PaddlePaddle/Paddle/pull/47466),[#52083](https://github.com/PaddlePaddle/Paddle/pull/52083),[#48176](https://github.com/PaddlePaddle/Paddle/pull/48176),[#49423](https://github.com/PaddlePaddle/Paddle/pull/49423),[#49452](https://github.com/PaddlePaddle/Paddle/pull/49452),[#51037](https://github.com/PaddlePaddle/Paddle/pull/51037),[#52007](https://github.com/PaddlePaddle/Paddle/pull/52007),[#52441](https://github.com/PaddlePaddle/Paddle/pull/52441),[#52085](https://github.com/PaddlePaddle/Paddle/pull/52085),[#50817](https://github.com/PaddlePaddle/Paddle/pull/50817),[#52646](https://github.com/PaddlePaddle/Paddle/pull/52646),[#50777](https://github.com/PaddlePaddle/Paddle/pull/50777),[#53288](https://github.com/PaddlePaddle/Paddle/pull/53288),[#54009](https://github.com/PaddlePaddle/Paddle/pull/54009)
-- c++17 standard support. [#53345](https://github.com/PaddlePaddle/Paddle/pull/53345),[#53892](https://github.com/PaddlePaddle/Paddle/pull/53892),[#54282](https://github.com/PaddlePaddle/Paddle/pull/54282),[#49017](https://github.com/PaddlePaddle/Paddle/pull/49017),[#47635](https://github.com/PaddlePaddle/Paddle/pull/47635),[#54258](https://github.com/PaddlePaddle/Paddle/pull/54258)
-- cuda12 support. [#52285](https://github.com/PaddlePaddle/Paddle/pull/52285),[#49592](https://github.com/PaddlePaddle/Paddle/pull/49592),[#52232](https://github.com/PaddlePaddle/Paddle/pull/52232),[#52654](https://github.com/PaddlePaddle/Paddle/pull/52654),[#54641](https://github.com/PaddlePaddle/Paddle/pull/54641)
-- CodeStyle。[#45909](https://github.com/PaddlePaddle/Paddle/pull/45909),[#47772](https://github.com/PaddlePaddle/Paddle/pull/47772),[#48538](https://github.com/PaddlePaddle/Paddle/pull/48538),[#49522](https://github.com/PaddlePaddle/Paddle/pull/49522),[#47264](https://github.com/PaddlePaddle/Paddle/pull/47264),[#49558](https://github.com/PaddlePaddle/Paddle/pull/49558)
-- Compilation Warning is removed. [#47163](https://github.com/PaddlePaddle/Paddle/pull/47163),[#47216](https://github.com/PaddlePaddle/Paddle/pull/47216),[#47309](https://github.com/PaddlePaddle/Paddle/pull/47309),[#47252](https://github.com/PaddlePaddle/Paddle/pull/47252),[#47341](https://github.com/PaddlePaddle/Paddle/pull/47341),[#47399](https://github.com/PaddlePaddle/Paddle/pull/47399),[#47513](https://github.com/PaddlePaddle/Paddle/pull/47513),[#47558](https://github.com/PaddlePaddle/Paddle/pull/47558),[#47706](https://github.com/PaddlePaddle/Paddle/pull/47706),[#52717](https://github.com/PaddlePaddle/Paddle/pull/52717),[#51203](https://github.com/PaddlePaddle/Paddle/pull/51203),[#51336](https://github.com/PaddlePaddle/Paddle/pull/51336),[#51608](https://github.com/PaddlePaddle/Paddle/pull/51608),[#51633](https://github.com/PaddlePaddle/Paddle/pull/51633),[#46644](https://github.com/PaddlePaddle/Paddle/pull/46644),[#53092](https://github.com/PaddlePaddle/Paddle/pull/53092),[#53185](https://github.com/PaddlePaddle/Paddle/pull/53185),[#53246](https://github.com/PaddlePaddle/Paddle/pull/53246),[#53650](https://github.com/PaddlePaddle/Paddle/pull/53650),[#53683](https://github.com/PaddlePaddle/Paddle/pull/53683),[#53687](https://github.com/PaddlePaddle/Paddle/pull/53687),[#53886](https://github.com/PaddlePaddle/Paddle/pull/53886),[#53689](https://github.com/PaddlePaddle/Paddle/pull/53689),[#53679](https://github.com/PaddlePaddle/Paddle/pull/53679),[#53681](https://github.com/PaddlePaddle/Paddle/pull/53681),[#53532](https://github.com/PaddlePaddle/Paddle/pull/53532),[#47137](https://github.com/PaddlePaddle/Paddle/pull/47137),[#47045](https://github.com/PaddlePaddle/Paddle/pull/47045),[#52186](https://github.com/PaddlePaddle/Paddle/pull/52186),[#52490](https://github.com/PaddlePaddle/Paddle/pull/52490),[#53924](https://github.com/PaddlePaddle/Paddle/pull/53924),[#53938](https://github.com/PaddlePaddle/Paddle/pull/53938),[#53945](https://github.com/PaddlePaddle/Paddle/pull/53945),[#53851](https://github.com/PaddlePaddle/Paddle/pull/53851),[#53847](https://github.com/PaddlePaddle/Paddle/pull/53847),[#53818](https://github.com/PaddlePaddle/Paddle/pull/53818),[#53931](https://github.com/PaddlePaddle/Paddle/pull/53931)
-- Support protobuf upgrade. [#49875](https://github.com/PaddlePaddle/Paddle/pull/49875),[#48495](https://github.com/PaddlePaddle/Paddle/pull/48495),[#49673](https://github.com/PaddlePaddle/Paddle/pull/49673),[#52499](https://github.com/PaddlePaddle/Paddle/pull/52499),[#51161](https://github.com/PaddlePaddle/Paddle/pull/51161),[#49168](https://github.com/PaddlePaddle/Paddle/pull/49168)
-- Support offline compilation of third-party libraries. [#54326](https://github.com/PaddlePaddle/Paddle/pull/54326),[#54370](https://github.com/PaddlePaddle/Paddle/pull/54370),[#54335](https://github.com/PaddlePaddle/Paddle/pull/54335),[#54346](https://github.com/PaddlePaddle/Paddle/pull/54346),[#53744](https://github.com/PaddlePaddle/Paddle/pull/53744),[#54319](https://github.com/PaddlePaddle/Paddle/pull/54319),[#53915](https://github.com/PaddlePaddle/Paddle/pull/53915)
-- Phi independent compilation header file dependency decoupling. [#50456](https://github.com/PaddlePaddle/Paddle/pull/50456),[#47088](https://github.com/PaddlePaddle/Paddle/pull/47088),[#52573](https://github.com/PaddlePaddle/Paddle/pull/52573),[#52651](https://github.com/PaddlePaddle/Paddle/pull/52651)
-- Python2.x decommissioning. [#48685](https://github.com/PaddlePaddle/Paddle/pull/48685)
-
-## 6. Security
-- Fix bugs such as null pointer usage, illegal address access, memory out of bounds, divide by 0, and Python IndexError [PR49976](https://github.com/PaddlePaddle/Paddle/pull/49976), [ PR49993](https://github.com/PaddlePaddle/Paddle/pull/49993)[, PR49942](https://github.com/PaddlePaddle/Paddle/pull/49942), [PR49965](https://github.com/PaddlePaddle/Paddle/pull/49965)[, PR50000](https://github.com/PaddlePaddle/Paddle/pull/50000)[, PR50005](https://github.com/PaddlePaddle/Paddle/pull/50005)[, PR49953](https://github.com/PaddlePaddle/Paddle/pull/49953)[, PR49995](https://github.com/PaddlePaddle/Paddle/pull/49995)[, PR49974](https://github.com/PaddlePaddle/Paddle/pull/49974)[, PR50015](https://github.com/PaddlePaddle/Paddle/pull/50015)[, PR50010](https://github.com/PaddlePaddle/Paddle/pull/50010), [PR49979](https://github.com/PaddlePaddle/Paddle/pull/49979), [PR49994](https://github.com/PaddlePaddle/Paddle/pull/49994), [PR49977](https://github.com/PaddlePaddle/Paddle/pull/49977)[, PR49968](https://github.com/PaddlePaddle/Paddle/pull/49968), [PR49984](https://github.com/PaddlePaddle/Paddle/pull/49984)[, PR49958](https://github.com/PaddlePaddle/Paddle/pull/49958)[, PR50008](https://github.com/PaddlePaddle/Paddle/pull/50008)[, PR51714](https://github.com/PaddlePaddle/Paddle/pull/51714), [PR51847](https://github.com/PaddlePaddle/Paddle/pull/51847), [PR51034](https://github.com/PaddlePaddle/Paddle/pull/51034)[, PR51088](https://github.com/PaddlePaddle/Paddle/pull/51088)[, PR51091](https://github.com/PaddlePaddle/Paddle/pull/51091)[, PR51092](https://github.com/PaddlePaddle/Paddle/pull/51092), [PR49966](https://github.com/PaddlePaddle/Paddle/pull/49966), [PR49656](https://github.com/PaddlePaddle/Paddle/pull/49656), [PR52161](https://github.com/PaddlePaddle/Paddle/pull/52161), [PR49548](https://github.com/PaddlePaddle/Paddle/pull/49548), [PR49546](https://github.com/PaddlePaddle/Paddle/pull/49546), [PR49547](https://github.com/PaddlePaddle/Paddle/pull/49547), [PR49549](https://github.com/PaddlePaddle/Paddle/pull/49549), [PR51850](https://github.com/PaddlePaddle/Paddle/pull/51850)
-
-## Thanks to our Contributors
-This release contains contributions from:
-1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin Wu Jiawen , Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, Ding Yi, Fu Jianhan, Liu Ge Gu Tou, Lu Lin, Zhou Zhouzhou, Jiang Yongyong, Xue Zhawu, Zhang Chunqiao, Zhang Zhenghai, Ning Meng Wei, Wang Mingdong, Shi Xiaowei, Chao Ji Ma Niu, Chen Cangye, Qi Ma Xiao Mao
-
-# 2.4.2 Release Note
-
- V2.4.2 fixed known bugs, and added a tiny set of features.
-
-## Training Framework (distributed included)
-
- - Fix the problem while using paddle.utils.dlpack.to_dlpack API to create dlpack objects multiple times in the for loop, and fix the bug that the reference counting error causes the memory actually pointed by dlpack to be destructed unexpectedly. [#50138](https://github.com/PaddlePaddle/Paddle/pull/50138)
- - Fixed the issue of out-of-bounds memory access when the input tensor is multi-dimensional in paddle.multiplex API. [#49368](https://github.com/PaddlePaddle/Paddle/pull/49368)
- - Fix the occasional compilation error caused by incorrect referencing of the Eigen header file. [#48157](https://github.com/PaddlePaddle/Paddle/pull/48157)
- - Fixed the bug that the output value of the backward operator may be None when the output gradient parameter order of the custom operator is not continuous.[#48656](https://github.com/PaddlePaddle/Paddle/pull/48656)
- - Add cutlass and implement the fusion kernel of gather+gemm+scatter; Optimize training and inference performance of sparse convolution; Optimize inference performance of batch_norm under 1D input data.[#50118](https://github.com/PaddlePaddle/Paddle/pull/50118)
- - Fix compilation failure in gcc54 environment caused by using constexpr. [#50421](https://github.com/PaddlePaddle/Paddle/pull/50421)
- - Move sum op kernel to PHI and fix bug that can't get correct SelectedRows' dims when run infermeta.[#49342](https://github.com/PaddlePaddle/Paddle/pull/49342)
- - Fixed the issue that the fold operator accesses memory out of bounds under large bs input.[#49491](https://github.com/PaddlePaddle/Paddle/pull/49491)
- - Fix the problem that no parameter Layer cannot call backward under dynamic to static mode.[#49812](https://github.com/PaddlePaddle/Paddle/pull/49812)
- - Fix the compile problem of CUDA11.8 on windows platform.[#50205](https://github.com/PaddlePaddle/Paddle/pull/50205)
- - Fix the unsupported error for `FusedDropoutActBiasGrad` on H100.[#47285](https://github.com/PaddlePaddle/Paddle/pull/47285)
- - Add `debug_graphviz_path` option into `build_strategy`.[#46531](https://github.com/PaddlePaddle/Paddle/pull/46531)
- - Fix the not closed `popen` object.[#47053](https://github.com/PaddlePaddle/Paddle/pull/47053)
-
-## Deployment Direction (Paddle Inference)
-
- - Improve the functionality and stability of mixed-precision inference. Reconstruct the implementation of interface convert_to_mixed_precision and add parameter precision to interface enable_use_gpu.[#49077](https://github.com/PaddlePaddle/Paddle/pull/49077)、[#49239](https://github.com/PaddlePaddle/Paddle/pull/49239)、[#49477](https://github.com/PaddlePaddle/Paddle/pull/49477)
- - Support compilation under jetson ampere architecture.[#49364](https://github.com/PaddlePaddle/Paddle/pull/49364)
- - Fixed fc kernel diff.[#49781](https://github.com/PaddlePaddle/Paddle/pull/49781)
- - Fixed the error of trt workspace parameter type under CAPI. [#48350](https://github.com/PaddlePaddle/Paddle/pull/48350)
- - Fixed the error caused by arg_max/arg_min without flatten dtype parameter in Paddle 1.x version. [#49771](https://github.com/PaddlePaddle/Paddle/pull/49771)
- - Fixed the bug of missing information about lod logic after split infermeta's refactoring. [#49745](https://github.com/PaddlePaddle/Paddle/pull/49745)
- - Fixed the bug of the constant-folding pass, which causes the conv2d weight to be non-persistent after folding and not enter the TensorRT engine. [#50105](https://github.com/PaddlePaddle/Paddle/pull/50105)
-
-# 2.4.1 Release Note
-
-
-Remove the dependence of the Paddle on python.so, and fix the bug that fails to execute due to the inability to find python.so in specific environments, including conda.
-
-
-# 2.4.0 Release Note
-
-## 1. Important Updates
-
-- **New dynamic graph architecture is officially effective**: The new dynamic graph framework has significantly improved the scheduling performance. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% kits is improved by over 5%. The functional architecture is clearer, and the secondary development capability and experience are significantly enhanced.
-
-- **Comprehensive improvement of the dynamic-static unification ability of the PaddlePaddle**: The dynamic-to-static function is provided with richer Python syntax support. The Python syntax coverage of the PaddlePaddle reaches 90%. The syntax transcription logic is mainly optimized to completely support the control flow syntax, with providing smooth dynamic-to-static graph experiences by pressing one key. With the newly upgraded static graph executor, the dynamic-to-static training has better acceleration capability, and the key model test shows that it is close to the best level of the static graph. The dynamic-to-static scalability is improved, with newly supporting multi-function merge export and inference. Users can use the PHI operator library for secondary development and flexible deployment. This can effectively support the custom decoding of U2++ featured models in the speech domain.
-
-- **Add sparse computing APIs**: Add 55 sparse APIs `paddle.sparse.*` and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API.
-
-- **Large-scale graph neural network GPU training engine**: Through the heterogeneous hierarchical storage technology of SSD, memory, and video memory, it breaks through the video memory bottleneck and supports all-GPU storage and training of super-large-scale graphs. It realizes the all-GPU integrated solution of walk, sampling and training. This can increase the training speed by more than 10x under the same costs, compared to the traditional distributed CPU solution.
-
-- **Environment adaptation**: Add pre-compiled installer adapted to CUDA version 11.7. It newly supports the running in Ubuntu 22.04 or later.
-
-### Forward-looking forecast
-
-- PaddlePaddle Framework will deprecate support for python 3.6 in version 2.5.
-- The PaddlePaddle framework will gradually deprecate the API under the `paddle.fluild` namespace on the python side, and some of the APIs under this namespace will be directly removed in version 2.5.
-
-## 2. Incompatibility upgrade
-
-- The pre-compiled installer for CUDA version 10.1 is cancelled.
-- The -Tensor.clear_gradient(bool set_to_zero) interface will not take the value passed by kwargs, and will have to pass the bool variable of set_to_zero through args.
-- In order to improve the utilization efficiency of video memory, only the gradients of forward leaf node variables, such as the gradients of network parameters in training, are retained in the dynamic graph by default, instead of the gradients of non-leaf nodes. If you need to preserve a specific Tensor gradient, you can call the Tensor.retain_grads() interface before reverse execution.
-- paddle.autograd. PyLayer will no longer support the case where the input is tuple, pass in a list of Tensor if you want a group of them.
-
-## 3. Training framework (including the distributed feature)
-
-### (1)New APIs and enhanced API functions
-- **Add the sparse computing class API**:paddle.sparse
- - Add 55 sparse APIs and support mainstream sparse computing scenarios. The APIs have been applied to sparse training and inference deployment for 3D point cloud target detection, Sparse Transformers, and other tasks, with a speedup of 105.75% compared to DenseTensor in high sparse scenarios. In contrast to similar products, the speed of sparse computing is increased by 4.01%-58.55%. Support the computing of a variety of sparse Tensors (SparseCoo and SparseCsr). This is the ultimate saving of video memory. Meanwhile, it maintains a consistent usage experience, with the same usage method of the dense Tensor API.[#45849](https://github.com/PaddlePaddle/Paddle/pull/45849), [#46694](https://github.com/PaddlePaddle/Paddle/pull/46694), [#45086](https://github.com/PaddlePaddle/Paddle/pull/45086), [#41857](https://github.com/PaddlePaddle/Paddle/pull/41857), [#42935](https://github.com/PaddlePaddle/Paddle/pull/42935), [#43475](https://github.com/PaddlePaddle/Paddle/pull/43475), [#43668](https://github.com/PaddlePaddle/Paddle/pull/43668), [#43966](https://github.com/PaddlePaddle/Paddle/pull/43966), [#44022](https://github.com/PaddlePaddle/Paddle/pull/44022), [#44346](https://github.com/PaddlePaddle/Paddle/pull/44346), [#44432](https://github.com/PaddlePaddle/Paddle/pull/44432), [#44451](https://github.com/PaddlePaddle/Paddle/pull/44451), [#44743](https://github.com/PaddlePaddle/Paddle/pull/44743), [#42013](https://github.com/PaddlePaddle/Paddle/pull/42013), [#43520](https://github.com/PaddlePaddle/Paddle/pull/43520), [#41434](https://github.com/PaddlePaddle/Paddle/pull/41434), [#42130](https://github.com/PaddlePaddle/Paddle/pull/42130), [#41276](https://github.com/PaddlePaddle/Paddle/pull/41276), [#41857](https://github.com/PaddlePaddle/Paddle/pull/41857), [#41356](https://github.com/PaddlePaddle/Paddle/pull/41356)
-- **Add the audio field API:** paddle.audio
- - Add the feature extraction APIs such as MFCC, Spectrogram, and LogMelSpectrogram. Support the GPU computing. The performance increases by more than 15x compared to the CPU. This can significantly improve the GPU utilization in speech model training.[#45424](https://github.com/PaddlePaddle/Paddle/pull/45424)
- - Add the feature extraction basic APIs such as Window Function and Discrete Cosine Transform. This can facilitate users to customize the speech feature extraction.[#45424](https://github.com/PaddlePaddle/Paddle/pull/45424)
- - Add the speech I/O module. It provides 2 types of audio I/O backend and supports 6 types of codecs for convenient loading of speech data. [#45939](https://github.com/PaddlePaddle/Paddle/pull/45939)
- - Add TESS and ESC50 speech classification datasets. It is convenient for users to complete the classical speech classification model.[#45939](https://github.com/PaddlePaddle/Paddle/pull/45939)
-- **Add the graph learning domain API:** paddle.geometric
- - Graph learning is gradually becoming a key technology in the field of machine learning. The new paddle.geometric module of PaddlePaddle provides a better modeling and training development experience of graph learning.
- - Message passing: The message passing mechanism of the graph learning is the basis of graph modeling. We add 7 graph learning message passing APIs to make it more convenient to complete the modeling of the graph learning. Among them, 3 newly added message passing fusion operators can significantly reduce the GPU memory consumption in the GNN model training. In the dense graph scenarios, more than 50% of GPU memory can be saved in the models of GCN series, and the training speed can increase by more than 20%.[#44848](https://github.com/PaddlePaddle/Paddle/pull/44848), [#44580](https://github.com/PaddlePaddle/Paddle/pull/44580), [#43174](https://github.com/PaddlePaddle/Paddle/pull/43174), [#44970](https://github.com/PaddlePaddle/Paddle/pull/44970)
- - Graph sampling: Graph sampling is the performance bottleneck of GNN model training. This newly added high-performance graph sampling operator supports high concurrent graph sampling. It can increase the sampling speed of GraphSage by more than 32 times and the model training speed by more than 12 times.[#44970](https://github.com/PaddlePaddle/Paddle/pull/44970)
-- **Add the vision domain API**
- - The paddle.vision is added with target detection domain operators.([#43736](https://github.com/PaddlePaddle/Paddle/pull/43736)), paddle.vision.generate_proposals([#43611](https://github.com/PaddlePaddle/Paddle/pull/43611)), paddle.vision.matrix_nms([#44357](https://github.com/PaddlePaddle/Paddle/pull/44357)), paddle.vision.prior_box 和 paddle.vision.box_coder( [#47282](https://github.com/PaddlePaddle/Paddle/pull/47282) ).
-
-- - **Add other API**
- - Add the iinfo([#45321](https://github.com/PaddlePaddle/Paddle/pull/45321)), count_nonzero([#44169](https://github.com/PaddlePaddle/Paddle/pull/44169)), nanmedian([#42385](https://github.com/PaddlePaddle/Paddle/pull/42385)), remainder\_ ([#45266](https://github.com/PaddlePaddle/Paddle/pull/45266)), take([#44741](https://github.com/PaddlePaddle/Paddle/pull/44741)), triu_indices([#45168](https://github.com/PaddlePaddle/Paddle/pull/45168)), sgn([#44568](https://github.com/PaddlePaddle/Paddle/pull/44568)), bucketize([#44195](https://github.com/PaddlePaddle/Paddle/pull/44195)), nanquantile([#41343](https://github.com/PaddlePaddle/Paddle/pull/41343)), frac([#41226](https://github.com/PaddlePaddle/Paddle/pull/41226)), logcumsumexp([#42267](https://github.com/PaddlePaddle/Paddle/pull/42267)), pairwise_distance([#44161](https://github.com/PaddlePaddle/Paddle/pull/44161)), heaviside([#41872](https://github.com/PaddlePaddle/Paddle/pull/41872)), logspace([#41261](https://github.com/PaddlePaddle/Paddle/pull/41261)), corrcoef([#40690](https://github.com/PaddlePaddle/Paddle/pull/40690))
- - Add the RReLU([#41823](https://github.com/PaddlePaddle/Paddle/pull/41823)), CyclicLR([#40698](https://github.com/PaddlePaddle/Paddle/pull/40698)), OneCycleLR([#41825](https://github.com/PaddlePaddle/Paddle/pull/41825)), Softmax2D([#40910](https://github.com/PaddlePaddle/Paddle/pull/40910)), SoftMarginLoss([#42364](https://github.com/PaddlePaddle/Paddle/pull/42364)), MultiLabelSoftMarginLoss([#41183](https://github.com/PaddlePaddle/Paddle/pull/41183)), TripletMarginLoss([#40487](https://github.com/PaddlePaddle/Paddle/pull/40487)), TripletMarginWithDistanceLoss([#40545](https://github.com/PaddlePaddle/Paddle/pull/40545)), CosineEmbeddingLoss 和 cosine_embedding_loss([#41680](https://github.com/PaddlePaddle/Paddle/pull/41680)), PixelUnshuffle([#40728](https://github.com/PaddlePaddle/Paddle/pull/40728)), ChannelShuffle([#40743](https://github.com/PaddlePaddle/Paddle/pull/40743))
-- **Enhanced API functions**
- - Add the large batch_size calculation function of BatchNorm1D [#43072](https://github.com/PaddlePaddle/Paddle/pull/43072)
-- **Optimize the collective communications distributed training API**
- - Optimize the `fleet.init` function, and add the `log_level` parameter to facilitate users to view logs during operation [#45909](https://github.com/PaddlePaddle/Paddle/pull/45909)
- - Add the `paddle.distributed.fleet.recompute_sequential paddle.distributed.fleet.recompute_hybrid` interface. It is convenient for users to use the recompute function [#45348](https://github.com/PaddlePaddle/Paddle/pull/45348)
- - Add the `paddle.distributed.fleet.layers.mpu` package. It is convenient for users to use tensor parallel function [#45803](https://github.com/PaddlePaddle/Paddle/pull/45803)
- - Add the communication API `paddle.distributed.destroy_process_group paddle.distributed.isend paddle.distributed.irecv paddle.distributed.all_to_all_single`. It improves the completeness and ease of use of communication [#43918](https://github.com/PaddlePaddle/Paddle/pull/43918)
- - Add the `paddle.distributed.stream` package. The performance is increased by 5% to 10% compared to the base version[#46023](https://github.com/PaddlePaddle/Paddle/pull/46023) [#45282](https://github.com/PaddlePaddle/Paddle/pull/45282)
- - The communication API is added with the support of multiple data types such as `Char/Byte/Bool`. It improves the completeness and ease of use of communication [#45574](https://github.com/PaddlePaddle/Paddle/pull/45574) [#45440](https://github.com/PaddlePaddle/Paddle/pull/45440)
- - The communication API asynchronous parameter is changed from`use_calc_stream` to `sync_op`, It enhances the semantic readability of the interface [#46493](https://github.com/PaddlePaddle/Paddle/pull/46493)
-- **Enhanced high-level API**
- - The visual model ResNeXt in the high-level API implements the reuse of the ResNet code for refactoring. [#40588](https://github.com/PaddlePaddle/Paddle/pull/40588)
- - The visual models Inceptionv3, MobileNetv1, MobileNetv2, and ShuffleNetv2 in the high level API are improved.[#40431](https://github.com/PaddlePaddle/Paddle/pull/40431)
-
-### (2)New functions and important upgrades
-
-- **The new dynamic graph architecture is officially launched**:The scheduling performance of the new dynamic graph framework is greatly improved. Compared with the original architecture, the scheduling performance is significantly enhanced. The scheduling performance of more than 90% APIs is improved by over 50%, and the model performance of more than 50% of kits is improved by over 5%. The new dynamic graph architecture is clear, and the coupling is low. The learning and development costs of extension modules such as Hook and PyLayer are significantly reduced based on the new architecture. [#37550](https://github.com/PaddlePaddle/Paddle/pull/37550) , [#37574](https://github.com/PaddlePaddle/Paddle/pull/37574) , [#37813](https://github.com/PaddlePaddle/Paddle/pull/37813) , [#37926](https://github.com/PaddlePaddle/Paddle/pull/37926) , [#39192](https://github.com/PaddlePaddle/Paddle/pull/39192) , [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599) , [#37406](https://github.com/PaddlePaddle/Paddle/pull/37406) , [#37466](https://github.com/PaddlePaddle/Paddle/pull/37466) , [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599) , [#40945](https://github.com/PaddlePaddle/Paddle/pull/40945) , [#39989](https://github.com/PaddlePaddle/Paddle/pull/39989)
-
-- **High-order auto-differentiation mechanism**:In order to better support scientific computing and other scenarios, the PaddlePaddle framework has been further improved and optimized for higher-order auto-differentiation capabilities. At present, the `paddle.incubate.autograd` directory has provided relevant trial functions and APIs for forward/reverse higher-order auto-differentiation (Currently they are in incubation, and related functions and API signatures may change).If you intend to implement related models and explore the auto-differentiation mechanism by yourself, please read the [usage and limitations of higher-order auto-differentiation](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/incubate/autograd/Overview_cn.html) carefully. Specific upgrades include:
- 1. Static graph higher-order differentiation mechanism upgrade. Through the base operator system and program transformation, it supports higher-order forward and reverse differentiation, with the availability of the compiler and distributed functions.[#41919](https://github.com/PaddlePaddle/Paddle/pull/41919), [#41201](https://github.com/PaddlePaddle/Paddle/pull/41201)
- 2. Add the forward and reverse higher-order auto-differentiation API, `paddle.incubate.autograd.forward_grad`, `paddle.incubate.autograd.grad`. [#43354](https://github.com/PaddlePaddle/Paddle/pull/43354)
- 3. Add 18 higher-order auto-differentiation operators:`sin`, `cos`, `exp`, `erf`, `abs`, `log`, `cast`, `where`, `equal`, `not_equal`, `greater_than`, `greater_equal`, `elementwise_pow` `square`, `elementwise_max`, `gelu`, `reduce_mean`, `size`. [#46184](https://github.com/PaddlePaddle/Paddle/pull/46184), [#46024](https://github.com/PaddlePaddle/Paddle/pull/46024), [#45888](https://github.com/PaddlePaddle/Paddle/pull/45888), [#45338](https://github.com/PaddlePaddle/Paddle/pull/45338), [#44345](https://github.com/PaddlePaddle/Paddle/pull/44345)
- 4. Fix the existing bugs of the operators such as`elementwise_div`, `reduce_sum`, `p_norm`. [#46514](https://github.com/PaddlePaddle/Paddle/pull/46514), [#46184](https://github.com/PaddlePaddle/Paddle/pull/46184)
-- **Generic heterogeneous parameter server architecture**:
- - Parameter server GPUGraph infrastructure upgraded to meet the implementation needs of large-scale applications: The storage and training of large-scale graph neural networks based on the traditional CPU feature high cost, low stability, and less performance. To overcome these problems, we have built a pure GPU graph training engine (PGLBox). Through the heterogeneous hierarchical storage technology of SSD, memory and video memory, it supports the training of ultra-large scale graph models. The training performance is improved by more than 10x compared with CPU graph training engine on the premise of equal cost. The task failure rate is extremely low.[#44594](https://github.com/PaddlePaddle/Paddle/pull/44594)
- - Large-scale federation parameter server architecture: For large-scale personalized recommendation scenarios, the large-scale federation parameter server training is developed based on the heterogeneous PS infrastructure, to support horizontal and vertical federation under hundreds of billions of parameters. It includes two features: User private parameters updated locally and public parameters updated remotely. Users can flexibly configure the slicing policy for private and public parameters. A new central scheduling node Coordinator is added. Users can perform secondary development from the base class to customize the Client selection policy. [#42682](https://github.com/PaddlePaddle/Paddle/pull/42682) , [#44864](https://github.com/PaddlePaddle/Paddle/pull/44864) , [#44327](https://github.com/PaddlePaddle/Paddle/pull/44327)
-- **Adaptive parallel**
- - Design and launch a complete automatic parallelism interface system: Support automatic dynamic-to-static distributed training, automatic distributed data loading, automatic distributed saving and loading, automatic parameter conversion, custom slice marker and custom execution process. Users can easily obtain the automatic distributed training capability based on a single machine networking. It supports data parallel, model parallel, pipeline parallel, and hybrid parallel. [#45776](https://github.com/PaddlePaddle/Paddle/pull/45776) ,[#46552](https://github.com/PaddlePaddle/Paddle/pull/46552) , [#44202](https://github.com/PaddlePaddle/Paddle/pull/44202) , [#45840](https://github.com/PaddlePaddle/Paddle/pull/45840) , [#45518](https://github.com/PaddlePaddle/Paddle/pull/45518) , [#40528](https://github.com/PaddlePaddle/Paddle/pull/40528), [#42838](https://github.com/PaddlePaddle/Paddle/pull/42838), [#43093](https://github.com/PaddlePaddle/Paddle/pull/43093), [#43312](https://github.com/PaddlePaddle/Paddle/pull/43312), [#45053](https://github.com/PaddlePaddle/Paddle/pull/45053).
- - Improve the underlying adaptive parallel mechanism, including the upgrade of the distributed costmodel design and implementation, to provide better evaluation of the slice policy. Add the native distributed properties to ProgramIR and enrich the Cluster functions. [#40457](https://github.com/PaddlePaddle/Paddle/pull/40457) , [#42601](https://github.com/PaddlePaddle/Paddle/pull/42601) , [#42727](https://github.com/PaddlePaddle/Paddle/pull/42727) , [#42874](https://github.com/PaddlePaddle/Paddle/pull/42784) , [#43114](https://github.com/PaddlePaddle/Paddle/pull/43114) , [#44095](https://github.com/PaddlePaddle/Paddle/pull/44095) , [#44146](https://github.com/PaddlePaddle/Paddle/pull/44146) , [#44701](https://github.com/PaddlePaddle/Paddle/pull/44701) , [#44973](https://github.com/PaddlePaddle/Paddle/pull/44973) , [#45002](https://github.com/PaddlePaddle/Paddle/pull/45002) , [#45118](https://github.com/PaddlePaddle/Paddle/pull/45118) , [#45237](https://github.com/PaddlePaddle/Paddle/pull/45237) , [#42576](https://github.com/PaddlePaddle/Paddle/pull/42576) , [#41722](https://github.com/PaddlePaddle/Paddle/pull/41722) , [#44150](https://github.com/PaddlePaddle/Paddle/pull/44150) , [#44989](https://github.com/PaddlePaddle/Paddle/pull/44989), [#44951](https://github.com/PaddlePaddle/Paddle/pull/44951), [#44963](https://github.com/PaddlePaddle/Paddle/pull/44963) .
- - Add the Shardingstage1/2/3 AutoTuning feature under data parallel. This allows to automatically select the highest throughput Shardingstage policy while ensuring that the video memory constraints are met. [#43782](https://github.com/PaddlePaddle/Paddle/pull/43782) .
-
-- **Training hardware access - Plug-in solutions**:Add custom Runtime/Kernel/CCL/Graph/Pass solutions. The hardware vendors can choose which modules to implement on-demand based on hardware characteristics.
-
-- **ONNX format export**
- - Support the quantized model export. The exported ONNX model uses TensorRT or ONNXRuntime to load inference. About 1.5~4 times inference acceleration can be obtained [#856](https://github.com/PaddlePaddle/Paddle2ONNX/pull/856), [#782](https://github.com/PaddlePaddle/Paddle2ONNX/pull/782)
- - Add the export of a large model greater than 2GB [#942](https://github.com/PaddlePaddle/Paddle2ONNX/pull/942)
-
-### (3)Function optimization
-- **Comprehensive increase of dynamic-to-static analysis conversion & extension capabilities**
- - In order to improve the success rate and experience of model dynamic-to-static conversion, the transcription logic of control flow syntax is reconstructed. The core syntax has been upgraded to JIT (just-in-time) paradigm to achieve equivalent transcription with Python codes. The syntax functions such as break, return and continue are improved.[#43666](https://github.com/PaddlePaddle/Paddle/pull/43666) , [#43846](https://github.com/PaddlePaddle/Paddle/pull/43846) , [#43848](https://github.com/PaddlePaddle/Paddle/pull/43848) , [#43880](https://github.com/PaddlePaddle/Paddle/pull/43880) , [#43957](https://github.com/PaddlePaddle/Paddle/pull/43957) , [#43328](https://github.com/PaddlePaddle/Paddle/pull/43328) , [#43348](https://github.com/PaddlePaddle/Paddle/pull/43348) , [#43998](https://github.com/PaddlePaddle/Paddle/pull/43998) , [#44465](https://github.com/PaddlePaddle/Paddle/pull/44465) , [#44504](https://github.com/PaddlePaddle/Paddle/pull/44504) , [#43713](https://github.com/PaddlePaddle/Paddle/pull/43713) , [#43864](https://github.com/PaddlePaddle/Paddle/pull/43864) , [#43967](https://github.com/PaddlePaddle/Paddle/pull/43967) , [#44155](https://github.com/PaddlePaddle/Paddle/pull/44155) , [#44487](https://github.com/PaddlePaddle/Paddle/pull/44487) , [#44527](https://github.com/PaddlePaddle/Paddle/pull/44527) , [#45105](https://github.com/PaddlePaddle/Paddle/pull/45105) , [#45900](https://github.com/PaddlePaddle/Paddle/pull/45900)
- - In order to support the voice custom decoding flexible deployment scenarios, the jit.save/load interface function is extended to support user multi-function merge and export. A new JITLayer component is added to support the invocation of class functions. Meanwhile, the custom inference deployment function is implemented with the PHI operator library C++ API. [#44283](https://github.com/PaddlePaddle/Paddle/pull/44283), [#41783](https://github.com/PaddlePaddle/Paddle/pull/41783), [#43607](https://github.com/PaddlePaddle/Paddle/pull/43607), [#43754](https://github.com/PaddlePaddle/Paddle/pull/43754), [#43758](https://github.com/PaddlePaddle/Paddle/pull/43758), [#43798](https://github.com/PaddlePaddle/Paddle/pull/43798), [#44010](https://github.com/PaddlePaddle/Paddle/pull/44010), [#44351](https://github.com/PaddlePaddle/Paddle/pull/44351), [#44465](https://github.com/PaddlePaddle/Paddle/pull/44465), [#44504](https://github.com/PaddlePaddle/Paddle/pull/44504), [#44597](https://github.com/PaddlePaddle/Paddle/pull/44597), [#44738](https://github.com/PaddlePaddle/Paddle/pull/44738), [#44984](https://github.com/PaddlePaddle/Paddle/pull/44984), [#46249](https://github.com/PaddlePaddle/Paddle/pull/46249)
- - In order to unify API dynamic and static behaviors, 20 operators are upgraded to support variable attribute information of Op in static graphs, to ensure consistent dynamic and static behaviors and improve the success rate of dynamic-to-static conversion of models. Include `pad2d`,`depthwise_conv2d_transpose`,`conv2d_transpose`,`adaptive_avg_pool2d`,`reverse`,`bincount`,`multinomial`,`reduce_sum`,`reduce_mean`,`reduce_prod`,`reduce_min`,`reduce_max`,`uniform`,`squeeze`,`max_unpool2d`,`dropout`,`cumsum`,`eye`,`argmin`,`argmax`. [#44737](https://github.com/PaddlePaddle/Paddle/pull/44737), [#45084](https://github.com/PaddlePaddle/Paddle/pull/45084), [#45189](https://github.com/PaddlePaddle/Paddle/pull/45189), [#45391](https://github.com/PaddlePaddle/Paddle/pull/45391), [#45417](https://github.com/PaddlePaddle/Paddle/pull/45417), [#45427](https://github.com/PaddlePaddle/Paddle/pull/45427), [#45514](https://github.com/PaddlePaddle/Paddle/pull/45514), [#45525](https://github.com/PaddlePaddle/Paddle/pull/45525), [#45543](https://github.com/PaddlePaddle/Paddle/pull/45543), [#45660](https://github.com/PaddlePaddle/Paddle/pull/45660), [#46352](https://github.com/PaddlePaddle/Paddle/pull/46352/), [#46433](https://github.com/PaddlePaddle/Paddle/pull/46433), [#45078](https://github.com/PaddlePaddle/Paddle/pull/45078), [#45342](https://github.com/PaddlePaddle/Paddle/pull/45342), [#45372](https://github.com/PaddlePaddle/Paddle/pull/45372), [#45453](https://github.com/PaddlePaddle/Paddle/pull/45453), [#45522](https://github.com/PaddlePaddle/Paddle/pull/45522), [#45620](https://github.com/PaddlePaddle/Paddle/pull/45620)
- - In order to solve the problem of occasional loss of error reporting stack for user dynamic-to-static, the logic of the error reporting module is optimized to improve the readability of the error reporting stack and the user debugging experience. [#44054](https://github.com/PaddlePaddle/Paddle/pull/44054), [#44083](https://github.com/PaddlePaddle/Paddle/pull/44083), [#44781](https://github.com/PaddlePaddle/Paddle/pull/44781), [#44996](https://github.com/PaddlePaddle/Paddle/pull/44996)
- - Add the TypeHint syntax recognition and transcription module to fully support Python Type Hint syntax. [#47121](https://github.com/PaddlePaddle/Paddle/pull/47121)
-
-- **PHI operator library covers the full amount of arithmetic class operators**:Continuously build the highly reusable operator library PHI. The remaining PaddlePaddle 2.x arithmetic class PythonAPI-associated operators and related kernels are migrated to the PHI operators library and rewritten as functional expression. Add about 180 forward/reverse operator CPU&GPU kernels, and 170 Kunlun-specific arithmetic kernels. This further enhances the kernel function sets that can be reused when new operators are added. In addition, add more than 100 C++ arithmetic class APIs. These APIs can be used in the custom operators, further enhancing the ease of use for external extension development based on the PaddlePaddle. [#44577](https://github.com/PaddlePaddle/Paddle/pull/44577), [#44631](https://github.com/PaddlePaddle/Paddle/pull/44631), [#44434](https://github.com/PaddlePaddle/Paddle/pull/44434), [#44605](https://github.com/PaddlePaddle/Paddle/pull/44605), [#44676](https://github.com/PaddlePaddle/Paddle/pull/44676), [#44742](https://github.com/PaddlePaddle/Paddle/pull/44742), [#44436](https://github.com/PaddlePaddle/Paddle/pull/44436) , [#45887](https://github.com/PaddlePaddle/Paddle/pull/45887), [#45851](https://github.com/PaddlePaddle/Paddle/pull/45851), [#45623](https://github.com/PaddlePaddle/Paddle/pull/45623), [#45397](https://github.com/PaddlePaddle/Paddle/pull/45397), [#45863](https://github.com/PaddlePaddle/Paddle/pull/45863)
-
-- **Normalized operator definitions with significantly improving the model simplicity**:For the problems of many redundant parameters in the historical operator definitions of PaddlePaddle 1.x and the high cost of understanding the adaptation, the redundant parameters of about 150 high-frequency operators are cleaned up centrally. Basically, the mathematically irrelevant parameters are removed. After these redundant parameters are cleaned up, the amount of information in the inference model stored in the PaddlePaddle is significantly reduced. Generally, about 40% of the attribute variables are removed, significantly improving the clarity of the PaddlePaddle operator definition, and improving the experience of model analysis and debugging. Meanwhile, the size of the inference model stored in the PaddlePaddle is also significantly reduced by more than 70%. As a result, this can significantly improve the lightweight of the PaddlePaddle model. [#44310](https://github.com/PaddlePaddle/Paddle/pull/44310) , [#45613](https://github.com/PaddlePaddle/Paddle/pull/45613) , [#45684](https://github.com/PaddlePaddle/Paddle/pull/45684) , [#45708](https://github.com/PaddlePaddle/Paddle/pull/45708) , [#45758](https://github.com/PaddlePaddle/Paddle/pull/45758) , [#45786](https://github.com/PaddlePaddle/Paddle/pull/45786) , [#45772](https://github.com/PaddlePaddle/Paddle/pull/45772) , [#45845](https://github.com/PaddlePaddle/Paddle/pull/45845) , [#45984](https://github.com/PaddlePaddle/Paddle/pull/45984) , [#46218](https://github.com/PaddlePaddle/Paddle/pull/46218) , [#46553](https://github.com/PaddlePaddle/Paddle/pull/46553)
-
-### (4)Performance optimization
-
-- AMP performance and accuracy optimization
- - More operators are added with the support of FP16 data types, including elementwise series operators, compare series operators, strided_slice, set_value, uniform_ramdom, etc.([#45504](https://github.com/PaddlePaddle/Paddle/pull/45504) [#44405](https://github.com/PaddlePaddle/Paddle/pull/44405) [#45496](https://github.com/PaddlePaddle/Paddle/pull/45496) [#46641](https://github.com/PaddlePaddle/Paddle/pull/46641), [#46906](https://github.com/PaddlePaddle/Paddle/pull/46906) )
- - Optimize the implementation scheme of the hard_swish operator FP16 Kernel to guarantee the accuracy without loss. ( [35386](https://github.com/PaddlePaddle/Paddle/pull/35386) )
- - More operators are added with the support of BF16 data types, including fused_linear, empty, selu, pow, adam, clip, embedding, gelu, pad3d, pixel_shuffle, tile, where, etc. [#46364](https://github.com/PaddlePaddle/Paddle/pull/46364), [#47177](https://github.com/PaddlePaddle/Paddle/pull/47177)
-- AutoTuning of single machine training performance
- - Transpose OP supports automatic Kernel selection mechanism. This allows the automatic search for the best Kernel implementation for different model configurations, improving the model performance. [#43310](https://github.com/PaddlePaddle/Paddle/pull/43310) (Transpose Op access AutoTuning function)
- - AMP Layout auto-switching supports the new dynamic graph mode. For the ResNet50, TSM, and DeepLabV3 models, the performance increases by 9%-21% by Layout AutoTuning in the new dynamic graph. ([#45409](https://github.com/PaddlePaddle/Paddle/pull/45409), [#45751](https://github.com/PaddlePaddle/Paddle/pull/45751), [#45826](https://github.com/PaddlePaddle/Paddle/pull/45826), [#46880](https://github.com/PaddlePaddle/Paddle/pull/46880))
-- Generic performance optimization of GPU single machine training
- - Optimize the Cache scheme of the Conv operator cuDNN algorithm and Cache the results in all algorithm acquisition methods. This can significantly reduce the CPU overhead of the operator.([#41891](https://github.com/PaddlePaddle/Paddle/pull/41891) [#47197](https://github.com/PaddlePaddle/Paddle/pull/47197) )
- - Further optimize the GPU Kernel and Python side performance of multiple operators, including dist, poisson, depthwise_conv2d, transpose, eigh, broadcast computation, reduce computation, layer_norm, cross_entropy, etc. This can achieve better performance in more configuration scenarios. ([#44946](https://github.com/PaddlePaddle/Paddle/pull/44946), [#45057](https://github.com/PaddlePaddle/Paddle/pull/45057), [#45160](https://github.com/PaddlePaddle/Paddle/pull/45160), [#42491](https://github.com/PaddlePaddle/Paddle/pull/42491), [#42704](https://github.com/PaddlePaddle/Paddle/pull/42704), [#42853](https://github.com/PaddlePaddle/Paddle/pull/42853), [#46287](https://github.com/PaddlePaddle/Paddle/pull/46287), [#46362](https://github.com/PaddlePaddle/Paddle/pull/46362), [#46490](https://github.com/PaddlePaddle/Paddle/pull/46490), [#46412](https://github.com/PaddlePaddle/Paddle/pull/46412), [#46623](https://github.com/PaddlePaddle/Paddle/pull/46623), [#40051](https://github.com/PaddlePaddle/Paddle/pull/40051) )
-- Performance optimization of distributed training for collective communications
- - To improve pipeline parallel scheduling efficiency, support the dynamic graph Interleaving1F1B scheduling policy. In the GPT-3 model, the performance is improved by 3%-4%. [#45797](https://github.com/PaddlePaddle/Paddle/pull/45797) , [#45869](https://github.com/PaddlePaddle/Paddle/pull/45869) , [#45922](https://github.com/PaddlePaddle/Paddle/pull/45922) , [#46209](https://github.com/PaddlePaddle/Paddle/pull/46209) , [#45402](https://github.com/PaddlePaddle/Paddle/pull/45402) , [#45444](https://github.com/PaddlePaddle/Paddle/pull/45444) , [#45497](https://github.com/PaddlePaddle/Paddle/pull/45497) , [#45797](https://github.com/PaddlePaddle/Paddle/pull/45797) , [#45869](https://github.com/PaddlePaddle/Paddle/pull/45869) , [#45922](https://github.com/PaddlePaddle/Paddle/pull/45922), [#46209](https://github.com/PaddlePaddle/Paddle/pull/46209), [#46399](https://github.com/PaddlePaddle/Paddle/pull/46399) , [#46483](https://github.com/PaddlePaddle/Paddle/pull/46483) , [#46876](https://github.com/PaddlePaddle/Paddle/pull/46876) , [#47242](https://github.com/PaddlePaddle/Paddle/pull/47242) , [#47249](https://github.com/PaddlePaddle/Paddle/pull/47249) , [#47497](https://github.com/PaddlePaddle/Paddle/pull/47497) , [#47517](https://github.com/PaddlePaddle/Paddle/pull/47517)
- - To improve the distributed training performance of the MLPerfBERT model, the DistributedFusedLamb distributed optimizer supports hierarchical AllReduce. It improves MLPerfBERT performance by 17% on the DCU1024 card. [#44821](https://github.com/PaddlePaddle/Paddle/pull/44821) , [#44843](https://github.com/PaddlePaddle/Paddle/pull/44843)
- - To optimize the video memory footprint when using DataParallel, the Buffer Lazy initialization policy for Tensor Fusion is supported, thus reducing the video memory footprint by an amount equal to the number of model parameters. [#45631](https://github.com/PaddlePaddle/Paddle/pull/45631).
- - Distributed parallel policies DataParallel and Sharding support BF16 training. [#46846](https://github.com/PaddlePaddle/Paddle/pull/46846) , [#47246](https://github.com/PaddlePaddle/Paddle/pull/47246)
- - To support the Sequence Parallel policy, the Distributed Pipeline Parallel supports enable_partial_send_recv policy, and supports the tensor after slice of the transmission sequence parallel. [#46992](https://github.com/PaddlePaddle/Paddle/pull/46992) , [#47083](https://github.com/PaddlePaddle/Paddle/pull/47083)
- - To improve the performance of sharding stage 2 policy, implement the overlap of sharding stage 2 optimizer broadcast parameters with next step forward and use multi-CUDA Stream for communication. In the GPT 6.7B model, the 16-card training performance is improved by 11%. [#46495](https://github.com/PaddlePaddle/Paddle/pull/46495) , [#46656](https://github.com/PaddlePaddle/Paddle/pull/46656) , [#47061](https://github.com/PaddlePaddle/Paddle/pull/47061)
-
-### (5)Bug fix
-
-- Dynamic-to-static
- - Fix the bug of reporting an error in dynamic-to-static of the model in a Parameter no-gradient scenario during multi-card training. [#44485](https://github.com/PaddlePaddle/Paddle/pull/44485)
- - Fix the bug of where redundant frame logs are mistakenly output by the terminal in the dynamic-to-static. [#45754](https://github.com/PaddlePaddle/Paddle/pull/45754), [#46800](https://github.com/PaddlePaddle/Paddle/pull/46800)
- - Fix the bug of reporting an error in the dynamic-to-static training when the control flow in the model contains a Tensor that does not require a gradient. [#43034](https://github.com/PaddlePaddle/Paddle/pull/43034)
- - Fix the bug of incorrect computation value during gradient aggregation in the dynamic-to-static training. [#44893](https://github.com/PaddlePaddle/Paddle/pull/44893)
- - Fix the bug of reporting an error in the dynamic-to-static when the function is decorated with @staticmethod. [#44983](https://github.com/PaddlePaddle/Paddle/pull/44983), [#45268](https://github.com/PaddlePaddle/Paddle/pull/45268), [#45277](https://github.com/PaddlePaddle/Paddle/pull/45277)
- - Fix the bug of too much video memory footprint in some scenarios where the model contains the dynamic-to-static training. [#45380](https://github.com/PaddlePaddle/Paddle/pull/45380)
- - Fix the bug of reporting an error of dynamic-to-static shape derivation in the networking phase when the model contains a complex control flow. [#45916](https://github.com/PaddlePaddle/Paddle/pull/45916), [#46020](https://github.com/PaddlePaddle/Paddle/pull/46020)
-- Fix the error report mechanism
- - Replace self.assertTrue(np.allclose(...)) with np.testing.assert_allclose to get fuller error reporting information ( [#44947](https://github.com/PaddlePaddle/Paddle/pull/44947), [#44988](https://github.com/PaddlePaddle/Paddle/pull/44988), [#45213](https://github.com/PaddlePaddle/Paddle/pull/45213))
-- Distributed training in collective communications
- - Fix several bugs in communication library initialization and communication process, and enhance the system operation stability. [#44964](https://github.com/PaddlePaddle/Paddle/pull/44964) [#45100](https://github.com/PaddlePaddle/Paddle/pull/45100) [#44758](https://github.com/PaddlePaddle/Paddle/pull/44758)
- - Fix the bug of frequent occurrences of hang in pipeline parallel, and enhance the ease of use of the policy [#47201](https://github.com/PaddlePaddle/Paddle/pull/47201); enhance the pipeline function to support unbalanced input. [#47199](https://github.com/PaddlePaddle/Paddle/pull/47199)
- - Fix the bug that the performance of the new dynamic graph MP/PP policy is lower than the old dynamic graph. [#47071](https://github.com/PaddlePaddle/Paddle/pull/47071)
- - Fix the bug that the shardingstage2 policy incorrectly maintains the parameter trainable property. [#47240](https://github.com/PaddlePaddle/Paddle/pull/47240)
- - Fix the bug that tensornumel is greater than INT32_MAX in series of OPs. [#45711](https://github.com/PaddlePaddle/Paddle/pull/45711), [#45741](https://github.com/PaddlePaddle/Paddle/pull/45741), [#45897](https://github.com/PaddlePaddle/Paddle/pull/45897), [#46158](https://github.com/PaddlePaddle/Paddle/pull/46158), [#46767](https://github.com/PaddlePaddle/Paddle/pull/46767), [#47191](https://github.com/PaddlePaddle/Paddle/pull/47191), [#46045](https://github.com/PaddlePaddle/Paddle/pull/46045), [#46160](https://github.com/PaddlePaddle/Paddle/pull/46160)
- - Fix the bug of too much video memory footprint in FusedAttention and Fused FeedForward OP.[#47236](https://github.com/PaddlePaddle/Paddle/pull/47236), [#47235](https://github.com/PaddlePaddle/Paddle/pull/47235)
- - Fix the bug of incorrect parameter update in multi_tensor_adam and multi_tensor_momentumOP when the parameters passed in are listofdict. [#47352](https://github.com/PaddlePaddle/Paddle/pull/47352), [#47372](https://github.com/PaddlePaddle/Paddle/pull/47372)
-
-## 4. Deployment direction (Paddle Inference)
-
-### (1)New features
-
-- Optimize the back-end graph engine integration scheme
- - In order to reduce Paddle-TensorRT plugin code development and reduce the number of Paddle-TensorRT subgraphs and thus reducing resource usage, a generic plugin mechanism has been developed, to automatically provide a unified TensorRT plugin interface for rich Phi operators in the framework. As a result, the video memory footprint can be effectively reduced in most scenarios. [#46970](https://github.com/PaddlePaddle/Paddle/pull/46070), [#46179](https://github.com/PaddlePaddle/Paddle/pull/46179), [#46580](https://github.com/PaddlePaddle/Paddle/pull/46580)
- - In order to facilitate users to customize operators in the framework and make Paddle-TensorRT perform efficient inference, the function is upgraded to support the framework custom Paddle-TensorRT plugin. [#46970](https://github.com/PaddlePaddle/Paddle/pull/46070)
-- Optimize the Inference library build system. The size can be pruned on demand
- - Pre-compiled installer supports TensorRT by default: The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default, reducing the switching cost for users using PaddleTensorRT. [#46008](https://github.com/PaddlePaddle/Paddle/pull/46008), [#45824](https://github.com/PaddlePaddle/Paddle/pull/45824), [#46058](https://github.com/PaddlePaddle/Paddle/pull/46058)
- - The size can be pruned on demand: Pruned according to the model operator. [#47033](https://github.com/PaddlePaddle/Paddle/pull/47033) , [#47049](https://github.com/PaddlePaddle/Paddle/pull/47049) , [#47047](https://github.com/PaddlePaddle/Paddle/pull/47047)
-- Inference supports native AMP
- - In order to make full use of GPUTensorCore computation capability and improve the model inference performance, a model accuracy conversion tool has been developed. The InferenceGPU natively supports the inference of the mixed precision model. For the usages, refer to the documentation. [documentation](https://github.com/PaddlePaddle/Paddle-Inference-Demo/blob/release/v2.4/docs-official/guides/nv_gpu_infer/gpu_mixed_precision.md), [#43814](https://github.com/PaddlePaddle/Paddle/pull/43814), [#43881](https://github.com/PaddlePaddle/Paddle/pull/43881), [#44057](https://github.com/PaddlePaddle/Paddle/pull/44057), [#44307](https://github.com/PaddlePaddle/Paddle/pull/44307), [#44457](https://github.com/PaddlePaddle/Paddle/pull/44457), [#44866](https://github.com/PaddlePaddle/Paddle/pull/44866), [#45050](https://github.com/PaddlePaddle/Paddle/pull/45050), [#45346](https://github.com/PaddlePaddle/Paddle/pull/45346), [#45379](https://github.com/PaddlePaddle/Paddle/pull/45379), [#45406](https://github.com/PaddlePaddle/Paddle/pull/45406), [#45882](https://github.com/PaddlePaddle/Paddle/pull/45882)
- - In order to improve the inference performance of the mixed precision model, the FP16kernel of high-frequency operators that do not support FP16 computation is supplemented, thus reducing the possibility of inserting the cast operator due to input precision mismatch. The inference performance is improved. [#44642](https://github.com/PaddlePaddle/Paddle/pull/44642), [#45061](https://github.com/PaddlePaddle/Paddle/pull/45061), [#44653](https://github.com/PaddlePaddle/Paddle/pull/44653), [#45504](https://github.com/PaddlePaddle/Paddle/pull/45504), [#45061](https://github.com/PaddlePaddle/Paddle/pull/45061), [#44969](https://github.com/PaddlePaddle/Paddle/pull/44969), [#44558](https://github.com/PaddlePaddle/Paddle/pull/44558), [#44710](https://github.com/PaddlePaddle/Paddle/pull/44710), [#43871](https://github.com/PaddlePaddle/Paddle/pull/43871), [#44792](https://github.com/PaddlePaddle/Paddle/pull/44792)
-- Upgrade the compression and inference engine
- - Upgrade the quantization model storage format. The new format supports PaddleInference, PaddleLite and Paddle2ONNX 3 deployment methods. The supported chips include X86 CPU, NVIDIA GPU, and Arm CPU. ([#46305](https://github.com/PaddlePaddle/Paddle/pull/46305), [#462832](https://github.com/PaddlePaddle/Paddle/pull/46283), [#46022](https://github.com/PaddlePaddle/Paddle/pull/46022) )
- - Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips.
-- Add the INT8 full quantization function compatible with SoC/NPU chips. This can ensure the output INT8 quantization model has the best inference acceleration and precision on SoC/NPU chips.
- - Upgrade the interface module between the PaddlePaddle framework and compiler, to support inference models to access the compiler for optimization via Paddle Inference. ([#44499](https://github.com/PaddlePaddle/Paddle/pull/44499) [#44708](https://github.com/PaddlePaddle/Paddle/pull/44708) )
-
-### (2)Underlying optimization
-
-- **GPU performance optimization**
- - Add the TensorRT mapping for operators such as matmul_v2, LSTM, reshape, fill_constant, swish, mulitclass_nms3, bilinear_interp_v2, split, silu, shuffle_channel operators. Optimize the support for the dynamic shape. Performance improved by 7% to 90% for multi-class focused models. ([#46177](https://github.com/PaddlePaddle/Paddle/pull/46177), [#44678](https://github.com/PaddlePaddle/Paddle/pull/44678), [#44314](https://github.com/PaddlePaddle/Paddle/pull/44314), [#44561](https://github.com/PaddlePaddle/Paddle/pull/44561), [#45166](https://github.com/PaddlePaddle/Paddle/pull/45166), [#44411](https://github.com/PaddlePaddle/Paddle/pull/44411), [#43424](https://github.com/PaddlePaddle/Paddle/pull/43424), [#44516](https://github.com/PaddlePaddle/Paddle/pull/44516))
- - Add constant folding PASS for inference performance optimization, to improve the performance of SwinTransformer, HifiGAN, FastSpeech2, and other models.([#45494](https://github.com/PaddlePaddle/Paddle/pull/45494))
- - Add cache of conv_fusionworkspacesize, to improve the computation performance of conv_fusion. ([#45902](https://github.com/PaddlePaddle/Paddle/pull/45902))
-- **Vision ViT model optimization**
- - Add the ViT model Attention structure fusion PASS, and support OSSPlugin and auto padding. The ViT inference speed increases by 30%-40%. [#45019](https://github.com/PaddlePaddle/Paddle/pull/45019) [#45506](https://github.com/PaddlePaddle/Paddle/pull/45506)
-- **Inference performance optimization of large model**
- - To improve the inference speed of very large generative models and save the video memory, add INT8 implementation (fused_multi_transformer_int8_op) to the multi-layer Transformer fusion operator (fused_multi_transformer_op), and support quantized inference of generative models. Use the matrix multiplication algorithm to select, quantize/de-quantize the kernel fusion for performance optimization. [#46169](https://github.com/PaddlePaddle/Paddle/pull/46169)
- - Add Pass for automatic matching fusion in order to improve the ease of use of fused_multi_transformer fusion for large model inference.
-- **CPU performance optimization**
- - Optimize the speech U2++ model. The FP32 model inference speed is improved by 35%. The INT8 model inference speed is improved by 69%. ([#47592](https://github.com/PaddlePaddle/Paddle/pull/47592), [#47127](https://github.com/PaddlePaddle/Paddle/pull/47127), [#47391](https://github.com/PaddlePaddle/Paddle/pull/47391), [#47234](https://github.com/PaddlePaddle/Paddle/pull/47234), [#47009](https://github.com/PaddlePaddle/Paddle/pull/47009), [#47080](https://github.com/PaddlePaddle/Paddle/pull/47080))
-
-
-### (3)Bug fix
-
-- TensorRT workspace size supports int64. ([#44469](https://github.com/PaddlePaddle/Paddle/pull/44469) )
-- In Paddle-TRT, fully support Op's input as weight.([#45545](https://github.com/PaddlePaddle/Paddle/pull/45545) )
-- In Paddle-TRT, support conv2d_transpose/conv3d_transpose to have the output_padding attribute.([#45004](https://github.com/PaddlePaddle/Paddle/pull/45004) )
-- In Paddle-TRT, enhance the strided_slice support for dynamic shape. ([#46819](https://github.com/PaddlePaddle/Paddle/pull/46819) )
-- In Paddle-TRT, optimize the video memory footprint of context when running in multi-thread scenarios.([#45468](https://github.com/PaddlePaddle/Paddle/pull/45468) )
-- In Paddle-TRT, fix the bug of repeatedly generating serialization files in case of change of initialization sequences when multiple models run in the same process.([#43942](https://github.com/PaddlePaddle/Paddle/pull/43942) )
-- Fix the bug of occasional crash when Predictor is initialized to run for multiple times in the same process.([#45203](https://github.com/PaddlePaddle/Paddle/pull/45203) )
-- Fix the bug of abnormal inference accuracy of quantization models such as MobileNetV3_large, ERNIE 3.0-Medium and bert ([#45416](https://github.com/PaddlePaddle/Paddle/pull/45416), [#46283](https://github.com/PaddlePaddle/Paddle/pull/46283), [#45920](https://github.com/PaddlePaddle/Paddle/pull/45920) [#47573](https://github.com/PaddlePaddle/Paddle/pull/47574))
-
-## 5. Environment adaptation
-
-- The pre-compiled installer for training and the pre-compiled installer for deployment (Paddle Inference) are unified into one pre-compiled installer. The build system is optimized so that the pre-compiled installer supports TensorRT by default.
-- The pre-compiled installer for CUDA version 10.1 is cancelled.
-- Add the pre-compiled installer for CUDA 11.7.
-- Decrease of source code compilation time: Reduce inter-module dependencies, improve the parallel, and optimize the compilation speed of some modules. The full compilation time is reduced by about 20 minutes in total.
-- Support the running of PaddlePaddle on windows 11, Centos 8, Ubuntu 22.04, Jetson 5.02 system environment. Support to run PaddlePaddle linux installer in windows system by using the WSL 2 tool.
-- Fix the running error bug of the PaddlePaddle in glibc2.34+ environment.
-- Optimize the code style of C++, Python, CMake in the whole code repository. Introduce or upgrade the following code style checking tools.
- - pre-commit is upgraded from 1.10.4 to 2.17.0: [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103)
- - pylint is changed from default version to specify as: [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103)
- - remove-crlf is upgraded from 1.0.1 to 1.1.14 : [#43103](https://github.com/PaddlePaddle/Paddle/pull/43103)
- - cpplint is changed from default version to specify as 1.6.0 : [#43175](https://github.com/PaddlePaddle/Paddle/pull/43175), [#43978](https://github.com/PaddlePaddle/Paddle/pull/43978), [#43673](https://github.com/PaddlePaddle/Paddle/pull/43673), [#43679](https://github.com/PaddlePaddle/Paddle/pull/43679), [#43695](https://github.com/PaddlePaddle/Paddle/pull/43695), [#43733](https://github.com/PaddlePaddle/Paddle/pull/43733), [#43740](https://github.com/PaddlePaddle/Paddle/pull/43740)
- - clang-format is upgrade from 3.8 to 13.0 : [#42840](https://github.com/PaddlePaddle/Paddle/pull/42840), [#43248](https://github.com/PaddlePaddle/Paddle/pull/43248), [#43329](https://github.com/PaddlePaddle/Paddle/pull/43329), [#43333](https://github.com/PaddlePaddle/Paddle/pull/43333), [#43633](https://github.com/PaddlePaddle/Paddle/pull/43633), [#43678](https://github.com/PaddlePaddle/Paddle/pull/43678)
- - Introduce the black tool for python code style checking :[#46014](https://github.com/PaddlePaddle/Paddle/pull/46014)
- - Introduce the cmakelint tool for cmake file code checking. Version is 1.4.2 : [#43222](https://github.com/PaddlePaddle/Paddle/pull/43222), [#43406](https://github.com/PaddlePaddle/Paddle/pull/43406), [#43414](https://github.com/PaddlePaddle/Paddle/pull/43414), [#43428](https://github.com/PaddlePaddle/Paddle/pull/43428)
- - Introduce cmake-format for automatic formatting of cmake files. Version is 0.6.13 : [#43057](https://github.com/PaddlePaddle/Paddle/pull/43057)
-
-## 6. Hardware adaptation
### Hygon DCU
-- Add the Profiler function on DCU, to collect, count and display performance data of model running process on DCU, and support DCU occupancy display at kernel level.
-### Kunlunxin Chip
-- Add Profiler function on Kunlunxin 2 generation chip, which can collect, count and display the performance data of model running process on Kunlunxin 2 generation chip, and support occupancy display of Kunlunxin 2 generation chip at kernel level.
-- Training/reasoning support for Kunlunxin 2 generation chips (Kunlunxin AI accelerator cards R200, R300, R200-8F, R200-8FS, RG800), a total of 51 models such as PPYOLOE, PP-OCR, ERNIE3.0, PP-TSM, PP-TTS, DLRM, PPO, etc. have been verified, supporting static graph + dynamic graph training, supporting mixed precision training, support single machine single card and single machine multi-card training, covering 5 fields of intelligent vision, natural language processing, intelligent speech, intelligent recommendation, reinforcement learning.
-### Cambricon
-- Support the training/inference of Cambricon MLU chip (MLU370 series of boards): The ResNet50, BERT, YoloV3, OCR-DB, Deeplabv3 and many other models are verified. Support the static graph + dynamic graph training. Support mixed precision training. Support the single machine single card and single machine multi-card training.
-### Graphcore
-- Support the training/inference of Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU). Support ResNet50, BERT and other models. Support the static graph and dynamic-to-static graph mode training. Support the single chip, single machine, and multi-machine distributed training.
-- Add the support of more operators
-- Upgrade to Poplar SDK v3.0.0 [#46892](https://github.com/PaddlePaddle/Paddle/pull/46892)
-* Support the training models by using the dynamic-to-static graph mode. Add a new paddle.incubate.identity_loss op to assist with composition [#43770](https://github.com/PaddlePaddle/Paddle/pull/43770)
-* Support the Paddle native distributed training API: paddle.distributed.launch [#43311](https://github.com/PaddlePaddle/Paddle/pull/43311)
-* Support the training models with the mixed precision [#41733](https://github.com/PaddlePaddle/Paddle/pull/41733)
-* Paddle Inference supports custom operators by using PopART [#45235](https://github.com/PaddlePaddle/Paddle/pull/45235)
-
-### Intel
-- Migrate oneDNN operators : transpose2_grad([#46139](https://github.com/PaddlePaddle/Paddle/pull/46139)), relu6_grad([#46501](https://github.com/PaddlePaddle/Paddle/pull/46501)), gaussian_random([#46747](https://github.com/PaddlePaddle/Paddle/pull/46747), [#45481](https://github.com/PaddlePaddle/Paddle/pull/45481)), sgd and stack([#46374](https://github.com/PaddlePaddle/Paddle/pull/46374)), concat+grad, expand+grad,fill_constant([#45863](https://github.com/PaddlePaddle/Paddle/pull/45863)), slice, slice_grad, split,pad and pad3d([#46101](https://github.com/PaddlePaddle/Paddle/pull/46101)), softmax_grad([#46257](https://github.com/PaddlePaddle/Paddle/pull/46257)), Shape([#46051](https://github.com/PaddlePaddle/Paddle/pull/46051)), Sum([#46239](https://github.com/PaddlePaddle/Paddle/pull/46239)), Transpose2_grad([#46139](https://github.com/PaddlePaddle/Paddle/pull/46139)), Cast, clip+grad andpool+grad([#45775](https://github.com/PaddlePaddle/Paddle/pull/45775)), Reduce sum+grad,mean+grad, min and max([#45536](https://github.com/PaddlePaddle/Paddle/pull/45536)), Relu and abs([#45397](https://github.com/PaddlePaddle/Paddle/pull/45397)), Gelu([#45596](https://github.com/PaddlePaddle/Paddle/pull/45596)), Scale([#45537](https://github.com/PaddlePaddle/Paddle/pull/45537))
-- Optimize kernels of fill_constant, fc, conv, and a number of operators
-- Add several Pass fusion optimizations
-- Optimize the Adam-W CPU FP32 optimizer ([#42522](https://github.com/PaddlePaddle/Paddle/pull/42522))
-- Optimize pad3d fp32 onednn operator kernel implementation ([#43990](https://github.com/PaddlePaddle/Paddle/pull/43990))
-- Optimize the concurrent execution of matmul, FC andlookup_v2 kernels ([#44023](https://github.com/PaddlePaddle/Paddle/pull/44023), [#44078](https://github.com/PaddlePaddle/Paddle/pull/444078), [#44640](https://github.com/PaddlePaddle/Paddle/pull/44640), [#44744](https://github.com/PaddlePaddle/Paddle/pull/44744), [#45249](https://github.com/PaddlePaddle/Paddle/pull/45249))
-- FC onednn operator kernel supports bf16 ( [#42758](https://github.com/PaddlePaddle/Paddle/pull/42758), [#43154](https://github.com/PaddlePaddle/Paddle/pull/43154), [#43109](https://github.com/PaddlePaddle/Paddle/pull/43109))
-- Add the fusion of matrix multiplication and activation functions ([#43519](https://github.com/PaddlePaddle/Paddle/pull/43519), [#43198](https://github.com/PaddlePaddle/Paddle/pull/43198))
-- Support convolution operator int8 parameter production IR passes ( [#44680](https://github.com/PaddlePaddle/Paddle/pull/44680), [#42625](https://github.com/PaddlePaddle/Paddle/pull/42625))
-- Add pool/avg quantization and scales correction ([#44186](https://github.com/PaddlePaddle/Paddle/pull/44186))
-- Add the matmul and elementwise onednn operator kernel fusion ([#45077](https://github.com/PaddlePaddle/Paddle/pull/45077))
-- Fix the QAT precision bug ([#43693](https://github.com/PaddlePaddle/Paddle/pull/43693), [#45936](https://github.com/PaddlePaddle/Paddle/pull/45936), [#46378](https://github.com/PaddlePaddle/Paddle/pull/46378))
-- Migrate 42 oneDNN operator kernels to PHI operator library ([#46374](https://github.com/PaddlePaddle/Paddle/pull/46374), [#46101](https://github.com/PaddlePaddle/Paddle/pull/46101), [#45989](https://github.com/PaddlePaddle/Paddle/pull/45989), [#45863](https://github.com/PaddlePaddle/Paddle/pull/45863), [#45775](https://github.com/PaddlePaddle/Paddle/pull/45775), [#45626](https://github.com/PaddlePaddle/Paddle/pull/45626), [#45536](https://github.com/PaddlePaddle/Paddle/pull/45536), [#46501](https://github.com/PaddlePaddle/Paddle/pull/46501), [#46257](https://github.com/PaddlePaddle/Paddle/pull/46257), [#45596](https://github.com/PaddlePaddle/Paddle/pull/45596), [#45537](https://github.com/PaddlePaddle/Paddle/pull/45537), [#45481](https://github.com/PaddlePaddle/Paddle/pull/45481), [#45397](https://github.com/PaddlePaddle/Paddle/pull/45397), [#46239](https://github.com/PaddlePaddle/Paddle/pull/46239), [#46139](https://github.com/PaddlePaddle/Paddle/pull/46139), [#46051](https://github.com/PaddlePaddle/Paddle/pull/46051))
-- Quantize the elementwise_sub and shape operator kernels ([#42854](https://github.com/PaddlePaddle/Paddle/pull/42854), [#44124](https://github.com/PaddlePaddle/Paddle/pull/44124))
-
-## Thanks to our Contributors
-
-This release contains contributions from:
-
-0x45f, Aganlengzi, Ainavo, Allen Guo, Asthestarsfalll, Aurelius84, Baibaifan, baoachun, BiynXu, Bo Zhang, BrilliantYuKaimin, cambriconhsq, caozhou, carryyu, ccrrong, ceci3, chalsliu, Chang Xu, Charles-hit, Chen Long, Chen Weihang, chenjian, chentianyu03, Chenxiao Niu, cifar10, crystal, csy0225, danleifeng, David Nicolas, dc-cheny, denglin-github, dongfangshenzhu, duanboqiang, duanyanhui, engineer, enzodechine, Fan Zhang, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, FlyingQianMM, freeliuzc, furnace, fuyou765, fwenguang, Ghost Screaming, gongweibao, Guanghua Yu, guguguzi, Guoxia Wang, Haipeng Wang, handiz, Haohongxiang, haosicheng, helen88, heliqi, hong, HongyuJia, houj04, huangxu96, Hui Zhang, Huihuang Zheng, huzhiqiang, Jacek Czaja, Jack Zhou, jack603047588, Jackwaterveg, jakpiase, james, Jiabin Yang, jiangcheng, Jiaqi Liu, JingZhuangzhuang, joanna.wozna.intel, JYChen, JZ-LIANG, Kaipeng Deng, kangguangli, kuizhiqing, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, lidanqing, LielinJiang, Ligoml, Lijunhui, lilong12, limingshu, Lin Manhui, Linjie Chen, liqitong-a, littletomatodonkey, liu zhengxi, Liu-xiandong, liutiexing, Liyulingyue, LiYuRio, Lux et Veritas, lyq, Matsumoto Ruko, MayYouBeProsperous, mengqingchun02, Ming-Xu Huang, ming1753, minghaoBD, moyan, mrcangye, Netpunk, niuliling123, Nyakku Shigure, OccupyMars2025, onecatcn, pangyoki, parap1uie-s, peachlcy, piotrekobi, Qi Li, QingshuChen, qipengh, Rayman, Regan Yue, RichardWooSJTU, risemeup1, Roc, ronnywang, Rui Li, Ruibiao Chen, seemingwang, Shang Zhizhou, shangliang Xu, ShenLiang, shentanyue, Shijie, ShiningZhang, shixingbo, shiyutang, Shuangchi He, Siming Dai, Sing_chan, Skr Bang, SmirnovKol, sneaxiy, sprouteer, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao CHANG, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, tiancaishaonvjituizi, tianshuo78520a, Tomasz Socha, TTerror, USTCKAY, Vigi Zhang, Walter, Wang Bojun, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, WangXi, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wawltor, wbn, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, whs, Wilber, WJJ1995, wuhuachaocoding, wuhuanzhou, wuyefeilin, XiaoguangHu, xiaoguoguo626807, xiaohemaikoo, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiayanming, Xingyuan Zhang, xiongkun, yang131313, yangguohao, YangZhou, Yanxing Shi, Yao Zihang, yaoxuefeng, yaozhixin, yeliang2258, Yilingyelu, Yiqun Liu, ykkk2333, Yuang Liu, Yuanle Liu, YuanRisheng, yuguo, Yulong Ao, Yulv-git, YUNSHEN XIE, Zhang Jun, Zhang Ting, Zhang Zheng, zhangbo9674, zhangbopd, zhangchunle, Zhangjingyu06, zhangkaihuo, zhangxiaoci, zhangyikun02, zhangzhenguo, Zhanlue Yang, zhaocaibei123, zhaoying9105, zhaoyingli, Zhen Wang, Zhengyang Song, zhiboniu, Zhong Hui, Zhou Wei, zhoutianzi666, zhupengyang, ziyoujiyi, zlsh80826, zmxdream, zn, Zuza Gawrysiak, zyfncg, 傅剑寒, 六个骨头, 津, 熊峻峰, 王明冬, 石晓伟
-
-# 2.3.1 Release Note
-
-## **1. Important Updates**
-
-- V2.3.1 is built on V2.3 by fixing known issues and releasing precompiled binary that supports CUDA 11.6.
-
-## **2. Training Framework (distributed included)**
-
-### **(1) Function Optimization**
-
-#### API
-
-- Modify two initialization modes of `paddle.nn.initializer.KaimingUniform` and `paddle.nn.initializer.KaimingNormal`, to support multiple types of activation functions. ([#43721](https://github.com/PaddlePaddle/Paddle/pull/43721), [#43827](https://github.com/PaddlePaddle/Paddle/pull/43827))
-- Optimize the data pre-fetching function of `paddle.io.DataLoader`, so that it can support the setting of the `prefetch_factor` to set the cache size of pre-fetched data. This can avoid IO blocking when reading large blocks of data. ([#43674](https://github.com/PaddlePaddle/Paddle/pull/43674))
-
-#### **New dynamic graph execution mechanism**
-
-- Modify the initialization method of optional type Tensor in the new dynamic graph API logic to prevent data exceptions caused by early destruction. ([#42561](https://github.com/PaddlePaddle/Paddle/pull/42561))
-
-#### **New static graph executor**
-
-- Defer initialization of the thread pools in the executor, to avoid creating thread pools for `programs` that execute only once (e.g.,`save, load, startup_program`, etc.). ([#43768](https://github.com/PaddlePaddle/Paddle/pull/43768))
-
-#### **Mixed precision training**
-
-- Disabling `state_dict` hook in `set_state_dict` in `paddle.nn.Layer`. ([#43407](https://github.com/PaddlePaddle/Paddle/pull/43407))
-
-#### **Distributed training**
-
-- Enabling tensor parallelism in `paddle.incubate.nn.functional.fused_attention` and `paddle.incubate.nn.functional.fused_feedforward`. ([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505))
-
-#### **Others**
-
-- Adjust print format of the framework operator kernels to facilitate automated splitting and parsing. ([#42931](https://github.com/PaddlePaddle/Paddle/pull/42931))
-- Update the model quantization API to support the round-off in `rounding to nearest ties to even`, and support quantization in the range [-128, 127]. ([#43829](https://github.com/PaddlePaddle/Paddle/pull/43829))
-- Support AMP mixed precision training in quantization-aware training. ([#43689](https://github.com/PaddlePaddle/Paddle/pull/43689))
-- Add the `progress bar` at the beginning of quantization-aware training, so that it is easy to check the progress of quantization initialization. Skip the scale op when counting out_threshold to speed up the initialization process. ([#43454](https://github.com/PaddlePaddle/Paddle/pull/43454))
-- Support `conv` and `bn` fusion in the dynamic graph quantization training. Support the settings of skip_tensor_list in the static graph offline quantization, to skip some layers without quantization. ([#43301](https://github.com/PaddlePaddle/Paddle/pull/43301))
-
-### **(2) Performance Optimization**
-
-- Optimize`paddle.incubate.nn.functional.fused_attention` and `paddle.incubate.nn.functional.fused_feedforward`operators. Add `add_residual` property to control whether to perform add-`residual` operation in the last step. The performance of CAE model is improved by 7.7%. ([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719))
-- Optimize `linspace` operator. Initialize three input Tensor of `start`,`stop` and `num` on CPU, to avoid GPU->CPU copy in the operator. This can speed up SOLOv2 model performance by 6%. ([#43746](https://github.com/PaddlePaddle/Paddle/pull/43746))
-
-### **(3) Bug Fix**
-
-#### API
-
-- Fix the error reported by `paddle.io.DataLoader` when `return_list=True` due to multi-thread conflict. ([#43691](https://github.com/PaddlePaddle/Paddle/pull/43691))
-- Fix the error that the `to` method reports NoneType does not have the device attribute when the `paddle.nn.Layer` parameter has the `None` type parameter. ([#43597](https://github.com/PaddlePaddle/Paddle/pull/43597))
-- Fix the bug that the calculation result of cumsum op is wrong in some `shape` settings. ([#42500](https://github.com/PaddlePaddle/Paddle/pull/42500), [#43777](https://github.com/PaddlePaddle/Paddle/pull/43777))
-- Fix the bug that the output result dimension of `Tensor.__getitem__` is 0 in the networking stage when using `bool` index in the static graph. ([#43246](https://github.com/PaddlePaddle/Paddle/pull/43246))
-- Fix the bug occurred when `paddle.slice` and `paddle.strided_slice` handle negative parameters. ([#43432](https://github.com/PaddlePaddle/Paddle/pull/43432))
-- Fix the bug that the assignment result of set_value op is abnormal when the processing slice `step` is negative. ([#43694](https://github.com/PaddlePaddle/Paddle/pull/43694))
-- Fix the bug that the `copy` interface in C++ cannot copy between multiple cards. ([#43728](https://github.com/PaddlePaddle/Paddle/pull/43728))
-- Fix the bug in inference stage caused by attribute naming in `paddle.incubate.nn.functional.fused_attention`and `paddle.incubate.nn.functional.fused_feedforward`. ([#43505](https://github.com/PaddlePaddle/Paddle/pull/43505))
-- Fix an exception in ConditionalBlockGrad op when processing Tensor that does not require `grad`. ([#43034](https://github.com/PaddlePaddle/Paddle/pull/43034))
-- Fix the bug of device memory increase caused by einsum op in the speed optimization of backward computation. By default, this optimization is enabled. ([#43397](https://github.com/PaddlePaddle/Paddle/pull/43397))
-- Fix the bug that data fails to be fixed when `paddle.io.DataLoader` multi-process data reads the fixing random seeds under a single card. ([#43702](https://github.com/PaddlePaddle/Paddle/pull/43702))
-- Fix the bug that softmax op triggers CUDNN_STATUS_NOT_SUPPORT when the Tensor exceeds 2G. ([#43719](https://github.com/PaddlePaddle/Paddle/pull/43719))
-- Fix the bug that the trace op `Event` string is indistinguishable among different operators that cause the inconvenient performance analysis. ([#42789](https://github.com/PaddlePaddle/Paddle/pull/42789))
-
-#### **Others**
-
-- Fix the bug of overflowing device memory caused by multiple deepcopy and saving in case of dynamic-to-static. ([#43141](https://github.com/PaddlePaddle/Paddle/pull/43141))
-- Fix the bug that the device id introduced by the upgrade of PlaceType used in the custom operator is wrong in the multi-card scenario. ([#43830](https://github.com/PaddlePaddle/Paddle/pull/43830))
-- Optimize the `paddle.profiler.Profiler` timeline visualization logic, move events customized in python scripts from C++ folding display to python folding display. ([#42790](https://github.com/PaddlePaddle/Paddle/pull/42790))
-
-## **3.** Deployment Direction (Paddle Inference)
-
-### **(1) New Features**
-
-#### **New functions**
-
-- Add the support of the PaddleSlim quantization model for ONNX Runtime backends on CPUs. ([#43774](https://github.com/PaddlePaddle/Paddle/pull/43774), [#43796](https://github.com/PaddlePaddle/Paddle/pull/43796))
-
-### **(2) Underlying Optimization**
-
-#### **CPU performance optimization**
-
-- Remove `gpu_cpu_reshape2_matmul_fuse_pass` from EnableMkldnn configuration to fix the bug of ResNet50 performance degradation. ([#43750](https://github.com/PaddlePaddle/Paddle/pull/43750))
-
-#### **GPU performance optimization**
-
-- Add the support of `bilinear_interp_v2` TensorRT convert. ([#43618](https://github.com/PaddlePaddle/Paddle/pull/43618))
-- Add `matmul_scale_fuse_pass` and `multihead_matmul_fuse_pass_v3` to GPU pass. ([#43765](https://github.com/PaddlePaddle/Paddle/pull/43765))
-- Add the support of the GPU handle deferred initialization. ([#43661](https://github.com/PaddlePaddle/Paddle/pull/43661))
-
-### **(3) Bug Fixing**
-
-#### **Framework and API fixing**
-
-- Fix the compile error problem when binding Paddle-Lite XPU. ([#43178](https://github.com/PaddlePaddle/Paddle/pull/43178))
-- Fix the bug of false trigger of ERNIE 3.0 pass. ([#43948](https://github.com/PaddlePaddle/Paddle/pull/43948))
-- Fix the bug that int8 quantization attribute in multihead op cannot be read. ([#43020](https://github.com/PaddlePaddle/Paddle/pull/43020))
-
-#### **Backend capability fixing**
-
-- Fix the bug that two ops of elementwise_mul and matmul in MKLDNN are crashed during quantitative inference. ([#43725](https://github.com/PaddlePaddle/Paddle/pull/43725))
-- Fix a bug where TensorRT subgraph serialization files are repeatedly generated for the same model during inference. ([#42945](https://github.com/PaddlePaddle/Paddle/pull/43945), [#42633](https://github.com/PaddlePaddle/Paddle/pull/42633))
-- Fix a conflict between the ONNX Runtime backend and the externally use of protobuf. ([#43159](https://github.com/PaddlePaddle/Paddle/pull/43159), [#43742](https://github.com/PaddlePaddle/Paddle/pull/43742))
-- Fix an error reported by python prediction library when using ONNX Runtime backend in case of multiple inputs. ([#43621](https://github.com/PaddlePaddle/Paddle/pull/43621))
-
-## **4. Environment Adaptation**
-
-### **Compile and install**
-
-- Complete verification and adaptation of CUDA 11.6, and release CUDA 11.6 precompiled binary. ([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005))
-- Fix a cub error when compiling with CUDA 11.6 on Windows. ([#43935](https://github.com/PaddlePaddle/Paddle/pull/43935), [#44005](https://github.com/PaddlePaddle/Paddle/pull/44005))
-- Fix the bug of long compilation time for elementwise and reduce op. ([#43202](https://github.com/PaddlePaddle/Paddle/pull/43202), [#42779](https://github.com/PaddlePaddle/Paddle/pull/42779), [#43205](https://github.com/PaddlePaddle/Paddle/pull/43205))
-
-### **New hardware adaptation**
-
-- Cambricon MLU supports PaddlePaddle Profiler. ([#42115](https://github.com/PaddlePaddle/Paddle/pull/42115))
-- GraphCore IPU supports visualization of compilation progress. ([#42078](https://github.com/PaddlePaddle/Paddle/pull/42078))
-
-# 2.3.0 Release Note
-
-## 1. **Important Updates**
-
-We are excited to release the PaddlePaddle Framework V2.3.0. This version contains the following highlights.
-
-### API
-
-- Added more than 100 new APIs, covering automatic differentiation, linear algebra, probability distribution, sparse tensor, framework performance analysis, hardware device management, vision domain, etc.
-
-- Added 4 new automatic differentiation APIs, 11 new linear algebra APIs, and 21 new probability distribution APIs to better support use cases in scientific computing, reinforcement learning, xand other application areas.
-
-- Added 11 new Sparse Tensor APIs including basic functions of sparse tensor construction and conversion. The COO and CSR formats are supported.
-
-- Added 9 new framework performance analysis APIs. The new performance profiling APIs, centered around Paddle.Profiler.Profiler, help users collect and analyze performance statistics during training and inference.
-
-- Added 7 APIs for device management, facilitating hardware information acquistion.
-
-- Added several visual and text domain APIs to facilitate ~~the~~ reusability of MobileNetV3, ResNeXt and other backbone networks, to achieve the fast networking.
-
-
-### **Paddle** HIgh reusability operator l**ibrary**
-
-- We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework's performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these built-in functions can greatly reduce the user's development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composability, the reuse of kernels through Primitives, and accelerators through plugins.
-
-### **Distributed Training**
-
-- Fully upgrade the adaptive distributed training architecture, including multiple modules such as elastic resource management, asynchronous pipelined executor, heterogeneous communication, and automatic parallelism, and support the hard-aware distributed training and inference under a variety of heterogeneous hardware.
-
-- Add MoE parallel strategy, GroupSharded parallel strategy, and Pure FP16 under dynamic graph hybrid Parallelism, which further supports the efficient distributed training of large models under the dynamic graph.
-
-- Comprehensively upgrade and optimize the architecture of general heterogeneous parameter server, and simplify each module, such as communication and storage, to improve the secondary development experience of parameter server. The performance of GPU parameter server is improved by 2.38 times under 100 billion parameters and 10 billion data.
-
-
-### **Compile and Install**
-
-- From version 2.3.0, PaddlePaddle upgrades GPU architectures supported.
-
-
-### **Inference Deployment**
-
-- Add the Java API and ONNX Runtime CPU backend.
-
-- Support the TensorRT 8.0 / 8.2 and structured sparsity, with deep performance optimization for ERNIE-like structural models.
-
-
-### **Hardware Backend Extention**
-
-- Add custom device support: provide a plug-in way to extend PaddlePaddle hardware backend.
-
-- Add training/inference support for multiple heterogeneous chips such as HUAWEI Ascend 910 / GraphCore IPU / Cambricon MLU / KUNLUNXIN 2.
-
-
-### **Framework Architecture**
-
-- In this version, we did a lot of work on the framework executor. For details, please see [New Dynamic Graph Execution Mechanism](#new-dynamic-graph-execution-mechanism) and [New Static Graph Executor](#new-static-graph-executor).
-
-## **2. Incompatibility Upgrade**
-
-- Due to limitation of the binary size, sm35 CUDA ARCH is dropped in pre-compiled binaries. ([#41754](https://github.com/PaddlePaddle/Paddle/pull/41754))
-
-- When `paddle.to_tensor` converts a python int scalar to a Tensor, the default data type on Windows changes from int32 to int64, thus alignment with Linux/Mac. ([#39662](https://github.com/PaddlePaddle/Paddle/pull/39662))
-
-- To keep consistency with division behavior under python3, the division symbol `/` has been changed from “rounding divide” to “true divide”, and the data type of the computed output has been switched from int to float. ([#40890](https://github.com/PaddlePaddle/Paddle/pull/40890))
-
-
-
-
-
-2.2
-
-
-2.3.0
-
-
-
-
-
-
-
-```python
->>> import paddle
->>> a = paddle.to_tensor([327])
->>> b = paddle.to_tensor([80])
->>> a / b
-Tensor(shape=[1], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
- [4])
-```
-
-
-
-
-
-```python
->>> import paddle
->>> a = paddle.to_tensor([327])
->>> b = paddle.to_tensor([80])
->>> a / b
-Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=True,
- [4.08750010])
-```
-
-
-
-
-
-- Revise the ELU's formula. The computing method in case of alpha <0 aligns with the original paper, thus fixing a small number of cases where the results are incorrectly calculated. Meanwhile, elu_ will report an error in case of alpha <0, because it is not mathematically possible to compute the inverse gradient from the output only at alpha <0. ([#37316](https://github.com/PaddlePaddle/Paddle/pull/37316))
-
-
-
-
-2.2
-
-
-2.3.0
-
-
-
-
-
-
-
-```python
-# elu(x) = max(0, x) + min(0, α ∗ (e^x − 1))
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
- [ 0. , -74.48576355])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
- [ 0. , -74.48576355])
-```
-
-
-
-
-
-```python
-# elu(x) = x, if x > 0
-# elu(x) = α ∗ (e^x − 1), if x <= 0
->>> import paddle
->>> x = paddle.to_tensor([-1., 6.])
->>> m = paddle.nn.ELU(-0.2)
->>> out = m(x)
->>> out
-Tensor(shape=[2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
- [0.12642412, 6. ])
->>> out = paddle.nn.functional.elu_(x, alpha=-0.2, name=None)
-Traceback (most recent call last):
- File "", line 1, in
- File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
- return caller(func, *(extras + args), **kw)
- File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
- return wrapped_func(*args, **kwargs)
- File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/inplace_utils.py", line 34, in __impl__
- return func(*args, **kwargs)
- File "/usr/local/lib/python3.7/dist-packages/paddle/nn/functional/activation.py", line 89, in elu_
- assert alpha >= 0., "elu_ only support alpha >= 0, please use elu instead."
-AssertionError: elu_ only support alpha >= 0, please use elu instead.
-```
-
-
-
-
-
-## **3. Training Framework (with the distributed function)**
-
-### **(1) New functions**
-
-#### API
-
-- Add 4 new automatic differentiation APIs to support scientific computing, as listed below: ([#40692](https://github.com/PaddlePaddle/Paddle/pull/40692))
-
- - `paddle.incubate.autograd.vjp`, compute vector-Jacobi matrix product.
-
- - `paddle.incubate.autograd.jvp`, compute Jacobi matrix-vector product.
-
- - `paddle.incubate.autograd.Jacobian`, compute Jacobi matrix.
-
- - `paddle.incubate.autograd.Hessian`, compute Hessian matrix.
-
-- Add linear algebra class API
-
- - Add `paddle.linalg.triangular_solve`, to compute a system of linear equations with unique solutions through a triangular coefficient. ([#36714](https://github.com/PaddlePaddle/Paddle/pull/36714))
-
- - Add `paddle.linalg.eig`, to compute the characteristic decomposition of the general square matrix. ([#35764](https://github.com/PaddlePaddle/Paddle/pull/35764))
-
- - Add `paddle.linalg.sovle`, to compute solutions to systems of linear equations. ([#35715](https://github.com/PaddlePaddle/Paddle/pull/35715))
-
- - Add `paddle.linalg.lstsq`, to compute least-squares solutions to systems of linear equations. ([#38585](https://github.com/PaddlePaddle/Paddle/pull/38585), [#38621](https://github.com/PaddlePaddle/Paddle/pull/38621))
-
- - Add `paddle.linalg.qr`, compute QR decomposition of matrix. ([#35742](https://github.com/PaddlePaddle/Paddle/pull/35742), [#38824](https://github.com/PaddlePaddle/Paddle/pull/38824))
-
- - Add `paddle.inner`, to compute inner product of a matrix. ([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706))
-
- - Add `paddle.outer`, to compute outer product of a matrix. ([#37706](https://github.com/PaddlePaddle/Paddle/pull/37706))
-
- - Add `paddle.linalg.cov`, to compute covariance between vectors. ([#38392](https://github.com/PaddlePaddle/Paddle/pull/38392))
-
- - Add `paddle.linalg.cholesky_sovle`, to compute the cholesky solution of the equation. ([#38167](https://github.com/PaddlePaddle/Paddle/pull/38167))
-
- - Add `paddle.linalg.lu` and `paddle.linalg.lu_unpack`, to compute matrix lu decomposition, and decompress lu matrix. ([#38617](https://github.com/PaddlePaddle/Paddle/pull/38617), [#38559](https://github.com/PaddlePaddle/Paddle/pull/38559), [#38616](https://github.com/PaddlePaddle/Paddle/pull/38616))
-
-- Add 21 new probability distribution class APIs for reinforcement learning, variation inference, scientific computing, and other scenarios. Including 6 random variable distributions, 13 random variable transformations, and 2 KL divergence computing. as listed below: ([#40536](https://github.com/PaddlePaddle/Paddle/pull/40536), [#38820](https://github.com/PaddlePaddle/Paddle/pull/38820), [#38558](https://github.com/PaddlePaddle/Paddle/pull/38558/files), [#38445](https://github.com/PaddlePaddle/Paddle/pull/38445), [#38244](https://github.com/PaddlePaddle/Paddle/pull/38244), [#38047](https://github.com/PaddlePaddle/Paddle/pull/38047))
-
- - `paddle.distribution.ExponentialFamily`, exponential distribution family base class.
-
- - `paddle.distribution.Beta`, `Beta` distribution.
-
- - `paddle.distribution.Dirichlet`, `Dirichlet` distribution.
-
- - `paddle.distribution.Independent`, Independent distribution, used to create higher order distributions.
-
- - `paddle.distribution.TransformedDistribution`, Transform distribution, used to generate higher-order distributions through the base distribution and a series of transformations.
-
- - `paddle.distribution.Multionmial`, a multinomial distribution.
-
- - `paddle.distribution.Transform`, base class for transforming random variables.
-
- - `paddle.distribution.AbsTransform`, take absolute value transform.
-
- - `paddle.distribution.AffineTransform`, affine transform.
-
- - `paddle.distribution.ChainTransform`, chain combination of the transform.
-
- - `paddle.distribution.ExpTransform`, exponential transform.
-
- - `paddle.distribution.IndependentTransform`, independent transform, used to extend the `event_dim` of the transform definition field.
-
- - `paddle.distribution.PowerTransform`, power transform.
-
- - `paddle.distribution.ReshapeTransform`, `reshape` transform.
-
- - `paddle.distribution.SigmoidTransform`, `sigmoid` transform.
-
- - `paddle.distribution.SoftmaxTransform`, `softmax` transform.
-
- - `paddle.distribution.StackTransform`, `stack` transform, used to combine multiple transforms in a `stack` method.
-
- - `paddle.distribution.StickBreakingTransform`, `stickbreaking` transform.
-
- - `paddle.distribution.TanhTransform`, `tanh` transform.
-
- - `paddle.distribution.kl_divergence`, compute KL divergence.
-
- - `paddle.distribution.register_kl`, register user-defined KL divergence calculation function.
-
-- Add high-level API
-
- - Add `paddle.vision.models.AlexNet` and `paddle.vision.models.alexnet`, to use AlexNet models directly. ([#36058](https://github.com/PaddlePaddle/Paddle/pull/36058))
-
- - Add `paddle.vision.models.DenseNet`, `paddle.vision.models.densenet121`, `paddle.vision.models.densenet161`, `paddle.vision.models. densenet169`, `paddle.vision.models.densenet201`, and `paddle.vision.models.densenet264`, to use DenseNet models directly. ([#36069](https://github.com/PaddlePaddle/Paddle/pull/36069))
-
- - Add `paddle.vision.models.GoogLeNet` and `paddle.vision.models.googlenet`, to use GoogLeNet models directly. ([#36034](https://github.com/PaddlePaddle/Paddle/pull/36034))
-
- - Add `paddle.vision.models.InceptionV3`, `paddle.vision.models.inception_v3`, to use InceptionV3 models directly. ([#36064](https://github.com/PaddlePaddle/Paddle/pull/36064))
-
- - Add `paddle.vision.models.MobileNetV3Small`, `paddle.vision.models.MobileNetV3Large`, `paddle.vision.models.mobilenet_v3_small`, and `paddle.vision.models.mobilenet_v3_large`, to use MobileNetV3 models directly. ([#38653](https://github.com/PaddlePaddle/Paddle/pull/38653))
-
- - Add `paddle.vision.models.resnext50_32x4d`, `paddle.vision.models.resnext50_64x4d`, `paddle.vision.models. paddle.vision.models.resnext101_32x4d`, `paddle.vision.models.resnext101_64x4d`, `paddle.vision.models.resnext152_32x4d`, and `paddle.vision.models.resnext152_64x4d`, to use ResNeXt models directly. ([#36070](https://github.com/PaddlePaddle/Paddle/pull/36070))
-
- - Add `paddle.vision.models.ShuffleNetV2`, `paddle.vision.models.shufflenet_v2_x0_25`, `paddle.vision.models.shufflenet_v2_x0_33`, `paddle.vision.models.shufflenet_v2_x0_5`, `paddle.vision.models.shufflenet_v2_x1_0`, `paddle.vision.models.shufflenet_v2_x1_5`, `paddle.vision.models.shufflenet_v2_x2_0`, and `paddle.vision.models.shufflenet_v2_swish`, to use ShuffleNetV2 models directly ([#36067](https://github.com/PaddlePaddle/Paddle/pull/36067))
-
- - Add `paddle.vision.models.SqueezeNet`, `paddle.vision.models.squeezenet1_0`, and `paddle.vision.models.squeezenet1_1`, to use SqueezeNet models directly. ([#36066](https://github.com/PaddlePaddle/Paddle/pull/36066))
-
- - Add `paddle.vision.models.wide_resnet50_2`, and `paddle.vision.models.wide_resnet101_2`, to use WideResNet models directly. ([#36952](https://github.com/PaddlePaddle/Paddle/pull/36952))
-
- - Add `paddle.vision.ops.nms` API, to support single-category and multi-category non-maximum suppression (NMS) algorithms for target detection and prediction task acceleration ([#40962](https://github.com/PaddlePaddle/Paddle/pull/40962))
-
- - Add `paddle.vision.ops.roi_pool` and `paddle.vision.ops.RoIPool`, to support RoI region pooling operations in detection tasks. ([#36154](https://github.com/PaddlePaddle/Paddle/pull/36154))
-
- - Add `paddle.vision.ops.roi_align` and `paddle.vision.ops.RoIAlign`, to support RoI Align operations in detection tasks. ([#35102](https://github.com/PaddlePaddle/Paddle/pull/36154))
-
- - Add `paddle.text.ViterbiDecoder`, and `paddle.text.viterbi_decode` Viterbi decoding API, mainly for sequence tagging model prediction. ([#35778](https://github.com/PaddlePaddle/Paddle/pull/35778))
-
-- Add 11 Sparse class APIs, to support basic functions, such as creating Sparse Tensor in COO and CSR formats, and add C++ inter-converting with Tensor.
-
- - `paddle.sparse.sparse_coo_tensor`,create Sparse Tensor in COO format. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `paddle.sparse.sparse_csr_tensor`,create Sparse Tensor in CSR format. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `paddle.sparse.ReLU`,support ReLU activation layer for SparseCooTensor. ([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959))
-
- - `paddle.sparse.functional.relu`,support ReLU function of SparseCooTensor. ([#40959](https://github.com/PaddlePaddle/Paddle/pull/40959))
-
- - `Tensor.values()`,c++ method to get non-zero elements of a SparseCooTensor or SparseCsrTensor. ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.indices()`,c++ method to get the coordinate information of a SparseCooTensor. ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.crows()`,c++ method to get information about the compressed row information of the SparseCsrTensor. ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.cols()`,c++ method to get the column information of the SparseCsrTensor ([#40608](https://github.com/PaddlePaddle/Paddle/pull/40608))
-
- - `Tensor.to_sparse_coo()`,c++ method to convert a DenseTensor or SparseCsrTensor to a SparseCooTensor. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `Tensor.to_sparse_csr()`,c++ convert a DenseTensor or SparseCooTensor to a SparseCsrTensor. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
- - `Tensor.to_dense()`,c++ convert a SparseCooTensor or SparseCsrTensor to a DenseTensor. ([#40780](https://github.com/PaddlePaddle/Paddle/pull/40780))
-
-- Add hardware related APIs
-
- - Add four GPU memory monitoring related APIs: `paddle.device.cuda.max_memory_allocated`, `paddle.device.cuda.max_memory_reserved`, `paddle.device.cuda.memory_allocated`, and `paddle.device.cuda.memory_reserved`, to view and analyze the GPU memory usage in real-time. ([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657))
-
- - Add `paddle.device.cuda.get_device_properties`, to return the properties of the GPU device. ([#35661](https://github.com/PaddlePaddle/Paddle/pull/35661))
-
- - Add `paddle.device.cuda.get_device_name` and `paddle.device.cuda.get_device_capability`, to return the name and compute capability of the GPU device. ([#35672](https://github.com/PaddlePaddle/Paddle/pull/35672))
-
-- Add Tensor operation API
-
- - Add `paddle.nansum`, to sum input Tensor along `axis` with ignoring the `NaNs` values. ([#38137](https://github.com/PaddlePaddle/Paddle/pull/38137))
-
- - Add `paddle.nanmean`,to average input Tensor along `axis` with ignoring the `NaNs` values. ([#40472](https://github.com/PaddlePaddle/Paddle/pull/40472))
-
- - Add `paddle.clone`, to return a copy of the input Tensor and provide gradient calculation. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
- - Add `paddle.Tensor.element_size`, to return the number of bytes allocated for a single element in a Tensor. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
- - Add `paddle.Tensor.to_uva_tensor`, to convert the numpy objects to be accessed by CUDA objects with virtual addresses, which are stored in CPU memory physically. ([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#38950](https://github.com/PaddlePaddle/Paddle/pull/38950))
-
- - Add `paddle.rot90`, to rotate the n-dimensional Tensor by 90 degrees along the plane specified by `axes`. ([#37634](https://github.com/PaddlePaddle/Paddle/pull/37634))
-
- - Add `paddle.logit` and `paddle.Tensor.logit`, to compute the logit function values for input Tensor. ([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844))
-
- - Add `paddle.repeat_interleave`, to copy the input along the specified axis, and return a new Tensor. ([#37981](https://github.com/PaddlePaddle/Paddle/pull/37981))
-
- - Add `paddle.renorm`, to split the Tensor into multiple pieces at the specified `axis` and then perform p norm operations separately. ([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459))
-
- - Add `paddle.mode` and `paddle.Tensor.mode`, to search the values and indices of the input Tensor along the specified axis. ([#38446](https://github.com/PaddlePaddle/Paddle/pull/38446))
-
- - Add `paddle.quantile` and `paddle.Tensor.quantile`, to compute the q-quantile of a Tensor along the specified axis. ([#38567](https://github.com/PaddlePaddle/Paddle/pull/38567))
-
- - Add `paddle.kthvalue` and `paddle.Tensor.kthvalue`, to find the values and indices of the kth smallest at the specified axis. ([#38386](https://github.com/PaddlePaddle/Paddle/pull/38386))
-
- - Add `paddle.is_floating_point` and `paddle.Tensor.is_floating_point`, to determine if the input Tensor is the floating point type. ([#37885](https://github.com/PaddlePaddle/Paddle/pull/37885))
-
- - Add `paddle.erfinv` and `paddle.Tensor.erfinv`, to compute the inverse error function of the input Tensor. ([#38295](https://github.com/PaddlePaddle/Paddle/pull/38295))
-
- - Add `paddle.lerp` and `paddle.Tensor.lerp`, to compute linear interpolation among the input Tensors based on the given weights. ([#37253](https://github.com/PaddlePaddle/Paddle/pull/37253))
-
- - Add `paddle.angle`, to compute the phase angle of a complex Tensor. ([#37689](https://github.com/PaddlePaddle/Paddle/pull/37689))
-
- - Add `paddle.rad2deg` and `paddle.Tensor.rad2deg`, to convert each of the elements of input from the angles in radians to the degrees. ([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598))
-
- - Add `paddle.deg2rad` and `paddle.Tensor.deg2rad`, to convert each of the elements of input from the degrees in radians to the angles. ([#37598](https://github.com/PaddlePaddle/Paddle/pull/37598))
-
- - Add `paddle.gcd` and `paddle.Tensor.gcd`, to compute the greatest common divisors of the absolute values of two inputs by element. ([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819))
-
- - Add `paddle.lcm` and `paddle.Tensor.lcm`, to compute the least common multiple of the absolute value of two inputs by element. ([#37819](https://github.com/PaddlePaddle/Paddle/pull/37819))
-
- - Add `paddle.amax` and `paddle.Tensor.amax`, to get the maximum value of Tensor elements along the specified dimension. ([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417))
-
- - Add `paddle.amin` and `paddle.Tensor.amin`, to get the minimum value of Tensor elements along the specified dimension. ([#38417](https://github.com/PaddlePaddle/Paddle/pull/38417))
-
- - Add `paddle.isclose`, to determine if each element of two Tensors is close to each other. ([#37135](https://github.com/PaddlePaddle/Paddle/pull/37135))
-
- - Add `paddle.put_along_axis` and `paddle.take_along_axis`, for extracting or placing elements with specified index subscripts. ([#38608](https://github.com/PaddlePaddle/Paddle/pull/38608))
-
- - Add `paddle.bincount` and `paddle.Tensor.bincount`, for counting the number of occurrences of each element in a Tensor. ([#36317](https://github.com/PaddlePaddle/Paddle/pull/36317))
-
- - Add `paddle.fmax` and `paddle.fmin`, to extend the max/min function to support the case of NaN values in the two Tensors. If there is one NaN value in the corresponding position, return that non-NaN value; if there are two NaN values in the corresponding position, return the NaN value. ([#37826](https://github.com/PaddlePaddle/Paddle/pull/37826))
-
- - Add `paddle.diff`, for computing the nth forward difference along a given dimension. It currently supports n=1. ([#37441](https://github.com/PaddlePaddle/Paddle/pull/37441))
-
- - Add inverse hyperbolic functions: `paddle.asinh`, `paddle.acosh`, and `paddle.atanh`. ([#37076](https://github.com/PaddlePaddle/Paddle/pull/37076))
-
- - Add `paddle.as_real` and `paddle.as_complex` for conversion between real Tensor and complex Tensor. ([#37784](https://github.com/PaddlePaddle/Paddle/pull/37784))
-
- - Add `paddle.complex`, for constructing a complex Tensor with the given real and imaginary parts. ([#37918](https://github.com/PaddlePaddle/Paddle/pull/37918), [#38272](https://github.com/PaddlePaddle/Paddle/pull/38272))
-
- - Add `paddle.det` and `paddle.slogdet`, to compute the determinant of a matrix and the natural logarithm of the determinant. ([#34992](https://github.com/PaddlePaddle/Paddle/pull/34992))
-
- - Add `paddle.nn.utils.parameters_to_vector`, to flatten parameters to a 1-D Tensor. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
- - Add `paddle.nn.utils.vector_to_parameters`, to transform a Tensor with 1-D shape to the parameters. ([#38020](https://github.com/PaddlePaddle/Paddle/pull/38020))
-
-- Add networking class APIs
-
- - Add `paddle.nn.Fold` and `paddle.nn.functional.fold`, to extract sliding local area blocks for the Tensors of a batch. ([#38613](https://github.com/PaddlePaddle/Paddle/pull/38613))
-
- - Add `paddle.nn.CELU` and `paddle.nn.functional.celu`, to support the CELU activation layer. ([#36088](https://github.com/PaddlePaddle/Paddle/pull/36088))
-
- - Add `paddle.nn.HingeEmbeddingLoss`. Add a way to compute hinge embedding loss. It is usually used for nonlinear embedding or semi-supervised learning. ([#37540](https://github.com/PaddlePaddle/Paddle/pull/37540))
-
- - Add `paddle.nn.ZeroPad2D` API, for zero-padding according to the padding property. ([#37151](https://github.com/PaddlePaddle/Paddle/pull/37151))
-
- - Add `paddle.nn.MaxUnPool3D` and `paddle.nn.MaxUnPool1D`, for computing 3D maximum inverse pooling and 1D maximum inverse pooling. ([#38716](https://github.com/PaddlePaddle/Paddle/pull/38716))
-
- - Add `paddle.incubate.graph_khop_sampler`, `paddle.incubate.graph_sample_neighbors`, and `paddle.incubate.graph_reindex` APIs, to support graph multi-order neighbor sampling and graph reindexing operations. They are mainly used for graph neural network model training. ([#39146](https://github.com/PaddlePaddle/Paddle/pull/39146), [#40809](https://github.com/PaddlePaddle/Paddle/pull/40809))
-
-- Add random number class APIs
-
- - Add `paddle.poisson`, to generate a Tensor that obeys Poisson distributed with the lambda parameter. ([#38117](https://github.com/PaddlePaddle/Paddle/pull/38117))
-
- - Add `paddle.randint_like` API, to generate a new Tensor that obeys uniform distribution in the range [low, high), with the shape of the output matching the shape of the input. ([#36169](https://github.com/PaddlePaddle/Paddle/pull/36169))
-
- - Add `paddle.Tensor.exponential_`. It is an inplace style API that populates the input Tensor with exponentially distributed random numbers. ([#38256](https://github.com/PaddlePaddle/Paddle/pull/38256))
-
-- Add parameter initialization class APIs
-
- - Add `paddle.nn.initializer.Dirac`, to initialize 3D/4D/5D parameters with Dirac delta functions. It is commonly used for initialization of Conv1D/Conv2D/Conv3D parameters in the convolution layer. ([#37389](https://github.com/PaddlePaddle/Paddle/pull/37389))
-
- - Add `paddle.nn.initializer.Orthogonal` for orthogonal matrix initialization. The initialized parameter is the (semi-) orthogonal vector. ([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163))
-
- - Add `paddle.nn.initializer.calculate_gain`, to get the recommended gain value for the activation function. The gain value can be used to set certain initialization APIs to adjust the initialization range. ([#37163](https://github.com/PaddlePaddle/Paddle/pull/37163))
-
-- Add learning rate class API
-
- - Add `paddle.optimizer.lr.MultiplicativeDecay`, to provide the `lambda` function to set the learning rate. ([#38250](https://github.com/PaddlePaddle/Paddle/pull/38250))
-- Add distributed-related APIs
-
- - Add `paddle.incubate.optimizer.DistributedFusedLamb`, to allow the Lamb optimizer to update parameters distributedly. ([#40011](https://github.com/PaddlePaddle/Paddle/pull/40011), [#39972](https://github.com/PaddlePaddle/Paddle/pull/39972), [#39900](https://github.com/PaddlePaddle/Paddle/pull/39900), [#39747](https://github.com/PaddlePaddle/Paddle/pull/39747), [#39148](https://github.com/PaddlePaddle/Paddle/pull/39148), [#39416](https://github.com/PaddlePaddle/Paddle/pull/39416))
-- Add new optimizer-related APIs([#40710](https://github.com/PaddlePaddle/Paddle/pull/40710))
-
- - `paddle.incubate.optimizer.functional.minimize_bfgs`,add second-order optimizer BFGS.
-
- - `paddle.incubate.optimizer.functional.minimize_lbfgs`,add second-order optimizer L-BFGS.
-
-- Add `paddle.incubate.multiprocessing` module, to provide Tensor (CPU/GPU) data transfer between python processes. ([#37302](https://github.com/PaddlePaddle/Paddle/pull/37302), [#41339](https://github.com/PaddlePaddle/Paddle/pull/41339))
-
-- Add `paddle.incubate.autotune.set_config` API, to support multi-version Kernel auto-selection, mixed precision data layout auto-conversion, and num_workers auto-selection for DataLoader to automatically improve model performance. ([#42301](https://github.com/PaddlePaddle/Paddle/pull/42301))
-
-- Add `paddle.incubate.nn.FusedMultiTransformer` and `paddle.incubate.nn.functional.fused_multi_transformer` API, to fuse multiple layers of transformers into a single op to improve model inference performance. It should be noted that only forward is supported. ([#42311](https://github.com/PaddlePaddle/Paddle/pull/42311))
-
-- Add einsum_v2 operators for consistent interface between dynamic graph mode and static graph mode. It is compatible with the `paddle.einsum` implementation at the original python side, while supporting dynamic to static export and more complete Infershape inference. ([#42495](https://github.com/PaddlePaddle/Paddle/pull/42495), [#42327](https://github.com/PaddlePaddle/Paddle/pull/42327), [#42397](https://github.com/PaddlePaddle/Paddle/pull/42397), [#42105](https://github.com/PaddlePaddle/Paddle/pull/42105))
-
-
-#### IR(Intermediate Representation)
-
-- Dynamic graph to static graph
-
- - For the variable type StaticAnalysis module, add support for type tag similar to `a, b = paddle.shape(x)`. ([#39245](https://github.com/PaddlePaddle/Paddle/pull/39245))
-
- - Add a computed field, supporting `InputSpec.name` as the Program cache hash key. ([#38273](https://github.com/PaddlePaddle/Paddle/pull/38273))
-
- - Add syntax for supporting `dict['key'] = x.shape`. ([#40611](https://github.com/PaddlePaddle/Paddle/pull/40611))
-
- - Add the support for Pure FP16 training. ([#36944](https://github.com/PaddlePaddle/Paddle/pull/36944))
-
- - Add the support `for i in [x,y,z]` syntax. ([#37259](https://github.com/PaddlePaddle/Paddle/pull/37259))
-
- - Add the support for type hint syntax of python3. ([#36544](https://github.com/PaddlePaddle/Paddle/pull/36544))
-
-- Pass development
-
- - Add forward and backward fusion for FC + [relu|gelu] based on NVIDIA cuBlasLt Epilogue. ([#39437](https://github.com/PaddlePaddle/Paddle/pull/39437))
-- Kernel Primitive API
-
- - Add KP operators on GPU platform, including cast, scale, clip, bce_loss, abs_grad, reduce_sum_grad, reduce_mean_grad, clip, bce_loss, full, full_like, distribution, random, masked_select_kernel, where_index, masked_select_grad, dropout, sigmoid, where, and abs_grad. ([#36203](https://github.com/PaddlePaddle/Paddle/pull/36203), [#36423](https://github.com/PaddlePaddle/Paddle/pull/36423), [#39390](https://github.com/PaddlePaddle/Paddle/pull/39390), [#39734](https://github.com/PaddlePaddle/Paddle/pull/39734), [#38500](https://github.com/PaddlePaddle/Paddle/pull/38500), [#38959](https://github.com/PaddlePaddle/Paddle/pull/38959), [#39197](https://github.com/PaddlePaddle/Paddle/pull/39197/), [#39563](https://github.com/PaddlePaddle/Paddle/pull/39563), [#39666](https://github.com/PaddlePaddle/Paddle/pull/39666), [#40517](https://github.com/PaddlePaddle/Paddle/pull/40517), [#40617](https://github.com/PaddlePaddle/Paddle/pull/40617), [#40766](https://github.com/PaddlePaddle/Paddle/pull/40766), [#39898](https://github.com/PaddlePaddle/Paddle/pull/39898), [#39609](https://github.com/PaddlePaddle/Paddle/pull/39609))
-
- - Add the support for XPU2 source code compilation mode. ([#37254](https://github.com/PaddlePaddle/Paddle/pull/37254), [#40397](https://github.com/PaddlePaddle/Paddle/pull/40397), [#38455](https://github.com/PaddlePaddle/Paddle/pull/38455))
-
- - Add the support for KP operator reuse on XPU2 and GPU, including reduce, broadcast, elementwise_add, `exp、log、relu、sigmoid、leaky_relu、softplus、hard_swish、reciprocal`。([#36904](https://github.com/PaddlePaddle/Paddle/pull/36904), [#37226](https://github.com/PaddlePaddle/Paddle/pull/37226), [#38918](https://github.com/PaddlePaddle/Paddle/pull/38918), [#40560](https://github.com/PaddlePaddle/Paddle/pull/40560/), [#39787](https://github.com/PaddlePaddle/Paddle/pull/39787), [#39917](https://github.com/PaddlePaddle/Paddle/pull/39917), [#40002](https://github.com/PaddlePaddle/Paddle/pull/40002), [#40364](https://github.com/PaddlePaddle/Paddle/pull/40364))
-
- - Add unit tests of KP operators on the XPU2 platform, including `brelu、ceil、celu、elu、floor、hard_shrink、hard_sigmoid、log1p、logsigmoid、relu6、silu、soft_relu、softsign、sqrt、square、swish、thresholded_relu、softshrink`。([#40448](https://github.com/PaddlePaddle/Paddle/pull/40448), [#40524](https://github.com/PaddlePaddle/Paddle/pull/40524))
-
- - Add the support for XPU2 KP models, including resnet50, deepfm, wide_deep, yolov3-darknet53, det_mv3_db, bert, transformer, mobilenet_v3, and GPT2.
-
-
-#### **Mixed Precision Training**
-
-- Split the `paddle.amp.GradScaler.unscale_` method from the `minimize` of the mixed precision training `paddle.amp.GradScaler`, to provide a separate interface for recovering the loss. ([#35825](https://github.com/PaddlePaddle/Paddle/pull/35825))
-
-- Add the FP16 support for `paddle.nn.ClipByGlobalNorm` dynamic graph mode. Add FP16 Kernel for clip op to enable clip-related operations to support FP16 compute. ([#36198](https://github.com/PaddlePaddle/Paddle/pull/36198), [#36577](https://github.com/PaddlePaddle/Paddle/pull/36577))
-
-- Support the case that the `optimizer` parameter transferred from `paddle.amp.decorate` is Nan. ([#37541](https://github.com/PaddlePaddle/Paddle/pull/37541))
-
-- For the merged_momentum op,add the support of input multiple learning rates, the computing for use_nesterov policy and the regularization computing. ([#37527](https://github.com/PaddlePaddle/Paddle/pull/37527))
-
-- Add multi_tensor policy to `paddle.optimizer.Momentum` optimizer. Add `set_to_zero` branch to `clear_grad` of `Optimzizer` class. ([#37564](https://github.com/PaddlePaddle/Paddle/pull/37564))
-
-- Add multi_tensor policy to `paddle.optimizer.Adam`. ([#38010](https://github.com/PaddlePaddle/Paddle/pull/38010))
-
-- Add multi_precision policy to `paddle.optimizer.SGD` optimizer. ([#38231](https://github.com/PaddlePaddle/Paddle/pull/38231))
-
-- Add the storage `master weight` parameter to the optimizer `state_dict` method. ([#39121](https://github.com/PaddlePaddle/Paddle/pull/39121))
-
-- Add support for op CUDA bfloat16 mixed precision training. Support for O1 and O2 modes. Enable the above training modes via `paddle.amp.auto_cast`. ([#39029](https://github.com/PaddlePaddle/Paddle/pull/39029), [#39815](https://github.com/PaddlePaddle/Paddle/pull/39815))
-
-- Add bfloat16 CUDA Kernel for the following ops: matmul, concat, split, dropout, reshape, slice, squeeze, stack, transpose, unbind, elementwize_max, elementwize_add, elementwize_mul, elementwize_sub, scale, sum, layer_norm, p_norm, reduce_sum, softmax, log_softmax, sigmoid, sqrt, softplus, square, gaussian_random, fill_constant, and fill_any_like. ([#39485](https://github.com/PaddlePaddle/Paddle/pull/39485), [#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683), [#39843](https://github.com/PaddlePaddle/Paddle/pull/39843), [#39999](https://github.com/PaddlePaddle/Paddle/pull/39999), [#40004](https://github.com/PaddlePaddle/Paddle/pull/40004), [#40027](https://github.com/PaddlePaddle/Paddle/pull/40027))
-
-- Add bfloat16 CPU Kernel for the following ops: dropout, reshape, slice, squeeze, unsqueeze, stack, transpose, unbind, elementwize_max, elementwise_mul, elementwise_sub, and gather. ([#39380](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39395](https://github.com/PaddlePaddle/Paddle/pull/39380), [#39402](https://github.com/PaddlePaddle/Paddle/pull/39402), [#39457](https://github.com/PaddlePaddle/Paddle/pull/39457), [#39461](https://github.com/PaddlePaddle/Paddle/pull/39461), [#39602](https://github.com/PaddlePaddle/Paddle/pull/39602), [#39716](https://github.com/PaddlePaddle/Paddle/pull/39716), [#39683](https://github.com/PaddlePaddle/Paddle/pull/39683))
-
-- Support printing of Tensor with data of bfloat16. ([#39375](https://github.com/PaddlePaddle/Paddle/pull/39375), [#39370](https://github.com/PaddlePaddle/Paddle/pull/39370))
-
-- Add support for FP16 computation for `p_norm`, `elementwise_max`, and `fill_constant_batch_size_like ``scatter`. ([#35888](https://github.com/PaddlePaddle/Paddle/pull/35888), [#39907](https://github.com/PaddlePaddle/Paddle/pull/39907), [#38136](https://github.com/PaddlePaddle/Paddle/pull/38136), [#38499](https://github.com/PaddlePaddle/Paddle/pull/38499))
-
-- Add support for int16_t for the following ops: cumsum, less_than, less_equal, greater_than, greater_equal, equal, not_equal, fill_any_like, grather_nd reduce_sum, where_index, reshape, and unsqueeze. ([#39636](https://github.com/PaddlePaddle/Paddle/pull/39636))
-
-- Add support for int16_t label type for cross_entropy op. ([#39409](https://github.com/PaddlePaddle/Paddle/pull/39409))
-
-- Add support for int16_t id type for embedding op. ([#39381](https://github.com/PaddlePaddle/Paddle/pull/39381))
-
-- Add support for FP16 type for reduce_mean op. ([#38289](https://github.com/PaddlePaddle/Paddle/pull/38289))
-
-- Add support for FP16 type for elementwise_min op. ([#38123](https://github.com/PaddlePaddle/Paddle/pull/38123))
-
-- Update bfloat16 AMP oneDNN default support list. ([#39304](https://github.com/PaddlePaddle/Paddle/pull/39304))
-
-
-#### **Paddle HIgh reusability operator library**
-
-We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework's performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these built-in functions can greatly reduce the user's development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composabilty, the reuse of kernels through Primitives, and accelerators through plugins.The main contents include six parts as below:
-
-- **The implementation of the operator library infrastructure, core components and mechanisms**: The directory structure of the new operator library is reasonably planned, design and implement the common base data structure of the new operator library, the new functional InferMeta and Kernel development paradigm and the corresponding registration and management components. Support the automated compilation object generation and compilation dependency generation of Kernel files, allowing developers to focus only on the functional Kernel implementation, and making the development paradigm clear and concise. ([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#37107](https://github.com/PaddlePaddle/Paddle/pull/37107), [#36946](https://github.com/PaddlePaddle/Paddle/pull/36946), [#36948](https://github.com/PaddlePaddle/Paddle/pull/36948), [#37876](https://github.com/PaddlePaddle/Paddle/pull/37876), [#37916](https://github.com/PaddlePaddle/Paddle/pull/37916), [#37977](https://github.com/PaddlePaddle/Paddle/pull/37977), [38078](https://github.com/PaddlePaddle/Paddle/pull/38078), [#38861](https://github.com/PaddlePaddle/Paddle/pull/38861), [#39123](https://github.com/PaddlePaddle/Paddle/pull/39123), [#39131](https://github.com/PaddlePaddle/Paddle/pull/39131), [#39748](https://github.com/PaddlePaddle/Paddle/pull/39748), [#39790](https://github.com/PaddlePaddle/Paddle/pull/39790), [#39941](https://github.com/PaddlePaddle/Paddle/pull/39941), [#40239](https://github.com/PaddlePaddle/Paddle/pull/40239), [#40635](https://github.com/PaddlePaddle/Paddle/pull/40635), [#41091](https://github.com/PaddlePaddle/Paddle/pull/41091), [#37409](https://github.com/PaddlePaddle/Paddle/pull/37409), [#37942](https://github.com/PaddlePaddle/Paddle/pull/37942), [#39002](https://github.com/PaddlePaddle/Paddle/pull/39002), [#38109](https://github.com/PaddlePaddle/Paddle/pull/38109), [#37881](https://github.com/PaddlePaddle/Paddle/pull/37881), [#37517](https://github.com/PaddlePaddle/Paddle/pull/37517), [#39870](https://github.com/PaddlePaddle/Paddle/pull/39870), [#40975](https://github.com/PaddlePaddle/Paddle/pull/40975), [#39475](https://github.com/PaddlePaddle/Paddle/pull/39475), [#37304](https://github.com/PaddlePaddle/Paddle/pull/37304), #36910, #37120, #37146, #37215, #37255, #37369, #38258, #38257, #38355, #38853, #38937, #38977, #38946, #39085, #39153, #39228, #38301, #38275, #38506, #38607, #38473, #38632, #38811, #38880, #38996, #38914, #39101)
-
-- **Operator library C++ API system construction**: design and implement yaml configuration file-based operator definition paradigm, to automatically generate more than 200 C++ operator class APIs for internal and external developers to reuse. This reduces the cost of repeated development of basic operators. ([#37668](https://github.com/PaddlePaddle/Paddle/pull/37668), [#36938](https://github.com/PaddlePaddle/Paddle/pull/36938), [#38172](https://github.com/PaddlePaddle/Paddle/pull/38172), [#38182](https://github.com/PaddlePaddle/Paddle/pull/38182), [#38311](https://github.com/PaddlePaddle/Paddle/pull/38311), [#38438](https://github.com/PaddlePaddle/Paddle/pull/38438), [#39057](https://github.com/PaddlePaddle/Paddle/pull/39057), [#39229](https://github.com/PaddlePaddle/Paddle/pull/39229), [#39281](https://github.com/PaddlePaddle/Paddle/pull/39281), [#39263](https://github.com/PaddlePaddle/Paddle/pull/39263), [#39408](https://github.com/PaddlePaddle/Paddle/pull/39408), [#39436](https://github.com/PaddlePaddle/Paddle/pull/39436), [#39482](https://github.com/PaddlePaddle/Paddle/pull/39482), [#39497](https://github.com/PaddlePaddle/Paddle/pull/39497), [#39651](https://github.com/PaddlePaddle/Paddle/pull/39651), [#39521](https://github.com/PaddlePaddle/Paddle/pull/39521), [#39760](https://github.com/PaddlePaddle/Paddle/pull/39760), [#40060](https://github.com/PaddlePaddle/Paddle/pull/40060), [#40196](https://github.com/PaddlePaddle/Paddle/pull/40196), [#40218](https://github.com/PaddlePaddle/Paddle/pull/40218), [#40640](https://github.com/PaddlePaddle/Paddle/pull/40640), [#40732](https://github.com/PaddlePaddle/Paddle/pull/40732), [#40729](https://github.com/PaddlePaddle/Paddle/pull/40729), [#40840](https://github.com/PaddlePaddle/Paddle/pull/40840), [#40867](https://github.com/PaddlePaddle/Paddle/pull/40867), [#41025](https://github.com/PaddlePaddle/Paddle/pull/41025), [#41368](https://github.com/PaddlePaddle/Paddle/pull/41368))
-
-- **Operator library compatible with various execution systems**: Implement new InferMeta and Kernel to access the original dynamic and static graph execution system. Support the safe removal of the original OpKernel registration and migration to the new Kernel form. ([#34425](https://github.com/PaddlePaddle/Paddle/pull/34425), [#38825](https://github.com/PaddlePaddle/Paddle/pull/38825), [#38837](https://github.com/PaddlePaddle/Paddle/pull/38837), [#38842](https://github.com/PaddlePaddle/Paddle/pull/38842), [#38976](https://github.com/PaddlePaddle/Paddle/pull/38976), [#39134](https://github.com/PaddlePaddle/Paddle/pull/39134), [#39140](https://github.com/PaddlePaddle/Paddle/pull/39140), [#39135](https://github.com/PaddlePaddle/Paddle/pull/39135), [#39252](https://github.com/PaddlePaddle/Paddle/pull/39252), [#39222](https://github.com/PaddlePaddle/Paddle/pull/39222), [#39351](https://github.com/PaddlePaddle/Paddle/pull/39351))
-
-- **Decouple the underlying data structures and tool functions of the operator library from the framework**: Relieve PHI's dependence on the framework for core data structures, lay the foundation for subsequent independent compilation of PHI, and support infrt, custom Kernel, and a series of Phi-based construction work ([#38583](https://github.com/PaddlePaddle/Paddle/pull/38583), [#39188](https://github.com/PaddlePaddle/Paddle/pull/39188), [#39560](https://github.com/PaddlePaddle/Paddle/pull/39560), [#39931](https://github.com/PaddlePaddle/Paddle/pull/39931), [#39169](https://github.com/PaddlePaddle/Paddle/pull/39169), [#38951](https://github.com/PaddlePaddle/Paddle/pull/38951), [#38898](https://github.com/PaddlePaddle/Paddle/pull/38898), [#38873](https://github.com/PaddlePaddle/Paddle/pull/38873), [#38696](https://github.com/PaddlePaddle/Paddle/pull/38696), [#38651](https://github.com/PaddlePaddle/Paddle/pull/38651), [#39359](https://github.com/PaddlePaddle/Paddle/pull/39359), [#39305](https://github.com/PaddlePaddle/Paddle/pull/39305), [#39234](https://github.com/PaddlePaddle/Paddle/pull/39234), [#39098](https://github.com/PaddlePaddle/Paddle/pull/39098), [#39120](https://github.com/PaddlePaddle/Paddle/pull/39120), [#38979](https://github.com/PaddlePaddle/Paddle/pull/38979), [#38899](https://github.com/PaddlePaddle/Paddle/pull/38899), [#38844](https://github.com/PaddlePaddle/Paddle/pull/38844), [#39714](https://github.com/PaddlePaddle/Paddle/pull/39714), [#39729](https://github.com/PaddlePaddle/Paddle/pull/39729), [#39889](https://github.com/PaddlePaddle/Paddle/pull/39889), [#39587](https://github.com/PaddlePaddle/Paddle/pull/39587), [#39558](https://github.com/PaddlePaddle/Paddle/pull/39558), [#39514](https://github.com/PaddlePaddle/Paddle/pull/39514), [#39502](https://github.com/PaddlePaddle/Paddle/pull/39502), [#39300](https://github.com/PaddlePaddle/Paddle/pull/39300), [#39246](https://github.com/PaddlePaddle/Paddle/pull/39246), [#39124](https://github.com/PaddlePaddle/Paddle/pull/39124))
-
-- **Integration between custom operator mechanism and Phi with improvement**: support for calling over 200 C++ operator class APIs automatically generated by PHI when writing custom operators. This reduces custom operator development costs. A series of bugs are fixed. ([#37122](https://github.com/PaddlePaddle/Paddle/pull/37122), [#37276](https://github.com/PaddlePaddle/Paddle/pull/37276), [#37281](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37262](https://github.com/PaddlePaddle/Paddle/pull/37281), [#37415](https://github.com/PaddlePaddle/Paddle/pull/37415), [#37423](https://github.com/PaddlePaddle/Paddle/pull/37423), [#37583](https://github.com/PaddlePaddle/Paddle/pull/37683), [#38776](https://github.com/PaddlePaddle/Paddle/pull/38776), [#39353](https://github.com/PaddlePaddle/Paddle/pull/39353), [#41072](https://github.com/PaddlePaddle/Paddle/pull/41072))
-
-- **Operator scale migration and refactoring**: migrate about 250 high-frequency forward and backward operator Kernel to the new operator library and refactor them as a single function. Achieve the high-performance operator by encapsulating multiple base Kernel functions on the C++ side for the fast combination. Meanwhile, add the corresponding yaml operator definition, and access to the new dynamic graph execution system to improve the python API scheduling performance. The migrated and refactored operators include:
-
- - sqrt ([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - square([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - sin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - sinh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - elementwise_fmax([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140))
-
- - elementwise_fmin([#40140](https://github.com/PaddlePaddle/Paddle/pull/40140))
-
- - pool2d([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - max_pool2d_with_index([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - pool3d([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - max_pool3d_with_index([#40208](https://github.com/PaddlePaddle/Paddle/pull/40208), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - fill_constant ([#36930](https://github.com/PaddlePaddle/Paddle/pull/36930), [#39465](https://github.com/PaddlePaddle/Paddle/pull/39465))
-
- - p_norm ([#40819](https://github.com/PaddlePaddle/Paddle/pull/40819))
-
- - fill_constant_batch_size_like ([#40784](https://github.com/PaddlePaddle/Paddle/pull/40784))
-
- - conv2d([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354))
-
- - conv2d_transpose([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - conv3d([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354))
-
- - conv3d_transpose([#40675](https://github.com/PaddlePaddle/Paddle/pull/40675), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - mish([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - gather_nd ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043))
-
- - gather ([#40500](https://github.com/PaddlePaddle/Paddle/pull/40500))
-
- - scatter ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043))
-
- - scatter_nd_add ([#40090](https://github.com/PaddlePaddle/Paddle/pull/40090), [#40043](https://github.com/PaddlePaddle/Paddle/pull/40043))
-
- - sgd([40045](https://github.com/PaddlePaddle/Paddle/pull/40045))
-
- - momentum ([#41319](https://github.com/PaddlePaddle/Paddle/pull/41319))
-
- - rmsprop([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994))
-
- - index_sample([#38130](https://github.com/PaddlePaddle/Paddle/pull/38130), [#38459](https://github.com/PaddlePaddle/Paddle/pull/38459),[#39905](https://github.com/PaddlePaddle/Paddle/pull/39905))
-
- - adam ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351))
-
- - layer_norm([#40193](https://github.com/PaddlePaddle/Paddle/pull/40193))
-
- - adagrad([#40994](https://github.com/PaddlePaddle/Paddle/pull/40994/))
-
- - adamax ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173))
-
- - adadelta ([#40173](https://github.com/PaddlePaddle/Paddle/pull/40173))
-
- - clip([#40602](https://github.com/PaddlePaddle/Paddle/pull/40602), [#41661](https://github.com/PaddlePaddle/Paddle/pull/41661), [#41675](https://github.com/PaddlePaddle/Paddle/pull/41675))
-
- - ceil ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - cos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - atan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - cosh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - erf([#40388](https://github.com/PaddlePaddle/Paddle/pull/40388))
-
- - asin ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - acos ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - scale ([#39278](https://github.com/PaddlePaddle/Paddle/pull/39278))
-
- - elementwise_pow ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993))
-
- - elementwise_sub ([#39225](https://github.com/PaddlePaddle/Paddle/pull/39225), [#37260](https://github.com/PaddlePaddle/Paddle/pull/37260))
-
- - round ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - floor ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - pow ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - elementwise_floordiv ([#40993](https://github.com/PaddlePaddle/Paddle/pull/40993))
-
- - reciprocal([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - log1p ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - allclose ([#40469](https://github.com/PaddlePaddle/Paddle/pull/40469))
-
- - mul ([#40833](https://github.com/PaddlePaddle/Paddle/pull/40833))
-
- - elementwise_max ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590))
-
- - elementwise_min ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590))
-
- - elementwise_mod ([#40590](https://github.com/PaddlePaddle/Paddle/pull/40590))
-
- - elementwise_add ([#39048](https://github.com/PaddlePaddle/Paddle/pull/39048), [#37043](https://github.com/PaddlePaddle/Paddle/pull/37043))
-
- - matmul_v2 ([#36844](https://github.com/PaddlePaddle/Paddle/pull/36844), [#38713](https://github.com/PaddlePaddle/Paddle/pull/38713))
-
- - elementwise_mul ([#41042](https://github.com/PaddlePaddle/Paddle/pull/41042), [#40252](https://github.com/PaddlePaddle/Paddle/pull/40252), [#37471](https://github.com/PaddlePaddle/Paddle/pull/37471))
-
- - elementwise_div ([#40172](https://github.com/PaddlePaddle/Paddle/pull/40172), [#40039](https://github.com/PaddlePaddle/Paddle/pull/40039), [#37418](https://github.com/PaddlePaddle/Paddle/pull/37418))
-
- - SelectedRows ([#39037](https://github.com/PaddlePaddle/Paddle/pull/39037), [#39087](https://github.com/PaddlePaddle/Paddle/pull/39087), [#39128](https://github.com/PaddlePaddle/Paddle/pull/39128), [#39162](https://github.com/PaddlePaddle/Paddle/pull/39162), [#39236](https://github.com/PaddlePaddle/Paddle/pull/39236))
-
- - fill_any_like ([#39807](https://github.com/PaddlePaddle/Paddle/pull/39807))
-
- - dot([#38359](https://github.com/PaddlePaddle/Paddle/pull/38359))
-
- - sum ([#40873](https://github.com/PaddlePaddle/Paddle/pull/40873))
-
- - cumsum ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - diag_v2 ([#39914](https://github.com/PaddlePaddle/Paddle/pull/39914))
-
- - auc ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - log_loss ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - one_hot_v2([39876](https://github.com/PaddlePaddle/Paddle/pull/39876))
-
- - sigmoid_cross_entropy_with_logits ([#39976](https://github.com/PaddlePaddle/Paddle/pull/39976), [#40200](https://github.com/PaddlePaddle/Paddle/pull/40200))
-
- - bce_loss ([#39868](https://github.com/PaddlePaddle/Paddle/pull/39868))
-
- - argsort ([#40151](https://github.com/PaddlePaddle/Paddle/pull/40151))
-
- - arg_max ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222))
-
- - arg_min ([#40222](https://github.com/PaddlePaddle/Paddle/pull/40222))
-
- - segment_pool ([#40099](https://github.com/PaddlePaddle/Paddle/pull/40099))
-
- - frobenius_norm([#40707](https://github.com/PaddlePaddle/Paddle/pull/40707), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - dist ([#40178](https://github.com/PaddlePaddle/Paddle/pull/40178))
-
- - isnan_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076))
-
- - logical_and ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - logical_not ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - isfinite_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076))
-
- - logical_or ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - isinf_v2 ([#40076](https://github.com/PaddlePaddle/Paddle/pull/40076))
-
- - is_empty ([#39919](https://github.com/PaddlePaddle/Paddle/pull/39919))
-
- - logical_xor ([#39942](https://github.com/PaddlePaddle/Paddle/pull/39942))
-
- - less_than([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - not_equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - less_equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - equal_all([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - uniform_random ([#39937](https://github.com/PaddlePaddle/Paddle/pull/39937))
-
- - randint ([#39876](https://github.com/PaddlePaddle/Paddle/pull/39876), [#41375](https://github.com/PaddlePaddle/Paddle/pull/41375))
-
- - randperm ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265))
-
- - unbind ([#39789](https://github.com/PaddlePaddle/Paddle/pull/39789))
-
- - bernoulli ([#39590](https://github.com/PaddlePaddle/Paddle/pull/39590))
-
- - increment ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - multinomial ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - addmm ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - cholesky ([#39858](https://github.com/PaddlePaddle/Paddle/pull/39858), [#39913](https://github.com/PaddlePaddle/Paddle/pull/39913))
-
- - where ([#39811](https://github.com/PaddlePaddle/Paddle/pull/39811))
-
- - log10 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - log2 ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - expm1([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - atan2 ([#39806](https://github.com/PaddlePaddle/Paddle/pull/39806))
-
- - gaussian_random ([#39932](https://github.com/PaddlePaddle/Paddle/pull/39932), [#40122](https://github.com/PaddlePaddle/Paddle/pull/40122), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191))
-
- - empty ([#38334](https://github.com/PaddlePaddle/Paddle/pull/38334))
-
- - truncated_gaussian_random ([#39971](https://github.com/PaddlePaddle/Paddle/pull/39971), [#40191](https://github.com/PaddlePaddle/Paddle/pull/40191))
-
- - mv ([#39861](https://github.com/PaddlePaddle/Paddle/pull/39861), [#39954](https://github.com/PaddlePaddle/Paddle/pull/39954))
-
- - tan ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - set_value ([#40195](https://github.com/PaddlePaddle/Paddle/pull/40195), [#40478](https://github.com/PaddlePaddle/Paddle/pull/40478), [#40636](https://github.com/PaddlePaddle/Paddle/pull/40636))
-
- - bitwise_and ([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - bitwise_not([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - bitwise_or([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - poisson([#39814](https://github.com/PaddlePaddle/Paddle/pull/39814))
-
- - cholesky_solve([#40387](https://github.com/PaddlePaddle/Paddle/pull/40387))
-
- - bitwise_xor([#40031](https://github.com/PaddlePaddle/Paddle/pull/40031))
-
- - triangular_solve([#40417](https://github.com/PaddlePaddle/Paddle/pull/40417))
-
- - sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626))
-
- - atanh ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - softsign([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - thresholded_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - tanh_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - stanh([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - reduce_mean ([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559))
-
- - reduce_max([#40225](https://github.com/PaddlePaddle/Paddle/pull/40225))
-
- - reduce_min ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374))
-
- - mean ([#40872](https://github.com/PaddlePaddle/Paddle/pull/40872), [#41319](https://github.com/PaddlePaddle/Paddle/pull/41319))
-
- - reduce_all ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374))
-
- - reduce_any ([#40374](https://github.com/PaddlePaddle/Paddle/pull/40374))
-
- - logsumexp ([#40790](https://github.com/PaddlePaddle/Paddle/pull/40790))
-
- - softshrink([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - range ([#41265](https://github.com/PaddlePaddle/Paddle/pull/41265), [#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - stack([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - tile ([#40371](https://github.com/PaddlePaddle/Paddle/pull/40371))
-
- - unique([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - unstack([#40581](https://github.com/PaddlePaddle/Paddle/pull/40851))
-
- - slice([#40736](https://github.com/PaddlePaddle/Paddle/pull/40736))
-
- - transpose2([#39327](https://github.com/PaddlePaddle/Paddle/pull/39327))
-
- - unsqueeze2( [#40596](https://github.com/PaddlePaddle/Paddle/pull/40596))
-
- - squeeze2( [#40596](https://github.com/PaddlePaddle/Paddle/pull/40596))
-
- - strided_slice ([#40708](https://github.com/PaddlePaddle/Paddle/pull/40708))
-
- - softmax ([#39547](https://github.com/PaddlePaddle/Paddle/pull/39547))
-
- - leaky_relu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - gelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393))
-
- - prelu ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393))
-
- - log_softmax ([#40393](https://github.com/PaddlePaddle/Paddle/pull/40393))
-
- - elu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - logsigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626))
-
- - psroi_pool ([#40353](https://github.com/PaddlePaddle/Paddle/pull/40353), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - kthvalue([#40575](https://github.com/PaddlePaddle/Paddle/pull/40575))
-
- - mode ([#40571](https://github.com/PaddlePaddle/Paddle/pull/40571))
-
- - yolo_box([#40112](https://github.com/PaddlePaddle/Paddle/pull/40112))
-
- - yolov3_loss ([#40944](https://github.com/PaddlePaddle/Paddle/pull/40944))
-
- - temporal_shift([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - depthwise_conv2d([#39354](https://github.com/PaddlePaddle/Paddle/pull/39354))
-
- - pad3d ([#40701](https://github.com/PaddlePaddle/Paddle/pull/40701))
-
- - pad( [#40012](https://github.com/PaddlePaddle/Paddle/pull/40012))
-
- - greater_equal([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - kldiv_loss ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - isclose ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - silu ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - unfold ([#39778](https://github.com/PaddlePaddle/Paddle/pull/39778))
-
- - batch_norm([39347](https://github.com/PaddlePaddle/Paddle/pull/39347))
-
- - norm([#39324](https://github.com/PaddlePaddle/Paddle/pull/39324))
-
- - roi_pool ([#40574](https://github.com/PaddlePaddle/Paddle/pull/40574), [#40682](https://github.com/PaddlePaddle/Paddle/pull/40682), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - roi_align ([#40382](https://github.com/PaddlePaddle/Paddle/pull/40382), [#40556](https://github.com/PaddlePaddle/Paddle/pull/40556), [#41402](https://github.com/PaddlePaddle/Paddle/pull/41402))
-
- - deformable_conv ([#40700](https://github.com/PaddlePaddle/Paddle/pull/40700), [#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644))
-
- - deformable_conv_v1 ([#40794](https://github.com/PaddlePaddle/Paddle/pull/40794), [#41644](https://github.com/PaddlePaddle/Paddle/pull/41644))
-
- - label_smooth ([#39796](https://github.com/PaddlePaddle/Paddle/pull/39796))
-
- - grid_sampler ([#40585](https://github.com/PaddlePaddle/Paddle/pull/40585))
-
- - greater_than([#39970](https://github.com/PaddlePaddle/Paddle/pull/39970))
-
- - pixel_shuffle ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712))
-
- - nearest_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855))
-
- - bilinear_interp_v2 ([#40855](https://github.com/PaddlePaddle/Paddle/pull/40855))
-
- - softmax_with_cross_entropy ([#40832](https://github.com/PaddlePaddle/Paddle/pull/40832))
-
- - rnn ([#41007](https://github.com/PaddlePaddle/Paddle/pull/41007))
-
- - reverse ([#40791](https://github.com/PaddlePaddle/Paddle/pull/40791))
-
- - trace ([#39510](https://github.com/PaddlePaddle/Paddle/pull/39510))
-
- - kron([#40427](https://github.com/PaddlePaddle/Paddle/pull/40427))
-
- - accuracy([#39982](https://github.com/PaddlePaddle/Paddle/pull/39982))
-
- - gather_tree ([#40082](https://github.com/PaddlePaddle/Paddle/pull/40082), [#39844](https://github.com/PaddlePaddle/Paddle/pull/39844))
-
- - dropout([#40148](https://github.com/PaddlePaddle/Paddle/pull/40148))
-
- - bincount ([#39947](https://github.com/PaddlePaddle/Paddle/pull/39947))
-
- - warpctc ([#41389](https://github.com/PaddlePaddle/Paddle/pull/41389), [#40023](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/40023))
-
- - multiplex([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40102](https://github.com/PaddlePaddle/Paddle/pull/40102))
-
- - qr([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#40007](https://github.com/PaddlePaddle/Paddle/pull/40007))
-
- - assign_value ([#40967](https://github.com/PaddlePaddle/Paddle/pull/40967))
-
- - assign ([#40022](https://github.com/PaddlePaddle/Paddle/pull/40022))
-
- - cast ([#37610](https://github.com/PaddlePaddle/Paddle/pull/37610))
-
- - tril_triu([#40007](https://github.com/PaddlePaddle/Paddle/pull/40007), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - where_index ([#40255](https://github.com/PaddlePaddle/Paddle/pull/40255))
-
- - index_select ([#40260](https://github.com/PaddlePaddle/Paddle/pull/40260), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - roll ([#40257](https://github.com/PaddlePaddle/Paddle/pull/40257), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - cumprod (Xiong Kun [#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - shard_index ([#40254](https://github.com/PaddlePaddle/Paddle/pull/40254))
-
- - reshape2 ([#40914](https://github.com/PaddlePaddle/Paddle/pull/40914), [#39631](https://github.com/PaddlePaddle/Paddle/pull/39631), [#38833](https://github.com/PaddlePaddle/Paddle/pull/38833), [#37164](https://github.com/PaddlePaddle/Paddle/pull/37164))
-
- - flip ([#39822](https://github.com/PaddlePaddle/Paddle/pull/39822), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - eye ([#39712](https://github.com/PaddlePaddle/Paddle/pull/39712), [#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#41476](https://github.com/PaddlePaddle/Paddle/pull/41476))
-
- - lookup_table_v2([#39901](https://github.com/PaddlePaddle/Paddle/pull/39901))
-
- - searchsorted([#40520](https://github.com/PaddlePaddle/Paddle/pull/40520), [#41053](https://github.com/PaddlePaddle/Paddle/pull/41053))
-
- - adamw ([#40351](https://github.com/PaddlePaddle/Paddle/pull/40351))
-
- - tanh ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - cross ([#39829](https://github.com/PaddlePaddle/Paddle/pull/39829))
-
- - concat ([#38955](https://github.com/PaddlePaddle/Paddle/pull/38955), [#41112](https://github.com/PaddlePaddle/Paddle/pull/41112))
-
- - split ([#39060](https://github.com/PaddlePaddle/Paddle/pull/39060))
-
- - linspace ([#40124](https://github.com/PaddlePaddle/Paddle/pull/40124))
-
- - huber_loss ([#39761](https://github.com/PaddlePaddle/Paddle/pull/39761))
-
- - hierarchical_sigmoid([#40553](https://github.com/PaddlePaddle/Paddle/pull/40553))
-
- - nll_loss ([#39936](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39936))
-
- - graph_send_recv ([#40092](https://github.com/PaddlePaddle/Paddle/pull/40092), [#40320](https://github.com/PaddlePaddle/Paddle/pull/40320))
-
- - abs([#39492](https://github.com/PaddlePaddle/Paddle/pull/39492), [#39762](https://github.com/PaddlePaddle/Paddle/pull/39762))
-
- - exp([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - rsqrt([#40727](https://github.com/PaddlePaddle/Paddle/pull/40727))
-
- - viterbi_decode ([#40186](https://github.com/PaddlePaddle/Paddle/pull/40186))
-
- - conj ([#38247](https://github.com/PaddlePaddle/Paddle/pull/38247))
-
- - real ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - imag ([#39777](https://github.com/PaddlePaddle/Paddle/pull/39777), [#41173](https://github.com/PaddlePaddle/Paddle/pull/41173))
-
- - take_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40270](https://github.com/PaddlePaddle/Paddle/pull/40270), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - put_along_axis ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - lgamma ([#39770](https://github.com/PaddlePaddle/Paddle/pull/39770))
-
- - relu ([#40175](https://github.com/PaddlePaddle/Paddle/pull/40175))
-
- - maxout ([#39959](https://github.com/PaddlePaddle/Paddle/pull/39959), [#40974](https://github.com/PaddlePaddle/Paddle/pull/40974))
-
- - log ([#40785](https://github.com/PaddlePaddle/Paddle/pull/40785))
-
- - bilinear_tensor_product([#39903](https://github.com/PaddlePaddle/Paddle/pull/39903))
-
- - flatten_contiguous_range ([#38712](https://github.com/PaddlePaddle/Paddle/pull/38712), [#36957](https://github.com/PaddlePaddle/Paddle/pull/36957), [#41345](https://github.com/PaddlePaddle/Paddle/pull/41345))
-
- - matrix_rank ([#40074](https://github.com/PaddlePaddle/Paddle/pull/40074), [#40519](https://github.com/PaddlePaddle/Paddle/pull/40519), [#41466](https://github.com/PaddlePaddle/Paddle/pull/41466))
-
- - logit ([#37844](https://github.com/PaddlePaddle/Paddle/pull/37844))
-
- - lerp ([#40105](https://github.com/PaddlePaddle/Paddle/pull/40105), [#39524](https://github.com/PaddlePaddle/Paddle/pull/39524))
-
- - erfinv ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712))
-
- - broadcast_tensors([#40047](https://github.com/PaddlePaddle/Paddle/pull/40047))
-
- - gumbel_softmax([#39873](https://github.com/PaddlePaddle/Paddle/pull/39873))
-
- - diagonal ([#39575](https://github.com/PaddlePaddle/Paddle/pull/39575))
-
- - trunc ([#39543](https://github.com/PaddlePaddle/Paddle/pull/39543), [#39772](https://github.com/PaddlePaddle/Paddle/pull/39772))
-
- - multi_dot ([#40038](https://github.com/PaddlePaddle/Paddle/pull/40038))
-
- - matrix_power ([#40231](https://github.com/PaddlePaddle/Paddle/pull/40231))
-
- - digamma([#39240](https://github.com/PaddlePaddle/Paddle/pull/39240))
-
- - masked_select([#39193](https://github.com/PaddlePaddle/Paddle/pull/39193))
-
- - determinant ([#40539](https://github.com/PaddlePaddle/Paddle/pull/40539))
-
- - eigh ([#40213](https://github.com/PaddlePaddle/Paddle/pull/40213))
-
- - size ([#39949](https://github.com/PaddlePaddle/Paddle/pull/39949), [#39712](https://github.com/PaddlePaddle/Paddle/pull/39712))
-
- - shape ([#40248](https://github.com/PaddlePaddle/Paddle/pull/40248))
-
- - reduce_sum([#37559](https://github.com/PaddlePaddle/Paddle/pull/37559), [#41295](https://github.com/PaddlePaddle/Paddle/pull/41295))
-
- - reduce_prod ([#39844](https://github.com/PaddlePaddle/Paddle/pull/39844))
-
- - histogram([#39496](https://github.com/PaddlePaddle/Paddle/pull/39496))
-
- - meshgrid ([#41411](https://github.com/PaddlePaddle/Paddle/pull/41411))
-
- - brelu ([#40385](https://github.com/PaddlePaddle/Paddle/pull/40385))
-
- - hard_swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - hard_shrink ([#40565](https://github.com/PaddlePaddle/Paddle/pull/40565))
-
- - selu ([#39819](https://github.com/PaddlePaddle/Paddle/pull/39819))
-
- - expand_v2 ([#39471](https://github.com/PaddlePaddle/Paddle/pull/39471))
-
- - top_k_v2([#40064](https://github.com/PaddlePaddle/Paddle/pull/40064))
-
- - expand_as_v2([#40373](https://github.com/PaddlePaddle/Paddle/pull/40373))
-
- - swish ([#40913](https://github.com/PaddlePaddle/Paddle/pull/40913))
-
- - hard_sigmoid ([#40626](https://github.com/PaddlePaddle/Paddle/pull/40626))
-
- - exp, det, assign, gaussian_random, matrix_rank, eye, and deformable_conv. ([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755), [#41737](https://github.com/PaddlePaddle/Paddle/pull/41737))
-
-#### **New Dynamic Graph Execution Mechanism**
-
-To improve scheduling performance and custom development capability of the dynamic graph execution mechanism of the PaddlePaddle, we have reconstructed the underlying execution mechanism of the dynamic graph. With the new execution method, the PHI operator library can be used for efficient runtime execution. For the operators supported by the PHI operator library, switching to the new dynamic graph mode will get a significant improvement in scheduling performance. However, due to the huge workload required in the upgrade of the overall framework execution mechanism and this part of the work is coupled with a lot on the PHI operator library, we still do not use this execution method by default in this version. If you want to try it, you can switch to it by setting the environment variable `FLAGS_enable_eager_mode=1`.The details are as follows:
-
-- **Implementation of dynamic graph execution infrastructure, core components and mechanism**: By staticizing dynamic graph-related execution codes, the original homogeneous operators constructing converted to specific calling for different PHI APIs, thus greatly optimizing the scheduling overhead. ([#36059](https://github.com/PaddlePaddle/Paddle/pull/36059), [#37323](https://github.com/PaddlePaddle/Paddle/pull/37323), [#37556](https://github.com/PaddlePaddle/Paddle/pull/37556), [#37555](https://github.com/PaddlePaddle/Paddle/pull/37555), [#37478](https://github.com/PaddlePaddle/Paddle/pull/37478), [#37458](https://github.com/PaddlePaddle/Paddle/pull/37458), [#37479](https://github.com/PaddlePaddle/Paddle/pull/37479), [#37599](https://github.com/PaddlePaddle/Paddle/pull/37599), [#37659](https://github.com/PaddlePaddle/Paddle/pull/37659), [#37654](https://github.com/PaddlePaddle/Paddle/pull/37654), [#39200](https://github.com/PaddlePaddle/Paddle/pull/39200), [#39309](https://github.com/PaddlePaddle/Paddle/pull/39309), [#39319](https://github.com/PaddlePaddle/Paddle/pull/39319), [#39414](https://github.com/PaddlePaddle/Paddle/pull/39414), [#39504](https://github.com/PaddlePaddle/Paddle/pull/39504), [#39526](https://github.com/PaddlePaddle/Paddle/pull/39526), [#39878](https://github.com/PaddlePaddle/Paddle/pull/39878), [#39963](https://github.com/PaddlePaddle/Paddle/pull/39963))
-
-- **New dynamic graph execution mechanism sub-function development and adaptation**: support more flexible and complete dynamic graph sub-functions such as hook, pylayer, double_grad, inplace, amp, etc. ([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396), [#40400](https://github.com/PaddlePaddle/Paddle/pull/40400), [#40695](https://github.com/PaddlePaddle/Paddle/pull/40695), [#41043](https://github.com/PaddlePaddle/Paddle/pull/41043), [#40915](https://github.com/PaddlePaddle/Paddle/pull/40915), [#41104](https://github.com/PaddlePaddle/Paddle/pull/41104), [#41350](https://github.com/PaddlePaddle/Paddle/pull/41350), [#41209](https://github.com/PaddlePaddle/Paddle/pull/41209), [#40830](https://github.com/PaddlePaddle/Paddle/pull/40830), [#40891](https://github.com/PaddlePaddle/Paddle/pull/40891), [#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377), [#37193](https://github.com/PaddlePaddle/Paddle/pull/37193), [#36965](https://github.com/PaddlePaddle/Paddle/pull/36965), [#37810](https://github.com/PaddlePaddle/Paddle/pull/37810), [#36837](https://github.com/PaddlePaddle/Paddle/pull/36837), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#39531](https://github.com/PaddlePaddle/Paddle/pull/39531), [#39638](https://github.com/PaddlePaddle/Paddle/pull/39638), [#39674](https://github.com/PaddlePaddle/Paddle/pull/39674), [#39893](https://github.com/PaddlePaddle/Paddle/pull/39893), [#40170](https://github.com/PaddlePaddle/Paddle/pull/40170), [#40693](https://github.com/PaddlePaddle/Paddle/pull/40693), [#40937](https://github.com/PaddlePaddle/Paddle/pull/40937), [#41016](https://github.com/PaddlePaddle/Paddle/pull/41016), [#41051](https://github.com/PaddlePaddle/Paddle/pull/41051), [#41121](https://github.com/PaddlePaddle/Paddle/pull/41121), [#41198](https://github.com/PaddlePaddle/Paddle/pull/41198), [#41287](https://github.com/PaddlePaddle/Paddle/pull/41287), [#41380](https://github.com/PaddlePaddle/Paddle/pull/41380), [#41306](https://github.com/PaddlePaddle/Paddle/pull/41306), [#41387](https://github.com/PaddlePaddle/Paddle/pull/41387), [#40623](https://github.com/PaddlePaddle/Paddle/pull/40623), [#40945](https://github.com/PaddlePaddle/Paddle/pull/40945), [#39282](https://github.com/PaddlePaddle/Paddle/pull/39282), [#39449](https://github.com/PaddlePaddle/Paddle/pull/39449), [#38488](https://github.com/PaddlePaddle/Paddle/pull/38488))
-
-- **Automatic code generation mechanism for new dynamic graph execution**: When we are trying to split the computation and scheduling logic of a large number of homogeneous operators into different specific scheduling logics, we find that it is a huge workload. So we introduce a new automatic code generation logic to generate code and thus simplify the runtime logic of dynamic graphs. Meanwhile, in order to adapt to the various types of runtime logic in the previous framework, we also use some complicated compilation techniques to obtain information at runtime to generate more accurate scheduling code. ([#37574](https://github.com/PaddlePaddle/Paddle/pull/37574), [#37575](https://github.com/PaddlePaddle/Paddle/pull/37575), [#37639](https://github.com/PaddlePaddle/Paddle/pull/37639), [#37723](https://github.com/PaddlePaddle/Paddle/pull/37723), [#37753](https://github.com/PaddlePaddle/Paddle/pull/37753), [#37812](https://github.com/PaddlePaddle/Paddle/pull/37812), [#37837](https://github.com/PaddlePaddle/Paddle/pull/37837), [#37910](https://github.com/PaddlePaddle/Paddle/pull/37910), [#37943](https://github.com/PaddlePaddle/Paddle/pull/37943), [#37992](https://github.com/PaddlePaddle/Paddle/pull/37992), [#37959](https://github.com/PaddlePaddle/Paddle/pull/37959), [#38017](https://github.com/PaddlePaddle/Paddle/pull/38017), [#37969](https://github.com/PaddlePaddle/Paddle/pull/37969), [#38160](https://github.com/PaddlePaddle/Paddle/pull/38160), [#38085](https://github.com/PaddlePaddle/Paddle/pull/38085), [#38562](https://github.com/PaddlePaddle/Paddle/pull/38562), [#38573](https://github.com/PaddlePaddle/Paddle/pull/38573), [#39192](https://github.com/PaddlePaddle/Paddle/pull/39192), [#39215](https://github.com/PaddlePaddle/Paddle/pull/39215), [#39355](https://github.com/PaddlePaddle/Paddle/pull/39355), [#39358](https://github.com/PaddlePaddle/Paddle/pull/39358), [#39328](https://github.com/PaddlePaddle/Paddle/pull/39328), [#39233](https://github.com/PaddlePaddle/Paddle/pull/39233), [#39628](https://github.com/PaddlePaddle/Paddle/pull/39628), [#39767](https://github.com/PaddlePaddle/Paddle/pull/39767), [#39743](https://github.com/PaddlePaddle/Paddle/pull/39743), [#39897](https://github.com/PaddlePaddle/Paddle/pull/39897), [#39797](https://github.com/PaddlePaddle/Paddle/pull/39797), [#39997](https://github.com/PaddlePaddle/Paddle/pull/39997), [#40058](https://github.com/PaddlePaddle/Paddle/pull/40058), [#40080](https://github.com/PaddlePaddle/Paddle/pull/40080), [#40107](https://github.com/PaddlePaddle/Paddle/pull/40107), [#39962](https://github.com/PaddlePaddle/Paddle/pull/39962), [#40132](https://github.com/PaddlePaddle/Paddle/pull/40132), [#40276](https://github.com/PaddlePaddle/Paddle/pull/40276), [#40266](https://github.com/PaddlePaddle/Paddle/pull/40266), [#40480](https://github.com/PaddlePaddle/Paddle/pull/40480), [#40482](https://github.com/PaddlePaddle/Paddle/pull/40482), [#40368](https://github.com/PaddlePaddle/Paddle/pull/40368), [#40650](https://github.com/PaddlePaddle/Paddle/pull/40650), [#40815](https://github.com/PaddlePaddle/Paddle/pull/40815), [#40907](https://github.com/PaddlePaddle/Paddle/pull/40907), [#40935](https://github.com/PaddlePaddle/Paddle/pull/40935), [#41089](https://github.com/PaddlePaddle/Paddle/pull/41089))
-
-- **New dynamic graph execution mechanism accessed into the main framework and Integration test**: we currently use some environment variables to distinguish between static graph mode and dynamic graph mode (including new dynamic graph and old dynamic graph mode). We have adapted most logics of dynamic graphs in these modes. However, there are still a lot of problems being fixed. ([#37638](https://github.com/PaddlePaddle/Paddle/pull/37638), [#37643](https://github.com/PaddlePaddle/Paddle/pull/37643), [#37653](https://github.com/PaddlePaddle/Paddle/pull/37653), [#38314](https://github.com/PaddlePaddle/Paddle/pull/38314), [#38337](https://github.com/PaddlePaddle/Paddle/pull/38337), [#38338](https://github.com/PaddlePaddle/Paddle/pull/38338), [#39164](https://github.com/PaddlePaddle/Paddle/pull/39164), [#39326](https://github.com/PaddlePaddle/Paddle/pull/39326), [#40391](https://github.com/PaddlePaddle/Paddle/pull/40391), [#40201](https://github.com/PaddlePaddle/Paddle/pull/40201), [#40854](https://github.com/PaddlePaddle/Paddle/pull/40854), [#40887](https://github.com/PaddlePaddle/Paddle/pull/40887))
-
-- **Update some judgment logics under dynamic graphs, to support fast execution paths for dynamic graphs in compatible forms**:([#40786](https://github.com/PaddlePaddle/Paddle/pull/40786))
-
- - Non-static graph mode (current transition scheme): `_non_static_mode()`。
-
- - Determined as new dynamic graph in dynamic graph mode (recommended judgment logic): `_in_dygrah_mode()`。
-
- - Determined as old dynamic graph in dynamic graph mode (Not recommended. It will be deprecated in future versions): `_in_legacy_dygraph()`。
-
- - Enable old dynamic graph and disable new dynamic graph in dynamic graph mode: `_enable_legacy_dygraph()` or exit `_test_eager_guard()`。
-
- - Enable new dynamic graph and disable old dynamic graph in dynamic graph mode: `_disable_legacy_dygraph()` or with `with _test_eager_guard()`。
-
- - Determine in new dynamic graph in static or dynamic graph mode: `_in_eager_without_dygraph_check()`。
-
-- **Support inplace after dynamic graph reconstruction**: input and output are the same Tensor.
-
- - Adapt the inplace strategy for dynamic graph reconstruction intermediate states. ([#40400](https://github.com/PaddlePaddle/Paddle/pull/40400))
-
- - Adapt the inplace strategy to the final state of the dynamic graph reconstruction. ([#40695](https://github.com/PaddlePaddle/Paddle/pull/40695))
-
- - Add inplace strategy to PyLayer function after dynamical graph reconstruction. ([#41043](https://github.com/PaddlePaddle/Paddle/pull/41043))
-
- - Add inplace strategy for Tensor's setitem function after dynamical graph reconstruction. ([#40915](https://github.com/PaddlePaddle/Paddle/pull/40915))
-
- - Add `_reset_grad_inplace_version` interface after dynamic graph reconstruction, to set the inplace version of the Tensor's gradient to 0. ([#41101](https://github.com/PaddlePaddle/Paddle/pull/41101))
-
- - If the value of the forward Tensor is not needed during the inverse computation (no need buffer property), the inplace version detection operation is not needed for that Tensor. For Tensor with no_need_buffer, skip the inplace version check. ([#41350](https://github.com/PaddlePaddle/Paddle/pull/41350))
-
- - Unify error messages for inplace version checks after and before reconstruction of dynamic graphs. ([#41209](https://github.com/PaddlePaddle/Paddle/pull/41209))
-
-- **Support view strategy after dynamical graph reconstruction**: input and output Tensor share underlying data.
-
- - Adapt the view strategy for dynamic graph reconstruction intermediate states. Include `reshape`, `squeeze`, `unsqueeze`, and `flatten` APIs. ([#40830](https://github.com/PaddlePaddle/Paddle/pull/40830))
-
- - Adapt the view strategy for dynamic graph reconstruction final state. Include `reshape` API. ([#40891](https://github.com/PaddlePaddle/Paddle/pull/40891))
-
-- **Add support for weakref on the python side of the new dynamic graph eager Tensor.** ([#41797](https://github.com/PaddlePaddle/Paddle/pull/41797))
-
-- **Enhance the new dynamic graph DoubleGrad function** to support the basic DoubleGrad feature. ([#41893](https://github.com/PaddlePaddle/Paddle/pull/41893), [#41894](https://github.com/PaddlePaddle/Paddle/pull/41894), [#41895](https://github.com/PaddlePaddle/Paddle/pull/41895))
-
-- **Add `core.eager.StringTensor` interface**, to support the construction of StringTensor on python side and the use of the StringTensor related APIs. ([#41039](https://github.com/PaddlePaddle/Paddle/pull/41039))
-
-- **Add `_grad_name` and `_grad_value`*to `core.eager.Tensor` to return the name and value of a gradient. ([#41990](https://github.com/PaddlePaddle/Paddle/pull/41990))
-
-- **Add the processing of the no_need_buffer attribute for dynamic graph intermediate state.** The Tensor with the no_need_buffer attribute is skipped in the inplace backward check operation. ([#41720](https://github.com/PaddlePaddle/Paddle/pull/41720))
-
-
-#### **New Static Graph Executor**
-
-In order to solve the problem that the original static graph executor of the PaddlePaddle is not good enough for scheduling in some scenarios and it is not easy to use multiple streams, we have implemented a new static graph executor with superior performance. It is easy to take advantage of the asynchronous scheduling capabilities of multi-streams and multi-threads. The new executor is a compatible upgrade of the original executor. At present, it is used by default in single-card scenarios. Users do not need to make any changes in the training codes. It can be used automatically. Of course, we also provide an interface to switch back to the original executor. Users can switch back to the original executor by setting the environment variable: `FLAGS_USE_STANDALONE_EXECUTOR=false`. ([#41179](https://github.com/PaddlePaddle/Paddle/pull/41179)) The main contents are as follows.
-
-- Basic components: High-performance thread pool for multi-threaded scheduling in the executor ([#35470](https://github.com/PaddlePaddle/Paddle/pull/35470), [#35930](https://github.com/PaddlePaddle/Paddle/pull/35930), [#36030](https://github.com/PaddlePaddle/Paddle/pull/36030), [#36480](https://github.com/PaddlePaddle/Paddle/pull/36480), [#36688](https://github.com/PaddlePaddle/Paddle/pull/36688), [#36740](https://github.com/PaddlePaddle/Paddle/pull/36740), [#38335](https://github.com/PaddlePaddle/Paddle/pull/38335), [#40770](https://github.com/PaddlePaddle/Paddle/pull/40770)) and thread co-op component ([#38779](https://github.com/PaddlePaddle/Paddle/pull/38779), [#40876](https://github.com/PaddlePaddle/Paddle/pull/40876), [#40912](https://github.com/PaddlePaddle/Paddle/pull/40912)). There is the timely memory recovery after operator execution ([#37642](https://github.com/PaddlePaddle/Paddle/pull/37642), [#39617](https://github.com/PaddlePaddle/Paddle/pull/39617), [#40859](https://github.com/PaddlePaddle/Paddle/pull/40859)). There is the new dependency analysis algorithm for parallel executor ([#37231](https://github.com/PaddlePaddle/Paddle/pull/37231)) etc.
-
-- Scheduling logic: Optimize the scheduling method of operator in the executor. Support multi-stream multi-threaded asynchronous scheduling mechanism. Change transforms such as data type, device, and layout to the operator scheduling to improve performance. Support caching the selection of operator Kernel. Support the selection of new PHI operator. ([#35024](https://github.com/PaddlePaddle/Paddle/pull/35024), [#34922](https://github.com/PaddlePaddle/Paddle/pull/34922), [#35711](https://github.com/PaddlePaddle/Paddle/pull/35711), [#35928](https://github.com/PaddlePaddle/Paddle/pull/35928), [#39458](https://github.com/PaddlePaddle/Paddle/pull/39458),[#36899](https://github.com/PaddlePaddle/Paddle/pull/36899))。
-
-- Interface compatibility: Compatible with the user interface and functionality of the original executor, such as alignment with python interface Executor.run(), support for managing Tensor in Scope, etc. This ensures that users can switch to the new executor without perception. ([#37278](https://github.com/PaddlePaddle/Paddle/pull/37278), [#37379](https://github.com/PaddlePaddle/Paddle/pull/37379), [#37445](https://github.com/PaddlePaddle/Paddle/pull/37445), [#37510](https://github.com/PaddlePaddle/Paddle/pull/37510), [#40955](https://github.com/PaddlePaddle/Paddle/pull/40955), [#41778](https://github.com/PaddlePaddle/Paddle/pull/41178), [#41058](https://github.com/PaddlePaddle/Paddle/pull/41058), [#38584](https://github.com/PaddlePaddle/Paddle/pull/38584), [#37957](https://github.com/PaddlePaddle/Paddle/pull/37957), [#37672](https://github.com/PaddlePaddle/Paddle/pull/37672), [#37474](https://github.com/PaddlePaddle/Paddle/pull/37474), [#37085](https://github.com/PaddlePaddle/Paddle/pull/37085), [#37061](https://github.com/PaddlePaddle/Paddle/pull/37061), [#36945](https://github.com/PaddlePaddle/Paddle/pull/36945))
-
-- Enhance debugging and error reporting in multi-threaded scenarios by capturing error reports from sub-threads and throwing them uniformly in the main thread. This can improve user experience. ([#36692](https://github.com/PaddlePaddle/Paddle/pull/36692),[#36802](https://github.com/PaddlePaddle/Paddle/pull/36802))
-
-- Fix the bug with the new executor communication flow resetting stream cache information in the allocator, to reduce RecordStream overhead in cross-stream scenarios. This improves performance of DeepFM models by about 8% after optimization. ([#42046](https://github.com/PaddlePaddle/Paddle/pull/42046))
-
-- Optimize the dependency analysis method between new executor operators to improve runtime performance. Establish correct dependencies for send/recv communication operators to support pipeline parallel. ([#42009](https://github.com/PaddlePaddle/Paddle/pull/42009))
-
-
-
-#### **Distributed Training**
-
-- Basic functions of multi-machine multi-card parallel training based on collective communication
-
- - Add support for elastic training, enables scaling up and down the number of workers, enables training process resuming when node failure,to improve the fault tolerance of distributed training. ([#36684](https://github.com/PaddlePaddle/Paddle/pull/36684), [#37177](https://github.com/PaddlePaddle/Paddle/pull/37177), [#37781](https://github.com/PaddlePaddle/Paddle/pull/37781))
-
- - Refactor launch startup module, add `master` collaboration and node number `nnodes` definition, to improve the ease of using the distributed startup. ([#40086](https://github.com/PaddlePaddle/Paddle/pull/40086), [#40568](https://github.com/PaddlePaddle/Paddle/pull/40568), [#40782](https://github.com/PaddlePaddle/Paddle/pull/40782), [#40844](https://github.com/PaddlePaddle/Paddle/pull/40844), [#40936](https://github.com/PaddlePaddle/Paddle/pull/40936), [#41190](https://github.com/PaddlePaddle/Paddle/pull/41190), [#41314](https://github.com/PaddlePaddle/Paddle/pull/41314))
-
- - Add support for GPU/NPU/XPU multi-hardware heterogeneous training. ([#37613](https://github.com/PaddlePaddle/Paddle/pull/37613), [#37998](https://github.com/PaddlePaddle/Paddle/pull/37998))
-
- - Add fleet_executor asynchronous pipeline executor. ([#36966](https://github.com/PaddlePaddle/Paddle/pull/36966), [#37049](https://github.com/PaddlePaddle/Paddle/pull/37049), [#37087](https://github.com/PaddlePaddle/Paddle/pull/37087), [#37126](https://github.com/PaddlePaddle/Paddle/pull/37126), [#37150](https://github.com/PaddlePaddle/Paddle/pull/37150), [#37203](https://github.com/PaddlePaddle/Paddle/pull/37203), [#37167](https://github.com/PaddlePaddle/Paddle/pull/37167), [#37282](https://github.com/PaddlePaddle/Paddle/pull/37282), [#37319](https://github.com/PaddlePaddle/Paddle/pull/37319), [#37462](https://github.com/PaddlePaddle/Paddle/pull/37462), [#37507](https://github.com/PaddlePaddle/Paddle/pull/37507), [#37533](https://github.com/PaddlePaddle/Paddle/pull/37533), [#37576](https://github.com/PaddlePaddle/Paddle/pull/37576), [#37605](https://github.com/PaddlePaddle/Paddle/pull/37605), [#37691](https://github.com/PaddlePaddle/Paddle/pull/37691), [#37742](https://github.com/PaddlePaddle/Paddle/pull/37742), [#37783](https://github.com/PaddlePaddle/Paddle/pull/37783), [#37809](https://github.com/PaddlePaddle/Paddle/pull/37809), [#37862](https://github.com/PaddlePaddle/Paddle/pull/37862), [#37882](https://github.com/PaddlePaddle/Paddle/pull/37882), [#37934](https://github.com/PaddlePaddle/Paddle/pull/37934), [#38024](https://github.com/PaddlePaddle/Paddle/pull/38024), [#38083](https://github.com/PaddlePaddle/Paddle/pull/38083), [#38164](https://github.com/PaddlePaddle/Paddle/pull/38164), [#38261](https://github.com/PaddlePaddle/Paddle/pull/38261), [#38290](https://github.com/PaddlePaddle/Paddle/pull/38290), [#40607](https://github.com/PaddlePaddle/Paddle/pull/40607), [#37093](https://github.com/PaddlePaddle/Paddle/pull/37093), [#37106](https://github.com/PaddlePaddle/Paddle/pull/37106), [#37143](https://github.com/PaddlePaddle/Paddle/pull/37143), [#37338](https://github.com/PaddlePaddle/Paddle/pull/37338), [#37376](https://github.com/PaddlePaddle/Paddle/pull/37376), [#37485](https://github.com/PaddlePaddle/Paddle/pull/37485), [#37531](https://github.com/PaddlePaddle/Paddle/pull/37531), [#37623](https://github.com/PaddlePaddle/Paddle/pull/37623), [#37693](https://github.com/PaddlePaddle/Paddle/pull/37693), [#37755](https://github.com/PaddlePaddle/Paddle/pull/37755), [#37807](https://github.com/PaddlePaddle/Paddle/pull/37807), [#37889](https://github.com/PaddlePaddle/Paddle/pull/37889), [#38420](https://github.com/PaddlePaddle/Paddle/pull/38420), [#38539](https://github.com/PaddlePaddle/Paddle/pull/38539), [#36892](https://github.com/PaddlePaddle/Paddle/pull/36892), [#37084](https://github.com/PaddlePaddle/Paddle/pull/37084), [#37158](https://github.com/PaddlePaddle/Paddle/pull/37158), [#37361](https://github.com/PaddlePaddle/Paddle/pull/37361), [#37509](https://github.com/PaddlePaddle/Paddle/pull/37509), [#37603](https://github.com/PaddlePaddle/Paddle/pull/37603), [#37703](https://github.com/PaddlePaddle/Paddle/pull/37703), [#37824](https://github.com/PaddlePaddle/Paddle/pull/37824), [#38114](https://github.com/PaddlePaddle/Paddle/pull/38114), [#38322](https://github.com/PaddlePaddle/Paddle/pull/38322), [#38535](https://github.com/PaddlePaddle/Paddle/pull/38535), [#38650](https://github.com/PaddlePaddle/Paddle/pull/38650), [#38709](https://github.com/PaddlePaddle/Paddle/pull/38709), [#38799](https://github.com/PaddlePaddle/Paddle/pull/38799), [#38839](https://github.com/PaddlePaddle/Paddle/pull/38839), [#38904](https://github.com/PaddlePaddle/Paddle/pull/38904))
-
- - Add distributed inference function for large-scale model. ([#38795](https://github.com/PaddlePaddle/Paddle/pull/38795), [#39012](https://github.com/PaddlePaddle/Paddle/pull/39012), [#39032](https://github.com/PaddlePaddle/Paddle/pull/39032), [#39076](https://github.com/PaddlePaddle/Paddle/pull/39076), [#39194](https://github.com/PaddlePaddle/Paddle/pull/39194), [#39207](https://github.com/PaddlePaddle/Paddle/pull/39207), [#39241](https://github.com/PaddlePaddle/Paddle/pull/39241), [#39603](https://github.com/PaddlePaddle/Paddle/pull/39603), [#39758](https://github.com/PaddlePaddle/Paddle/pull/39758), [#39992](https://github.com/PaddlePaddle/Paddle/pull/39992)).
-
-- Dynamic graph hybrid parallelism
-
- - Reconstruct `paddle.distributed.fleet.utils.recompute`, to support new dynamic computational graph. ([#41396](https://github.com/PaddlePaddle/Paddle/pull/41396))
-
- - Add pure FP16 training to support data parallelism. ([#36420](https://github.com/PaddlePaddle/Paddle/pull/36420))
-
- - Add MoE (Mixture of Experts) parallel strategy, to support large-scale MoE model training. ([#41092](https://github.com/PaddlePaddle/Paddle/pull/41092), [#40895](https://github.com/PaddlePaddle/Paddle/pull/40895), [#40850](https://github.com/PaddlePaddle/Paddle/pull/40580), [#39224](https://github.com/PaddlePaddle/Paddle/pull/39224))
-
- - Add GroupSharded parallel strategy. Support stage1, stage2, stage3, and it supports synchronous and asynchronous communication. It can be used together with the basic function combinations such as Recompute, AMP O1\O2, Offload, GroupShardedClipGrad, and GroupShardedScaler. ([#37489](https://github.com/PaddlePaddle/Paddle/pull/37489), [#37568](https://github.com/PaddlePaddle/Paddle/pull/37568), [#37707](https://github.com/PaddlePaddle/Paddle/pull/37707), [#37836](https://github.com/PaddlePaddle/Paddle/pull/37836), [#37947](https://github.com/PaddlePaddle/Paddle/pull/37947), [#38151](https://github.com/PaddlePaddle/Paddle/pull/38151), [#38407](https://github.com/PaddlePaddle/Paddle/pull/38407), [#38052](https://github.com/PaddlePaddle/Paddle/pull/38052), [#39112](https://github.com/PaddlePaddle/Paddle/pull/39112), [#38989](https://github.com/PaddlePaddle/Paddle/pull/38989), [#39171](https://github.com/PaddlePaddle/Paddle/pull/39171), [#39285](https://github.com/PaddlePaddle/Paddle/pull/39285), [#39334](https://github.com/PaddlePaddle/Paddle/pull/39334), [#39397](https://github.com/PaddlePaddle/Paddle/pull/39397), [#39581](https://github.com/PaddlePaddle/Paddle/pull/39581), [#39668](https://github.com/PaddlePaddle/Paddle/pull/39668), [#40129](https://github.com/PaddlePaddle/Paddle/pull/40129), [#40396](https://github.com/PaddlePaddle/Paddle/pull/40396), [#40488](https://github.com/PaddlePaddle/Paddle/pull/40488), [#40601](https://github.com/PaddlePaddle/Paddle/pull/40601),[#37725](https://github.com/PaddlePaddle/Paddle/pull/37725),[#37904](https://github.com/PaddlePaddle/Paddle/pull/37904), [#38064](https://github.com/PaddlePaddle/Paddle/pull/38064))
-
-- Static graph hybrid parallelism
-
- - Add `scale_gradient` flag bit to `gradient_scale_configs` to control the position where the gradient aggregation operation averages the gradients under pipeline parallelism. ([#36384](https://github.com/PaddlePaddle/Paddle/pull/36384))
-
- - Under tensor parallelism, the dropout op supports the settings of deterministic random seed generators, to ensure random consistency for non-distributed variables and randomness of distributed variables. ([#36228](https://github.com/PaddlePaddle/Paddle/pull/36228))
-
- - NPU hybrid parallelism supports Offload, with saving 40% of NPU memory. ([#37224](https://github.com/PaddlePaddle/Paddle/pull/37224))
-
- - Add `force_cpu` optional parameter to the seed op, to allow dropout to read seed values directly from CPU. ([#35820](https://github.com/PaddlePaddle/Paddle/pull/35820))
-
- - Improve the Automatic Sparsity (ASP) sharding strategy and support the selection of sharding strategy according to the program. ([#40028](https://github.com/PaddlePaddle/Paddle/pull/40028))
-
-- Automatic parallel
-
- - Add the process restart (relaunch) after automatic mapping between logical processes and physical devices. ([#37523](https://github.com/PaddlePaddle/Paddle/pull/37523), [#37326](https://github.com/PaddlePaddle/Paddle/pull/37326))
-
- - Improve the underlying mechanism and interface for automatic parallel to facilitate the unification of modules and add the optimized pass. ([#36617](https://github.com/PaddlePaddle/Paddle/pull/36617), [#38132](https://github.com/PaddlePaddle/Paddle/pull/38132))
-
- - Add unified resource representation, to support for automatic mapping between logical processes and physical devices. ([#37091](https://github.com/PaddlePaddle/Paddle/pull/37091), [#37482](https://github.com/PaddlePaddle/Paddle/pull/37482), [#37094](https://github.com/PaddlePaddle/Paddle/pull/37094))
-
- - Improve the distributed attribute complementation for the backward and update parts of the computation graph. ([#36744](https://github.com/PaddlePaddle/Paddle/pull/36744))
-
- - Add data slicing function. ([#36055](https://github.com/PaddlePaddle/Paddle/pull/36055))
-
- - Add tensor resharding function to reshard the tensor according to the distributed properties of the tensor and operator. ([#40865](https://github.com/PaddlePaddle/Paddle/pull/40865), [#41106](https://github.com/PaddlePaddle/Paddle/pull/41106))
-
- - Add the automatic conversion pass of distributed parameters when the number of resources or parallel policy changes. ([#40434](https://github.com/PaddlePaddle/Paddle/pull/40434))
-
- - Add GradientMerge pass to reduce the number of communications and improve training efficiency. ([#38259](https://github.com/PaddlePaddle/Paddle/pull/38259), [#40737](https://github.com/PaddlePaddle/Paddle/pull/40737))
-
- - Add Recompute pass to reduce the activation memory storage. ([#38920](https://github.com/PaddlePaddle/Paddle/pull/38920))
-
- - Add Sharding optimization pass, to support p-g-os 3 stage optimization. ([#38502](https://github.com/PaddlePaddle/Paddle/pull/38502))
-
- - Add AMP + FP16 optimization pass. ([#38764](https://github.com/PaddlePaddle/Paddle/pull/38764), [#40615](https://github.com/PaddlePaddle/Paddle/pull/40615))
-
- - Add fused QKV parallelization for Transformer class model. ([#39080](https://github.com/PaddlePaddle/Paddle/pull/39080))
-
- - Improve the sharding propagation for while op to ensure convergence of the fix-point algorithm. ([#39939](https://github.com/PaddlePaddle/Paddle/pull/39939), [#39086](https://github.com/PaddlePaddle/Paddle/pull/39086), [#39014](https://github.com/PaddlePaddle/Paddle/pull/39014))
-
- - Support training and inference for sub-block and while op control flow. ([#39612](https://github.com/PaddlePaddle/Paddle/pull/39612), [#39895](https://github.com/PaddlePaddle/Paddle/pull/39895), [#40077](https://github.com/PaddlePaddle/Paddle/pull/40077))
-
-- Parameter Server
-
- - Add NaN/Inf value checking tool under GPUPS. ([#38131](https://github.com/PaddlePaddle/Paddle/pull/38131))
-
- - Under GPUPS, add set_date interface to adapt incremental training. ([#36194](https://github.com/PaddlePaddle/Paddle/pull/36194))
-
- - Under GPUPS, add asynchronous release dataset function. ([#37790](https://github.com/PaddlePaddle/Paddle/pull/37790))
-
- - Under GPUPS, support the Dump parameters and intermediate layers([#36157](https://github.com/PaddlePaddle/Paddle/pull/36157));
-
- - Under GPUPS, support the optimizer parameter configuration. ([#39783](https://github.com/PaddlePaddle/Paddle/pull/39783), [#39849](https://github.com/PaddlePaddle/Paddle/pull/39849))
-
- - Under the Unified Parameter Server, refactor the base classes of each module such as communication and storage, to improve the ease of secondary development of each module. ([#41207](https://github.com/PaddlePaddle/Paddle/pull/41207), [#41022](https://github.com/PaddlePaddle/Paddle/pull/41022), [#40702](https://github.com/PaddlePaddle/Paddle/pull/40702), [#39341](https://github.com/PaddlePaddle/Paddle/pull/39341) [#39377](https://github.com/PaddlePaddle/Paddle/pull/39377), [#39191](https://github.com/PaddlePaddle/Paddle/pull/39191), [#39064](https://github.com/PaddlePaddle/Paddle/pull/39064))
-
- - Add evaluation metrics module under the Unified Parameter Server, to support AUC/WuAUC/MaskAUC and other evaluation metrics calculation and customizable extensions. ([#38789](https://github.com/PaddlePaddle/Paddle/pull/38789))
-
- - Supports XPU parameter server training on KUNLUNXIN 2. ([#41917](https://github.com/PaddlePaddle/Paddle/pull/41917), [#42266](https://github.com/PaddlePaddle/Paddle/pull/42266), [#41916](https://github.com/PaddlePaddle/Paddle/pull/41916))
-
-#### Profiler
-
-- Add the performance analysis module `paddle.profiler` in the Python layer: Provide the ability to collect, export, and count performance data during the training push. ([#40065](https://github.com/PaddlePaddle/Paddle/pull/40065), [#40357](https://github.com/PaddlePaddle/Paddle/pull/40357), [#40888](https://github.com/PaddlePaddle/Paddle/pull/40888))
-
- - `paddle.profiler.Profiler`: performance analyzer, interface for user interaction. ([#41029](https://github.com/PaddlePaddle/Paddle/pull/41029), [#41524](https://github.com/PaddlePaddle/Paddle/pull/41524), [#41157](https://github.com/PaddlePaddle/Paddle/pull/41157), [#40249](https://github.com/PaddlePaddle/Paddle/pull/40249), [#40111](https://github.com/PaddlePaddle/Paddle/pull/40111), [#39964](https://github.com/PaddlePaddle/Paddle/pull/39964), [#40133](https://github.com/PaddlePaddle/Paddle/pull/40133))
-
- - `paddle.profiler.RecordEvent`: provide custom punches to record time. ([#39693](https://github.com/PaddlePaddle/Paddle/pull/39693), [#39694](https://github.com/PaddlePaddle/Paddle/pull/39694), [#39695](https://github.com/PaddlePaddle/Paddle/pull/39695), [#39675](https://github.com/PaddlePaddle/Paddle/pull/39675),[#41445](https://github.com/PaddlePaddle/Paddle/pull/41445), [#41132](https://github.com/PaddlePaddle/Paddle/pull/41132))
-
- - `paddle.profiler.ProfilerTarget`: specify the target device for performance analysis.
-
- - `paddle.profiler.ProfilerState`: indicate the state of the performance analyzer.
-
- - `paddle.profiler.SortedKeys`: specify the sorting method of the data within the statistics form.
-
- - `paddle.profiler.make_scheduler`: the scheduler generating the performance analyzer state and implement the periodic control of the collection scope.
-
- - `paddle.profiler.export_chrome_tracing`: save performance data to a google chrome tracing file viewable by the chrome://tracing plugin. ([#39316](https://github.com/PaddlePaddle/Paddle/pull/39316), [#39984](https://github.com/PaddlePaddle/Paddle/pull/39984), [#41029](https://github.com/PaddlePaddle/Paddle/pull/41029))
-
- - `paddle.profiler.export_protobuf`: save performance data to a protobuf file represented by internal structure. ([#39519](https://github.com/PaddlePaddle/Paddle/pull/39519), [#39109](https://github.com/PaddlePaddle/Paddle/pull/39109), [#39474](https://github.com/PaddlePaddle/Paddle/pull/39474))
-
- - `paddle.profiler.load_profiler_result`: load the performance data saved to a protobuf file.
-
- - `paddle.profiler.Profiler` generate statistics for data reading, step overhead and throughput for the model training by specifying the `timer_only` parameter. ([#40386](https://github.com/PaddlePaddle/Paddle/pull/40386))
-
-- Refactor Profiler underlying infrastructure in C++ layer
-
- - Refactor the Profiler's controller architecture. ([#38826](https://github.com/PaddlePaddle/Paddle/pull/38826), [#39230](https://github.com/PaddlePaddle/Paddle/pull/39230), [#39779](https://github.com/PaddlePaddle/Paddle/pull/39779) )
-
- - Add Host Tracer to collect host-side performance metrics. ([#37629](https://github.com/PaddlePaddle/Paddle/pull/39629), [#37766](https://github.com/PaddlePaddle/Paddle/pull/37766), [#37944](https://github.com/PaddlePaddle/Paddle/pull/37944), [#38280](https://github.com/PaddlePaddle/Paddle/pull/38280), [#39975](https://github.com/PaddlePaddle/Paddle/pull/39975), [#40460](https://github.com/PaddlePaddle/Paddle/pull/40460))
-
- - Add CUDA Tracer to collect device-side performance metrics. ([#39488](https://github.com/PaddlePaddle/Paddle/pull/39488))
-
- - Profiler support for grading. ([#39926](https://github.com/PaddlePaddle/Paddle/pull/39926))
-
-- Modify the name and type of logging for op under new dynamic graph. ([#41771](https://github.com/PaddlePaddle/Paddle/pull/41771/)
-
-- Add Kernel running statistics into profilers' summarization and optimize the summarization. ([#41989](https://github.com/PaddlePaddle/Paddle/pull/41989)
-
-- Remove side-effect to performance in forward computing forward when Profiler is off. ([#42142](https://github.com/PaddlePaddle/Paddle/pull/42142))
-
-#### **CINN compiler adoption**
-
-With the recent development of PaddlePaddle's compiler, a.k.a, CINN([GitHub - PaddlePaddle/CINN: Compiler Infrastructure for Neural Networks](https://github.com/PaddlePaddle/CINN)), paddle framework has also been changed to adapt the compiler CINN features. These include the subgraph management related functions for the Paddle-CINN runtime, optimization of memory and speed performance, and bug fixing during development.
-
-- Functions developed:
-
- - Subgraph op related functions:
-
- - Add the function to find and generate CINN subgraphs from computational graphs. ([#36345](https://github.com/PaddlePaddle/Paddle/pull/36345))
-
- - Add cinn_launch op as a runtime entry point to CINN. It is responsible for scheduling CINN to compile the subgraph, to initialize the data, and to execute the generated kernels. ([#36600](https://github.com/PaddlePaddle/Paddle/pull/36600))
-
- - Add a helper class `CinnLaunchContext` to the kernel implementation of cinn_launch op to manage the intermediate data for compiling and running subgraphs, to improve scalability and code readability. ([#37938](https://github.com/PaddlePaddle/Paddle/pull/37938))
-
- - Add additional fetch nodes to CINN subgraphs, thus ensuring that CINN external nodes can fetch the values of variables. ([#37172](https://github.com/PaddlePaddle/Paddle/pull/37172), [#37190](https://github.com/PaddlePaddle/Paddle/pull/37190))
-
- - Add the function to symbolize a CINN subgraph, which is used to topologically sort the subgraphs and return the CINN execution sequence. ([#36417](https://github.com/PaddlePaddle/Paddle/pull/36417)
-
- - Add `CinnCompiler` class for involking subgraphs in the CINN compiled graph that can be replaced by using CINN operators. ([#36562](https://github.com/PaddlePaddle/Paddle/pull/36562), [#36975](https://github.com/PaddlePaddle/Paddle/pull/36975))
-
- - Add the interface to CINN symbolization class to get the names of subgraph fetched variables to prevent fetched variables from being eliminated in compilation optimizations. ([#37218](https://github.com/PaddlePaddle/Paddle/pull/37218))
-
- - Checking, debugging, and PI changes related:
-
- - Synchronize the update of NetBuilder API name changes in CINN. ([#40392](https://github.com/PaddlePaddle/Paddle/pull/40392))
-
- - Add necessary log information to Paddle-CINN for better debugging. ([#36867](https://github.com/PaddlePaddle/Paddle/pull/36867))
-
- - Add the bidirectional conversion function between Paddle desc and CINN desc. ([#36100](https://github.com/PaddlePaddle/Paddle/pull/36100))
-
- - The operator implemented in CINN may not use some input variables compared to Paddle. Therefore, remove the check that the input variables must be used in the cinn_launch op. ([#37119](https://github.com/PaddlePaddle/Paddle/pull/37119))
-
- - Added cinn_instruction_run op for invoking CINN to execute a single generation instruction, facilitating the construction of scheduling run subgraphs on the Paddle side. ([#39435](https://github.com/PaddlePaddle/Paddle/pull/39435), [#39576](https://github.com/PaddlePaddle/Paddle/pull/39576))
-
- - Add control macros to Paddle for CUDA/CUBLAS/MKL/CINN pass application required to compile CINN. ([#37066](https://github.com/PaddlePaddle/Paddle/pull/37066), [#36660](https://github.com/PaddlePaddle/Paddle/pull/36660))
-
- - Add two control flags FLAGS_allow_cinn_ops and FLAGS_deny_cinn_ops to control the categories of CINN operators used to replace native operators during Paddle training. ([#36842](https://github.com/PaddlePaddle/Paddle/pull/36842))
-
-- Performance optimization:
-
- - Speed optimization
-
- - Optimize the computational time consumed by CinnCacheKey. ([#37786](https://github.com/PaddlePaddle/Paddle/pull/37786), [#37317](https://github.com/PaddlePaddle/Paddle/pull/37317))
-
- - Cache variable scope for CINN compiled subgraphs to reduce runtime parameter construction overhead. ([#37983](https://github.com/PaddlePaddle/Paddle/pull/37983))
-
- - Utilize CINN's auto-tuning in case of subgraph compilation, could be enabled by flag, for further tuning of training performance. ([#41795](https://github.com/PaddlePaddle/Paddle/pull/41795))
-
- - Refactor the correctness check of compilation results in case of subgraph compilation to avoid repeated checks at runtime and reduce the scheduling overhead. ([#41777](https://github.com/PaddlePaddle/Paddle/pull/41777))
-
- - Enable TransposeFolding and GemmRewriter optimization passes by default in Paddle-CINN training. ([#41084](https://github.com/PaddlePaddle/Paddle/pull/41084))
-
- - Pass the cuda stream created in Paddle into CINN so that Paddle and CINN can use the same CUDA stream in cuda computing. ([#37337](https://github.com/PaddlePaddle/Paddle/pull/37337))
-
- - Move CINN optimization pass application logic from Paddle to CINN. ([#42047](https://github.com/PaddlePaddle/Paddle/pull/42047), [#42070](https://github.com/PaddlePaddle/Paddle/pull/42070))
-
- - Device memory optimization
-
- - Add NoNeedBufferVars to cinn_launch op to declare a list of input variables that do not require a buffer, so that the memory can be freed in advance. ([#38367](https://github.com/PaddlePaddle/Paddle/pull/38367))
-
- - Pass in reference count information for external variables to the subgraph, so that subgraphs within cinn_launch can reuse memory optimization passes and reduce the memory overhead in using CINN. ([#39209](https://github.com/PaddlePaddle/Paddle/pull/39209), [#39622](https://github.com/PaddlePaddle/Paddle/pull/39622))
-
- - Add the function to convert a collection of executable instructions generated by CINN compilation to a Paddle Graph, supporting reuse of the Paddle scheduler and memory optimization pass, further reducing the memory overhead in using CINN. ([#39724](https://github.com/PaddlePaddle/Paddle/pull/39724), [#39911](https://github.com/PaddlePaddle/Paddle/pull/39911))
-
- - Add Kernel of cinn_instruction_run op, to support dynamic device memory requests based on data types inferred from compilation results. ([#40920](https://github.com/PaddlePaddle/Paddle/pull/40920))
-
-- Bug fixing:
-
- - Fix and optimize the generation logic of CINN subgraphs. ([#36503](https://github.com/PaddlePaddle/Paddle/pull/36503))
-
- - Fix the bug that Paddle-CINN does not support no-input subgraphs. ([#40814](https://github.com/PaddlePaddle/Paddle/pull/40814))
-
- - Fix an error reported due to CINN not being able to handle useless outputs in operators such as batch_norm. ([#36996](https://github.com/PaddlePaddle/Paddle/pull/36996))
-
- - Fix several bugs in CINN subgraph partitioning and symbolization, and solve problems with Paddle training accessing the CINN. ([#36739](https://github.com/PaddlePaddle/Paddle/pull/36739), [#36698](https://github.com/PaddlePaddle/Paddle/pull/36698) )
-
- - CINN does not yet support the control flow yet. Add logic to skip control flow when encountered. ([#40812](https://github.com/PaddlePaddle/Paddle/pull/40812))
-
-#### **Other**
-
-- Model quantization
-
- - Upgrade quantization storage format to unify quantization formats for dynamic and static graphs. ([#41041](https://github.com/PaddlePaddle/Paddle/pull/41041))
-
- - Add new post training quantization (PTQ): EMD and Adaround. ([#40421](https://github.com/PaddlePaddle/Paddle/pull/40421), [#38460](https://github.com/PaddlePaddle/Paddle/pull/38460))
-
- - Support to quantize more operations in PTQ and QAT, such as crop, split, ab, unsqueeze etc. ([#40083](https://github.com/PaddlePaddle/Paddle/pull/40083))
-
- - Support to quantize operators in control flow. ([#37498](https://github.com/PaddlePaddle/Paddle/pull/37498))
-
- - Support quantization of matmul_v2 operator. ([#36469](https://github.com/PaddlePaddle/Paddle/pull/36469))
-
- - Add support for quantized matmul_v2 inference on TensorRT. ([#36594](https://github.com/PaddlePaddle/Paddle/pull/36594))
-
-- CUDA memory optimization
-
- - Implement multi-stream safe Allocator to support safe and efficient use of CUDA memory in asynchronous computing scenarios. ([#37290](https://github.com/PaddlePaddle/Paddle/pull/37290))
-
- - Add new APIs (paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved) for GPU memory monitoring in runtime. ([#38657](https://github.com/PaddlePaddle/Paddle/pull/38657))
-
- - Support allocate CUDA Managed Memory to train super large models in memory-constrained scenarios. ([#39075](https://github.com/PaddlePaddle/Paddle/pull/39075))
-
- - Add GetBasePtr interface in C++ to get device address created with *cudaMalloc*. ([#37978](https://github.com/PaddlePaddle/Paddle/pull/37978))
-
- - Reduce the number of free blocks in AutoGrowth Allocator to improve memory allocation performance. ([#35732](https://github.com/PaddlePaddle/Paddle/pull/35732))
-
- - Remove redundant Float32 temporary tensor and cast operation for tensor with data type FP16 in`initializer.Normal` and `initializer.Constant`to save 2x memory. ([#38818](https://github.com/PaddlePaddle/Paddle/pull/38818))
-
-- High-order derivative testing for models in dynamic graphs.
-
- - Add third-order derivative testing for network in dynamic graphs. ([#36814](https://github.com/PaddlePaddle/Paddle/pull/36814), [#37377](https://github.com/PaddlePaddle/Paddle/pull/37377))
-- Custom op: Support to custom op in ROCm(HIP) platform. ([#36771](https://github.com/PaddlePaddle/Paddle/pull/36771))
-
-- Cost Model: Add basic Cost Model based on profiling infomation. ([#35774](https://github.com/PaddlePaddle/Paddle/pull/35774))
-
-- Added a function to allow user to add their own layer and correspond pruning way to ASP support. ([#40253](https://github.com/PaddlePaddle/Paddle/pull/40253))
-
-- Add string tensor data structure, allowing the framework to have the ability to represent and process string. ([#39830](https://github.com/PaddlePaddle/Paddle/pull/39830), [#40992](https://github.com/PaddlePaddle/Paddle/pull/40992))
-
-- Add or upgrade oneDNN FP32/int8/bfloat16 Kernel, including:
-
- - ELU ([#37149](https://github.com/PaddlePaddle/Paddle/pull/37149))
-
- - exp ([#38624](https://github.com/PaddlePaddle/Paddle/pull/38624))
-
- - stack ([#37002](https://github.com/PaddlePaddle/Paddle/pull/37002))
-
- - softplus ([#36382](https://github.com/PaddlePaddle/Paddle/pull/36382))
-
- - round ([#39653](https://github.com/PaddlePaddle/Paddle/pull/39653))
-
- - shape ([#36033](https://github.com/PaddlePaddle/Paddle/pull/36033))
-
- - flatten and flatten2 ([#35892](https://github.com/PaddlePaddle/Paddle/pull/35892))
-
- - slice ([#37630](https://github.com/PaddlePaddle/Paddle/pull/37630))
-
- - elementwise_mul ([#40546](https://github.com/PaddlePaddle/Paddle/pull/40546))
-
- - elementwise_add ([#38176](https://github.com/PaddlePaddle/Paddle/pull/38176))
-
- - ementwise_div ([#36158](https://github.com/PaddlePaddle/Paddle/pull/36158))
-
- - elementwise_sub ([#35662](https://github.com/PaddlePaddle/Paddle/pull/35662))
-
- - roi_align ([#37848](https://github.com/PaddlePaddle/Paddle/pull/37848))
-
- - nearest_interp and nearest_interp_v2 ([#37985](https://github.com/PaddlePaddle/Paddle/pull/37985),[#38622](https://github.com/PaddlePaddle/Paddle/pull/38622),[#39490](https://github.com/PaddlePaddle/Paddle/pull/39490))
-
- - assembly optimized Adam ([#39158](https://github.com/PaddlePaddle/Paddle/pull/39158))
-
- - logsoftmax ([#39793](https://github.com/PaddlePaddle/Paddle/pull/39793))
-
- - activation ([#40721](https://github.com/PaddlePaddle/Paddle/pull/40721))
-
- - mul ([#38552](https://github.com/PaddlePaddle/Paddle/pull/38552))
-
- - mean ([#37104](https://github.com/PaddlePaddle/Paddle/pull/37104))
-
- - relu ([#36265](https://github.com/PaddlePaddle/Paddle/pull/36265))
-
- - pool2d ([#37081](https://github.com/PaddlePaddle/Paddle/pull/37081))
-
- - concat ([#35889](https://github.com/PaddlePaddle/Paddle/pull/35889))
-
- - conv2d ([#38507](https://github.com/PaddlePaddle/Paddle/pull/38507),[#38938](https://github.com/PaddlePaddle/Paddle/pull/38938),[#36284](https://github.com/PaddlePaddle/Paddle/pull/36284))
-
- - LayerNorm ([#40418](https://github.com/PaddlePaddle/Paddle/pull/40418))
-
-- Add the 3-stage storage graph retrieval engine based on SSD - host memory - GPU device memory, to support large-scale graph neural network training. ([#42472](https://github.com/PaddlePaddle/Paddle/pull/42472), [#42321](https://github.com/PaddlePaddle/Paddle/pull/42321), [#42027](https://github.com/PaddlePaddle/Paddle/pull/42027))
-
-- Add heterogeneous multi-cloud training communication module switch, implement the Send/Recv interface function, and support multiple heterogeneous cloud communication. ([#40965](https://github.com/PaddlePaddle/Paddle/pull/40965) [40911](https://github.com/PaddlePaddle/Paddle/pull/40911))
-
-### **(2) Function optimization**
-
-#### API
-
-- Add backward implementation of `paddle.linalg.det `. ([#36013](https://github.com/PaddlePaddle/Paddle/pull/36013))
-
-- Add support for mixed precision training O2 mode for `paddle.Model`, i.e., support for Pure FP16 training mode of the original dynamic/static graphs. ([#36441](https://github.com/PaddlePaddle/Paddle/pull/40962441))
-
-- Support for self chain calls for `paddle.nn.Layer`. ([#36609](https://github.com/PaddlePaddle/Paddle/pull/36609))
-
-- Add settings of `is_distributed` property for the `to` method of `paddle.nn.Layer` to ensure that the distributed properties remain consistent before and after network parameter transform. ([#36221](https://github.com/PaddlePaddle/Paddle/pull/36221))
-
-- Improve the parameter conversion logic of the `to` method of `paddle.nn.Layer`, to reduce the peak memory consumption of the conversion process and improve the conversion success rate. ([#36862](https://github.com/PaddlePaddle/Paddle/pull/36862))
-
-- Support settings of the shape of the output Tensor for `paddle.incubate.graph_send_recv` to reduce the memory usage during the actual computation. ([#40509](https://github.com/PaddlePaddle/Paddle/pull/40509))
-
-- Add the support of int32 and int64 data types for `paddle.incubate.segment_sum`, `segment_mean`, `segment_max`, and `segment_min`. ([#40577](https://github.com/PaddlePaddle/Paddle/pull/40577))
-
-- Add the support of the bool type for transpose op. ([#35886](https://github.com/PaddlePaddle/Paddle/pull/35886))
-
-- Switch the `paddle.mm` underlying operator from matmul to matmul_v2. ([#35770](https://github.com/PaddlePaddle/Paddle/pull/35770))
-
-- Support static graph mode and support the unknown shape for `paddle.einsum`. ([#40360](https://github.com/PaddlePaddle/Paddle/pull/40360))
-
-- Support data`parallelism for paddle.nn.functional.margin_cross_entropy` and `paddle.nn.functional.class_center_sample`. ([#39852](https://github.com/PaddlePaddle/Paddle/pull/39852))
-
-- Support input of shape [1] for `paddle.nn.functional.grid_sample`. ([#36183](https://github.com/PaddlePaddle/Paddle/pull/36183))
-
-- Support NHWC data format for `paddle.nn.PRelu`. ([#37019](https://github.com/PaddlePaddle/Paddle/pull/37019))
-
-- Support the fixed random state using `paddle.seed` for `paddle.nn.functional.class_center_sample`. ([#38248](https://github.com/PaddlePaddle/Paddle/pull/38248))
-
-- Add ROCM backend support for all APIs under `paddle.fft`, and optimize CUFFT backend error messages. ([#36415](https://github.com/PaddlePaddle/Paddle/pull/36415), [#36114](https://github.com/PaddlePaddle/Paddle/pull/36114/files))
-
-- Support the function that the slicing dimension i 0, that is, allow slicing index results to be empty. ([#37313](https://github.com/PaddlePaddle/Paddle/pull/37313))
-
-- Support int and bool type Tensor with using bool index for `Tensor.setitem`. ([#37761](https://github.com/PaddlePaddle/Paddle/pull/37761))
-
-- Support nearest mode for `paddle.nn.functional.interpolate` when the input shape is 5D. ([#38868](https://github.com/PaddlePaddle/Paddle/pull/38868))
-
-- Add the support of int16 for `paddle.nn.Embedding`and`paddle.gather`. ([#40964](https://github.com/PaddlePaddle/Paddle/pull/40964), [#40052](https://github.com/PaddlePaddle/Paddle/pull/40052))
-
-- Support data`parallelism on single machine on``CPU platform``in paddle.distributed.spawn`. ([#35745](https://github.com/PaddlePaddle/Paddle/pull/35745), [#36758](https://github.com/PaddlePaddle/Paddle/pull/36758), [#36637](https://github.com/PaddlePaddle/Paddle/pull/36637))
-
-- Add `depthwise_conv2d` MKLDNN operator. ([#38484](https://github.com/PaddlePaddle/Paddle/pull/38484))
-
-- Add complex types check in the static graph model for API`paddle.abs`, `paddle.transpose`, `paddle.squeeze`, `paddle.unsqueeze`, `paddle.matmul`, and `paddle.full`. ([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113))
-
-- Support tuple and list type arguments for `paddle.autograd.PyLayer`. ([#38146](https://github.com/PaddlePaddle/Paddle/pull/38146))
-
-- Add check whether tensor is inplace and leaf when calculate gradient. ([#37931](https://github.com/PaddlePaddle/Paddle/pull/37931))
-
-- Support HIP library for `paddle.autograd.PyLayer`. ([#38184](https://github.com/PaddlePaddle/Paddle/pull/38184))
-
-- Support more size inputs for `paddle.take_along_axis` and `paddle.put_along_axis`, and allow index matrix shape size to be larger than array matrix shape size. ([#39072](https://github.com/PaddlePaddle/Paddle/pull/39072))
-
-- Optimize the error report message of API `paddle.nn.Pad2D` when replicate is 0. ([#36510](https://github.com/PaddlePaddle/Paddle/pull/36510/files))
-
-- Support pad input in tuple format for API `paddle.nn.Pad2D`. ([#35985](https://github.com/PaddlePaddle/Paddle/pull/35985/files))
-
-- Add tdm_sample API in `paddle.distributed.InMemoryDataset` to support sampling operations in TDM algorithms. ([#37044](https://github.com/PaddlePaddle/Paddle/pull/37044))
-
-- Add Pre-saving Hooks mechanism for `paddle.jit.save`. ([#38186](https://github.com/PaddlePaddle/Paddle/pull/38186))
-
-- Add new higher-order differentiation-related APIs.
-
- - `elementwise_add`: add third-order Kernel, to support computation of third-order differentiation. ([#36508](https://github.com/PaddlePaddle/Paddle/pull/36508), [#36618](https://github.com/PaddlePaddle/Paddle/pull/36618))
-
- - `matmul_v2`: add third-order Kernel, to support computation of third-order differentiation. ([#36459](https://github.com/PaddlePaddle/Paddle/pull/36459))
-
- - `elementwise_mul`: Add third-order Kernel, to support computation of third-order differentiation. ([#37152](https://github.com/PaddlePaddle/Paddle/pull/37547))
-
-- Improve the logic of the `paddle.amp.GradScaler` to call check_finite_and_unscale op, to eliminate the cudaMemcpy introduced by the creation of the bool variable. ([#37770](https://github.com/PaddlePaddle/Paddle/pull/37770))
-
-- Add check for unstack and unique op in case of input Tensor with 0 elements. ([#36021](https://github.com/PaddlePaddle/Paddle/pull/36021))
-
-- Add new multi-layer, bi-directional LSTM function that supports KUNLUNXIN 2, to improve RNN forward/backward ops, and support the use of temporal model training. ([#](https://github.com/PaddlePaddle/Paddle/pull/41781)[42076](https://github.com/PaddlePaddle/Paddle/pull/42076))
-
-- Add bce_loss forward/backward ops for KUNLUNXIN 2. ([#41610](https://github.com/PaddlePaddle/Paddle/pull/41610))
-
-- Add backward implementation of `paddle.linalg.det `. ([#36013](https://github.com/PaddlePaddle/Paddle/pull/36013))
-
-#### IR(Intermediate Representation)
-
-- Dynamic Graphs to Static Graphs
-
- - Optimize the behavior of the `ProgramCache.last` interface for dynamic graph to static graph so that it returns the most recently used Program instead of the final generated Program. ([#39541](https://github.com/PaddlePaddle/Paddle/pull/39541))
-
- - Optimize the error report message for the `paddle.reshape` API for dynamic graph to static graph, and add a new recommended usage hint. ([#40599](https://github.com/PaddlePaddle/Paddle/pull/40599))
-
- - Optimize the type of exception catch in the `is_api_in_module` function when transcribing dynamic code to static code. ([#40243](https://github.com/PaddlePaddle/Paddle/pull/40243))
-
- - Optimize the hint of error message for dynamic graph to static graph,hide warning information by default. ([#39730](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/39730))
-
- - Add the support of type hint syntax for dynamic graph to static graph to improve the accuracy of variable type analysis. ([#39572](https://github.com/PaddlePaddle/Paddle/pull/39572))
-
- - Optimize the `paddle.cond` function to allow values are equal for basic types such as bool and int. ([#37888](https://github.com/PaddlePaddle/Paddle/pull/37888))
-
- - Optimize the decorate function `@to_static` to allow the switch of the train/eval mode. ([#37383](https://github.com/PaddlePaddle/Paddle/pull/37383))
-
- - Optimize the stack of error report for dynamic graph to static graph, to highlight user-related codes and reduce the framework redundant error stack. ([#36741](https://github.com/PaddlePaddle/Paddle/pull/36741))
-
- - Remove `no_value` placeholder from the return value of `paddle.cond`. ([#36513](https://github.com/PaddlePaddle/Paddle/pull/36513)、[#36826](https://github.com/PaddlePaddle/Paddle/pull/36826))
-
- - Adapt the run_program op to the new dynamic graph mode. ([#40198](https://github.com/PaddlePaddle/Paddle/pull/40198), [#40355](https://github.com/PaddlePaddle/Paddle/pull/40355))
-
- - Add check for zip syntax. ([#37846](https://github.com/PaddlePaddle/Paddle/pull/https://github.com/PaddlePaddle/Paddle/pull/37846))
-
- - Fix the dynamic graph to static graph failure due to the error of dimension and type judgment in the `paddle.signal.frame`, `paddle.signal.stft` and `paddle.signal.istft`. ([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113))
-
- - Add registration of plural type Kernel for mean, pad3d ops. ([#40113](https://github.com/PaddlePaddle/Paddle/pull/40113))
-
-
-#### **Mixed Precision Training**
-
-- Add GPU Compute Capability environment check for amp. Add the usage warning for GPU environments that the fail acceleration for training. ([#38086](https://github.com/PaddlePaddle/Paddle/pull/38086))
-
-- Add check of calling order when using `paddle.amp.decorate` and `paddle.DataParallel` at the same time. ([#38785](https://github.com/PaddlePaddle/Paddle/pull/38785))
-
-
-#### **Distributed Training**
-
-- Basic functions of the distributed training
-
- - Optimize Fleet API and DistributedStrategy configuration to use dynamic graph parallel function conveniently. ([#40408](https://github.com/PaddlePaddle/Paddle/pull/40408))
-
- - Optimize Dynamic Graph mixed parallel HybridParallelClipGrad strategy, support 4D hybrid parallel and Pure FP16 training. ([#36237](https://github.com/PaddlePaddle/Paddle/pull/36237), [#36555](https://github.com/PaddlePaddle/Paddle/pull/36555))
-
- - Restructure dynamic graph data parallel strategy, to support new dynamic graph and communication. ([#40389](https://github.com/PaddlePaddle/Paddle/pull/40389), [#40593](https://github.com/PaddlePaddle/Paddle/pull/40593), [#40836](https://github.com/PaddlePaddle/Paddle/pull/40836), [#41119](https://github.com/PaddlePaddle/Paddle/pull/41119), [#41413](https://github.com/PaddlePaddle/Paddle/pull/41413), [#39987](https://github.com/PaddlePaddle/Paddle/pull/39987))
-
- - Support distributed tensor model parallel for fused_attention op. ([#40101](https://github.com/PaddlePaddle/Paddle/pull/40101))
-
- - Support the distributed tensor model parallel for fused_feedforward op. ([#40160](https://github.com/PaddlePaddle/Paddle/pull/40160))
-
-- Graph retrieval engine
-
- - Optimize the data format returned by the graph sampling interface of the graph engine, with a 3x improvement of the sampling speed. ([#37315](https://github.com/PaddlePaddle/Paddle/pull/37315))
-
- - Reduce the amount of graph engine threads to improve performance. ([#37098](https://github.com/PaddlePaddle/Paddle/pull/37098))
-
- - Optimize graph engine data transfer to improve performance. ([#37341](https://github.com/PaddlePaddle/Paddle/pull/37341))
-
- - Optimize the merge logic of embedding op to improve performance by exploiting the topological relationship of embedding op in the model. [(#35942)](https://github.com/PaddlePaddle/Paddle/pull/35942)
-
-- Communication library: restructure the communication library to improve the scalability and development of the communication library, and support heterogeneous communication. ([#41398](https://github.com/PaddlePaddle/Paddle/pull/41398), [#39720](https://github.com/PaddlePaddle/Paddle/pull/39720), [#40911](https://github.com/PaddlePaddle/Paddle/pull/40911), [#40579](https://github.com/PaddlePaddle/Paddle/pull/40579), [#40629](https://github.com/PaddlePaddle/Paddle/pull/40629), [#40437](https://github.com/PaddlePaddle/Paddle/pull/40437), [#40430](https://github.com/PaddlePaddle/Paddle/pull/40430), [#40228](https://github.com/PaddlePaddle/Paddle/pull/40228), [#40181](https://github.com/PaddlePaddle/Paddle/pull/40181), [#40100](https://github.com/PaddlePaddle/Paddle/pull/40100), [#40097](https://github.com/PaddlePaddle/Paddle/pull/40097), [#39892](https://github.com/PaddlePaddle/Paddle/pull/39892), [#39384](https://github.com/PaddlePaddle/Paddle/pull/39384), [#39737](https://github.com/PaddlePaddle/Paddle/pull/39737), [#40040](https://github.com/PaddlePaddle/Paddle/pull/40040))
-
-- Support the publication of MoE-related interfaces in `paddle.incubate.distributed.models.moe ` (`moe.GShardGate `, `moe.BaseGate `, `moe.SwitchGate `, `moe.MoELayer `, and `moe. ClipGradForMOEByGlobalNorm `). ([#42300](https://github.com/PaddlePaddle/Paddle/pull/42300))
-
-- Fix the error report in the use of recomputing in `paddle.incubate.distributed.models.moe.MoELayer `. ([#42128](https://github.com/PaddlePaddle/Paddle/pull/42128))
-
-- Fix the error report in the new dynamic graph pipeline parallel caused by different data types ([#41937](https://github.com/PaddlePaddle/Paddle/pull/41937) [#42053](https://github.com/PaddlePaddle/Paddle/pull/42053))
-
-- Fix the error report in the new dynamic graph tensor model parallel due to different data types([#41960](https://github.com/PaddlePaddle/Paddle/pull/41960))
-
-#### **Custom operator**
-
-- Enhance the C++ custom operator mechanism for writing second-order gradient operators, to support adding suffixes to the gradient input variables of second-order gradient operators for use as outputs. ([#41781](https://github.com/PaddlePaddle/Paddle/pull/41781))
-
-- Remove the use of the deprecated enumeration type `PlaceType` from the Tensor API member methods, make it compatible, and add a deprecation warning. ([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882))
-
-- Add deprecated warning for a number of deprecated interfaces of the original Tensor API, including the incomplete constructor, reshape, mutable_data, and copy_to methods. ([#41882](https://github.com/PaddlePaddle/Paddle/pull/41882))
-
-#### **Other**
-
-- Error report and debugging optimization
-
- - Optimize `the error message of the label` boundary check for the cross_entropy op. ([#40001](https://github.com/PaddlePaddle/Paddle/pull/40001))
-
- - Add profile record for `infer_shape` and `compute` methods of op execution of dynamic graphs, show their cost in timeline. ([#39023](https://github.com/PaddlePaddle/Paddle/pull/39023))
-
- - Replace `pybind::index_error` error hint on Windows for unknown exceptions. ([#40538](https://github.com/PaddlePaddle/Paddle/pull/40538))
-
- - Add the error message in the out-of-bounds checks for user scatter op. ([#37429](https://github.com/PaddlePaddle/Paddle/pull/37429))
-
-- Download tool: For the problem of slow decompression of directories with multiple files in `paddle.utils.download.get_path_from_url`, replace the original way (traverse directory in loop) of decompressing files in directories one by one by calling extractall on the directory, which greatly improves the decompression speed. ([#37311](https://github.com/PaddlePaddle/Paddle/pull/37311))
-
-- Speed up the quantization training for`fake_quantize_range_abs_max`、`fake_quantize_abs_max`、`fake_quantize_dequantize_abs_max`、 `fake_quantize_moving_average_abs_max`, etc. ([#40491](https://github.com/PaddlePaddle/Paddle/pull/40491))
-
-
-### **(3) Performance optimization**
-
-#### **Distributed Training**
-
-- Hybrid parallel optimizer `sharding_optimizer` supports `optimize_cast` optimization, which move the parameter cast during forward and backwark stage to the optimizer stage. This improves performance by 7%. ([#35878](https://github.com/PaddlePaddle/Paddle/pull/35878))
-
-- GPUPS optimization: support for gradient fuse allreduce training. This improves training performance by 20%. ([#35131](https://github.com/PaddlePaddle/Paddle/pull/35131))
-
-- GPUPS optimization: dump CPU optimization speed improves by 3.21x. ([#40068](https://github.com/PaddlePaddle/Paddle/pull/40068))
-
-- CPU parameter server streaming training optimization: support for automatic statistics of sparse parameter statistics, incremental saving of sparse parameters, etc. The training performance improves by 20%. ([#36465](https://github.com/PaddlePaddle/Paddle/pull/36465), [#36601](https://github.com/PaddlePaddle/Paddle/pull/36601), [#36734](https://github.com/PaddlePaddle/Paddle/pull/36734), [#36909](https://github.com/PaddlePaddle/Paddle/pull/36909), [#36943](https://github.com/PaddlePaddle/Paddle/pull/36943), [#37181](https://github.com/PaddlePaddle/Paddle/pull/37181), [#37194](https://github.com/PaddlePaddle/Paddle/pull/37194), [#37515](https://github.com/PaddlePaddle/Paddle/pull/37515), [#37626](https://github.com/PaddlePaddle/Paddle/pull/37626), [#37995](https://github.com/PaddlePaddle/Paddle/pull/37995), [#38582](https://github.com/PaddlePaddle/Paddle/pull/38582), [#39250](https://github.com/PaddlePaddle/Paddle/pull/39250), [#40762](https://github.com/PaddlePaddle/Paddle/pull/40762), [#41234](https://github.com/PaddlePaddle/Paddle/pull/41234), [#41320](https://github.com/PaddlePaddle/Paddle/pull/41320), [#41400](https://github.com/PaddlePaddle/Paddle/pull/41400))
-
-#### **Auto-tuning**
-
-Add hardware-aware automatic performance tuning for the full training process, with performance improvements of about 3% to 50% or more on image classification, segmentation, detection, and image generation tasks compared to the model's default configuration. The auto-tuning status is set via the `paddle.incubate.autotune.set_config ` API. By default, it is currently disabled. Auto-tuning has three specific levels:
-
-- Add the auto-tuning function to `paddle.io.DataLoader `, to select the best num_workers based on training data and device resources. ([#42004](https://github.com/PaddlePaddle/Paddle/pull/42004))
-
-- Add mixed-precision training data layout auto-tuning feature, to select the best data layout based on device type and data type, and automatically convert it at runtime. ([#41964](https://github.com/PaddlePaddle/Paddle/pull/41964))
-
-- Add the automatic tuning of the required workspace size threshold for Conv, which is automatically set based on the GPU's currently available requested device memory resources. Add the automatic selection of Conv cuDNN algorithms based on the generic AlgorithmCache design and Kernel timing component, which supports data variation length models. ([#41833](https://github.com/PaddlePaddle/Paddle/pull/41833))
-
-#### **Operator Optimization**
-
-- Optimize `FasterTokenizer` performance, with a 10% performance improvement compared to pre-optimization. ([#36701](https://github.com/PaddlePaddle/Paddle/pull/36701))
-
-- Optimize `index_select` inverse computation, with 3.7~25.2x performance improvement over pre-optimization. ([#37055](https://github.com/PaddlePaddle/Paddle/pull/37055))
-
-- Optimize the performance of `paddle.nn.ClipByGlobalNorm`. Take 10*10 `paddle.nn.Linear` as an example. In contrast to pre-optimization, the performance improves by about 30%. ([#38209](https://github.com/PaddlePaddle/Paddle/pull/38209))
-
-- Optimize the performance of `pnorm` with very large or very small `axis` dimensions, with 31-96x improvement in forward speed and 1.1-19x improvement in backward speed. ([#37685](https://github.com/PaddlePaddle/Paddle/pull/37685), [#38215](https://github.com/PaddlePaddle/Paddle/pull/38215), [#39011](https://github.com/PaddlePaddle/Paddle/pull/39011))
-
-- Optimize `softmax` forward and backward performance, with a speedup ratio of about 2x for the `axis!=-1` configuration. ([#38602](https://github.com/PaddlePaddle/Paddle/pull/38602), [#38609](https://github.com/PaddlePaddle/Paddle/pull/38609), [#32387](https://github.com/PaddlePaddle/Paddle/pull/32387), [#37927](https://github.com/PaddlePaddle/Paddle/pull/37927/files))
-
-- Optimize `log_softmax` forward and backward performance, with a speedup ratio of about 6x to 20x for `axis!=-1` configurations. ([#38992](https://github.com/PaddlePaddle/Paddle/pull/38992), [#40612](https://github.com/PaddlePaddle/Paddle/pull/40612))
-
-- Optimize `softmax_with_cross_entropy` forward and backward performance, with a speedup ratio of about 1.3x for the `hard_label` configuration. ([#39553](https://github.com/PaddlePaddle/Paddle/pull/39553), [#40424](https://github.com/PaddlePaddle/Paddle/pull/40424), [#40643](https://github.com/PaddlePaddle/Paddle/pull/40643))
-
-- Optimize `top_k` performance, with a speedup ratio of more than 22x for one-dimension and larger `k` (k=5000) configuration. ([#40941](https://github.com/PaddlePaddle/Paddle/pull/40941))
-
-- Optimize `elementwise_mul` backward computation, with 1.85~12.16x performance improvement over pre-optimization. ([#37728](https://github.com/PaddlePaddle/Paddle/pull/37728))
-
-- Optimize `elementwise_min` and `elementwise_max` backward computation, to equalize or improve performance by 1.05x to 18.75x over pre-optimization. ([#38236](https://github.com/PaddlePaddle/Paddle/pull/38236), [#37906](https://github.com/PaddlePaddle/Paddle/pull/37906))
-
-- Optimize `nearest_interp` forward and backward computation, with forward performance improvement by 1.5x to 2.3x over pre-optimization, and backward performance improvement by 60% to 1.8x over pre-optimization. ([#38528](https://github.com/PaddlePaddle/Paddle/pull/38528), [#39067](https://github.com/PaddlePaddle/Paddle/pull/39067))
-
-- Optimize `bilinear_interp` forward and backward computation, with forward performance improvement by 0.4x to 2.3x over pre-optimization, and backward performance improvement by 10%-30% over pre-optimization. ([#39243](https://github.com/PaddlePaddle/Paddle/pull/39243), [#39423](https://github.com/PaddlePaddle/Paddle/pull/39423))
-
-- Optimize `dropout` forward and backward computation, with performance improvement by about 20%. ([#39795](https://github.com/PaddlePaddle/Paddle/pull/39795), [#38859](https://github.com/PaddlePaddle/Paddle/pull/38859), [#38279](https://github.com/PaddlePaddle/Paddle/pull/38279), [#40053](https://github.com/PaddlePaddle/Paddle/pull/40053))
-
-- Optimize `grid_sampler` forward and backward computation, with forward performance improvement by 10% to 30% over pre-optimization, and backward performance improvement by 10% to 60% over pre-optimization. ([#39751](https://github.com/PaddlePaddle/Paddle/pull/39751))
-
-- Optimize `group_norm` forward and backward computation, with the forward performance improvement by 1.04x to 2.35x, and backward performance improvement by 1.12x to 1.18x. ([#39944](https://github.com/PaddlePaddle/Paddle/pull/39944), [#40657](https://github.com/PaddlePaddle/Paddle/pull/40657), [#39596](https://github.com/PaddlePaddle/Paddle/pull/39596))
-
-- Optimize `conv1d` forward and backward computation, with the forward performance improvement by 1.00x to 2.01x, and backward performance improvement by 1.01x to 474.56x. ([#38425](https://github.com/PaddlePaddle/Paddle/pull/38425))
-
-- Optimize `elementwise_div` backward computation, with the backward performance improvement by 1.02x to 29.25x. ([#38044](https://github.com/PaddlePaddle/Paddle/pull/38044))
-
-- Optimize `gelu` forward and backward computation, with the backward performance improvement by 1.13x to 1.43x, and reverse performance improvement by 1.10x to 1.55x. ([#38188](https://github.com/PaddlePaddle/Paddle/pull/38188), [#38263](https://github.com/PaddlePaddle/Paddle/pull/38263))
-
-- Optimize `elementwise_sub` backward computation, with the backward performance improvement by 1.04x to 15.64x. ([#37754](https://github.com/PaddlePaddle/Paddle/pull/37754))
-
-- Optimize `flip's` forward performance on one-dimensional data input, with the performance improvement by 100%. ([#37825](https://github.com/PaddlePaddle/Paddle/pull/37825))
-
-- Optimize `layer_norm` forward and backward computation, with the forward performance improvement by 2x to 5x over pre-optimization, and backward performance improvement by 20% to 50% over pre-optimization. ([#39167](https://github.com/PaddlePaddle/Paddle/pull/39167), [#39247](https://github.com/PaddlePaddle/Paddle/pull/39247))
-
-- Optimize `embedding` forward and backward computation, with a maximum improvement of 1.51x in forward performance and 1.03x to 7.79x in backward performance. ([#39856](https://github.com/PaddlePaddle/Paddle/pull/39856), [#39886](https://github.com/PaddlePaddle/Paddle/pull/398866))
-
-- Optimize `gelu` FP16 forward and backward calculations, with forward performance improvement by 9% to 12% over pre-optimization, and backward performance improvement by 2% to 9% over pre-optimization. ([#38980](https://github.com/PaddlePaddle/Paddle/pull/38980))
-
-- Remove CPU -> GPU explicit data transfer operation in `gather_nd` forward and backward operators, and remove the explicit synchronous operation in `index_select` forward and backward operators. Change GPU -> GPU data transfer in `scatter_nd` from synchronous operation to asynchronous operation. ([#40933](https://github.com/PaddlePaddle/Paddle/pull/40933))
-
-- Optimize `Lars optimzier` computation, with the training performance improvement of Resnet50 PF16 model by 5.1% over pre-optimization. ([#35652](https://github.com/PaddlePaddle/Paddle/pull/35652), [#35476](https://github.com/PaddlePaddle/Paddle/pull/35476))
-
-- Optimize `AvgPool2dGrad` computation, with the performance improvement by 2.6x over pre-optimization. ([#35389](https://github.com/PaddlePaddle/Paddle/pull/35389))
-
-- Optimize `Elementwise` computation for multivariate output, improving performance by up to 15% over pre-optimization. ([#38329](https://github.com/PaddlePaddle/Paddle/pull/38329), [#38410](https://github.com/PaddlePaddle/Paddle/pull/38410))
-
-- Optimize `Categorical`the probs computation, simplify the computation logic, and improve the performance by 4x to 5x. ([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178))
-
-- Optimize the `paddle.sum ` performance, with performance improvement by about 20%. ([#42309](https://github.com/PaddlePaddle/Paddle/pull/42309))
-
-- Remove CudaStreamSync operation from `paddle.nn.ClipGradByGlobalNorm ` to reduce scheduling overhead during execution, with 5% performance improvement on ptb models. ([#42170](https://github.com/PaddlePaddle/Paddle/pull/42170))
-
-- Optimize a series of underlying data structures and detailed implementations in the original dynamic graph execution system to improve the scheduling performance of the original dynamic graph. ([#42010](https://github.com/PaddlePaddle/Paddle/pull/42010), [#42171](https://github.com/PaddlePaddle/Paddle/pull/42171), [#42224](https://github.com/PaddlePaddle/Paddle/pull/42224), [#42256](https://github.com/PaddlePaddle/Paddle/pull/42256), [#42306](https://github.com/PaddlePaddle/Paddle/pull/42306), [#42329](https://github.com/PaddlePaddle/Paddle/pull/42329)[, #42340](https://github.com/PaddlePaddle/Paddle/pull/42340), [#42368](https://github.com/PaddlePaddle/Paddle/pull/42368), [#42425](https://github.com/PaddlePaddle/Paddle/pull/42425))
-
-- Simplify the probs calculation logics of `paddle.distribution.Categorical `, to improve performance by 4x to 5x. ([#42178](https://github.com/PaddlePaddle/Paddle/pull/42178))
-
-### **(4) Bug fixing**
-
-#### API
-
-- Fix the output type error with `paddle.sum` when the input parameter type and output parameter type do not match and the number of reduce elements on the `axis` is 1. ([#36123](https://github.com/PaddlePaddle/Paddle/pull/36123))
-
-- Fix an `AttributeError` in `paddle.flops` when the layer output type is tuple. ([#38850](https://github.com/PaddlePaddle/Paddle/pull/38850))
-
-- Fix the `paddle.diag` failing to propagate gradients because there is no backward kernel. ([#40447](https://github.com/PaddlePaddle/Paddle/pull/40447))
-
-- Fix an error in sorting `paddle.sort` input with NaN values. ([#41070](https://github.com/PaddlePaddle/Paddle/pull/41070))
-
-- Fix the error when`paddle.full_like`'s input contains INF value. ([#40232](https://github.com/PaddlePaddle/Paddle/pull/40232))
-
-- Fix the bug in `paddle.strided_slice`: strided_slice result does not consistent with slice when the data in the input of starts is less than -rank. ([#39066](https://github.com/PaddlePaddle/Paddle/pull/39066))
-
-- Fix the bug in the `max_pool` family of operators where infer_shape is calculated incorrectly when index is returned. This affects the APIs: `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`. ([#40139](https://github.com/PaddlePaddle/Paddle/pull/40139))
-
-- Fix an issue where the dtype of pooling_mask returned by the `max_pool` family of operators is incorrect. Now the dtype of pooling_mask is int32. The affected APIs are `paddle.nn.functional.max_pool1d/2d/3d`, `paddle.nn.functional.adaptive_max_pool1d/2d/3d`, `paddle.nn.MaxPool1D/2D/3D`, `paddle.nn.AdaptiveMaxPool1D/2D/3D`. ([#39314](https://github.com/PaddlePaddle/Paddle/pull/39314) )
-
-- Fix the bug with `paddle.shape` where the backward gradient by default causes a computation error. ([#37340](https://github.com/PaddlePaddle/Paddle/pull/37340))
-
-- Fix the bug in `paddle.nn.Layer's` `to` method when converting both dtype and place at the same time. ([#37007](https://github.com/PaddlePaddle/Paddle/pull/38007))
-
-- Fix the bug that `paddle.amp.decorate` fails to rewrite the parameters of non-leaf network layers to FP16. ([#38402](https://github.com/PaddlePaddle/Paddle/pull/38402))
-
-- Fix the bug that the `paddle.amp.decorate` rewrites the non-input parameter in `paddle.nn.BatchNorm1D`, `paddle.nn.BatchNorm2D`, and `paddle.nn.BatchNorm3D` to FP16. ([#38541](https://github.com/PaddlePaddle/Paddle/pull/38541))
-
-- Fix the bug that the `paddle.amp.decorate` rewrites the non-input parameter in `paddle.nn.SyncBatchNorm` to FP16. ([#40943](https://github.com/PaddlePaddle/Paddle/pull/40943))
-
-- Fix redundant warnings in `paddle.nn.Layer.to`. ([#36700](https://github.com/PaddlePaddle/Paddle/pull/36700))
-
-- Fix the bug in `paddle.nn.RNN` when being used inside control flow. ([#41162](https://github.com/PaddlePaddle/Paddle/pull/41162))
-
-- Fix the bug that the `paddle.to_tensor` fails to specify the CUDAPlace of the Tensor. ([#39662](https://github.com/PaddlePaddle/Paddle/pull/39662))
-
-- Fix the issue that`paddle.nn.Identity` is not exposed. ([#39615](https://github.com/PaddlePaddle/Paddle/pull/39615))
-
-- Fix the bug where the output values of the `fill_` and `zero_` inplace APIs are incorrect when the input is on a CUDAPinned Place after dynamic graph reconstruction. ([#41229](https://github.com/PaddlePaddle/Paddle/pull/41229))
-
-- After refactoring the dynamic graph, fix the bug of incorrect inplace version value of the output Tensor when calling assign op using the append op. Change it to call assign op using the `_C_ops`. ([#41118](https://github.com/PaddlePaddle/Paddle/pull/41118))
-
-- Remove unreasonable codes in the `elementwise_add` 's third-order kernel, and fix an uninitialized issue in the network creation process. ([#36618](https://github.com/PaddlePaddle/Paddle/pull/36618))
-
-- Fix the missing attribute bug in `conv2d` execution of cuDNN Kernel. ([#38827](https://github.com/PaddlePaddle/Paddle/pull/38827))
-
-- Fix an issue where `multiclass_nms3` output shape is incorrect. ([#40059](https://github.com/PaddlePaddle/Paddle/pull/40059))
-
-- Fix an issue with `yolo_box` outputting incorrect shape. ([#40056](https://github.com/PaddlePaddle/Paddle/pull/40056))
-
-- Fix an issue where the higher-order differentiation `gradients` interface does not take effect as expected when target_grad is specified. ([#40940](https://github.com/PaddlePaddle/Paddle/pull/40940/))
-
-- Fix an issue that the network parameter type is incorrect when the default_dtype is modified in the op`_BatchNormBase` base class in the dynamic graph mode. The affected APIs are `paddle.nn.BatchNorm1D`,`paddle.nn.BatchNorm2D`,`paddle.nn.BatchNorm3D`, and `paddle.nn.SyncBatchNorm`. Specific reason: when `get_default_dtype() == 'float16'`, the default parameter data type is modified by `set_default_dtype('float32')`. The parameter type in dynamic graph mode is created by default_dtype; therefore, the change of the default parameter type causes the subsequent networking Parameter type error. ([#36376](https://github.com/PaddlePaddle/Paddle/pull/36376))
-
-- Fix the bug of the undefined intermediate variable in the backward op in batchnorm op in case that the data type is FP32 and the data dimension is `dims = 2 and data_layout = NHWC`. ([#37020](https://github.com/PaddlePaddle/Paddle/pull/37020))
-
-- Fix the bug that shape of weights is incorrect, when using`paddle.static.nn.prelu` in static graph mode, and input format is`NHWC`, `mode==channel`. ([#38310](https://github.com/PaddlePaddle/Paddle/pull/38310))
-
-- Fix the bug of `paddle.nn.functional.class_center_sample`: CUDA seed setting issue in multi-machine case. ([#38815](https://github.com/PaddlePaddle/Paddle/pull/38815))
-
-- Fix the bug of failing to report error when the input of`paddle.nn.functional.one_hot`is incorrect. ([#41335](https://github.com/PaddlePaddle/Paddle/pull/41335))
-
-- Fix an issue where a callback to reclaim device memory on a DCU device is not triggered in time, resulting in an OOM of the device memory. ([#40445](https://github.com/PaddlePaddle/Paddle/pull/40445))
-
-- Fix the bugs of `setitem` backward gradient abnormal and inplace logic handling abnormal in some dynamic graph scenarios. ([#37023](https://github.com/PaddlePaddle/Paddle/pull/37023), [#38298](https://github.com/PaddlePaddle/Paddle/pull/38298))
-
-- Fix the bug of index abnormal when Tensor array uses the Slice to index in the dynamic to static scenarios. ([#39251](https://github.com/PaddlePaddle/Paddle/pull/39251))
-
-- Fix the bug of memory or device memory leaks caused by some temporary variables not being correctly destructed when `paddle.Tensor.register_hook` interface is used. ([#40716](https://github.com/PaddlePaddle/Paddle/pull/40716))
-
-- Fix the bug that `Tensor.getitem` cannot get the value when the index is a bool Tensor with all False. ([#41297](https://github.com/PaddlePaddle/Paddle/pull/41297))
-
-- Fix the bug that `Tensor.getitem` cannot get the value when the index is a bool scalar Tensor. ([#40829](https://github.com/PaddlePaddle/Paddle/pull/40829))
-
-- Fix the bug in `paddle.index_select` when index is a 0-shape Tensor. ([#41383](https://github.com/PaddlePaddle/Paddle/pull/41383))
-
-- Fix the bug when the number of GPU threads requested by `paddle.index_select` and `paddle.index_sample` exceeds the limited machine resources. ([#41127](https://github.com/PaddlePaddle/Paddle/pull/41127), [#37816](https://github.com/PaddlePaddle/Paddle/pull/37816), [#39736](https://github.com/PaddlePaddle/Paddle/pull/39736), [#41563](https://github.com/PaddlePaddle/Paddle/pull/41563))
-
-- Fix the bug when ReduceConfig, elemwise_grad, gather, gather_nd, and scatter ops request more GPU threads than the limited machine resources. ([#40813](https://github.com/PaddlePaddle/Paddle/pull/40813), [#41127](https://github.com/PaddlePaddle/Paddle/pull/41127))
-
-- Fix the bug that the memory access is out of boundary when NX ! = 1 in ReadData, ReadDataBc, and ReadDataReduce in Kernel Primitive API. ([#36373](https://github.com/PaddlePaddle/Paddle/pull/36373))
-
-- Fix the bug of the computation result abnormal due to data overflow caused by the IndexRandom data type error. ([#39867](https://github.com/PaddlePaddle/Paddle/pull/39867), [#39891](https://github.com/PaddlePaddle/Paddle/pull/39891))
-
-- Fix the bug of the returned computing result error of reduce op when reduce_num = 1. ([#38771](https://github.com/PaddlePaddle/Paddle/pull/38771))
-
-- Fix the bug of the memory access out-of-bound of reduce op in the middle dimension of reduce in HIP environments. ([#41273](https://github.com/PaddlePaddle/Paddle/pull/41273))
-
-- Fix the bug of Kernel failed to properly release in the computation of two FP16 one-dimensional vectors of matmul op.
-
-- Fix the bug caused by CUDA integer computation overflow for some operators, including: bernoulli, gaussian_random, gumbel_softmax, multinomial, truncated_gaussian_random, uniform_ random_inplace, and uniform_random ops. ([#37670](https://github.com/PaddlePaddle/Paddle/pull/37670))
-
-- Fix the bug where `paddle.nn.Sequential` reports a KeyError error when traversing sublayers in a for loop. ([#39372](https://github.com/PaddlePaddle/Paddle/pull/39372))
-
-- Fix the bug of the check shape error in `paddle.nn.functional.unfold` when compiling in static graphs. ([#38907](https://github.com/PaddlePaddle/Paddle/pull/38907), [#38819](https://github.com/PaddlePaddle/Paddle/pull/38819))
-
-- Fix the bug of reporting an error if `axis` is specified when using dropout for static graphs. ([#37223](https://github.com/PaddlePaddle/Paddle/pull/37223))
-
-- Migrate the matmul operator in the `paddle.nn.MultiHeadAttention` to the matmul_v2 operator. ([#36222](https://github.com/PaddlePaddle/Paddle/pull/36222))
-
-- Fix the bug occurred in throwing FPE when the empty Tensor is used in `paddle.nn.functional.label_smooth`. ([#35861](https://github.com/PaddlePaddle/Paddle/pull/35861))
-
-- Fix the deformation bug of reshape op when input is an empty Tensor. Support the empty Tensor rehape to [-1]. ([#36087](https://github.com/PaddlePaddle/Paddle/pull/36087))
-
-- Fix the bug of the modified values will incorrectly override other rows when the `fill_diagonal` 's input parameter offset is non-zero. ([#36212](https://github.com/PaddlePaddle/Paddle/pull/36212))
-
-- Modify stop_gradient returned by the range op bing set to True in dynamic graph mode. ([#37486](https://github.com/PaddlePaddle/Paddle/pull/37486))
-
-- Fix the bug where Lamb optimizer is updated incorrectly when Beta1Pow and Beta2Pow are on the GPU. ([#38518](https://github.com/PaddlePaddle/Paddle/pull/38518))
-
-- Fix the bug where the conv2d operator doesn't respect to FLAGS_cudnn_deterministic. ([#37173](https://github.com/PaddlePaddle/Paddle/pull/37173))
-
-- Fix the bug caused by an earlier version of cufft that does not define CUFFT_VERSION. ([#37312](https://github.com/PaddlePaddle/Paddle/pull/37312))
-
-- Fix the computing error of `paddle.ifftshit` and `paddle.fftshift`. ([#36834](https://github.com/PaddlePaddle/Paddle/pull/36834), [#36748](https://github.com/PaddlePaddle/Paddle/pull/36748))
-
-- Fix the `axis` computation error in `paddle.fft` series of APIs. ([#36321](https://github.com/PaddlePaddle/Paddle/pull/36321))
-
-- Fix an output data type registration bug of batch_norm_grad op in case of FP16 data type. This bug causes the compilation failure in some scenarios. There is also the impact on FP16 computational precision. ([#42461](https://github.com/PaddlePaddle/Paddle/pull/42461))
-
-- Fix the incorrect Infershape information bug in the `paddle.nn.functional.pad ` API when the padding is Tensor in dynamic to static conversion. ([#42414](https://github.com/PaddlePaddle/Paddle/pull/42414))
-
-- Fix an exception in `paddle.distribution.StickBreakingTransform ` when the input dimension exceeds 2. ([#41762](https://github.com/PaddlePaddle/Paddle/pull/41672))
-
-- Fix a nan/inf bug calculated with QK^T in fused_attention op. ([#42032](https://github.com/PaddlePaddle/Paddle/pull/42032))
-
-- Fix a nan/inf bug calculated in fused_attention op with FusedResidualDropoutBias on V100. ([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398))
-
-- Fix a redundant data transform bug introduced by the full_like op during execution. ([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973))
-
-- Fix a problem with p_norm op calculating nan on GPU environments. ([#41804](https://github.com/PaddlePaddle/Paddle/pull/41804))
-
-- Fix a section error of split op when the sections parameter has a size of 0. ([#41755](https://github.com/PaddlePaddle/Paddle/pull/41755))
-
-- Fix the bug of reporting not supporting Place (gpu:0) in multi-card training when broadcast is required in 6 elementwise ops (pow, complex, divide_double, multiply_double, fmax, and fmin). ([#42332](https://github.com/PaddlePaddle/Paddle/pull/42332))
-
-- Fix the bug that the deprecated interface reports a warning in case of `import paddle` due to a PIL version update. ([#42307](https://github.com/PaddlePaddle/Paddle/pull/42307))
-
-- Fix the bug that `paddle.linalg.matrix_rank ` does not support tol as FP64 Tensor under static graph. ([#42085](https://github.com/PaddlePaddle/Paddle/pull/42085))
-
-#### IR(Intermediate Representation)
-
-- Dynamic to static graphs
-
- - Fix a type derivation error in reverse gradient accumulation when the `tensor_array` is used with the control flow. ([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585), [#39689](https://github.com/PaddlePaddle/Paddle/pull/39689))
-
- - Fix an issue where the parameter gradient type is not set correctly during dynamic to static AMP training. ([#40938](https://github.com/PaddlePaddle/Paddle/pull/40938))
-
- - Fix an issue of reporting an error in the dynamic to static transcription when there are misplaced annotations in the codes. ([#39035](https://github.com/PaddlePaddle/Paddle/pull/39035), [#38003](https://github.com/PaddlePaddle/Paddle/pull/38003))
-
- - Fix an issue where Tensor is not properly converted to Variable when calling a non-forward function in dynamic to static codes. ([#37296](https://github.com/PaddlePaddle/Paddle/pull/37296), [#38540](https://github.com/PaddlePaddle/Paddle/pull/38540))
-
- - Fix an issue where `paddle` is incorrectly passed as a variable when dynamic to static transcription. ([#37999](https://github.com/PaddlePaddle/Paddle/pull/37999))
-
- - Fix an issue where model parameters are incorrectly counted when calling `paddle.flops` after model dynamic to static conversion. ([#36852](https://github.com/PaddlePaddle/Paddle/pull/36852))
-
- - Fix an issue where GPU memory will keep growing in train mode and no_grad contexts after loading models using the `paddle.jit.save/load` interface. ([#36434](https://github.com/PaddlePaddle/Paddle/pull/36434))
-
- - Add warning in function of convert_call when converting the generator function. ([#35369](https://github.com/PaddlePaddle/Paddle/pull/35369))
-
- - Fix the run_program op dependency analysis bug. ([#38470](https://github.com/PaddlePaddle/Paddle/pull/38470))
-
- - Fix the code conversion bug when returning a single value in control flow For. ([#40683](https://github.com/PaddlePaddle/Paddle/pull/40683))
-
- - Fix the bug when generating a reverse op when the input to conditional_block op contains LoDTensorArray. ([#39585](https://github.com/PaddlePaddle/Paddle/pull/39585))
-
- - Fix the bug that `padddle.jit.save ` loses the forward_pre_hook and forward_post_hook of the top Layer in case of the export of a dynamic-to-static graph mode. ([#42273](https://github.com/PaddlePaddle/Paddle/pull/42273))
-
- - Fix the dynamic to static conversion error report where the shape parameter in `paddle.expand ` contains a Tensor. ([#41973](https://github.com/PaddlePaddle/Paddle/pull/41973))
-
-
-#### **Distributed Training**
-
-- Distributed training basic functions
-
- - Fix the bug of a port reporting error in the distributed multi-machine training. ([#37274](https://github.com/PaddlePaddle/Paddle/pull/37274))
-
- - Fix the brpc compilation dependency bug. ([#37064](https://github.com/PaddlePaddle/Paddle/pull/37064))
-
- - Fix an occupied port issue due to tcp self-connections when Fleet starts. ([#38174](https://github.com/PaddlePaddle/Paddle/pull/38174))
-
- - Fix the precision degradation bug under data parallel due to inconsistent initialization of FP16 parameters under multiple cards. ([#38838](https://github.com/PaddlePaddle/Paddle/pull/38838), [#38563](https://github.com/PaddlePaddle/Paddle/pull/38563), [#38405](https://github.com/PaddlePaddle/Paddle/pull/38405))
-
- - Fix the precision degradation under data parallel due to FP16 gradient synchronization without dividing by the number of cards. ([#38378](https://github.com/PaddlePaddle/Paddle/pull/38378))
-
-- Dynamic graph mixing parallel
-
- - Fix the bug where parameters are not updated in FP16 mode under mixed parallel by using the new update interface. ([#36017](https://github.com/PaddlePaddle/Paddle/pull/36017))
-- Static graph mixing parallel
-
- - Fix an issue where grad merge is not compatible with ClipGradientByGlobalNorm in distributed dp mode. ([#36334](https://github.com/PaddlePaddle/Paddle/pull/36334))
-
- - Fix an issue under hybrid parallelism where the non-distributed parameters of tensor model parallelism are not broadcast during the initialization phase, resulting in inconsistent non-distributed parameters across cards. ([#36186](https://github.com/PaddlePaddle/Paddle/pull/36186))
-
- - Fix the issue that sharding's save_persistables interface does not save FP16 parameters and offload persistent variables when sharding is enabled with offload. ([#40477](https://github.com/PaddlePaddle/Paddle/pull/40477))
-
- - Fix the bug where ema parameters are not saved on non-0 cards when sharding is enabled for training. ([#39860](https://github.com/PaddlePaddle/Paddle/pull/39860))
-
- - Fix an issue where FC incorrectly calculates gradients according to column cuts. ([#38724](https://github.com/PaddlePaddle/Paddle/pull/38724))
-
- - Fix the bug reported when DistributedStrategy is set to without_graph_optimizer when used with rnn. ([#36176](https://github.com/PaddlePaddle/Paddle/pull/36176))
-
-- GPUPS Parameter Server Training
-
- - Fix the CPU branch compilation bug triggered by the GPUPS macro definition. ([#37248](https://github.com/PaddlePaddle/Paddle/pull/37248))
-
- - Fix an occasional error raised when saving delta and pullsparse concurrency during GPUPS streamline training. ([#37233](https://github.com/PaddlePaddle/Paddle/pull/37233))
-
- - Fix a download error issue caused by HDFSClient querying a directory without returning the full path. ([#36590](https://github.com/PaddlePaddle/Paddle/pull/36590))
-
- - Fix the bug with pulling old parameters in GPUPS streamline training. ([#36512](https://github.com/PaddlePaddle/Paddle/pull/36512))
-
- - Fix a GPUPS multi-stream allocation issue. ([#37476](https://github.com/PaddlePaddle/Paddle/pull/37476))
-
- - Fix the bug of the GPUPS pybind out of core. ([#37287](https://github.com/PaddlePaddle/Paddle/pull/37287))
-
-
-#### **Other**
-
-- Fix the clip_extra issue when saving models for dynamic graph quantization training. ([#38323](https://github.com/PaddlePaddle/Paddle/pull/38323))
-
-- Fix an issue with abs_max scale initialization for dynamic graph quantization training. ([#39307](https://github.com/PaddlePaddle/Paddle/pull/39307))
-
-- Fix an issue of exceptions in saving model in dynamic graph quantization training. ([#38102](https://github.com/PaddlePaddle/Paddle/pull/38102), [#38012](https://github.com/PaddlePaddle/Paddle/pull/38012))
-
-- Fix the offline quantization flatten op output error. ([#37722](https://github.com/PaddlePaddle/Paddle/pull/37722))
-
-- Fix the non-matching dimension bug in case of inverse quantization matmul op. ([#36982](https://github.com/PaddlePaddle/Paddle/pull/36982))
-
-- Fix the bug of adding quantization op when quantizing matmul_v2 without weights. ([#36593](https://github.com/PaddlePaddle/Paddle/pull/36593))
-
-- Fix the error of saving the quant_axis attribute in the conv op channel-wise quantization when saving the models. ([#39054](https://github.com/PaddlePaddle/Paddle/pull/39054))
-
-- Fix the slow training of channel-wise quantization. ([#40772](https://github.com/PaddlePaddle/Paddle/pull/40772))
-
-- Fix the bug of quantization training when dividing by tensor(initialized as 0) leads to nan. ([#36762](https://github.com/PaddlePaddle/Paddle/pull/36762))
-
-- Fix incorrect settings of amp_level for mixed precision in multi-threaded scenarios. ([#39198](https://github.com/PaddlePaddle/Paddle/pull/39198))
-
-- Fix an issue where PyLayer and Recompute is not set mixed precision correctly when mixed precision training is used with PyLayer and Recompute. ([#39950](https://github.com/PaddlePaddle/Paddle/pull/39950), [#40042](https://github.com/PaddlePaddle/Paddle/pull/40042))
-
-- Fix an issue where `D_GLIBCXX_USE_CXX11_ABI` does not take effect when compiling custom operators under Mac. ([#37878](https://github.com/PaddlePaddle/Paddle/pull/37878))
-
-- Fix the bug of inconsistent dynamic and static behaviors in case of block=None the initializer-related API. ([#37827](https://github.com/PaddlePaddle/Paddle/pull/37827))
-
-- Fix the bug in python 3.6 where there is no fluid module. ([#35862](https://github.com/PaddlePaddle/Paddle/pull/35862))
-
-- Fix the bug where optimizer `paddle.optimizer.Adamw` incorrectly calls adam op. ([#36028](https://github.com/PaddlePaddle/Paddle/pull/36028))
-
-- Fix a logic error when the `paddle.optimizer.Momentum` optimizer parameter `regularizer` property is None under the multi tensor policy. ([#38344](https://github.com/PaddlePaddle/Paddle/pull/38344))
-
-- Fix the bug that the `paddle.optimizer.Momentum` and `paddle.optimizer.Adam` optimizers modify the `multi_precision` property under the multi tensor policy. ([#38991](https://github.com/PaddlePaddle/Paddle/pull/38991))
-
-- Fix the code compilation error when using final-state API amp in combination with optional Tensor. ([#40980](https://github.com/PaddlePaddle/Paddle/pull/40980))
-
-- Fix the bug where paddle+lite+xpu prediction library would report an error when calling lite CPU prediction, and fix the bug where paddle+lite(without NNAdapter) would report an error when compiling. ([#37449](https://github.com/PaddlePaddle/Paddle/pull/37449))
-
-- Fix the bug in Debug compile mode where LoDTensorArray crashes due to inconsistent Pybind11 bindings. ([#37954](https://github.com/PaddlePaddle/Paddle/pull/37954))
-
-- Fix the bug that prevents correct construction of Tensor in the extreme case where the shape parameter is a list of Tensor mix with int. ([#38284](https://github.com/PaddlePaddle/Paddle/pull/38284))
-
-- Fix a compatibility issue with the `paddle.optimizer.AdamW` API. ([#37905](https://github.com/PaddlePaddle/Paddle/pull/37905))
-
-- Fix the bug in _InstanceNormBase where the returne value of extra_repr is incorrect. ([#38537](https://github.com/PaddlePaddle/Paddle/pull/38537))
-
-- Fix the bug that the Paddle Inference lacks of the symbol `paddle::distributed::TensorTable` when the -DWITH_DISTRIBUTED is uesd. ([#41128](https://github.com/PaddlePaddle/Paddle/pull/41128))
-
-- matmul_v2 op reports error when there is a 0 value in the shape. ([#35791](https://github.com/PaddlePaddle/Paddle/pull/35791))
-
-- Fix the problem of the repeated printing for no gradient input hint message of the recomputed in dynamic graphs. Change it to the printing only once with using warning. ([#38293](https://github.com/PaddlePaddle/Paddle/pull/38293))
-
-- Fix the low accuracy bug on the validation set in later epoch training in visual models in the gelu op. ([#38450](https://github.com/PaddlePaddle/Paddle/pull/38450))
-
-- Fix adamw op error in numerical computation. ([#37746](https://github.com/PaddlePaddle/Paddle/pull/37746))
-
-- Add the parameters in the sparse_momentum `_C_ops` interface. ([#39969](https://github.com/PaddlePaddle/Paddle/pull/39969))
-
-- Fix the bug where there is no `distributed` module in python 3.6. ([#35848](https://github.com/PaddlePaddle/Paddle/pull/35848))
-
-- Fix the eigh unit test data initialization problem. ([#39568](https://github.com/PaddlePaddle/Paddle/pull/39568))
-
-- Fix the eigvalsh unit test data initialization problem. ([#39841](https://github.com/PaddlePaddle/Paddle/pull/39841))
-
-- Fix the bug of not working properly due to excessive register usage on V100 by segment op. ([#38113](https://github.com/PaddlePaddle/Paddle/pull/38113))
-
-- Fix the bug with conv-related op sparsification incorrectly set dimension. ([#36054](https://github.com/PaddlePaddle/Paddle/pull/36054))
-
-- Provide Automatic SParsity training for static graph-related function Alias to `Paddle.static.sparsity`. ([#36525](https://github.com/PaddlePaddle/Paddle/pull/36525))
-
-- Fix the bug where divide op’s integer division is still an integer. ([#40890](https://github.com/PaddlePaddle/Paddle/pull/40890))
-
-- Fix the crash bug of`paddle.multiplex` when input Tensor value is 0. ([#34972](https://github.com/PaddlePaddle/Paddle/pull/34972))
-
-- Fix a speed exception for set `reduction` parameter in `paddlpaddle.nn.functional.kl_div`. ([#37283](https://github.com/PaddlePaddle/Paddle/pull/37283))
-
-- Fix the data source unsorted bug in loading the Cifar dataset. ([#37272](https://github.com/PaddlePaddle/Paddle/pull/37272))
-
-- Fix the conversion of loss from uint16 to float in the ProgressBar class. ([#39231](https://github.com/PaddlePaddle/Paddle/pull/39231))
-
-- Fix the ShareBufferWith shared data type problem. ([#37464](https://github.com/PaddlePaddle/Paddle/pull/37464), [#37247](https://github.com/PaddlePaddle/Paddle/pull/37247))
-
-- Fix the performance issue when `paddle.io.DataLoader` uses IterableDataset and num_workers>0. ([#40541](https://github.com/PaddlePaddle/Paddle/pull/40541))
-
-- Fix the bug with `paddle.vision.ops.yolo_loss` returns incomplete values in dynamic graph. ([#40185](https://github.com/PaddlePaddle/Paddle/pull/40185))
-
-- Remove the restriction that the input parameter dataset of `paddle.io.BatchSampler` needs to be the `paddle.io.Dataset` type, to expand the support for user-defined datasets. ([#40184](https://github.com/PaddlePaddle/Paddle/pull/40184))
-
-- Fix the bug of `paddle.summary` reporting that op_flops does not exist. ([#36489](https://github.com/PaddlePaddle/Paddle/pull/36489))
-
-- Fix the formula error of lars_momentum op when lars_weight_decay=0. ([#40892](https://github.com/PaddlePaddle/Paddle/pull/40892))
-
-- Fix the bug that the optimize-offload cannot save presistable var. ([#36433](https://github.com/PaddlePaddle/Paddle/pull/36433))
-
-- Fix an issue where optimizer-offload does not support adamw op type. ([#36432](https://github.com/PaddlePaddle/Paddle/pull/36432))
-
-- Fix an issue where enable_program_desc_tracing_data in Tracer is not safe in multi-threaded scenarios. ([#39776](https://github.com/PaddlePaddle/Paddle/pull/39776))
-
-- Fix an issue where the model file size is not initialized when the model is read. ([#40518](https://github.com/PaddlePaddle/Paddle/pull/40518))
-
-- Fix the logic bug of the Expand op. When the dimension of the input Tensor X is smaller than the shape to be expanded, it may result in the incorrect Out.Shape. ([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677))
-
-- Fix the dynamic to static transcription error when the Expand_As op takes only y.shape without Y variable entered. ([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677))
-
-- Fix the logic error when Expand_As op computes the output shape. ([#38677](https://github.com/PaddlePaddle/Paddle/pull/38677))
-
-- Fix the bug that the variables of the `core.VarDesc.VarType.STRINGS` type report error when getting the `lod_level` property and setting its `lod_level` to None. ([#39077](https://github.com/PaddlePaddle/Paddle/pull/39077))
-
-- Fix an issue where the framework function `Pylayer` does not support different dtypes. ([#37974](https://github.com/PaddlePaddle/Paddle/pull/37974))
-
-- Fix the bug of division by zero of the learning rate decay API `paddle.optimizer.lr.PolynomialDecay`. ([#38782](https://github.com/PaddlePaddle/Paddle/pull/38782))
-
-- Fix the issue where some logs remained after calling the DisableGlogInfo() interface. ([#36356](https://github.com/PaddlePaddle/Paddle/pull/36356))
-
-- Fix an error in backward of multi-layer RNN (when dropout is set to 0) in the training of SimpleRNN, GRU and LSTM API CPU. ([#37080](https://github.com/PaddlePaddle/Paddle/pull/37080))
-
-- Add cache for fft on the backend of cufft and hipfft. ([#36646](https://github.com/PaddlePaddle/Paddle/pull/36646))
-
-- Enable the shifts parameter of `paddle.roll` to support transfer in Tensor. ([#36727](https://github.com/PaddlePaddle/Paddle/pull/36727))
-
-- Add onemkl to fft as an optional computation backend. ([#36414](https://github.com/PaddlePaddle/Paddle/pull/36414))
-
-- Fix the precision bug in the bfloat16 type under two mamtul_v2 and elementwise_div ops. ([#42479](https://github.com/PaddlePaddle/Paddle/pull/42479))
-
-- Fix a possible error in the next step caused by LoDTensorArray clearing only the internal Tensor and not clearing the Array during device memory recycling. ([#42398](https://github.com/PaddlePaddle/Paddle/pull/42398))
-
-
-## **4. Deployment Direction (Paddle Inference)**
-
-### **(1) New features**
-
-#### **New APIs**
-
-- Add the Java API so that Java developers can implement high performance inference on the server and in the cloud through a simple and flexible interface. ([#37162](https://github.com/PaddlePaddle/Paddle/pull/37162))
-
-- Add `GetTrtCompileVersion` and `GetTrtRuntimeVersion` interfaces for getting TensorRT version information. ([#36429](https://github.com/PaddlePaddle/Paddle/pull/36429))
-
-- Add the `ShareExternalData` interface to avoid memory copy of input data during inference. ([#39809](https://github.com/PaddlePaddle/Paddle/pull/39809))
-
-
-#### **New functions**
-
-- Add ONNX Runtime backend support. Currently it supports only CPU in the integrated version. ([#39988](https://github.com/PaddlePaddle/Paddle/pull/39988), [#40561](https://github.com/PaddlePaddle/Paddle/pull/40561))
-
-- Add support for Ascend 310 inference based on the Paddle Lite subgraph approach. ([#35226](https://github.com/PaddlePaddle/Paddle/pull/35226))
-
-- Add the native GPU FP16 inference. ([#40531](https://github.com/PaddlePaddle/Paddle/pull/40531))
-
-- For the switch_ir_debug interface, add the dump model function. ([#36581](https://github.com/PaddlePaddle/Paddle/pull/36581))
-
-- Add the configuration interface for TensorRT config: `void UpdateConfigInterleaved(paddle_infer::Config* c, bool with_interleaved)` for special data layout in int8 quantization inference. ([#38884](https://github.com/PaddlePaddle/Paddle/pull/38884))
-
-- Add TensorRT inspector output information to the log. It is valid only for TensorRT 8.2 or later. ([#38362](https://github.com/PaddlePaddle/Paddle/pull/38362),[#38200](https://github.com/PaddlePaddle/Paddle/pull/38200)))
-
-- Add the support of the TensorRT ASP sparse inference. ([#36413](https://github.com/PaddlePaddle/Paddle/pull/36413))
-
-
-### **(2) Underlying optimization**
-
-#### **CPU performance optimization**
-
-- Optimize the caching mechanism of MKLDNN. ([#38336](https://github.com/PaddlePaddle/Paddle/pull/38336), [#36980](https://github.com/PaddlePaddle/Paddle/pull/36980), [#36695](https://github.com/PaddlePaddle/Paddle/pull/36695))
-
-- Add matmul_scale_fuse pass. ([#37962](https://github.com/PaddlePaddle/Paddle/pull/37962))
-
-- Add MKLDNN reshape_transpose_matmul_v2_mkldnn_fuse_pass. ([#37847](https://github.com/PaddlePaddle/Paddle/pull/37847), [#40948](https://github.com/PaddlePaddle/Paddle/pull/40948))
-
-- Add MKLDNN conv_hard_sigmoid_mkldnn_fuse_pass. ([#36869](https://github.com/PaddlePaddle/Paddle/pull/36869))
-
-- Add MKLDNN matmul_v2_transpose_reshape_fuse_pass. ([#36481](https://github.com/PaddlePaddle/Paddle/pull/36481))
-
-- Add MKLDNN softplus_activation_mkldnn_fuse_pass. ([#36657](https://github.com/PaddlePaddle/Paddle/pull/36657))
-
-- Add MKLDNN elt_act_mkldnn_fuse_pass. ([#36541](https://github.com/PaddlePaddle/Paddle/pull/36541))
-
-- Add MKLDNN mish operator and conv_mish_mkldnn_fuse_pass. ([#38623](https://github.com/PaddlePaddle/Paddle/pull/38623))
-
-
-#### **GPU performance optimization**
-
-- Change the inference default video memory allocation policy from `naive_best_fit` to `auto_growth`, to solve the problem of some models filled up with the GPU video memory. ([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491))
-
-- Support gelu and FC+gelu ops using TensorRT inference. ([#38399](https://github.com/PaddlePaddle/Paddle/pull/38399))
-
-- Support `deformable_conv` inference using TensorRT under static shape. ([#36612](https://github.com/PaddlePaddle/Paddle/pull/36612) [#36850](https://github.com/PaddlePaddle/Paddle/pull/36850) [#37345](https://github.com/PaddlePaddle/Paddle/pull/37345))
-
-- Support nearest_interp_v2 op using TensorRT inference. ([#34126](https://github.com/PaddlePaddle/Paddle/pull/34126))
-
-- Add `yolo_box` TensorRT plugin to support input parameters `iou_aware` and `iou_aware_factor` so that the IoU computed by inference is used as a factor for confidence. ([#34128](https://github.com/PaddlePaddle/Paddle/pull/34128))
-
-- Support `elementwise_sub` and `elementwise_div` calling for TensorRT inference. ([#40806](https://github.com/PaddlePaddle/Paddle/pull/40806) [#41253](https://github.com/PaddlePaddle/Paddle/pull/41253))
-
-- Support `multiclass_nms3` using TensorRT inference. ([#41181](https://github.com/PaddlePaddle/Paddle/pull/41181) [#41344](https://github.com/PaddlePaddle/Paddle/pull/41344))
-
-- Support flatten_contiguous_rang op using TensorRT inference. ([#38922](https://github.com/PaddlePaddle/Paddle/pull/38922))
-
-- Support for `pool2d` attribute `padding` using TensorRT inference when dimension is 4, and `global_pooling` and `ceil_mode` are True. ([#39545](https://github.com/PaddlePaddle/Paddle/pull/39545))
-
-- Support batch_norm and elementwise_add using TensorRT inference when dimension is 5. ([#36446](https://github.com/PaddlePaddle/Paddle/pull/36446))
-
-- Add pool3d to use TensorRT inference. ([#36545](https://github.com/PaddlePaddle/Paddle/pull/36545), [#36783](https://github.com/PaddlePaddle/Paddle/pull/36783))
-
-- Add the `reduce` int32 and float types to use TensorRT inference. Add `reduce_mean` GPU operator int32 and int64 registration. ([#39088](https://github.com/PaddlePaddle/Paddle/pull/39088))
-
-- Modify MatmulV2ToMul pass. Modify the qualifier (not support of broadcast) and op_teller mapping condition. ([#36652](https://github.com/PaddlePaddle/Paddle/pull/36652))
-
-- Add the support for TenorRT plugin interface AddPluginV2IOExt. ([#36493](https://github.com/PaddlePaddle/Paddle/pull/36493))
-
-- Add the aligned attribute in roi_align op and support for TensorRT inference. ([#38905](https://github.com/PaddlePaddle/Paddle/pull/38905))
-
-- Add the support for TensorRT inference with concat attribute `axis = -1`. ([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096))
-
-- Add TensorRT plugin: preln_emb_eltwise_layernorm, preln_skip_la, and rnorm ops, for ERNIE-like model performance optimization. ([#39570](https://github.com/PaddlePaddle/Paddle/pull/39570))
-
-- Add TensorRT fuse pass: preln_embedding_eltwise_layernorm_fuse_pass, preln_skip_layernorm_fuse_pass, for ERNIE-like model performance optimization. ([#39508](https://github.com/PaddlePaddle/Paddle/pull/39508))
-
-- Split matmul fusion-related passes based on different backends (GPU, CPU, TensorRT), to support transpose function for FC weights. ([#39369](https://github.com/PaddlePaddle/Paddle/pull/39369))
-
-- Add the support to TensorRT by roll, strided_slice, and slice op in case of dynamic shapes. ([#41913](https://github.com/PaddlePaddle/Paddle/pull/41913), [#41573](https://github.com/PaddlePaddle/Paddle/pull/41573), [#41467](https://github.com/PaddlePaddle/Paddle/pull/41467))
-
-- Add div op support for TensorRT. ([#41243](https://github.com/PaddlePaddle/Paddle/pull/41243))
-
-- Quantization support
-
- - For the `PostTrainingQuantization` API, add the support for `paddle.io.DataLoader` object or `Python Generator` input. ([#38686](https://github.com/PaddlePaddle/Paddle/pull/38686))
-
- - ERNIE full quantization model inference supports for interleaved data layout. ([#39424](https://github.com/PaddlePaddle/Paddle/pull/39424))
-
- - Support for PaddleSlim new quantile model format inference. ([#41049](https://github.com/PaddlePaddle/Paddle/pull/41049))
-
- - Add matmul int8 quantization inference op converter and plugin. ([#37285](https://github.com/PaddlePaddle/Paddle/pull/37285))
-
- - Add pass to determine if all ops in the model can support int8 quantization. ([#36042](https://github.com/PaddlePaddle/Paddle/pull/36042))
-
- - Support quantization inference for the FC part of the multihead attention of the non-variable-length branch. ([#39660](https://github.com/PaddlePaddle/Paddle/pull/39660))
-
-
-#### **Ascend NPU Related Features**
-
-- - Refactor shape operator forward computation logic to support execution on NPU. ([#39613](https://github.com/PaddlePaddle/Paddle/pull/39613))
-
- - Refactor reshape operator forward computation logic to support ShapeTensor input. ([#38748](https://github.com/PaddlePaddle/Paddle/pull/38748))
-
- - Uniform accuracy type when loading model weights. ([#39160](https://github.com/PaddlePaddle/Paddle/pull/39160))
-
-
-### **(3) Bug fixing**
-
-#### **Framework and API fixing**
-
-- Fix the bug of model clipping when saving static graphs. ([#37579](https://github.com/PaddlePaddle/Paddle/pull/37579))
-
-- For the C API, add wrapper PD_Cstr for strings, and provide construction and destructing methods to avoid users to use C runtime library to destruct strings directly. ([#38667](https://github.com/PaddlePaddle/Paddle/pull/38667))
-
-- Fix the logic bug with memory reuse at prediction time. ([#37324](https://github.com/PaddlePaddle/Paddle/pull/37324))
-
-- Fix memory reuse error reporting in multi-threading. ([#37894](https://github.com/PaddlePaddle/Paddle/pull/37894))
-
-- Allow passing empty strings for inference when no weight file is available. ([#38579](https://github.com/PaddlePaddle/Paddle/pull/38579))
-
-- Fix an issue of clone not being supported when TensorRT dynamic shape is enabled. ([#38520](https://github.com/PaddlePaddle/Paddle/pull/38520))
-
-- Fix multi-threaded clone error after TensorRT dynamic shape is enabled. ([#40067](https://github.com/PaddlePaddle/Paddle/pull/40067))
-
-- Fix a TensorRT engine destructing issue. ([#35842](https://github.com/PaddlePaddle/Paddle/pull/35842), [#35938](https://github.com/PaddlePaddle/Paddle/pull/35938))
-
-- For the lite xpu interface, fix an issue where the xpu card cannot be selected. ([#36610](https://github.com/PaddlePaddle/Paddle/pull/36610))
-
-- The TensorRT dynamic shape parameter automatically generate the interface, to add the file existence check. ([#36628](https://github.com/PaddlePaddle/Paddle/pull/36628))
-
-- Fix the bug that the MKLDNN does not support conv3d. ([#42055](https://github.com/PaddlePaddle/Paddle/pull/42055))
-
-#### **Backend Capability Fixing**
-
-- Fix cuDNN default algorithm selection configuration for prediction, with using non-deterministic policies. ([#41491](https://github.com/PaddlePaddle/Paddle/pull/41491))
-
-- Fix the bug with deformable_conv op in TensorRT plugin resource recovery handling error. ([#38374](https://github.com/PaddlePaddle/Paddle/pull/38374))
-
-- Fix a serialization error in the TensorRT plugin for deformable_conv op. ([#38057](https://github.com/PaddlePaddle/Paddle/pull/38057))
-
-- Adapt the new refactor engine and serialization API of TensorRT 8.0. ([#36769](https://github.com/PaddlePaddle/Paddle/pull/36769))
-
-- Fix the bug that the Flatten2MatmulFusePass, Squeeze2MatmulFusePass, and Reshape2MatmulFusePass do not take effect. ([#37644](https://github.com/PaddlePaddle/Paddle/pull/37644))
-
-- Fix the bug with TensorRT input data reporting errors. ([#37427](https://github.com/PaddlePaddle/Paddle/pull/37427))
-
-- Add error message when input dimension is wrong. ([#38962](https://github.com/PaddlePaddle/Paddle/pull/38962))
-
-- Fix the bug with EmbEltwiseLayernorm output type error. ([#40015](https://github.com/PaddlePaddle/Paddle/pull/40015))
-
-- Remove conv_affine_channel_fuse_pass and the corresponding unit test. ([#39817](https://github.com/PaddlePaddle/Paddle/pull/39817))
-
-- Fix an issue where the adaptive_pool2d pass incorrectly replaces the pool attribute. ([#39600](https://github.com/PaddlePaddle/Paddle/pull/39600))
-
-- Fix the bug that shuffle_channel_detect_pass incorrectly generates shuffle_channel op. ([#39242](https://github.com/PaddlePaddle/Paddle/pull/39242))
-
-- Fix transpose parameter error. ([#39006](https://github.com/PaddlePaddle/Paddle/pull/39006))
-
-- Fix the crash bug when nearest_interp_v2 input scale dimension is less than 1. ([#38725](https://github.com/PaddlePaddle/Paddle/pull/38725))
-
-- Fix the bug that the prelu does not support one-dimensional input in dynamic shape. ([#39389](https://github.com/PaddlePaddle/Paddle/pull/39389))
-
-- Fix the bug in the kernel function of slice's special_slice_plugin. ([#39875](https://github.com/PaddlePaddle/Paddle/pull/39875))
-
-- Temporarily disable int8 branch under skip_layernorm variable length to prevent accuracy degradation. ([#39991](https://github.com/PaddlePaddle/Paddle/pull/39991))
-
-- Fix some bugs regarding support for preln_ernie models. ([#39733](https://github.com/PaddlePaddle/Paddle/pull/39733))
-
-- Fix the bug that slice may exceed threads limit in ERNIE. Fix the bug that the spacial_slice is incorrectly triggered. ([#39096](https://github.com/PaddlePaddle/Paddle/pull/39096))
-
-- Fix the bug that the elementwise does not support broadcast when the dimension is the same. ([#37908](https://github.com/PaddlePaddle/Paddle/pull/37908))
-
-- Fix the problem that the underlying implementation is different in the nearest_interp op when align_corners is True and TensorRT layer results and native op have diff. ([#37525](https://github.com/PaddlePaddle/Paddle/pull/37525))
-
-- Fix qkv_plugin: Kernel function computation error. ([#37096](https://github.com/PaddlePaddle/Paddle/pull/37096))
-
-- Fix the bug with inference pass for dynamic quantization. ([#35879](https://github.com/PaddlePaddle/Paddle/pull/35879))
-
-- Reuse directly when Tensor requests less memory than the allocated size. ([#37880](https://github.com/PaddlePaddle/Paddle/pull/37880))
-
-- Fix the hang bug when ERNIE fixed-length model is enabled with TensorRT. ([#37839](https://github.com/PaddlePaddle/Paddle/pull/37839))
-
-- Fix the crash bug when TensorRT int8 lacks of dynamic range information. ([#36900](https://github.com/PaddlePaddle/Paddle/pull/36900))
-
-- Fix the bug with slice deserialization code. ([#36588](https://github.com/PaddlePaddle/Paddle/pull/36588))
-
-- Fix yolo box calculation formula error. ([#36240](https://github.com/PaddlePaddle/Paddle/pull/36240))
-
-- Fix the crash bug when the earlier version model uses a later version of roi_align. ([#38788](https://github.com/PaddlePaddle/Paddle/pull/38788)) External Developers
-
-- Fix the bug of a large performance difference of softmax between python and C++. ([#37130](https://github.com/PaddlePaddle/Paddle/pull/37130))
-
-- Fix matmul inference failure on static shape 2-dimensional input and dynamic shape 3-dimensional input. ([#36849](https://github.com/PaddlePaddle/Paddle/pull/36849))
-
-- Fix reshape_transpose_matmul_mkldnn_fuse_pass mishandling of shapes. ([#36731](https://github.com/PaddlePaddle/Paddle/pull/36731))
-
-- Fix an issue where TensorRT gets 4 dimensions when the input is 2 dimensions. ([#36614](https://github.com/PaddlePaddle/Paddle/pull/36614))
-
-- Fix the bug report when the interpolate_v2 MKLDNN operator is null in the scale attribute. ([#36623](https://github.com/PaddlePaddle/Paddle/pull/36623))
-
-- Fix poor performance of the recurrent operator in multi-threaded scenarios. ([#36052](https://github.com/PaddlePaddle/Paddle/pull/36052))
-
-- Remove restrictions of relu, sigmoid, tanh, relu6, batch_norm, clip, concat, gelu, hard_sigmoid, prelu, softmax, split, and swish on TensorRT 2-dimensional inputs. ([#37097](https://github.com/PaddlePaddle/Paddle/pull/37097))
-
-- Fix reshape op to use TensorRT inference. ([#41090](https://github.com/PaddlePaddle/Paddle/pull/41090))
-
-- Fix matmul related pass, which is compatible with matmul_v2. ([#36424](https://github.com/PaddlePaddle/Paddle/pull/36424))
-
-- Support VALID and SAME attributes in the padding method of the conv2d operator when TensorRT is enabled. ([#38999](https://github.com/PaddlePaddle/Paddle/pull/38999))
-
-- Fix MKLDNN multi-input operator quantization problem. ([#39593](https://github.com/PaddlePaddle/Paddle/pull/39593), [#39346](https://github.com/PaddlePaddle/Paddle/pull/39346), [#40717](https://github.com/PaddlePaddle/Paddle/pull/40717))
-
-- Fix scale error of conv+activation in MKLDNN quantization scenarios. ([#38331](https://github.com/PaddlePaddle/Paddle/pull/38331))
-
-- Fix the bug in MKLDNN quantization without parameters where the quantization of subsequent operators is handled differently. ([#39342](https://github.com/PaddlePaddle/Paddle/pull/39342))
-
-- Fix a data type related issue in MKLDNN cpu_bfloat16_placement_pass. ([#38702](https://github.com/PaddlePaddle/Paddle/pull/38702))
-
-- Fix a split operator execution issue in MKLDNN bfloat16 inference. ([#39548](https://github.com/PaddlePaddle/Paddle/pull/39548))
-
-- Fix the bug with MKLDNN matmul_v2 operator not supporting 6 dimensions. ([#36342](https://github.com/PaddlePaddle/Paddle/pull/36342), [#38665](https://github.com/PaddlePaddle/Paddle/pull/38665))
-
-- Fix MKLDNN DeviceContext error in MKLDNN matmul_v2_transpose_reshape. ([#38554](https://github.com/PaddlePaddle/Paddle/pull/38554))
-
-- Fix incorrectly calculated results for segmentation models in MKLDNN inference scenarios. ([#37310](https://github.com/PaddlePaddle/Paddle/pull/37310))
-
-- Fix MKLDNN bfloat16 placement operator list and add the missing operator. ([#36291](https://github.com/PaddlePaddle/Paddle/pull/36291))
-
-- Fix the format bug of MKLDNN operators, including: FC, conv_transpose, 6-dimensional Tensor error reporting, and wrong output format of conv to NHWC input. ([#38890](https://github.com/PaddlePaddle/Paddle/pull/38890), [#37344](https://github.com/PaddlePaddle/Paddle/pull/37344), [#37175](https://github.com/PaddlePaddle/Paddle/pull/37175), [#38553](https://github.com/PaddlePaddle/Paddle/pull/38553), [#40049](https://github.com/PaddlePaddle/Paddle/pull/40049), [#39097](https://github.com/PaddlePaddle/Paddle/pull/39097))
-
-- Fix MKLDNN multi-threaded reasoning scenario error due to cache mechanism. ([#36290](https://github.com/PaddlePaddle/Paddle/pull/36290), [#35884](https://github.com/PaddlePaddle/Paddle/pull/35884))
-
-- Fix MKLDNN quantization model accuracy anomaly caused by matmul and FC. ([#38023](https://github.com/PaddlePaddle/Paddle/pull/38023), [#37618](https://github.com/PaddlePaddle/Paddle/pull/37618))
-
-- Fix the abnormal quantization model accuracy issue in MKLDNN quantization conversion scripts caused by missing passes. ([#37619](https://github.com/PaddlePaddle/Paddle/pull/37619), [#40542](https://github.com/PaddlePaddle/Paddle/pull/40542),[#38912](https://github.com/PaddlePaddle/Paddle/pull/38912))
-
-- Fix the crash bug in MKLDNN enabling volume op due to data type mismatch. ([#38133](https://github.com/PaddlePaddle/Paddle/pull/38133))
-
-- Fix an issue where some MKLDNN ops need to change back to the original layout after modifying the layout. ([#39422](https://github.com/PaddlePaddle/Paddle/pull/39422))
-
-- Fix the bug of Python API error report due to conflict with Ascend software stack, because the GIL lock is not released in the Ascend 910 inference scenario. ([#38605](https://github.com/PaddlePaddle/Paddle/pull/38605))
-
-
-## **5. Environment Adaptation**
-
-### **Compile and Install**
-
-- From version 2.3.0, PaddlePaddle has adjusted and upgraded the types of GPU architectures supported by the framework. (For more information, please refer to: [GPU architectures supported by PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/zh/2.3rc/install/Tables.html#gpu))
-
-
-Notes:
-
-- PIP source installation means downloading the installation package and dependency libraries from PIP official website with using `pip install paddlepaddle` or `pip install paddlepaddle-gpu`. This supports less architecture types, and lighter installation package,and only one CUDA version of the installation package is provided(compared with BOS source).
-
- - Prior to version 2.3, the PIP source installer (CUDA10.2) supports the following GPU architectures: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, and 7.5.
-
- - Later than version 2.3, the PIP source installer (CUDA11.0) supports the following GPU architectures: 6.0, 6.1, 7.0, 7.5, 8.0
-
-- The BOS source is a way to download the installation package and dependency libraries from the official website of PaddlePaddle, which supports more GPU architectures. The download source is from China and it is much faster. (compared with PIP source, it supports more kinds of architectures and provides multiple CUDA versions of installation packages).
-
- - Prior to version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website:
- - CUDA10: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5;
+#### New Features
- - CUDA11: 5.2,6.0,6.1,7.0,7.5,8.0。
+- Add the support for Hygon DCU K100. [#63535](https://github.com/PaddlePaddle/Paddle/pull/63535)
+- Support the complex64/128 data type and fusion operators such as fused_bias_residual_layernorm, fused_bias_dropout_residual_layer_norm, and rms_norm. [#63217](https://github.com/PaddlePaddle/Paddle/pull/63217)
- - Later than version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website:
+#### Bug Fixing
- - CUDA10: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5;
+- Fix compilation error issues in DTK and ROCM version upgrades. [#62832](https://github.com/PaddlePaddle/Paddle/pull/62832),[#62931](https://github.com/PaddlePaddle/Paddle/pull/62931),[#61872](https://github.com/PaddlePaddle/Paddle/pull/61872),[#63738](https://github.com/PaddlePaddle/Paddle/pull/63738)
- - CUDA11: 3.5, 5.0, 6.0, 6.1, 7.0, 7.5, 8.0。
+## Environment Updates
-- Support Python 3.10. Fix compilation bugs caused by some PythonC API changes on Windows. ([#41180](https://github.com/PaddlePaddle/Paddle/pull/42180))
+In this PaddlePaddle version, we complete the release and update synchronization of the basic dependency libraries, and remove the old dependency libraries that are no longer updated. Complete a number of optimizations to improve compilation efficiency and compatibility, and improve the CI pipeline monitoring function to enhance the user installation experience. Fixe the several known compilation problems, improved the compilation system of paddle, and add some new features. Through the optimizations, the compilation and installation experience of the PaddlePaddle framework is further improved to bring developers a better use and development experience.
-- The Windows platform supports the compilation through Visual Studio 2019. ([#38719](https://github.com/PaddlePaddle/Paddle/pull/38719))
+### New Support
-- Eliminate various warnings when compiling on the Windows platform. ([#38034](https://github.com/PaddlePaddle/Paddle/pull/38034), [#37890](https://github.com/PaddlePaddle/Paddle/pull/37890), [#37442](https://github.com/PaddlePaddle/Paddle/pull/37442), [#37439](https://github.com/PaddlePaddle/Paddle/pull/37439), [#36857](https://github.com/PaddlePaddle/Paddle/pull/36857))
+- Support users to install paddle without relying on local cuda and cudnn, thus improving the user installation experience. [#60841](https://github.com/PaddlePaddle/Paddle/pull/60841),[#61973](https://github.com/PaddlePaddle/Paddle/pull/61973),[#61862](https://github.com/PaddlePaddle/Paddle/pull/61862),[#61235](https://github.com/PaddlePaddle/Paddle/pull/61235),[#61209](https://github.com/PaddlePaddle/Paddle/pull/61209),[#61653](https://github.com/PaddlePaddle/Paddle/pull/61653),[#64083](https://github.com/PaddlePaddle/Paddle/pull/64083)
+- Support CUDA 12.3 completely. Complete the retirement of cuda10.2. [#63356](https://github.com/PaddlePaddle/Paddle/pull/63356),[#60299](https://github.com/PaddlePaddle/Paddle/pull/60299),[#64171](https://github.com/PaddlePaddle/Paddle/pull/64171),[#62189](https://github.com/PaddlePaddle/Paddle/pull/62189),[#63392](https://github.com/PaddlePaddle/Paddle/pull/63392),[#64228](https://github.com/PaddlePaddle/Paddle/pull/64228),[#62498](https://github.com/PaddlePaddle/Paddle/pull/62498),[#64298](https://github.com/PaddlePaddle/Paddle/pull/64298)
+- Support Python 3.12 completely, bringing more powerful language features and performance optimizations. Complete the retirement of python3.7. [#59875](https://github.com/PaddlePaddle/Paddle/pull/59875),[#59877](https://github.com/PaddlePaddle/Paddle/pull/59877),[#59876](https://github.com/PaddlePaddle/Paddle/pull/59876)
+- Upgrade of other paddle-dependent third-party libraries: [#63741](https://github.com/PaddlePaddle/Paddle/pull/63741),[#64447](https://github.com/PaddlePaddle/Paddle/pull/64447),[#60195](https://github.com/PaddlePaddle/Paddle/pull/60195),[#60110](https://github.com/PaddlePaddle/Paddle/pull/60110),[#61509](https://github.com/PaddlePaddle/Paddle/pull/61509)
-- Fix jetson compilation issues introduced by the underlying data structure upgrade. ([#39669](https://github.com/PaddlePaddle/Paddle/pull/39669), [#39441](https://github.com/PaddlePaddle/Paddle/pull/39441))
+### Compilation Optimizations
+- Optimize paddle's CMake codes, significantly improving compilation efficiency and experience. [##59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#60167](https://github.com/PaddlePaddle/Paddle/pull/60167),[#61052](https://github.com/PaddlePaddle/Paddle/pull/61052),[#59995](https://github.com/PaddlePaddle/Paddle/pull/59995),[#59607](https://github.com/PaddlePaddle/Paddle/pull/59607),[#63093](https://github.com/PaddlePaddle/Paddle/pull/63093),[#63887](https://github.com/PaddlePaddle/Paddle/pull/63887),[#62969](https://github.com/PaddlePaddle/Paddle/pull/62969),[#64007](https://github.com/PaddlePaddle/Paddle/pull/64007),[#59811](https://github.com/PaddlePaddle/Paddle/pull/59811),[#63045](https://github.com/PaddlePaddle/Paddle/pull/63045),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#60240](https://github.com/PaddlePaddle/Paddle/pull/60240),[#60235](https://github.com/PaddlePaddle/Paddle/pull/60235),[#61411](https://github.com/PaddlePaddle/Paddle/pull/61411),[#61944](https://github.com/PaddlePaddle/Paddle/pull/61944),[#61961](https://github.com/PaddlePaddle/Paddle/pull/61961),[#59990](https://github.com/PaddlePaddle/Paddle/pull/59990),[#59478](https://github.com/PaddlePaddle/Paddle/pull/59478),[#61501](https://github.com/PaddlePaddle/Paddle/pull/61501),[#60066](https://github.com/PaddlePaddle/Paddle/pull/60066),[#64133](https://github.com/PaddlePaddle/Paddle/pull/64133),[#64231](https://github.com/PaddlePaddle/Paddle/pull/64231),[#60087](https://github.com/PaddlePaddle/Paddle/pull/60087),[#60348](https://github.com/PaddlePaddle/Paddle/pull/60348),[#60737](https://github.com/PaddlePaddle/Paddle/pull/60737),[#61364](https://github.com/PaddlePaddle/Paddle/pull/61364),[#63214](https://github.com/PaddlePaddle/Paddle/pull/63214),[#62454](https://github.com/PaddlePaddle/Paddle/pull/62454),[#62473](https://github.com/PaddlePaddle/Paddle/pull/62473),[#63692](https://github.com/PaddlePaddle/Paddle/pull/63692),[#63950](https://github.com/PaddlePaddle/Paddle/pull/63950)
+- Support C++ unit test link dynamic library under linux and windowx, greatly reducing the size of C++ unit test and the size of the entire build directory. [#60008](https://github.com/PaddlePaddle/Paddle/pull/60008),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60960](https://github.com/PaddlePaddle/Paddle/pull/60960),[#60961](https://github.com/PaddlePaddle/Paddle/pull/60961),[#60831](https://github.com/PaddlePaddle/Paddle/pull/60831),[#60832](https://github.com/PaddlePaddle/Paddle/pull/60832),[#60833](https://github.com/PaddlePaddle/Paddle/pull/60833),[#61372](https://github.com/PaddlePaddle/Paddle/pull/61372),[#60834](https://github.com/PaddlePaddle/Paddle/pull/60834),[#61374](https://github.com/PaddlePaddle/Paddle/pull/61374),[#61463](https://github.com/PaddlePaddle/Paddle/pull/61463),[#61376](https://github.com/PaddlePaddle/Paddle/pull/61376),[#60830](https://github.com/PaddlePaddle/Paddle/pull/60830),[#61373](https://github.com/PaddlePaddle/Paddle/pull/61373),[#61672](https://github.com/PaddlePaddle/Paddle/pull/61672),[#61375](https://github.com/PaddlePaddle/Paddle/pull/61375),[#61676](https://github.com/PaddlePaddle/Paddle/pull/61676),[#62036](https://github.com/PaddlePaddle/Paddle/pull/62036),[#61945](https://github.com/PaddlePaddle/Paddle/pull/61945),[#61675](https://github.com/PaddlePaddle/Paddle/pull/61675),[#61674](https://github.com/PaddlePaddle/Paddle/pull/61674),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#61238](https://github.com/PaddlePaddle/Paddle/pull/61238),[#59988](https://github.com/PaddlePaddle/Paddle/pull/59988),[#60307](https://github.com/PaddlePaddle/Paddle/pull/60307),[#59612](https://github.com/PaddlePaddle/Paddle/pull/59612),[#59942](https://github.com/PaddlePaddle/Paddle/pull/59942),[#59968](https://github.com/PaddlePaddle/Paddle/pull/59968),[#59978](https://github.com/PaddlePaddle/Paddle/pull/59978),[#60121](https://github.com/PaddlePaddle/Paddle/pull/60121),[#60149](https://github.com/PaddlePaddle/Paddle/pull/60149),[#60161](https://github.com/PaddlePaddle/Paddle/pull/60161),[#60160](https://github.com/PaddlePaddle/Paddle/pull/60160),[#60230](https://github.com/PaddlePaddle/Paddle/pull/60230),[#60154](https://github.com/PaddlePaddle/Paddle/pull/60154),[#60356](https://github.com/PaddlePaddle/Paddle/pull/60356),[#60392](https://github.com/PaddlePaddle/Paddle/pull/60392),[#60517](https://github.com/PaddlePaddle/Paddle/pull/60517),[#61131](https://github.com/PaddlePaddle/Paddle/pull/61131),[#60959](https://github.com/PaddlePaddle/Paddle/pull/60959)
+- Add the support for Clang compiler. Users can now use Clang to compile, enjoying faster compilation speed and better error message prompts. [#63382](https://github.com/PaddlePaddle/Paddle/pull/63382),[#63133](https://github.com/PaddlePaddle/Paddle/pull/63133),[#61705](https://github.com/PaddlePaddle/Paddle/pull/61705),[#63152](https://github.com/PaddlePaddle/Paddle/pull/63152),[#63373](https://github.com/PaddlePaddle/Paddle/pull/63373)
-### **New Hardware Backend Extention**
+### CI Pipeline Improvements
-- Custom device support: provide a plug-in way to extend PaddlePaddle hardware backend. With this function, developers do not need to modify PaddlePaddle codes for specific hardware, but simply implement the standard interface and compile it into a dynamic link library that can be called by PaddlePaddle as a plug-in.This reduces the development effort of adding a new hardware backend to PaddlePaddle. Currently it supports custom Runtime and custom Kernel.
+- Improve the merge-in code monitoring mechanism in the CI pipeline, to ensure higher code quality and stability. Add a function monitoring module, to monitor various indicators of the CI pipeline in real time, ensuring smooth execution of each stage, to identify and resolve issues in a timely manner. [#61384](https://github.com/PaddlePaddle/Paddle/pull/61384),[#62190](https://github.com/PaddlePaddle/Paddle/pull/62190),[#60758](https://github.com/PaddlePaddle/Paddle/pull/60758),[#60399](https://github.com/PaddlePaddle/Paddle/pull/60399),[#58623](https://github.com/PaddlePaddle/Paddle/pull/58623),[#62177](https://github.com/PaddlePaddle/Paddle/pull/62177),[#62361](https://github.com/PaddlePaddle/Paddle/pull/62361),[#62893](https://github.com/PaddlePaddle/Paddle/pull/62893),[#63705](https://github.com/PaddlePaddle/Paddle/pull/63705),[#64476](https://github.com/PaddlePaddle/Paddle/pull/64476),[#64752](https://github.com/PaddlePaddle/Paddle/pull/64752),[#64733](https://github.com/PaddlePaddle/Paddle/pull/64733),[#61914](https://github.com/PaddlePaddle/Paddle/pull/61914)
-- Support Huawei NPU chip (Ascend910) training/inference. Support ResNet50, YoloV3, BERT, Transformer and many other models. Support static + dynamic graph and auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
+### Code Cleanup
-- Support Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU) training/inference. Support ResNet50, BERT and other models. Support static graph training. Support single card, and distribute training across multiple cards, multiple machines.
+- Remove some old codes. [#63580](https://github.com/PaddlePaddle/Paddle/pull/63580),[#62840](https://github.com/PaddlePaddle/Paddle/pull/62840),[#62886](https://github.com/PaddlePaddle/Paddle/pull/62886),[#63046](https://github.com/PaddlePaddle/Paddle/pull/63046),[#63004](https://github.com/PaddlePaddle/Paddle/pull/63004),[#63039](https://github.com/PaddlePaddle/Paddle/pull/63039),[#62733](https://github.com/PaddlePaddle/Paddle/pull/62733),[#62773](https://github.com/PaddlePaddle/Paddle/pull/62773),[#62768](https://github.com/PaddlePaddle/Paddle/pull/62768),[#62744](https://github.com/PaddlePaddle/Paddle/pull/62744),[#62861](https://github.com/PaddlePaddle/Paddle/pull/62861),[#62774](https://github.com/PaddlePaddle/Paddle/pull/62774),[#62851](https://github.com/PaddlePaddle/Paddle/pull/62851),[#62973](https://github.com/PaddlePaddle/Paddle/pull/62973),[#63273](https://github.com/PaddlePaddle/Paddle/pull/63273),[#62445](https://github.com/PaddlePaddle/Paddle/pull/62445),[#64382](https://github.com/PaddlePaddle/Paddle/pull/64382),[#64409](https://github.com/PaddlePaddle/Paddle/pull/64409),[#64391](https://github.com/PaddlePaddle/Paddle/pull/64391),[#64310](https://github.com/PaddlePaddle/Paddle/pull/64310),[#64348](https://github.com/PaddlePaddle/Paddle/pull/64348),[#64651](https://github.com/PaddlePaddle/Paddle/pull/64651),[#64709](https://github.com/PaddlePaddle/Paddle/pull/64709),[#61714](https://github.com/PaddlePaddle/Paddle/pull/61714),[#62109](https://github.com/PaddlePaddle/Paddle/pull/62109),[#61751](https://github.com/PaddlePaddle/Paddle/pull/61751),[#61691](https://github.com/PaddlePaddle/Paddle/pull/61691),[#61735](https://github.com/PaddlePaddle/Paddle/pull/61735)
-- Support cambricon MLU chip (MLU370x4) training/inference. Support models such as ResNet50. Support static graph + dynamic graph training. Support auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
+### Bug Fixing
-- Support KUNLUNXIN 2 chips (KUNLUNXIN AI acceleration cards R200, R300) training/inference. Support ResNet50, YoloV3, OCR-DB, SSD, MobilnetV3, UNet, BERT, Transformer, GPT-2, Wide&Deep, and DeepFM. Support static graph + dynamic graph training. Support auto-mixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
+- Fix several compilation issues of paddle framework. [#63297](https://github.com/PaddlePaddle/Paddle/pull/63297),[#62994](https://github.com/PaddlePaddle/Paddle/pull/62994),[#62651](https://github.com/PaddlePaddle/Paddle/pull/62651),[#64408](https://github.com/PaddlePaddle/Paddle/pull/64408),[#60934](https://github.com/PaddlePaddle/Paddle/pull/60934),[#62899](https://github.com/PaddlePaddle/Paddle/pull/62899),[#60528](https://github.com/PaddlePaddle/Paddle/pull/60528),[#63158](https://github.com/PaddlePaddle/Paddle/pull/63158),[#64549](https://github.com/PaddlePaddle/Paddle/pull/64549),[#62351](https://github.com/PaddlePaddle/Paddle/pull/62351),[#61259](https://github.com/PaddlePaddle/Paddle/pull/61259),[#61281](https://github.com/PaddlePaddle/Paddle/pull/61281),[#62304](https://github.com/PaddlePaddle/Paddle/pull/62304),[#60736](https://github.com/PaddlePaddle/Paddle/pull/60736),[#60811](https://github.com/PaddlePaddle/Paddle/pull/60811),[#63949](https://github.com/PaddlePaddle/Paddle/pull/63949),[#59892](https://github.com/PaddlePaddle/Paddle/pull/59892),[#60767](https://github.com/PaddlePaddle/Paddle/pull/60767),[#60856](https://github.com/PaddlePaddle/Paddle/pull/60856),[#61286](https://github.com/PaddlePaddle/Paddle/pull/61286),[#61638](https://github.com/PaddlePaddle/Paddle/pull/61638),[#62079](https://github.com/PaddlePaddle/Paddle/pull/62079),[#62142](https://github.com/PaddlePaddle/Paddle/pull/62142),[#62823](https://github.com/PaddlePaddle/Paddle/pull/62823),[#62814](https://github.com/PaddlePaddle/Paddle/pull/62814),[#62425](https://github.com/PaddlePaddle/Paddle/pull/62425),[#62619](https://github.com/PaddlePaddle/Paddle/pull/62619),[#60207](https://github.com/PaddlePaddle/Paddle/pull/60207),[#60765](https://github.com/PaddlePaddle/Paddle/pull/60765),[#61870](https://github.com/PaddlePaddle/Paddle/pull/61870),[#61923](https://github.com/PaddlePaddle/Paddle/pull/61923),[#62144](https://github.com/PaddlePaddle/Paddle/pull/62144),[#62426](https://github.com/PaddlePaddle/Paddle/pull/62426),[#63848](https://github.com/PaddlePaddle/Paddle/pull/63848),[#60682](https://github.com/PaddlePaddle/Paddle/pull/60682),[#61369](https://github.com/PaddlePaddle/Paddle/pull/61369),[#62882](https://github.com/PaddlePaddle/Paddle/pull/62882),[#63944](https://github.com/PaddlePaddle/Paddle/pull/63944),[#64812](https://github.com/PaddlePaddle/Paddle/pull/64812),[#60654](https://github.com/PaddlePaddle/Paddle/pull/60654),[#60887](https://github.com/PaddlePaddle/Paddle/pull/60887),[#62058](https://github.com/PaddlePaddle/Paddle/pull/62058),[#64639](https://github.com/PaddlePaddle/Paddle/pull/64639),[#60115](https://github.com/PaddlePaddle/Paddle/pull/60115),[#61940](https://github.com/PaddlePaddle/Paddle/pull/61940),[#62614](https://github.com/PaddlePaddle/Paddle/pull/62614),[#59914](https://github.com/PaddlePaddle/Paddle/pull/59914),[#63762](https://github.com/PaddlePaddle/Paddle/pull/63762),[#60145](https://github.com/PaddlePaddle/Paddle/pull/60145),[#60285](https://github.com/PaddlePaddle/Paddle/pull/60285),[#60378](https://github.com/PaddlePaddle/Paddle/pull/60378),[#60393](https://github.com/PaddlePaddle/Paddle/pull/60393),[#61057](https://github.com/PaddlePaddle/Paddle/pull/61057),[#61058](https://github.com/PaddlePaddle/Paddle/pull/61058),[#61151](https://github.com/PaddlePaddle/Paddle/pull/61151),[#61347](https://github.com/PaddlePaddle/Paddle/pull/61347),[#61554](https://github.com/PaddlePaddle/Paddle/pull/61554),[#61844](https://github.com/PaddlePaddle/Paddle/pull/61844),[#62915](https://github.com/PaddlePaddle/Paddle/pull/62915),[#61852](https://github.com/PaddlePaddle/Paddle/pull/61852),[#61704](https://github.com/PaddlePaddle/Paddle/pull/61704),[#61991](https://github.com/PaddlePaddle/Paddle/pull/61991),[#62264](https://github.com/PaddlePaddle/Paddle/pull/62264),[#62762](https://github.com/PaddlePaddle/Paddle/pull/62762),[#63820](https://github.com/PaddlePaddle/Paddle/pull/63820),[#63864](https://github.com/PaddlePaddle/Paddle/pull/63864),[#65017](https://github.com/PaddlePaddle/Paddle/pull/65017),[#61183](https://github.com/PaddlePaddle/Paddle/pull/61183),[#59866](https://github.com/PaddlePaddle/Paddle/pull/59866),[#61171](https://github.com/PaddlePaddle/Paddle/pull/61171),[#61290](https://github.com/PaddlePaddle/Paddle/pull/61290),[#61725](https://github.com/PaddlePaddle/Paddle/pull/61725),[#61614](https://github.com/PaddlePaddle/Paddle/pull/61614),[#61721](https://github.com/PaddlePaddle/Paddle/pull/61721),[#61494](https://github.com/PaddlePaddle/Paddle/pull/61494),[#61556](https://github.com/PaddlePaddle/Paddle/pull/61556),[#61689](https://github.com/PaddlePaddle/Paddle/pull/61689)
+## Documentation-related Bug Fixing
-## Thanks to our Contributors
+- With the enhancement of API feature, some API documentations have been fixed and enhanced simultaneously. [#62875](https://github.com/PaddlePaddle/Paddle/pull/62875), [#59793](https://github.com/PaddlePaddle/Paddle/pull/59793), [#60002](https://github.com/PaddlePaddle/Paddle/pull/60002), [#59985](https://github.com/PaddlePaddle/Paddle/pull/59985), [#63365](https://github.com/PaddlePaddle/Paddle/pull/63365), [#60962](https://github.com/PaddlePaddle/Paddle/pull/60962), [#60942](https://github.com/PaddlePaddle/Paddle/pull/60942), [#64232](https://github.com/PaddlePaddle/Paddle/pull/64232), [#63255](https://github.com/PaddlePaddle/Paddle/pull/63255)
+- Update/supplement API documentation. bernoulli_ ([#64504](https://github.com/PaddlePaddle/Paddle/pull/64504)), paddle.static.ctr_metric_bundle ([#60912](https://github.com/PaddlePaddle/Paddle/pull/60912)), LayerNorm ([#62928](https://github.com/PaddlePaddle/Paddle/pull/62928)), Sequential ([#63128](https://github.com/PaddlePaddle/Paddle/pull/63128)), paddle.summary ([#63121](https://github.com/PaddlePaddle/Paddle/pull/63121)), ShardOptimizer in AutoParallel ([#62933](https://github.com/PaddlePaddle/Paddle/pull/62933)), paddle.nccl.version ([#62480](https://github.com/PaddlePaddle/Paddle/pull/62480))
+- Update the Readme file. [#59883](https://github.com/PaddlePaddle/Paddle/pull/59883),[#60691](https://github.com/PaddlePaddle/Paddle/pull/60691),[#60749](https://github.com/PaddlePaddle/Paddle/pull/60749)
+- Update mkldnn to onednn. [#63199](https://github.com/PaddlePaddle/Paddle/pull/63199),[#63202](https://github.com/PaddlePaddle/Paddle/pull/63202),[#63215](https://github.com/PaddlePaddle/Paddle/pull/63215),[#63209](https://github.com/PaddlePaddle/Paddle/pull/63209)
+- Fix document rendering bugs. [#59725](https://github.com/PaddlePaddle/Paddle/pull/59725),[#60306](https://github.com/PaddlePaddle/Paddle/pull/60306)
+- Fix a lot of typos in the codes to enhance source readability. [#60093](https://github.com/PaddlePaddle/Paddle/pull/60093),[#60603](https://github.com/PaddlePaddle/Paddle/pull/60603),[#60631](https://github.com/PaddlePaddle/Paddle/pull/60631),[#60679](https://github.com/PaddlePaddle/Paddle/pull/60679),[#60741](https://github.com/PaddlePaddle/Paddle/pull/60741),[#60770](https://github.com/PaddlePaddle/Paddle/pull/60770),[#60784](https://github.com/PaddlePaddle/Paddle/pull/60784),[#60825](https://github.com/PaddlePaddle/Paddle/pull/60825),[#60857](https://github.com/PaddlePaddle/Paddle/pull/60857),[#60891](https://github.com/PaddlePaddle/Paddle/pull/60891),[#60921](https://github.com/PaddlePaddle/Paddle/pull/60921),[#60920](https://github.com/PaddlePaddle/Paddle/pull/60920),[#60923](https://github.com/PaddlePaddle/Paddle/pull/60923),[#60928](https://github.com/PaddlePaddle/Paddle/pull/60928),[#60940](https://github.com/PaddlePaddle/Paddle/pull/60940),[#60936](https://github.com/PaddlePaddle/Paddle/pull/60936),[#60932](https://github.com/PaddlePaddle/Paddle/pull/60932),[#60935](https://github.com/PaddlePaddle/Paddle/pull/60935),[#60931](https://github.com/PaddlePaddle/Paddle/pull/60931),[#60951](https://github.com/PaddlePaddle/Paddle/pull/60951),[#60964](https://github.com/PaddlePaddle/Paddle/pull/60964),[#60965](https://github.com/PaddlePaddle/Paddle/pull/60965),[#60967](https://github.com/PaddlePaddle/Paddle/pull/60967),[#60972](https://github.com/PaddlePaddle/Paddle/pull/60972),[#60971](https://github.com/PaddlePaddle/Paddle/pull/60971),[#60980](https://github.com/PaddlePaddle/Paddle/pull/60980),[#60984](https://github.com/PaddlePaddle/Paddle/pull/60984),[#60985](https://github.com/PaddlePaddle/Paddle/pull/60985),[#60989](https://github.com/PaddlePaddle/Paddle/pull/60989),[#60990](https://github.com/PaddlePaddle/Paddle/pull/60990),[#60991](https://github.com/PaddlePaddle/Paddle/pull/60991),[#60992](https://github.com/PaddlePaddle/Paddle/pull/60992),[#60994](https://github.com/PaddlePaddle/Paddle/pull/60994),[#60995](https://github.com/PaddlePaddle/Paddle/pull/60995),[#60996](https://github.com/PaddlePaddle/Paddle/pull/60996),[#61001](https://github.com/PaddlePaddle/Paddle/pull/61001),[#61000](https://github.com/PaddlePaddle/Paddle/pull/61000),[#60999](https://github.com/PaddlePaddle/Paddle/pull/60999),[#60998](https://github.com/PaddlePaddle/Paddle/pull/60998),[#61026](https://github.com/PaddlePaddle/Paddle/pull/61026),[#61009](https://github.com/PaddlePaddle/Paddle/pull/61009),[#61034](https://github.com/PaddlePaddle/Paddle/pull/61034),[#61033](https://github.com/PaddlePaddle/Paddle/pull/61033),[#61020](https://github.com/PaddlePaddle/Paddle/pull/61020),[#61092](https://github.com/PaddlePaddle/Paddle/pull/61092),[#61066](https://github.com/PaddlePaddle/Paddle/pull/61066),[#61063](https://github.com/PaddlePaddle/Paddle/pull/61063),[#61089](https://github.com/PaddlePaddle/Paddle/pull/61089),[#61071](https://github.com/PaddlePaddle/Paddle/pull/61071),[#61129](https://github.com/PaddlePaddle/Paddle/pull/61129),[#61128](https://github.com/PaddlePaddle/Paddle/pull/61128),[#61126](https://github.com/PaddlePaddle/Paddle/pull/61126),[#61123](https://github.com/PaddlePaddle/Paddle/pull/61123),[#61113](https://github.com/PaddlePaddle/Paddle/pull/61113),[#61189](https://github.com/PaddlePaddle/Paddle/pull/61189),[#61175](https://github.com/PaddlePaddle/Paddle/pull/61175),[#61153](https://github.com/PaddlePaddle/Paddle/pull/61153),[#61198](https://github.com/PaddlePaddle/Paddle/pull/61198),[#61206](https://github.com/PaddlePaddle/Paddle/pull/61206),[#61256](https://github.com/PaddlePaddle/Paddle/pull/61256),[#61255](https://github.com/PaddlePaddle/Paddle/pull/61255),[#61251](https://github.com/PaddlePaddle/Paddle/pull/61251),[#61246](https://github.com/PaddlePaddle/Paddle/pull/61246),[#61245](https://github.com/PaddlePaddle/Paddle/pull/61245),[#61231](https://github.com/PaddlePaddle/Paddle/pull/61231),[#61247](https://github.com/PaddlePaddle/Paddle/pull/61247),[#61265](https://github.com/PaddlePaddle/Paddle/pull/61265),[#61264](https://github.com/PaddlePaddle/Paddle/pull/61264),[#61266](https://github.com/PaddlePaddle/Paddle/pull/61266),[#61267](https://github.com/PaddlePaddle/Paddle/pull/61267),[#61268](https://github.com/PaddlePaddle/Paddle/pull/61268),[#61270](https://github.com/PaddlePaddle/Paddle/pull/61270),[#61334](https://github.com/PaddlePaddle/Paddle/pull/61334),[#61392](https://github.com/PaddlePaddle/Paddle/pull/61392),[#61404](https://github.com/PaddlePaddle/Paddle/pull/61404),[#61318](https://github.com/PaddlePaddle/Paddle/pull/61318),[#61383](https://github.com/PaddlePaddle/Paddle/pull/61383),[#61306](https://github.com/PaddlePaddle/Paddle/pull/61306),[#61324](https://github.com/PaddlePaddle/Paddle/pull/61324),[#61426](https://github.com/PaddlePaddle/Paddle/pull/61426),[#61390](https://github.com/PaddlePaddle/Paddle/pull/61390),[#61419](https://github.com/PaddlePaddle/Paddle/pull/61419),[#61420](https://github.com/PaddlePaddle/Paddle/pull/61420),[#61408](https://github.com/PaddlePaddle/Paddle/pull/61408),[#61425](https://github.com/PaddlePaddle/Paddle/pull/61425),[#61557](https://github.com/PaddlePaddle/Paddle/pull/61557),[#61628](https://github.com/PaddlePaddle/Paddle/pull/61628),[#61652](https://github.com/PaddlePaddle/Paddle/pull/61652),[#61602](https://github.com/PaddlePaddle/Paddle/pull/61602),[#61558](https://github.com/PaddlePaddle/Paddle/pull/61558),[#61660](https://github.com/PaddlePaddle/Paddle/pull/61660),[#61423](https://github.com/PaddlePaddle/Paddle/pull/61423),[#61627](https://github.com/PaddlePaddle/Paddle/pull/61627),[#61685](https://github.com/PaddlePaddle/Paddle/pull/61685),[#61690](https://github.com/PaddlePaddle/Paddle/pull/61690),[#61727](https://github.com/PaddlePaddle/Paddle/pull/61727),[#61738](https://github.com/PaddlePaddle/Paddle/pull/61738),[#61740](https://github.com/PaddlePaddle/Paddle/pull/61740),[#61741](https://github.com/PaddlePaddle/Paddle/pull/61741),[#61743](https://github.com/PaddlePaddle/Paddle/pull/61743),[#61744](https://github.com/PaddlePaddle/Paddle/pull/61744),[#61745](https://github.com/PaddlePaddle/Paddle/pull/61745),[#61761](https://github.com/PaddlePaddle/Paddle/pull/61761),[#61762](https://github.com/PaddlePaddle/Paddle/pull/61762),[#61764](https://github.com/PaddlePaddle/Paddle/pull/61764),[#61767](https://github.com/PaddlePaddle/Paddle/pull/61767),[#61768](https://github.com/PaddlePaddle/Paddle/pull/61768),[#61774](https://github.com/PaddlePaddle/Paddle/pull/61774),[#61781](https://github.com/PaddlePaddle/Paddle/pull/61781),[#61783](https://github.com/PaddlePaddle/Paddle/pull/61783),[#61757](https://github.com/PaddlePaddle/Paddle/pull/61757),[#61732](https://github.com/PaddlePaddle/Paddle/pull/61732),[#61776](https://github.com/PaddlePaddle/Paddle/pull/61776),[#61780](https://github.com/PaddlePaddle/Paddle/pull/61780),[#61730](https://github.com/PaddlePaddle/Paddle/pull/61730),[#61728](https://github.com/PaddlePaddle/Paddle/pull/61728),[#61633](https://github.com/PaddlePaddle/Paddle/pull/61633),[#61720](https://github.com/PaddlePaddle/Paddle/pull/61720),[#61734](https://github.com/PaddlePaddle/Paddle/pull/61734),[#61779](https://github.com/PaddlePaddle/Paddle/pull/61779),[#61775](https://github.com/PaddlePaddle/Paddle/pull/61775),[#61773](https://github.com/PaddlePaddle/Paddle/pull/61773),[#61787](https://github.com/PaddlePaddle/Paddle/pull/61787),[#61687](https://github.com/PaddlePaddle/Paddle/pull/61687),[#61747](https://github.com/PaddlePaddle/Paddle/pull/61747),[#61760](https://github.com/PaddlePaddle/Paddle/pull/61760),[#61782](https://github.com/PaddlePaddle/Paddle/pull/61782),[#61800](https://github.com/PaddlePaddle/Paddle/pull/61800),[#61748](https://github.com/PaddlePaddle/Paddle/pull/61748),[#61772](https://github.com/PaddlePaddle/Paddle/pull/61772),[#61786](https://github.com/PaddlePaddle/Paddle/pull/61786),[#61880](https://github.com/PaddlePaddle/Paddle/pull/61880),[#61718](https://github.com/PaddlePaddle/Paddle/pull/61718),[#61742](https://github.com/PaddlePaddle/Paddle/pull/61742),[#61766](https://github.com/PaddlePaddle/Paddle/pull/61766),[#61835](https://github.com/PaddlePaddle/Paddle/pull/61835),[#61838](https://github.com/PaddlePaddle/Paddle/pull/61838),[#61754](https://github.com/PaddlePaddle/Paddle/pull/61754),[#61833](https://github.com/PaddlePaddle/Paddle/pull/61833),[#61749](https://github.com/PaddlePaddle/Paddle/pull/61749),[#61938](https://github.com/PaddlePaddle/Paddle/pull/61938),[#61919](https://github.com/PaddlePaddle/Paddle/pull/61919),[#61924](https://github.com/PaddlePaddle/Paddle/pull/61924),[#61778](https://github.com/PaddlePaddle/Paddle/pull/61778),[#61839](https://github.com/PaddlePaddle/Paddle/pull/61839),[#61879](https://github.com/PaddlePaddle/Paddle/pull/61879),[#61929](https://github.com/PaddlePaddle/Paddle/pull/61929),[#61801](https://github.com/PaddlePaddle/Paddle/pull/61801),[#61788](https://github.com/PaddlePaddle/Paddle/pull/61788),[#61999](https://github.com/PaddlePaddle/Paddle/pull/61999),[#61928](https://github.com/PaddlePaddle/Paddle/pull/61928),[#61958](https://github.com/PaddlePaddle/Paddle/pull/61958),[#61982](https://github.com/PaddlePaddle/Paddle/pull/61982),[#61996](https://github.com/PaddlePaddle/Paddle/pull/61996),[#61953](https://github.com/PaddlePaddle/Paddle/pull/61953),[#61998](https://github.com/PaddlePaddle/Paddle/pull/61998),[#62003](https://github.com/PaddlePaddle/Paddle/pull/62003),[#61921](https://github.com/PaddlePaddle/Paddle/pull/61921),[#61881](https://github.com/PaddlePaddle/Paddle/pull/61881),[#61746](https://github.com/PaddlePaddle/Paddle/pull/61746),[#61955](https://github.com/PaddlePaddle/Paddle/pull/61955),[#62002](https://github.com/PaddlePaddle/Paddle/pull/62002),[#62001](https://github.com/PaddlePaddle/Paddle/pull/62001),[#61997](https://github.com/PaddlePaddle/Paddle/pull/61997),[#61765](https://github.com/PaddlePaddle/Paddle/pull/61765),[#61956](https://github.com/PaddlePaddle/Paddle/pull/61956),[#62004](https://github.com/PaddlePaddle/Paddle/pull/62004),[#62044](https://github.com/PaddlePaddle/Paddle/pull/62044),[#62040](https://github.com/PaddlePaddle/Paddle/pull/62040),[#62043](https://github.com/PaddlePaddle/Paddle/pull/62043),[#62042](https://github.com/PaddlePaddle/Paddle/pull/62042),[#62041](https://github.com/PaddlePaddle/Paddle/pull/62041),[#62039](https://github.com/PaddlePaddle/Paddle/pull/62039),[#62019](https://github.com/PaddlePaddle/Paddle/pull/62019),[#61910](https://github.com/PaddlePaddle/Paddle/pull/61910),[#61882](https://github.com/PaddlePaddle/Paddle/pull/61882),[#61836](https://github.com/PaddlePaddle/Paddle/pull/61836),[#62013](https://github.com/PaddlePaddle/Paddle/pull/62013),[#62055](https://github.com/PaddlePaddle/Paddle/pull/62055),[#62047](https://github.com/PaddlePaddle/Paddle/pull/62047),[#62000](https://github.com/PaddlePaddle/Paddle/pull/62000),[#62048](https://github.com/PaddlePaddle/Paddle/pull/62048),[#62075](https://github.com/PaddlePaddle/Paddle/pull/62075),[#62038](https://github.com/PaddlePaddle/Paddle/pull/62038),[#62045](https://github.com/PaddlePaddle/Paddle/pull/62045),[#62105](https://github.com/PaddlePaddle/Paddle/pull/62105),[#62214](https://github.com/PaddlePaddle/Paddle/pull/62214),[#62212](https://github.com/PaddlePaddle/Paddle/pull/62212),[#62183](https://github.com/PaddlePaddle/Paddle/pull/62183),[#62182](https://github.com/PaddlePaddle/Paddle/pull/62182),[#62181](https://github.com/PaddlePaddle/Paddle/pull/62181),[#62179](https://github.com/PaddlePaddle/Paddle/pull/62179),[#62178](https://github.com/PaddlePaddle/Paddle/pull/62178),[#62172](https://github.com/PaddlePaddle/Paddle/pull/62172),[#62168](https://github.com/PaddlePaddle/Paddle/pull/62168),[#62163](https://github.com/PaddlePaddle/Paddle/pull/62163),[#62162](https://github.com/PaddlePaddle/Paddle/pull/62162),[#62161](https://github.com/PaddlePaddle/Paddle/pull/62161),[#62160](https://github.com/PaddlePaddle/Paddle/pull/62160),[#62046](https://github.com/PaddlePaddle/Paddle/pull/62046),[#62175](https://github.com/PaddlePaddle/Paddle/pull/62175),[#62259](https://github.com/PaddlePaddle/Paddle/pull/62259),[#62258](https://github.com/PaddlePaddle/Paddle/pull/62258),[#62213](https://github.com/PaddlePaddle/Paddle/pull/62213),[#62260](https://github.com/PaddlePaddle/Paddle/pull/62260),[#62290](https://github.com/PaddlePaddle/Paddle/pull/62290),[#62288](https://github.com/PaddlePaddle/Paddle/pull/62288),[#62323](https://github.com/PaddlePaddle/Paddle/pull/62323),[#62319](https://github.com/PaddlePaddle/Paddle/pull/62319),[#62331](https://github.com/PaddlePaddle/Paddle/pull/62331),[#62330](https://github.com/PaddlePaddle/Paddle/pull/62330),[#62329](https://github.com/PaddlePaddle/Paddle/pull/62329),[#62324](https://github.com/PaddlePaddle/Paddle/pull/62324),[#62317](https://github.com/PaddlePaddle/Paddle/pull/62317),[#62311](https://github.com/PaddlePaddle/Paddle/pull/62311),[#62310](https://github.com/PaddlePaddle/Paddle/pull/62310),[#62308](https://github.com/PaddlePaddle/Paddle/pull/62308),[#62289](https://github.com/PaddlePaddle/Paddle/pull/62289),[#62307](https://github.com/PaddlePaddle/Paddle/pull/62307),[#62315](https://github.com/PaddlePaddle/Paddle/pull/62315),[#62406](https://github.com/PaddlePaddle/Paddle/pull/62406),[#62458](https://github.com/PaddlePaddle/Paddle/pull/62458),[#62459](https://github.com/PaddlePaddle/Paddle/pull/62459),[#62481](https://github.com/PaddlePaddle/Paddle/pull/62481),[#62465](https://github.com/PaddlePaddle/Paddle/pull/62465),[#62462](https://github.com/PaddlePaddle/Paddle/pull/62462),[#62453](https://github.com/PaddlePaddle/Paddle/pull/62453),[#62496](https://github.com/PaddlePaddle/Paddle/pull/62496),[#62457](https://github.com/PaddlePaddle/Paddle/pull/62457),[#62537](https://github.com/PaddlePaddle/Paddle/pull/62537),[#62514](https://github.com/PaddlePaddle/Paddle/pull/62514),[#62548](https://github.com/PaddlePaddle/Paddle/pull/62548),[#62544](https://github.com/PaddlePaddle/Paddle/pull/62544),[#62575](https://github.com/PaddlePaddle/Paddle/pull/62575),[#62463](https://github.com/PaddlePaddle/Paddle/pull/62463),[#62643](https://github.com/PaddlePaddle/Paddle/pull/62643),[#62803](https://github.com/PaddlePaddle/Paddle/pull/62803),[#62924](https://github.com/PaddlePaddle/Paddle/pull/62924),[#63037](https://github.com/PaddlePaddle/Paddle/pull/63037),[#63102](https://github.com/PaddlePaddle/Paddle/pull/63102),[#63139](https://github.com/PaddlePaddle/Paddle/pull/63139),[#63092](https://github.com/PaddlePaddle/Paddle/pull/63092),[#63147](https://github.com/PaddlePaddle/Paddle/pull/63147),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#60485](https://github.com/PaddlePaddle/Paddle/pull/60485),[#61273](https://github.com/PaddlePaddle/Paddle/pull/61273),[#63429](https://github.com/PaddlePaddle/Paddle/pull/63429),[#61954](https://github.com/PaddlePaddle/Paddle/pull/61954)
-This release contains contributions from the project core team as well as:
+## Others
-Adam Osewski, Allen Guo, arlesniak, chenenquan, chenyanlann, fengkuangxiaxia, fuqianya, fwenguang, guguguzi, helen88, houj04, Jacek Czaja, jakpiase, jianghaicheng, joanna.wozna.intel, joeqiao12, Leo Chen, Leo Guo, Li-fAngyU, lidanqing, Liyulingyue, Matsumoto GAO, maxhuiy, Ming-Xu Huang, Nyakku Shigure, piotrekobi, piotrekobiIntel, QingshuChen, qipengh, Skr Bang, Sylwester Fraczek, Sławomir Siwek, taixiurong, tanzhipeng, Tomasz Socha, TTerror, Webbley, yaozhixin, ykkk2333, yujun, Zhangjingyu06, zhangxiaoci, zhangyikun02, zhangyk0314, zlsh80826, zn, Zuza.
+Non-user related changes, including deprecated code cleanup, useless unit test cleanup, debugging or upgrade of monitoring mechanism. [#63377](https://github.com/PaddlePaddle/Paddle/pull/63377),[#64106](https://github.com/PaddlePaddle/Paddle/pull/64106),[#64220](https://github.com/PaddlePaddle/Paddle/pull/64220),[#64293](https://github.com/PaddlePaddle/Paddle/pull/64293),[#64464](https://github.com/PaddlePaddle/Paddle/pull/64464),[#64944](https://github.com/PaddlePaddle/Paddle/pull/64944),[#63638](https://github.com/PaddlePaddle/Paddle/pull/63638),[#63732](https://github.com/PaddlePaddle/Paddle/pull/63732),[#63735](https://github.com/PaddlePaddle/Paddle/pull/63735),[#63826](https://github.com/PaddlePaddle/Paddle/pull/63826),[#63982](https://github.com/PaddlePaddle/Paddle/pull/63982),[#63737](https://github.com/PaddlePaddle/Paddle/pull/63737),[#64471](https://github.com/PaddlePaddle/Paddle/pull/64471),[#64574](https://github.com/PaddlePaddle/Paddle/pull/64574),[#64494](https://github.com/PaddlePaddle/Paddle/pull/64494),[#62775](https://github.com/PaddlePaddle/Paddle/pull/62775),[#63601](https://github.com/PaddlePaddle/Paddle/pull/63601),[#62564](https://github.com/PaddlePaddle/Paddle/pull/62564),[#63772](https://github.com/PaddlePaddle/Paddle/pull/63772),[#64719](https://github.com/PaddlePaddle/Paddle/pull/64719),[#61640](https://github.com/PaddlePaddle/Paddle/pull/61640),[#63459](https://github.com/PaddlePaddle/Paddle/pull/63459),[#64062](https://github.com/PaddlePaddle/Paddle/pull/64062),[#63480](https://github.com/PaddlePaddle/Paddle/pull/63480),[#63833](https://github.com/PaddlePaddle/Paddle/pull/63833)[#63673](https://github.com/PaddlePaddle/Paddle/pull/63673),[#63672](https://github.com/PaddlePaddle/Paddle/pull/63672),[#64131](https://github.com/PaddlePaddle/Paddle/pull/64131),[#64156](https://github.com/PaddlePaddle/Paddle/pull/64156),[#64155](https://github.com/PaddlePaddle/Paddle/pull/64155),[#64159](https://github.com/PaddlePaddle/Paddle/pull/64159),[#63902](https://github.com/PaddlePaddle/Paddle/pull/63902),[#64230](https://github.com/PaddlePaddle/Paddle/pull/64230),[#64229](https://github.com/PaddlePaddle/Paddle/pull/64229),[#64236](https://github.com/PaddlePaddle/Paddle/pull/64236),[#64260](https://github.com/PaddlePaddle/Paddle/pull/64260),[#64175](https://github.com/PaddlePaddle/Paddle/pull/64175),[#64250](https://github.com/PaddlePaddle/Paddle/pull/64250),[#64269](https://github.com/PaddlePaddle/Paddle/pull/64269),[#64238](https://github.com/PaddlePaddle/Paddle/pull/64238),[#64349](https://github.com/PaddlePaddle/Paddle/pull/64349),[#64394](https://github.com/PaddlePaddle/Paddle/pull/64394),[#64402](https://github.com/PaddlePaddle/Paddle/pull/64402),[#64401](https://github.com/PaddlePaddle/Paddle/pull/64401),[#64388](https://github.com/PaddlePaddle/Paddle/pull/64388),[#64329](https://github.com/PaddlePaddle/Paddle/pull/64329),[#64502](https://github.com/PaddlePaddle/Paddle/pull/64502),[#64501](https://github.com/PaddlePaddle/Paddle/pull/64501),[#64515](https://github.com/PaddlePaddle/Paddle/pull/64515),[#64503](https://github.com/PaddlePaddle/Paddle/pull/64503),[#64514](https://github.com/PaddlePaddle/Paddle/pull/64514),[#64601](https://github.com/PaddlePaddle/Paddle/pull/64601),[#64564](https://github.com/PaddlePaddle/Paddle/pull/64564),[#64012](https://github.com/PaddlePaddle/Paddle/pull/64012),[#64697](https://github.com/PaddlePaddle/Paddle/pull/64697),[#64682](https://github.com/PaddlePaddle/Paddle/pull/64682),[#64051](https://github.com/PaddlePaddle/Paddle/pull/64051),[#63267](https://github.com/PaddlePaddle/Paddle/pull/63267),[#63426](https://github.com/PaddlePaddle/Paddle/pull/63426),[#63626](https://github.com/PaddlePaddle/Paddle/pull/63626),[#63257](https://github.com/PaddlePaddle/Paddle/pull/63257),[#63266](https://github.com/PaddlePaddle/Paddle/pull/63266),[#63468](https://github.com/PaddlePaddle/Paddle/pull/63468),[#63262](https://github.com/PaddlePaddle/Paddle/pull/63262),[#63248](https://github.com/PaddlePaddle/Paddle/pull/63248),[#63241](https://github.com/PaddlePaddle/Paddle/pull/63241),[#63252](https://github.com/PaddlePaddle/Paddle/pull/63252),[#63258](https://github.com/PaddlePaddle/Paddle/pull/63258),[#63235](https://github.com/PaddlePaddle/Paddle/pull/63235),[#63399](https://github.com/PaddlePaddle/Paddle/pull/63399),[#63488](https://github.com/PaddlePaddle/Paddle/pull/63488),[#63487](https://github.com/PaddlePaddle/Paddle/pull/63487),[#63466](https://github.com/PaddlePaddle/Paddle/pull/63466),[#63464](https://github.com/PaddlePaddle/Paddle/pull/63464),[#63483](https://github.com/PaddlePaddle/Paddle/pull/63483),[#63486](https://github.com/PaddlePaddle/Paddle/pull/63486),[#63475](https://github.com/PaddlePaddle/Paddle/pull/63475),[#63489](https://github.com/PaddlePaddle/Paddle/pull/63489),[#63470](https://github.com/PaddlePaddle/Paddle/pull/63470),[#63457](https://github.com/PaddlePaddle/Paddle/pull/63457),[#63493](https://github.com/PaddlePaddle/Paddle/pull/63493),[#63561](https://github.com/PaddlePaddle/Paddle/pull/63561),[#63584](https://github.com/PaddlePaddle/Paddle/pull/63584),[#63587](https://github.com/PaddlePaddle/Paddle/pull/63587),[#63586](https://github.com/PaddlePaddle/Paddle/pull/63586),[#63569](https://github.com/PaddlePaddle/Paddle/pull/63569),[#63559](https://github.com/PaddlePaddle/Paddle/pull/63559),[#63558](https://github.com/PaddlePaddle/Paddle/pull/63558),[#63555](https://github.com/PaddlePaddle/Paddle/pull/63555),[#63543](https://github.com/PaddlePaddle/Paddle/pull/63543),[#63589](https://github.com/PaddlePaddle/Paddle/pull/63589),[#63583](https://github.com/PaddlePaddle/Paddle/pull/63583),[#63565](https://github.com/PaddlePaddle/Paddle/pull/63565),[#63564](https://github.com/PaddlePaddle/Paddle/pull/63564),[#63265](https://github.com/PaddlePaddle/Paddle/pull/63265),[#63562](https://github.com/PaddlePaddle/Paddle/pull/63562),[#63591](https://github.com/PaddlePaddle/Paddle/pull/63591),[#63460](https://github.com/PaddlePaddle/Paddle/pull/63460),[#63238](https://github.com/PaddlePaddle/Paddle/pull/63238),[#63631](https://github.com/PaddlePaddle/Paddle/pull/63631),[#63707](https://github.com/PaddlePaddle/Paddle/pull/63707),[#63714](https://github.com/PaddlePaddle/Paddle/pull/63714),[#63854](https://github.com/PaddlePaddle/Paddle/pull/63854),[#63929](https://github.com/PaddlePaddle/Paddle/pull/63929),[#63532](https://github.com/PaddlePaddle/Paddle/pull/63532),[#59628](https://github.com/PaddlePaddle/Paddle/pull/59628),[#62209](https://github.com/PaddlePaddle/Paddle/pull/62209),[#63742](https://github.com/PaddlePaddle/Paddle/pull/63742),[#60518](https://github.com/PaddlePaddle/Paddle/pull/60518),[#62078](https://github.com/PaddlePaddle/Paddle/pull/62078),[#62684](https://github.com/PaddlePaddle/Paddle/pull/62684),[#62723](https://github.com/PaddlePaddle/Paddle/pull/62723),[#64141](https://github.com/PaddlePaddle/Paddle/pull/64141),[#60404](https://github.com/PaddlePaddle/Paddle/pull/60404),[#64212](https://github.com/PaddlePaddle/Paddle/pull/64212),[#60652](https://github.com/PaddlePaddle/Paddle/pull/60652),[#64545](https://github.com/PaddlePaddle/Paddle/pull/64545),[#64477](https://github.com/PaddlePaddle/Paddle/pull/64477),[#64556](https://github.com/PaddlePaddle/Paddle/pull/64556),[#63160](https://github.com/PaddlePaddle/Paddle/pull/63160),[#63796](https://github.com/PaddlePaddle/Paddle/pull/63796),[#64693](https://github.com/PaddlePaddle/Paddle/pull/64693),[#64484](https://github.com/PaddlePaddle/Paddle/pull/64484),[#64677](https://github.com/PaddlePaddle/Paddle/pull/64677),[#64461](https://github.com/PaddlePaddle/Paddle/pull/64461),[#63189](https://github.com/PaddlePaddle/Paddle/pull/63189),[#63855](https://github.com/PaddlePaddle/Paddle/pull/63855),[#63896](https://github.com/PaddlePaddle/Paddle/pull/63896),[#63193](https://github.com/PaddlePaddle/Paddle/pull/63193),[#63200](https://github.com/PaddlePaddle/Paddle/pull/63200),[#63406](https://github.com/PaddlePaddle/Paddle/pull/63406),[#61283](https://github.com/PaddlePaddle/Paddle/pull/61283),[#63607](https://github.com/PaddlePaddle/Paddle/pull/63607),[#64486](https://github.com/PaddlePaddle/Paddle/pull/64486),[#64004](https://github.com/PaddlePaddle/Paddle/pull/64004),[#63132](https://github.com/PaddlePaddle/Paddle/pull/63132),[#63553](https://github.com/PaddlePaddle/Paddle/pull/63553),[#63572](https://github.com/PaddlePaddle/Paddle/pull/63572),[#63794](https://github.com/PaddlePaddle/Paddle/pull/63794),[#63919](https://github.com/PaddlePaddle/Paddle/pull/63919),[#63980](https://github.com/PaddlePaddle/Paddle/pull/63980),[#62917](https://github.com/PaddlePaddle/Paddle/pull/62917),[#64451](https://github.com/PaddlePaddle/Paddle/pull/64451),[#63541](https://github.com/PaddlePaddle/Paddle/pull/63541),[#63703](https://github.com/PaddlePaddle/Paddle/pull/63703),[#64536](https://github.com/PaddlePaddle/Paddle/pull/64536),[#63264](https://github.com/PaddlePaddle/Paddle/pull/63264),[#63335](https://github.com/PaddlePaddle/Paddle/pull/63335),[#63841](https://github.com/PaddlePaddle/Paddle/pull/63841),[#64628](https://github.com/PaddlePaddle/Paddle/pull/64628),[#63419](https://github.com/PaddlePaddle/Paddle/pull/63419),[#62210](https://github.com/PaddlePaddle/Paddle/pull/62210),[#63557](https://github.com/PaddlePaddle/Paddle/pull/63557),[#63064](https://github.com/PaddlePaddle/Paddle/pull/63064),[#61442](https://github.com/PaddlePaddle/Paddle/pull/61442),[#63537](https://github.com/PaddlePaddle/Paddle/pull/63537),[#63839](https://github.com/PaddlePaddle/Paddle/pull/63839),[#60927](https://github.com/PaddlePaddle/Paddle/pull/60927),[#60566](https://github.com/PaddlePaddle/Paddle/pull/60566),[#60842](https://github.com/PaddlePaddle/Paddle/pull/60842),[#64612](https://github.com/PaddlePaddle/Paddle/pull/64612),[#60047](https://github.com/PaddlePaddle/Paddle/pull/60047),[#63898](https://github.com/PaddlePaddle/Paddle/pull/63898),[#60415](https://github.com/PaddlePaddle/Paddle/pull/60415),[#60474](https://github.com/PaddlePaddle/Paddle/pull/60474),[#60439](https://github.com/PaddlePaddle/Paddle/pull/60439),[#60565](https://github.com/PaddlePaddle/Paddle/pull/60565),[#64414](https://github.com/PaddlePaddle/Paddle/pull/64414),[#62526](https://github.com/PaddlePaddle/Paddle/pull/62526),[#54183](https://github.com/PaddlePaddle/Paddle/pull/54183),[#64096](https://github.com/PaddlePaddle/Paddle/pull/64096),[#61325](https://github.com/PaddlePaddle/Paddle/pull/61325),[#60629](https://github.com/PaddlePaddle/Paddle/pull/60629),[#61051](https://github.com/PaddlePaddle/Paddle/pull/61051),[#62103](https://github.com/PaddlePaddle/Paddle/pull/62103),[#63594](https://github.com/PaddlePaddle/Paddle/pull/63594),[#60968](https://github.com/PaddlePaddle/Paddle/pull/60968),[#64613](https://github.com/PaddlePaddle/Paddle/pull/64613),[#64073](https://github.com/PaddlePaddle/Paddle/pull/64073),[#63816](https://github.com/PaddlePaddle/Paddle/pull/63816),[#64416](https://github.com/PaddlePaddle/Paddle/pull/64416),[#62499](https://github.com/PaddlePaddle/Paddle/pull/62499),[#64531](https://github.com/PaddlePaddle/Paddle/pull/64531),[#63827](https://github.com/PaddlePaddle/Paddle/pull/63827),[#59885](https://github.com/PaddlePaddle/Paddle/pull/59885),[#59949](https://github.com/PaddlePaddle/Paddle/pull/59949),[#63428](https://github.com/PaddlePaddle/Paddle/pull/63428),[#63218](https://github.com/PaddlePaddle/Paddle/pull/63218),[#63538](https://github.com/PaddlePaddle/Paddle/pull/63538),[#64497](https://github.com/PaddlePaddle/Paddle/pull/64497),[#63082](https://github.com/PaddlePaddle/Paddle/pull/63082),[#64395](https://github.com/PaddlePaddle/Paddle/pull/64395),[#60183](https://github.com/PaddlePaddle/Paddle/pull/60183),[#63691](https://github.com/PaddlePaddle/Paddle/pull/63691),[#64428](https://github.com/PaddlePaddle/Paddle/pull/64428),[#64648](https://github.com/PaddlePaddle/Paddle/pull/64648),[#64650](https://github.com/PaddlePaddle/Paddle/pull/64650),[#59926](https://github.com/PaddlePaddle/Paddle/pull/59926),[#59750](https://github.com/PaddlePaddle/Paddle/pull/59750),[#60080](https://github.com/PaddlePaddle/Paddle/pull/60080),[#60208](https://github.com/PaddlePaddle/Paddle/pull/60208),[#64124](https://github.com/PaddlePaddle/Paddle/pull/64124),[#64187](https://github.com/PaddlePaddle/Paddle/pull/64187),[#64166](https://github.com/PaddlePaddle/Paddle/pull/64166),[#64284](https://github.com/PaddlePaddle/Paddle/pull/64284),[#64253](https://github.com/PaddlePaddle/Paddle/pull/64253),[#64555](https://github.com/PaddlePaddle/Paddle/pull/64555),[#59878](https://github.com/PaddlePaddle/Paddle/pull/59878),[#64081](https://github.com/PaddlePaddle/Paddle/pull/64081)