八. 实战：CUDA-BEVFusion部署分析-forward

最新推荐文章于 2026-06-26 16:34:51 发布

原创最新推荐文章于 2026-06-26 16:34:51 发布 · 2.4k 阅读

36 ·

本内容遵循CC 4.0 BY-SA版权协议

本博客上原创文章未经本人许可，不得用于商业用途。转载请注明出处，否则保留追究法律责任的权利

GEO检测

收录于

CUDA与TensorRT部署实战课程

AudioSeal 音频水印系统

语音合成

PyTorch

Cuda

**AudioSeal** 是 Meta 开源的语音水印系统，用于 AI 生成音频的检测和溯源。

前言

自动驾驶之心推出的《CUDA与TensorRT部署实战课程》，链接。记录下个人学习笔记，仅供自己参考

本次课程我们来学习下课程第八章—实战：CUDA-BEVFusion部署分析，一起来学习 CUDA-BEVFusion 的 forward 过程

课程大纲可以看下面的思维导图

在这里插入图片描述

0. 简述

这节课跟大家讲 CUDA-BEVFusion 代码分析的第三部分，也就是 forward 部分，那 forward 部分内容比较多，这边分两个部分去讲，先给大家讲 camera 的 forward 部分，然后讲剩下的 forward 部分

下面我们开始本次课程的学习🤗

1. 案例运行

在八. 实战：CUDA-BEVFusion部署分析-环境搭建文章中博主有详细介绍 CUDA-BEVFusion 的环境配置，这里就不再赘述了，不过九个月过去了 CUDA-BEVFusion 有了一些更新（截止到 2024/9/22），主要有以下几个点需要大家注意：

libprotobuf-dev==3.6.1 条件去除，目前要求 protobuf >= 3.12.0 即可，在 CUDA-BEVFusion/src/onnx/onnx-ml.pb.h 头文件有提到
layernorm plugin 支持，在 head model 替换为 head.bbox.layernormplugin.plan 后 bbox 回归分类准确率大大提升，对比图如下
TensorRT-10.x 支持
spconv-1.1.10 版本发布
CUDA-11.x 编译选项设置为 -std=c++14，CUDA-12.x 编译选项设置为 -std=c++17

Note：这里韩军老师分析的代码是 2023 年 12 月中旬的代码，调试工具使用的是 vscode

head.bbox.plan 的推理结果如下：

在这里插入图片描述

head.bbox.layernormplugin.plan 的推理结果如下：

在这里插入图片描述

2. forward

前面我们分析了 CUDA-BEVFusion 的初始化、坐标转换以及 BEVPool 的预计算部分，这里我们在 main 函数中接着往下看：

在这里插入图片描述

在 main 函数中我们会调用 core->forward 来执行 BEVFusion TRT 的前向推理，依次调度模型中各个部分的 forward，最终得到 3D bbox

在这里插入图片描述

可以看到这里有两个入口函数，一个是 forward_timer 有记录推理时间，另一个是单纯推理的 forward_only，那实现上两个没有什么区别，我们以 forward_timer 为例来讲解

在这里插入图片描述

forward_timer 函数中主要实现了以下几个 forward：

1. normalizer_
- 将环视图像进行预处理，包括 bilinear + normalization + NHWC2NCHW，并将多个 camera 的数据合并在一起
- 输入的环视图像 shape 是 6x3x1600x900，分别代表 number x channel x width x height
- output 是 FP32，shape 是 1x6x3x256x704，分别代表 batch x number x channel x height x width
- normalizer_ 完成后的 tensor 可以作为 camera backbone 的输入
- CUDA 预处理
2. lidar_scn_.voxelization_
- 将点云转为 voxel feature
- 输入的点云 shape 是 242180x5，分别代表 number x feature
- 输出是 FP16，shape 是 1440x1440x40，分别代表 width x height x channel
- voxelization 部分建议先看 CUDA-CenterPoint/src/preprocess_kernel.cu 的代码
- CUDA 预处理
3. lidar_scn_.nvative_scn_
- 将 voxelization 的 feature 通过 spconv 进行特征提取
- 输入的 shape 是 1440x1440x40，分别代表 width x height x channel
- 输出是 FP16，shape 是 1x256x180x180，其中 180x180 是 BEV Grid 的大小，每一个 Grid 上有 256 维的特征
- CUDA 加速
4. camera_depth_
- 将 LiDAR 点云的信息投影到各个 camera 的坐标系上，camera 上每一个点的坐标表示的是 distance
- 输出是 FP16，shape 是 1x6x1x256x704，分别代表 batch x number x channel x height x width？
- CUDA 预处理
5. camera_backbone
- 提取环视图像特征，直接调用 engine 的 context 里的 enqueueV2
- TensorRT DNN 推理
6. camera_bevpool_
- 将 6 个 camera 的 feature 和 depth feature 融合得到 camera bev feature
- 输出是 FP16
- CUDA 加速
7. camera_vtransform_
- 将 BEVPool 后的 camera bev feature map 通过几个 convolution 进行特征提取，也是直接调用的 enqueueV2
- 输出是 FP16
- TensorRT DNN 推理
8. transfusion_
- 将 camera bev feature 和 lidar bev feature 融合，也是直接调用的 enqueueV2
- 输出是 FP16
- TensorRT DNN 推理
9. transbbox_
- 将 Fusion 后的 bev feature 通过 head 模块得到输出结果，并 decode 拿到 3Dbbox
- head 部分的输出是 6 个
  - height：[dim, 1, 200]，3D 目标框的高度即 z 方向上的大小
  - dim：[dim, 3, 200]，3D 目标框的中心点坐标即 center_x, center_y, center_z
  - rot：[dim, 2, 200]，3D 目标框的 rotation 即 sin, cos
  - reg：[dim, 2, 200]，3D 目标框的长宽即 x, y 方向上的大小
  - vel：[dim, 2, 200]，3D 目标框的速度即 vx, vy
  - score：[dim, 10, 200], 3D 目标框的类别置信度
- TensorRT DNN 推理 + CUDA 后处理

我们对照着 BEVFusion 的网络结构图来看会更加的清晰：

在这里插入图片描述

2.1 camera-normalization.cu

我们先看 normalizer_ 的 forward 部分：

在这里插入图片描述

normalizer_ 的 forward 部分看起来会比较麻烦，因为它里面涉及到一些函数模板、函数指针，大家如果对 c++ 中函数模板不是很熟悉的话看起来可能会比较难理解，大家可以自己结合诸如 ChatGPT 的工具慢慢分析

forward 的主要工作流程如下：(from ChatGPT)

通过 param_.interpolation 和 param_.method.type 计算函数指针数据 func_list 中的索引 index，从而获取对应的 normalize_to_planar_kernel_function
调用 cudaMemcpyAsync 将图像从 Host 传输到 Device 上，每个相机的图像大小为 bytes_imaes，将所有相机的图像依次传输到 raw_images_ 中
使用 cuda_2d_launch 进行 CUDA kernel 的并行调度，核心的预处理工作就是在这个 kernel 中完成的，这个函数会根据输入的图像和参数调用前面选择的 normalize_to_planar_kernel_function 进行归一化处理
最终，函数返回处理后的图像数据，数据类型为 nvtype::half*

其中 normalize_to_planar_kernel_function 是一个指向特定 CUDA kernel 函数的指针，用来处理不同的归一化方法，其定义如下：

typedef void (*normalize_to_planar_kernel_fn)(
    int nx, int ny, int nz, 
    float sx, float sy, 
    int crop_x_, int crop_y_, 
    uchar3* imgs, 
    int image_width, int image_height, 
    void* output, 
    NormMethod method);

此外还有一些宏定义使用到，如下所示：

#define DefineNormType(...)                                                                                               \
  normalize_to_planar_kernel<NormType::Nothing, __VA_ARGS__>, normalize_to_planar_kernel<NormType::MeanStd, __VA_ARGS__>, \
      normalize_to_planar_kernel<NormType::AlphaBeta, __VA_ARGS__>,

#define DefineInterpolation(...) \
  DefineNormType(Interpolation::Nearest, __VA_ARGS__) DefineNormType(Interpolation::Bilinear, __VA_ARGS__)

#define DefineDataType DefineInterpolation(half)

#define DefineAllFunction DefineDataType

这些宏用来生成函数模板的组合，目的是在编译时根据不同的归一化方法和插值方法创建对应的 CUDA kernel 实现

DefineNormType：接收可变参数 __VA_ARGS__，根据不同的 NormType 生成函数模板，__VA_ARGS__ 指的是插值方式
DefineInterpolation：嵌套使用 DefineNormType，它根据 Interpolation 的不同值（Nearest 和 Bilinear）生成不同的函数模板
DefineDataType：进一步嵌套，指定数据类型为 half
DefineAllFunction：最终调用前面所有的宏定义，创建所有可能组合的函数

最后，这些函数模板组合会形成一个函数指针数组 func_list，用于在运行时选择合适的 CUDA kernel

static const normalize_to_planar_kernel_fn func_list[] = {DefineAllFunction nullptr};

这里的 func_list 是一个 函数指针数组，包含了所有可能的 normalize_to_planar_kernel 函数的实例，基于不同的 NormType 和 Interpolation 组合，nullptr 是数组的最后一项，表示结束。

forward 函数中的 cuda_2d_launch 是一个宏，和上节课分析的 cuda_linear_launch 类似，它是 gridDim 和 blockDim 都是二维的 kernel 调度，blockDim 固定为 (32, 32)，其定义如下：

#define cuda_2d_launch(kernel, stream, nx, ny, ...)                                \
  do {                                                                             \
    dim3 __threads__(32, 32);                                                      \
    dim3 __blocks__(divup(nx, 32), divup(ny, 32));                                 \
    kernel<<<__blocks__, __threads__, 0, stream>>>(nx, ny, __VA_ARGS__);           \
    nv::check_runtime(cudaPeekAtLastError(), #kernel, __LINE__, __FILE__);         \
    nv::check_runtime(cudaStreamSynchronize(stream), #kernel, __LINE__, __FILE__); \
  } while (false)
#else  // CUDA_DEBUG
#define cuda_linear_launch(kernel, stream, num, ...)                                \
  do {                                                                              \
    size_t __num__ = (size_t)(num);                                                 \
    size_t __blocks__ = divup(__num__, LINEAR_LAUNCH_THREADS);                      \
    kernel<<<__blocks__, LINEAR_LAUNCH_THREADS, 0, stream>>>(__num__, __VA_ARGS__); \
    nv::check_runtime(cudaPeekAtLastError(), #kernel, __LINE__, __FILE__);          \
  } while (false)

#define cuda_2d_launch(kernel, stream, nx, ny, nz, ...)                      \
  do {                                                                       \
    dim3 __threads__(32, 32);                                                \
    dim3 __blocks__(divup(nx, 32), divup(ny, 32), nz);                       \
    kernel<<<__blocks__, __threads__, 0, stream>>>(nx, ny, nz, __VA_ARGS__); \
    nv::check_runtime(cudaPeekAtLastError(), #kernel, __LINE__, __FILE__);   \
  } while (false)
#endif  // CUDA_DEBUG

通过函数指针我们实际调用的是 normalize_to_planar_kernel，其实现如下：

template <NormType norm_type, Interpolation interpolation, typename OutputType>
static __global__ void normalize_to_planar_kernel(int nx, int ny, int nz, float sx, float sy, int crop_x, int crop_y,
                                                  uchar3* imgs, int image_width, int image_height, void* output,
                                                  NormMethod method) {
  int ix = cuda_2d_x;
  int iy = cuda_2d_y;
  if (ix >= nx || iy >= ny) return;

  int icamera = blockIdx.z;
  uchar3* img = imgs + image_width * image_height * icamera;

  // 这个kernel会对不同的interpolation进行不同方式的resize
  uchar3 pixel = load_pixel<interpolation>(img, ix, iy, crop_x, crop_y, sx, sy, image_width, image_height);

  if (method.channel_type == ChannelType::Invert) {
    unsigned char t = pixel.z;
    pixel.z = pixel.x;
    pixel.x = t;
  }

  // 这个kernel会对不同的normalization方式进行不同方式的normalization
  half3 normed = normalize_value<norm_type>(pixel, method);

  // 这个kernel将不同的camera的数据汇总在一起，并实现NHWC2NCHW
  store_output<OutputType>(normed, output, icamera, ix, iy, nx, ny);
}

这个 device 函数主要调用三个 kernel 核函数完成图像的预处理工作，分别是 load_pixel、normalize_value 以及 store_output，下面我们一个个来分析

template <Interpolation interpolation>
static __device__ uchar3 load_pixel(const uchar3* image, int x, int y, int tix, int tiy, float sx, float sy, int width,
                                    int height);

load_pixel 是一个 kernel 函数模板，用于根据不同的插值方式获取对应像素值，支持最近邻和双线性两种方式：

template <>
__device__ uchar3 load_pixel<Interpolation::Nearest>(const uchar3* image, int x, int y, int tox, int toy, float sx, float sy,
                                                     int width, int height) {
  // In some cases, the floating point precision will lead to miscalculation of the value,
  // making the result not exactly match with opencv,
  // so here you need to add eps as precision compensation
  //
  // A special case is when the input is 3840 and the output is 446, x = 223:
  // const int src_x_double = 223.0  * (3840.0  / 446.0);            // -> 1920
  // const int src_x_float  = 223.0f * (3840.0f / 446.0f);           // -> 1919
  // const int src_x_float  = 223.0f * (3840.0f / 446.0f) + 1e-5;    // -> 1920
  //
  // !!! If you want to use the double for sx/sy, you'll get a 2x speed drop
  const float eps = 1e-5;
  int ix = (x + tox) * sx + eps;
  int iy = (y + toy) * sy + eps;
  return image[iy * width + ix];
}

最近邻插值的实现较为简单，直接将目标位置 (x, y) 按照缩放系数 sx 和 sy 计算原始图像中的对应像素位置 (ix, iy)，并返回对应的像素值

template <>
__device__ uchar3 load_pixel<Interpolation::Bilinear>(const uchar3* image, int x, int y, int tox, int toy, float sx, float sy,
                                                      int width, int height) {
  uchar3 rgb[4];
  float src_x = (x + tox + 0.5f) * sx - 0.5f;
  float src_y = (y + toy + 0.5f) * sy - 0.5f;
  int y_low = floorf(src_y);
  int x_low = floorf(src_x);
  int y_high = limit(y_low + 1, 0, height - 1);
  int x_high = limit(x_low + 1, 0, width - 1);
  y_low = limit(y_low, 0, height - 1);
  x_low = limit(x_low, 0, width - 1);

  // 计算插值权重
  int ly = rint((src_y - y_low) * INTER_RESIZE_COEF_SCALE);
  int lx = rint((src_x - x_low) * INTER_RESIZE_COEF_SCALE);
  int hy = INTER_RESIZE_COEF_SCALE - ly;
  int hx = INTER_RESIZE_COEF_SCALE - lx;

  // 获取四个邻近像素
  rgb[0] = image[y_low * width + x_low];
  rgb[1] = image[y_low * width + x_high];
  rgb[2] = image[y_high * width + x_low];
  rgb[3] = image[y_high * width + x_high];

  // 按照双线性插值的权重计算最终像素值
  uchar3 output;
  output.x = (((hy * ((hx * rgb[0].x + lx * rgb[1].x) >> 4)) >> 16) + ((ly * ((hx * rgb[2].x + lx * rgb[3].x) >> 4)) >> 16) + 2) >> 2;
  output.y = (((hy * ((hx * rgb[0].y + lx * rgb[1].y) >> 4)) >> 16) + ((ly * ((hx * rgb[2].y + lx * rgb[3].y) >> 4)) >> 16) + 2) >> 2;
  output.z = (((hy * ((hx * rgb[0].z + lx * rgb[1].z) >> 4)) >> 16) + ((ly * ((hx * rgb[2].z + lx * rgb[3].z) >> 4)) >> 16) + 2) >> 2;
  return output;
}

双线性插值方法会根据周围四个像素的值以及它们与目标位置的距离进行加权计算，得到插值后的像素值。

normalize_value 也是一个 kernel 函数模板，用于根据 NormType 对像素值进行不同的归一化操作，支持以下三种归一化方法：

template <>
__device__ half3 normalize_value<NormType::Nothing>(const uchar3& pixel, const NormMethod& method) {
  return half3(pixel.x, pixel.y, pixel.z);
}

NormType::Nothing 这种方法不做任何归一化处理，直接返回原始像素值

template <>
__device__ half3 normalize_value<NormType::AlphaBeta>(const uchar3& pixel, const NormMethod& method) {
  return half3(pixel.x * method.alpha + method.beta, pixel.y * method.alpha + method.beta, pixel.z * method.alpha + method.beta);
}

NormType::AlphaBeta 使用 Alpha 和 Beta 进行线性归一化，常用于简单的缩放和平移操作

template <>
__device__ half3 normalize_value<NormType::MeanStd>(const uchar3& pixel, const NormMethod& method) {
  return half3((pixel.x * method.alpha - method.mean[0]) / method.std[0] + method.beta,
               (pixel.y * method.alpha - method.mean[1]) / method.std[1] + method.beta,
               (pixel.z * method.alpha - method.mean[2]) / method.std[2] + method.beta);
}

NormType::MeanStd 使用均值和标准差进行归一化

store_output 同样是函数模板，它将处理后的像素值存储到输出缓存区中，并将 NHWC 格式转换为 NCHW 格式：

template <>
__device__ void store_output<half>(const half3& normed, void* output, int icamera, int ix, int iy, int nx, int ny) {
  half* planar_pointer = (half*)output + icamera * ny * nx * 3;
  planar_pointer[(0 * ny + iy) * nx + ix] = normed.x;
  planar_pointer[(1 * ny + iy) * nx + ix] = normed.y;
  planar_pointer[(2 * ny + iy) * nx + ix] = normed.z;
}

总的来说，整个 normalize_to_planar_kernel 是一个高度优化的 CUDA kernel，主要功能是对输入的相机图像数据进行插值、归一化以及格式转换。整个流程包含以下步骤：

1. 插值处理：根据指定的插值方法（最近邻或双线性插值）对图像进行尺寸缩放
2. 通道处理：对像素通道进行可能的转换（例如 BGR 转换为 RGB）
3. 归一化处理：根据指定的归一化方法（无归一化、Alpha-Beta 或均值标准差归一化）对像素值进行归一化操作
4. 格式转换和存储：将归一化后的像素数据按 NCHW 格式存储到输出缓冲区中，以便后续深度学习模型使用

这种高度模块化的设计，通过模板和函数特化，可以在编译时根据不同的插值和归一化方法生成不同的代码，从而在不同场景下提供最优的性能。

OK，以上就是 normalizer_ 的 forward 过程

2.2 lidar-scn.cpp

下面我们来看 lidar_scn_ 的 forward：

在这里插入图片描述

SCN 的 forward 有两部分，一个是 voxelization 的 forward，一个是 native_scn 的 forward，由于 native_scn 的 forward 没有开源，我们这里只看 voxelization 的 forward

在这里插入图片描述

我们在看 voxelization 体素化的 forward 之前可以先看看 CenterPoint 中的体素化处理代码，这部分代码展开来讲可能太长了，后面我们有时间再跟大家去讲，这里我们先跳过

Note：voxelization 体素化处理代码的分析后续韩军老师也没有填坑

2.3 camera_depth.cu

下面我们接着来看 camera_depth_ 的 forward 部分：

在这里插入图片描述

depth 的 forward 部分实现的是 lidar 点到 camera 的投影，这里面有一些加速的小技巧，我们一起来看下。这里最终调用的是 compute_depth_kernel 这个核函数，我们一起来看下：

static __global__ void compute_depth_kernel(unsigned int num_points, const half* points, const float4* img_aug_matrix,
                                            const float4* lidar2image, unsigned int points_channel, int num_camera,
                                            unsigned int image_width, unsigned int image_height, half* depth_out) {
  int tid = cuda_linear_index;
  if (tid >= num_points) return;

  half3 point_half = *(const half3*)(&points[points_channel * tid]);
  float3 point = make_float3(point_half.x, point_half.y, point_half.z);

  for (int icamera = 0; icamera < num_camera; ++icamera) {
    float dist = clampf(project(lidar2image[4 * icamera + 2], point), 1e-5, 1e5);

    float3 projed = make_float3(project(lidar2image[4 * icamera + 0], point) / dist,
                                project(lidar2image[4 * icamera + 1], point) / dist, dist);

    float x = project(img_aug_matrix[4 * icamera + 0], projed);
    float y = project(img_aug_matrix[4 * icamera + 1], projed);

    // Here you must use the float type to determine the range. For int(-0.5), its value is 0.
    if (x >= 0 && x < image_width && y >= 0 && y < image_height) {
      int ix = static_cast<int>(x);
      int iy = static_cast<int>(y);
      depth_out[(icamera * image_height + iy) * image_width + ix] = __float2half(dist);
    }
  }
}

这个 kerenl 负责 lidar->camera 的坐标转换，其中涉及到了 lidar->camera 的变换矩阵（4x4）以及 camera 畸变内参（4x4）。我们上节课程有讲过 nuScenes 数据集中有各个传感器的 calibrator 信息（包含 rotation 和 translation）以及传感器坐标系到 global 坐标系的映射，由于每个传感器接收数据的频率不同，所以 lidar->camera 的坐标系转换其实按照步骤分的话应该是：

lidar 坐标系 -> ego 坐标系
ego 坐标系 -> global 坐标系
global 坐标系 -> ego 坐标系
ego 坐标系 -> camera 坐标系
camera 坐标系 -> image 坐标系

每一个坐标系的转换其实都是可以用包含 rotation 和 translation 的 4x4 矩阵表示，如果把上述的所有转换合并为一个矩阵的话，其实就可以直接得到以下变换矩阵：

lidar 坐标系 -> image 坐标系

这样一来，对每一个 lidar 点的 (x, y, z) 坐标乘以这个变换矩阵，就可以得到 lidar 点云在各个 camera 的投影了，具体的计算并没有什么特殊的，单纯的利用 cuda 多线程加速，输出保存为 half

下面我们简单分析下代码的各个部分：(from ChatGPT)

__global__ void compute_depth_kernel(unsigned int num_points, const half* points, const float4* img_aug_matrix,
                                     const float4* lidar2image, unsigned int points_channel, int num_camera,
                                     unsigned int image_width, unsigned int image_height, half* depth_out)

参数解释：

num_points：LiDAR 点云中的点数量
points：点云数据，存储在 half 类型的数组中，每个点包含三维坐标（X, Y, Z）
img_aug_matrix：相机的畸变内参矩阵
lidar2image：LiDAR 坐标系到相机图像坐标系的变换矩阵（4x4矩阵），每个相机有四个 float4 类型的向量（矩阵的每一行），用于将 LiDAR 坐标转换为相机图像坐标
points_channel：每个点的通道数（一般为 3，即 x, y, z 三个坐标）
num_camera：相机数量，多个相机可以用于多视角图像
image_width，image_height：相机图像的宽度和高度，用于检测投影后的点是否在图像的有效范围内
depth_out：输出的深度图，存储每个点投影到图像平面后的深度值

1. 线程索引计算和点数据读取

int tid = cuda_linear_index;
if (tid >= num_points) return;

half3 point_half = *(const half3*)(&points[points_channel * tid]);
float3 point = make_float3(point_half.x, point_half.y, point_half.z);

tid 是当前 CUDA 线程的线性索引，用来处理 LiDAR 点云中的第 tid 个点。cuda_linear_index 用于获取线程的线性索引
如果 tid 超过了点云的总数 num_points，则直接返回，防止越界
从 points 数组中读取第 tid 个点的坐标（x, y, z）。这里的 half3 表示三维点的半精度浮点数，而后通过 make_float3 转换为 float3 类型，方便后续的计算

2. 投影计算（LiDAR 坐标到相机图像坐标）

for (int icamera = 0; icamera < num_camera; ++icamera) {
  float dist = clampf(project(lidar2image[4 * icamera + 2], point), 1e-5, 1e5);

  float3 projed = make_float3(project(lidar2image[4 * icamera + 0], point) / dist,
                              project(lidar2image[4 * icamera + 1], point) / dist, dist);

这个循环对每个相机进行操作，遍历所有相机 icamera
project(lidar2image[4 * icamera + 2], point)：通过变换矩阵的第三行将 3D 点 point 投影到相机图像的深度平面上，计算出深度值 dist
clampf(dist, 1e-5, 1e5)：将深度限制在一定的范围内，避免数值过小（接近零）或过大导致的数值误差
project(lidar2image[4 * icamera + 0], point)：使用变换矩阵的第一行计算点在图像平面的 x 坐标，除以深度 dist 进行透视投影（标准的齐次坐标变换）
project(lidar2image[4 * icamera + 1], point)：使用变换矩阵的第二行计算点在图像平面的 y 坐标，同样除以深度 dist 进行透视投影
make_float3(...)：构建一个新的三维向量 projed，其中包含了投影后的 x, y 图像坐标和深度 `dist

其中的 project 函数具体实现如下：

static __forceinline__ __device__ float project(float4 T, float3 p) { 
    return T.x * p.x + T.y * p.y + T.z * p.z + T.w; 
}

project 函数的作用是将 3D 点 p（即 float3 类型的点）通过 4x4 齐次变换矩阵的一行（float4 类型）进行投影计算。该函数实现了一个标准的齐次坐标转换，通过与 float4 向量的点积（包括一个偏移量 T.w）来计算投影结果

Note：对于这种调用次数多的函数，可以直接使用 inline 内联展开，在编译的时候就展开到 caller端，避免了 call overhead。被调用的函数方叫做 callee，调用方叫做 caller，inline explansion 就是把 callee 的部分展开到 caller 的地方，从而取消了 call function 的这个 overhead

3. 图像畸变矩阵处理

float x = project(img_aug_matrix[4 * icamera + 0], projed);
float y = project(img_aug_matrix[4 * icamera + 1], projed);

使用相机的畸变矩阵 img_aug_matrix 对已经投影的坐标 projed 进一步处理
project(img_aug_matrix[4 * icamera + 0], projed)：通过畸变矩阵的第一行对 projed 进行变换，得到图像中的 x 坐标
project(img_aug_matrix[4 * icamera + 1], projed)：通过畸变矩阵的第二行对 projed 进行变换，得到图像中的 y 坐标。

4. 投影后的深度存储

if (x >= 0 && x < image_width && y >= 0 && y < image_height) {
  int ix = static_cast<int>(x);
  int iy = static_cast<int>(y);
  depth_out[(icamera * image_height + iy) * image_width + ix] = __float2half(dist);
}

如果投影后的图像坐标 x, y 落在图像的有效范围内（即 0 <= x < image_width 和 0 <= y < image_height），则将该点的深度值存储到对应的深度图中
depth_out 是一个二维数组，用来存储每个像素位置上的深度值。(icamera * image_height + iy) * image_width + ix 计算出该像素在深度图中的线性索引
__float2half(dist)：将深度值 dist 转换为半精度浮点数 half 类型，然后存储到深度图中

OK，以上就是 camera depth 的 forward 过程

2.4 camera_backbone.cu

下面我们接着来看 camera_backbone 的 forward 部分：

在这里插入图片描述

camera backbone 就是一个 DNN，因此这个 forward 直接调用的是 TensorRT engine 的 forward，内部就是一个 enqueueV2，需要注意的是这个 DNN 输入和输出都是两个，输入一个是 camera，一个是 LiDAR 到 camera 的 depth map，输出一个是 camera feature（32 channel），另一个是 depth feature（118 channel）

2.5 camera_bevpool.cu

下面我们接着来看 camera_bevpool_ 的 forward 部分：

在这里插入图片描述

BEVPool 的 forward 部分负责将 camera_feature 和 depth_weights 进行汇总得到的 (N,C,D,H,W) 的点的信息投影到 BEV 空间上 (N,C,BEVGrid_X_size,BEVGrid_Y_size)，对应的具体维度是：

camera_lidar_features：(N,C,D,H,W) = (6,80,118,32,88) = 1,993,728 * 80
BEVGrid：(1,C,BEVGrid_X_size,BEVGrid_Y_size) = (1,80,360,360) = 129,600 * 80

这部分的计算量相当大，所以 CUDA-BEVFusion 中采用了一些技巧来加速，包括：

Precomputation
Interval Reduction

其中 Precomputation 预计算是体现在 camera_geometry 的 update 部分，在前面我们有简单分析过，大家感兴趣的可以看看：八. 实战：CUDA-BEVFusion部署分析-coordTrans Precomputation

而 Interval Reduction 体现在 bevpool_half_pack10_kernel 这个核函数里面，在进行规约计算时我们需要将同一个 grid 中的 camera feature 乘上对应的 depth weights 最后累加得到 output feature，其实就是 bev 的 feature map

这里的 pack10 表示的是一个 thread 会处理 80 个 camera_feature_map 中的 10 个，同时也是 tile size，充分体现 CUDA 编程中的并行性

下面我们来简单分析下这个核函数：(from ChatGPT)

__global__ void bevpool_half_pack10_kernel(const half* camera_feature, const half* depth_weights, unsigned int nchannel,
                                           const int3* intervals, unsigned int n_intervals, const unsigned int* indices,
                                           unsigned int out_h, unsigned int out_w, unsigned int ndepth, unsigned int farea,
                                           half* output_bevfeat)

camera_feature：输入的相机特征图，包含了多相机的特征信息，形状为 (N,C,D,H,W)
depth_weights：每个点的深度权重，表示该点在不同深度平面上的重要性（概率分布）
nchannel：通道数，即每个特征图上的通道数
intervals：用于表示投影到 BEV 空间的区间，形状为 (num_intervals, 3)，每个区间的 x 和 y 表示点云投影到的范围，z 表示其对应的输出位置
indices：表示每个点在特征图中的索引，索引用于在相机特征图中寻找具体点
out_h，out_w：BEV 空间的高度和宽度
ndepth：深度维度
farea：特征图的面积，H * W
output_bevfeat：输出的 BEV 特征图，形状为 (N, C, BEVGrid_X_size, BEVGrid_Y_size)

1. 线程索引和区间查找：

int interval_index = blockIdx.y * blockDim.y + threadIdx.y;
int feature_block = threadIdx.x * tile_size;

if (interval_index >= n_intervals) return;
int3 interval = intervals[interval_index];
half accumulate[tile_size] = {0.};

计算当前线程要处理的 interval 索引，如果超出范围则直接返回
每个线程会处理特定数量的 tile_size 个特征通道（这里是 10 个通道），因此需要初始化一个 accumulate[tile_size] 数组用于存储累加结果

2. 遍历当前 interval 的所有点：

for (int i = interval.x; i < interval.y; i++) {
    int indice = indices[i];
    int camera_index = indice / (ndepth * farea);
    int fm_inner_index = indice % farea;
    half depth_weight = depth_weights[indice];
    unsigned int camera_feature_offset = (camera_index * farea + fm_inner_index) * nchannel + feature_block;
    combined_half feature = *(combined_half*)(camera_feature + camera_feature_offset);

interval.x 和 interval.y 分别表示该区间的起始和结束索引
使用 indices 查找当前点对应的 indice，并计算该点在相机特征图中的位置（camera_index 和 fm_inner_index）
通过 depth_weights[indice] 获取当前点的深度权重，表示该点在不同深度层的重要性
计算出相机特征图中该点的起始地址 camera_feature_offset，用于访问该点的 10 个通道的特征

3. 深度加权累加：

for (int j = 0; j < tile_size; j++) {
    accumulate[j] = __hfma(((half*)&feature)[j], depth_weight, accumulate[j]);
}

对每个特征通道进行加权累加，使用 __hfma 指令，它表示 half 精度的乘法累加操作，即 accumulate[j] += feature[j] * depth_weight
这个累加过程可以看作是对同一区间内的所有点的特征加权求和，depth_weight 作为特征的重要性权重，用于调整每个点对最终特征图的贡献

4. 将累加结果存入 BEV 特征图：

for (int j = 0; j < tile_size; j++) {
    unsigned int output_offset = interval.z + (feature_block + j) * out_h * out_w;
    output_bevfeat[output_offset] = accumulate[j];
}

累加完成后，将结果存入 BEV 特征图 output_bevfeat 的对应位置
interval.z 表示当前区间在 BEV 空间中的位置，feature_block + j 计算当前通道的偏移量
output_offset 是 BEV 特征图中的存储地址，最终将累加后的结果存入该地址

关键技术点包括：

1. Interval Reduction：

该技术通过预先计算出的 intervals 和 indices 来加速特征累加，避免了大量的重复计算。每个 interval 表示一个空间区域的点集合，核函数只需要对这些点进行加权求和，而不需要重新计算几何变换
此外，每个线程块处理一个 interval，并利用 tile_size 分块处理通道上的特征，使得每个线程处理的工作量较小，增加了并行计算的效率

2. 深度加权：

通过 depth_weights 对特征进行加权累加，可以更好地利用深度信息，表示每个特征在不同深度层的重要性

3. Half 精度优化：

使用 half 可以显著减少内存使用和数据传输带宽，适合用于 GPU 中进行大量特征地累加操作
利用 __hfma（half-precision fused multiply-add）进一步加速计算，减少浮点数乘法和加法操作的误差

4. #pragma unroll：

for 循环展开加速

OK，以上就是 BEVPool 的 forward 过程

2.6 camera-vtrasform.cu

下面我们接着来看 camera_vtransform_ 的 forward 部分：

在这里插入图片描述

vtransform 就是三个 conv 的 DNN，它的 forward 就是一个 enqueueV2

2.7 transfusion.cu

下面我们接着看 transfusion_ 的 forward 部分：

在这里插入图片描述

fusion 本质上就是一个略显复杂的 DNN，因此它的 forward 也是调用的 enqueueV2，值得注意的是这个 DNN 输入是两个：

一个是通过 Vtransform 特征提取后得到的 camera bev feature
一个是通过 SCN 特征提取后得到的 lidar bev feature

fusion 模块通过几个卷积将两个 bev feature 合并输出得到融合后的 bev feature

2.8 head_transbbox.cu

最后我们来看下 transbbox_ 的 forward 部分：

在这里插入图片描述

transbbox_ 就是一个 head 模块，输入融合的 bev feature 输出 bbox 的信息，是一个单输入多输出的 DNN，forward 过程也是一个 enqueueV2

在 forward 之后会调用一个 decode_kernel 核函数进行 bbox 解码，它和 YOLO 这种 2D 的 decode 不太一样，它处理的是 3D 信息，其实现如下：

static __global__ void decode_kernel(unsigned int num, const half* reg, const half* height, const half* dim, const half* rot,
                                     const half* vel, const half* score, int num_class, TransBBoxParameter param,
                                     float confidence_threshold, BoundingBox* output, unsigned int* output_size,
                                     unsigned int max_output_size) {
  int ibox = cuda_linear_index;
  if (ibox >= num) return;

  int label = 0;
  float confidence = score[0 * num + ibox];

  for (int i = 1; i < num_class; ++i) {
    float local_score = score[i * num + ibox];
    if (local_score > confidence) {
      label = i;
      confidence = local_score;
    }
  }

  if (confidence < confidence_threshold) return;

  auto xs = __half2float(reg[0 * num + ibox]);
  auto ys = __half2float(reg[1 * num + ibox]);
  xs = xs * param.out_size_factor * param.voxel_size.x + param.pc_range.x;
  ys = ys * param.out_size_factor * param.voxel_size.y + param.pc_range.y;

  auto zs = __half2float(height[ibox]);
  if (xs < param.post_center_range_start.x || xs > param.post_center_range_end.x) return;
  if (ys < param.post_center_range_start.y || ys > param.post_center_range_end.y) return;

  float3 dim_;
  dim_.x = exp(__half2float(dim[0 * num + ibox]));
  dim_.y = exp(__half2float(dim[1 * num + ibox]));
  dim_.z = exp(__half2float(dim[2 * num + ibox]));
  zs = zs - dim_.z * 0.5f;

  if (zs < param.post_center_range_start.z || zs > param.post_center_range_end.z) return;

  unsigned int iout = atomicAdd(output_size, 1);
  if (iout >= max_output_size) return;

  auto& obox = output[iout];
  auto vx = __half2float(vel[0 * num + ibox]);
  auto vy = __half2float(vel[1 * num + ibox]);
  auto rs = atan2(__half2float(rot[0 * num + ibox]), __half2float(rot[1 * num + ibox]));

  *(float3*)&obox.position = make_float3(xs, ys, zs);
  *(float3*)&obox.size = dim_;
  obox.velocity.vx = vx;
  obox.velocity.vy = vy;
  obox.z_rotation = rs;
  obox.score = confidence;
  obox.id = label;
}

可以看到我们最终得到的是一个 obox，它包括 velocity、rotation、score 等信息

OK，以上就是 CUDA-BEVFusion 的 forward 过程

总结

本次课程韩君老师带我们简单过了一遍 CUDA-BEVFusion 中的 forward 过程，除了几个 DNN 的 forward 过程外，一些模块 forward 过程中的加速处理值得我们学习，例如 Voxelization 的加速，BEVPool 的加速等等，那当然具体的代码细节可能还是需要大家自己多分析分析，这边只是有个总览，给大家一个基本印象

OK，以上就是第 11 小节有关 CUDA-BEVFusion 中 forward 的全部内容了，同时也是第八章的最后一个内容，也是本课程的最后一个内容。接下来博主会把前面没看的几个视频补充完最后写一个课程总结整个系列课程就结束了😄