【论文笔记7】CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classificatio阅读笔记

最新推荐文章于 2026-03-16 08:15:00 发布

原创

最新推荐文章于 2026-03-16 08:15:00 发布 · 3k 阅读

本文探讨了如何在Transformer模型中学习多尺度图像特征，提出了一种双分支架构，通过小规模和大规模patch处理，结合跨注意力模块有效融合信息。研究了四种融合策略，最终CrossAttentionFusion表现出最优性能，对比了与DeiT、其他Transformer及CNN架构，展示了在图像分类任务中的优势。

系列文章目录

论文阅读笔记 (1)：DeepLabv3
论文阅读笔记 (2) :STA手势识别
 论文阅读笔记 (3): ST-GCN
论文阅读笔记（5）:图上的光谱网路和深度局部链接网络
 论文阅读笔记（6）: GNN-快速局部光谱滤波
 论文阅读笔记（8）: 图卷积半监督分类

Abstract

目标

Inspired by this, in this paper, we study how to learn multi-scale feature rep- resentations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features.
在transformer 中引入多尺度特征的问题。

方法

Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. 我们的方法处理具有不同计算复杂度的两个独立分支的小块标记和大块标记，然后通过多次纯粹的注意力融合这些标记，使之相互补充。

Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention.Our proposed cross-attention only requires linear time for both computational and memory complexity instead ofquadratic time otherwise.

github: 链接

一、Introduction

1. 相关论文：

Octave convolutions
BigLittle Net
其他一些transformer 结构。

前两篇主要设计多尺度信息融合问题，也是本文的motivation。

2. 本文特点

Our approach processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.
用两个分支来交换不同大小的patch tokens，两个分支大小的计算量不同，并且交换信息次数是多次的。
We do so by an efficient cross-attention module, in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention.
用一个cross-attention 的模块来实现。
减少了计算量，提高了性能，具体实现后边看。

二、Related Works

三种类别的相关工作

convolutional neural networks with attention.
- SENet [18] uses channel-attention, CBAM [41] adds the spatial attention and ECANet [37] proposes an effi- cient channel attention to further improve SENet. There has also been a lot of interest in combining CNNs with different forms of self-attention
- SASA [31] and SAN [48] deploy a local-attention layer to replace convolutional layer.
- LambdaNetwork introduces an efficient global attention to model both con- tent and position-based interactions that considerably im- proves the speed-accuracy tradeoff of image classification models.BoTNet [32] replaced the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet resulting in models that achieve a strong perfor- mance for image classificatio.
- 与上边这些方法不同，本文的方法不是hybird of cnn and attention. 纯的transformer.
Vision Transformers
Multi-Scale Cnns

三、Method

1.Vision Transformer

在这里插入图片描述

Multi-Head Attention:

其中的Multi-Head Attention 结构为：
在这里插入图片描述
$\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$
其中的QKV 论文attention is all u need 中有解释。

$\begin{aligned} \operatorname{MultiHead}(Q, K, V) &=\operatorname{Concat}\left(\operatorname{head}_{1}, \ldots, \text { head }_{\mathrm{h}}\right) \\ \text { where head } &=\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}$