BEVCooper: Accurate and Communication-Efficient Bird’s-Eye-View Perception in Vehicular Networks
Abstract
Bird’s-Eye-View (BEV) is critical to connected and automated vehicles (CAVs) as it can provide unified and precise representation of vehicular surroundings. However, quality of the raw sensing data may degrade in occluded or distant regions, undermining the fidelity of constructed BEV map. In this paper, we propose BEVCooper, a novel collaborative perception framework that can guarantee accurate and low-latency BEV map construction. We first define an effective metric to evaluate the utility of BEV features from neighboring CAVs. Then, based on this, we develop an online learning-based collaborative CAV selection strategy that captures the ever-changing BEV feature utility of neighboring vehicles, enabling the ego CAV to prioritize the most valuable sources under bandwidth-constrained vehicle-to-vehicle (V2V) links. Furthermore, we design an adaptive fusion mechanism that optimizes BEV feature compression based on the environment dynamics and real-time V2V channel quality, effectively balancing feature transmission latency and accuracy of the constructed BEV map. Theoretical analysis demonstrates that, BEVCooper achieves asymptotically optimal CAV selection and adaptive feature fusion under dynamic vehicular topology and V2V channel conditions. Extensive experiments on real-world testbed show that, compared with state-of-the-art benchmarks, the proposed BEVCooper enhances BEV perception accuracy by up to and reduces end-to-end latency by , with only additional computational overhead.
I Introduction
The market penetration rate of connected and automated vehicles (CAVs) equipped with exterior high-end cameras is experiencing rapid growth [10787093, vanet3, survey1]. These multi-perspective cameras enable CAVs to construct BEV maps, thereby generating unified, accurate representations of their surroundings [bevfusion, bevsurvey1, bevsurvey2]. However, camera’s sensing performance degrades significantly under obstruction or at long distances, compromising the fidelity of the constructed bird’s-eye-view (BEV) map. To obtain accurate BEV representation, stand-alone perception, which synthesizes BEV map based on sensing data exclusively from a single CAV, is insufficient [10689455, pacp, luo2025improving]. Collaborative BEV perception, which leverages sensing data from multiple CAVs, has garnered significant attention [huang2023v2x, pradhan2024copilot, 10228934]. As illustrated in Figure 1, by allowing an ego CAV to request perception messages from its neighboring collaborative CAVs, collaborative BEV perception enables more accurate BEV map construction and supports safe driving decisions [cobevt].
According to the stage at which transmitted perception messages are incorporated into the BEV map construction process, BEV perception can be classified into three levels: raw-data level [emp, cp2, 10621158], intermediate feature level [v2vnet, cmass], and result level [chan2025energy, liu2021livemap]. Among these, intermediate feature level collaboration, which involves sharing locally extracted compact features, offers a promising trade-off between communication efficiency and preservation of BEV-relevant information. Consequently, it has received considerable attention in recent studies [cp1]. Although transmitting intermediate data incurs at least one order of magnitude less communication overhead than raw sensing data (e.g., reducing transmitted data from MBs to KBs [chen2019f]), the spectral resources available for inter-CAV data exchange remain limited. For instance, vehicle-to-vehicle (V2V) links are only allocated a MHz frequency band at GHz for C-V2X communications in China [5gaa2021deployment]. This scarce bandwidth limits the data transmission rate and makes it impractical for ego CAV to request BEV features from all surrounding CAVs, thereby calling for solutions to the following research problems.
How can the ego CAV select an optimal set of collaborative CAVs to construct an accurate BEV map? Owing to vehicular mobility, collaborative CAVs provide continuously-evolving and varying levels of contribution to the ego CAV’s BEV map construction. Under a limited selection budget, the ego CAV must identify the most valuable collaborators by evaluating their BEV feature utility in real time.
How can the ego CAV ensure timely BEV map construction in the presence of the straggler effect induced by heterogeneous V2V link quality? In collaborative BEV perception, the ego CAV cannot initiate data fusion and BEV map construction until it has received requested features from all collaborative CAVs. However, those CAVs with poor V2V link quality can inflate the overall feature transmission latency up to second-level [harbor], severely impeding real-time BEV map updates. This is inevitable in practice due to frequent signal blockages, dynamic inter-CAV distance variations, etc [boban2010impact].
Unfortunately, existing approaches face fundamental challenges in tackling above problems. First, although some studies have explored collaborative CAV selection, they typically rely on static sensor metadata, such as camera coverage, to assess utility [wang2024edge, jiawei, mass]. Such metrics are agnostic to the semantic content of the extracted features and cannot reflect their actual contribution to BEV map quality. Furthermore, as the available collaborative CAVs are constantly moving, the utility estimates of collaborative CAVs quickly become outdated. This necessitates a proper balance between exploring new collaborators and exploiting previously inferred utilities. Second, while prior works [pacp, harbor] have explored mitigating straggler effect through adaptive data compressing, they overlook the varying urgency of BEV map construction across different driving scenarios. For instance, ego CAV navigating through high-mobility urban intersections requires rapid BEV updates, whereas one cruising in a stable platoon can tolerate higher latency. Compression schemes that ignore such driving volatility may apply overly aggressive compression in low-urgency settings, degrading accuracy, or insufficient compression in time-sensitive contexts, resulting in excessive delay.
To address these challenges, our preliminary studies identify three critical requirements for accurate and communication-efficient collaborative BEV perception: effective BEV feature utility evaluation, proper exploration-exploitation in collaborative CAV selection, and driving volatility-aware straggler effect mitigation. Based on these insights, we propose BEVCooper, a collaborative perception framework that enables ego CAV to adapt to varying environments while maintaining accurate and timely BEV map construction, through the following designs and contributions.
First, we propose a novel BEV feature evaluation metric termed marginal BEV contribution, that assesses the incremental improvement in both map accuracy and additional Field-of-View (FoV) provided by a collaborative CAV. This metric enables BEVCooper to precisely identify the most beneficial collaborative CAVs for ego CAV’s BEV construction.
Second, we develop an online learning-based CAV selection strategy with an alternating exploration-exploitation architecture. This enables BEVCooper to optimally leverage known high-performance collaborators while systematically evaluating promising but underutilized CAVs. By dynamically adjusting the exploration-exploitation balance under a constrained selection budget, BEVCooper maintains superior BEV perception quality amid continuous vehicular mobility.
Third, we design a driving volatility-aware BEV feature fusion mechanism that dynamically optimizes compression ratios based on both environmental volatility and real-time V2V link quality. Unlike existing approaches that treat straggler mitigation statically, our design enables BEVCooper to adaptively balance feature quality and transmission latency, ensuring timely BEV map construction while maintaining perception accuracy across diverse driving scenarios.
Theoretical analysis shows that, BEVCooper achieves asymptotically optimal CAV selection and adaptive feature fusion in vehicular networks with continuously-changing feature utilities and V2V channel quality. Furthermore, BEVCooper is implemented on real-world platforms, i.e., NVIDIA Jetson Orin and RTX 3080Ti. Extensive experimental results demonstrate BEVCooper’s superiority over state-of-the-art methods, achieving improvements of in BEV perception accuracy and in transmission latency reduction across diverse driving scenarios.
II Observations and Motivations
This section presents the motivation for the design of BEVCooper, supported by preliminary experimental analysis.
II-A An Unified Metric for BEV Feature Utility Assessment
Observation: We first visualize the BEV map constructed by the ego CAV in the example shown in Figure 1. As illustrated in Figure 2, collaborative perception enhances the ego CAV’s perception accuracy within its own FoV and extends its perceptual coverage by incorporating complementary viewpoints, resulting in a more accurate and holistic BEV map.
Motivation: Therefore, when quantifying the extent to which a collaborative CAV’s BEV feature enhances the ego CAV’s perception capability, both the improvement in the ego CAV’s perception accuracy and the expansion of its FoV should be simultaneously taken into consideration. Focusing exclusively on one dimension, such as camera coverage [wang2024edge] or accuracy improvement [jiawei, mass] alone, may overlook valuable data from CAVs with complementary sensing geometries or higher-quality features in specific regions, ultimately compromising the overall quality of the constructed BEV map.
II-B Dynamic Perception Contribution of Collaborative CAVs
Observation: For ego CAV, the most intuitive strategy to maximize resource efficiency and enhance its perception accuracy would be to pre-identify and consistently select the CAVs with the highest marginal BEV perception. However, identifying such high-contributing CAVs is non-trivial. To illustrate this, we randomly assign one ego CAV and record the marginal BEV contributions of the remaining CAVs on a simulated dataset [opv2v]. As shown in Figure 3(a), these contributions exhibit significant temporal fluctuations and unpredictability. This variability arises from the rapidly changing driving environment, as depicted in Figure 3(b), where collaborative CAVs frequently relocate to positions with varying perceptual value.
Motivation: Based on the above observation, relying on pre-determined set of collaborative CAVs [emp, ruiqi1, robust] proves insufficient. This motivates our online learning-based CAV selection strategy, which dynamically evaluates and selects collaborative CAVs based on their real-time and historical marginal contributions to BEV map construction.
II-C Straggler Effect in Collaborative BEV Perception
Observation: Another critical challenge in collaborative perception is the straggler effect, where excessive feature transmission latency causes the constructed BEV map to deviate from the actual driving environment. To investigate its impact, we evaluate collaborative perception performance under real-world network conditions. Specifically, we adopt CoBEVT [cobevt] as the segmentation model for the BEV map segmentation task. Following the 5G NR V2X sidelink standard [sidelink], a total data rate of 40–50 Mbps is allocated to collaborative CAVs based on their distances to the ego CAV. The accuracy of a BEV map constructed with latency ms is quantified by the mean Intersection over Union (mIoU) gap, computed against the ground-truth label from the frame steps ahead. As shown in Figure 4, the straggler effect significantly prolongs feature transmission time, reducing perception accuracy by up to . Furthermore, Figure 4 illustrates that higher environmental volatility leads to a larger mIoU gap, thereby exacerbating the challenge of mitigating straggler effects under dynamic conditions.
Motivation: Existing studies has either overlooked the straggler effect [v2vnet, spatio, shao], or addressed it without accounting for the ego CAV’s dynamic driving environment [pacp, v2vnet, harbor], leading to compromised collaborative BEV perception performance. This inspires us to integrate driving volatility awareness with straggler mitigation to ensure timely and accurate BEV map construction.
III System Overview and Problem Formulation
As shown in Figure 5, we consider an ego CAV driving across the busy urban intersection, where the driving environment may include dynamic objects such as pedestrians and surrounding vehicles. The ego CAV establishes V2V connections with collaborative CAVs within a stable vehicle cluster that are willing to participate in collaborative perception [cluster]. Each CAV is equipped with four cameras and generates images at FPS. We consider homogeneous computing resources, i.e., all CAVs have identical computational capabilities [10643366], and focus on the straggler effect arising from heterogeneous V2V channel quality. Without loss of generality, we assume that all sensors are well-synchronized and share a common clock [opv2v]. Time is divided into discrete slots of duration ms, aligned with the camera’s sensing interval.
Figure 5 illustrates the workflow of BEVCooper in both control plane and data plane. At the beginning of each time slot, the control plane on the ego CAV determines the CAV selection strategy based on historical data and computes the fusion deadline by considering both driving volatility and V2V channel conditions. On the data plane, the selected collaborative CAVs, represented by the set with , transmit their locally extracted BEV features to the ego CAV. To meet the fusion deadline, each selected CAV adaptively adjusts its compression rate to ensure timely delivery. Upon receiving the features, the ego CAV performs feature fusion and BEV map construction, and subsequently updates the estimated utility of each collaborator based on its contribution to the current BEV map. This updated utility informs the CAV selection strategy in the subsequent time slots. To optimize this process, the ego CAV must accurately evaluate the marginal perception utility of each collaborator’s BEV feature.
III-A Marginal BEV Contribution
As stated in Section II-A, collaborative BEV perception provides two principal benefits to ego CAV: enhanced perception accuracy within its existing sensor FoV, and extended perception coverage beyond its FoV. To quantify the utility of each selected CAV’s BEV feature, we introduce the marginal BEV contribution metric. This metric reflects the incremental value contributed by a collaborative CAV in terms of both segmentation accuracy and spatial awareness.
Marginal Segmentation Accuracy: We use the marginal segmentation accuracy to quantify the extent to which the collaborative CAV enhances the perception quality within the ego CAV’s FoV. In specific, the marginal segmentation accuracy of CAV index is computed as:
| (1) |
where and denote the segmentation output with perception data from CAV set and , respectively. Function calculates the mean IoU between two segmented BEV map. Notably, this metric can be calculated without requiring ground truth segmentation results.
Normalized Extended FoV: In addition to improving BEV segmentation accuracy within the ego CAV’s FoV, collaborative perception also offers the advantage of extending the ego CAV’s FoV, allowing it to observe more distant or occluded objects. To measure this extended coverage, we define a normalized metric , representing the additional FoV area contributed by selected CAV :
| (2) |
where and denote the FoVs of the ego CAV and CAV , respectively, function denotes the area operator. As shown in Figure 6, each CAV’s perception FoV is modeled as a rectangle on the BEV map, which can be computed from metadata such as GPS coordinates and vehicle orientation.
Equation (2) quantifies the incremental FoV of a collaborative CAV by subtracting the overlapping polygonal area shared with the ego CAV’s FoV. A higher indicates a greater portion of newly observable area beyond the ego’s original FoV enabled by CAV . Although selecting collaborative CAVs with highly overlapping perception regions may yield higher marginal segmentation accuracy, it limits the benefit of extended FoV. To balance this trade-off, we define a unified contribution score that integrates both metrics:
Definition 1.
Marginal BEV Contribution. If a collaborative CAV is selected, its marginal BEV contribution to the ego CAV in current collaborative perception round is:
| (3) |
where is a weighted factor that balances segmentation accuracy and coverage contribution of individual CAVs.
Remark 1. The value of can be adaptively adjusted. When the ego CAV already achieves high BEV segmentation accuracy within its own FoV, increasing shifts the selection toward collaborators that provide broader spatial coverage.
III-B Problem Formulation
Let denote the action variable indicating whether collaborative CAV is selected at time , and define the action vector as:
| (4) |
where the summation of is subjected to . As the driving environment is constantly-changing, we consider each CAV is associated with a finite, hidden and dynamic reward , which captures its marginal BEV contribution. Then the reward received from selecting CAV is denoted by: . If CAV is selected at time , i.e., , its reward is accessed and updated. We assume the reward of selected collaborative CAV evolves according to a Markov transition model [xiong2022learning, restless4]:
| (5) |
If CAV is not selected, its reward, i.e., marginal BEV contribution, evolves according to an independent and unknown stochastic process due to restless dynamics [dai2024quantifying]. The ego CAV’s objective is to maximize the expected cumulative reward over a finite horizon :
| (6) | ||||
| (7) | ||||
| (8) |
where (7) and (8) are constraints enforcing binary decision variables and a limited selection budget. Although the simplest solution for (6) would be to always select the top- CAVs with the highest values, such a strategy requires full knowledge of each CAV’s underlying Markovian transition model, which is inherently stochastic and typically unknown in practice. Moreover, the value of is highly dynamic and and rapidly becomes outdated. Combined with the limited selection budget, the ego CAV must balance the learning of reward distributions from different CAVs and the selection of CAVs that are currently believed to provide the highest reward.
IV Online Collaborative CAV Selection
Given the complexity of the formulated problem in dynamic environments, BEVCooper incorporates an online collaborative CAV selection algorithm that combines deterministic exploration and exploitation in a cyclic structure.
IV-A Algorithm Design
As shown in Algorithm 1, the proposed online collaborative CAV selection comprises two main components.
Initialization (lines 1-5). The ego CAV first initializes the sample mean reward for all collaborative CAVs. It then sets phase counters , which respectively track the number of exploration and exploitation phases for each CAV . After that, the ego CAV performs an initial exploration phase over time slots, during which CAVs are selected sequentially at each time step, ensuring that a minimal amount of information is obtained about all CAVs at the start.
Alternate Exploration and Exploitation (lines 6-18). To address the dynamic and unknown rewards, the algorithm proceeds by alternating between exploration and exploitation phases. A threshold-driven trigger mechanism is designed to control the switch between two phases: If the number of time slots spent on exploration phases is below the current threshold , e.g., . An exploration phase, in which each CAV is selected at least times, will be triggered. is a predefined function that grows logarithmic with time slot . If the exploration phase is not triggered, the algorithm enters an exploitation phase. The ego CAV continuously selects top- collaborative CAVs for times. The exponential terms and ensure sufficient sampling while adapting to each vehicle’s historical performance.
The alternate exploration and exploitation structure of Algorithm 1 is illustrated in Figure 6. This design strikes a balance between maintaining accurate CAV perception quality estimates and maximizing cumulative perception reward in restless changing urban driving environment. In addition, the exponential growth of exploration and exploitation phases, based on counters and , ensures a balance between sampling frequency and computational efficiency. While the duration of a CAV cluster is inherently limited in real-world driving scenarios, substantial changes in cluster structure, e.g., collaborative CAVs joining/leaving, can be addressed using existing methods [cluster, huang2018path], which re-cluster the CAVs and reset the corresponding counters and accordingly.
| Jetson Orin | RTX 3080Ti | |
| BEV Feature Extraction | 425.7 ms (2.35) 3 | 8.5 ms (118) 0.3 |
| Segmentation Head | 3.84 ms (260) 0.15 | 2.03 ms (492) 0.04 |
IV-B Algorithm Analysis
Complexity Analysis. Compared to vanilla collaborative BEV perception, the additional computational overhead introduced by the Algorithm 1 primarily stems from processing the BEV segmentation head times to update . However, as shown in Table I, through practical deployment, the segmentation head [cobevt] executed on the ego CAV incurs only millisecond-level latency, enabling the Algorithm 1 to operate in real time.
Performance Analysis. The performance of Algorithm 1 is measured by its ability to approach the cumulative perception contribution that could be achieved with full knowledge of the underlying vehicular dynamics. To quantify the performance loss due to uncertainty and learning of the dynamic environment, we introduce the definition of regret. Let be the stationary marginal collaborative perception contribution of CAV , and be a descending permutation of the collaborative CAV set:
| (9) |
where denotes the -th vehicle in the descending order of . The optimal policy in (9) presumes full prior knowledge of the system dynamics, such as the transition probabilities governing the Markovian evolution of CAV selection rewards. However, in practice, Algorithm 1 must learn these dynamics online through interactions with the environment. Consequently, we define the cumulative CAV selection regret as the performance loss incurred due to this lack of prior knowledge:
| (10) |
Next, we prove that Algorithm 1 approaches the optimal policy by bounding the cumulative regret of CAV selection.
Theorem 1.
Let be a constant and define . Then, for any time horizon , the cumulative regret of Algorithm 1 satisfies: .
Proof.
Please refer to Appendix A. ∎
Remark 2. The logarithmic-order regret bound in Theorem 1 demonstrates that, in restless driving environments, Algorithm 1 can reliably learn and strike an effective balance between exploration and exploitation during collaborative CAV selection. However, prolonged transmission delays caused by poor V2V channel conditions can impede the ego CAV’s BEV map construction and substantially reduce its accuracy.
V Volatility-aware BEV Feature Fusion
To alleviate the straggler effect in collaborative BEV perception, an intuitive approach is to reduce the time required for stragglers to transmit BEV features by feature compression [compress1, cobevt]. this raises two key questions. First, how can a straggler be identified? As illustrated in Figure 7, whether a collaborative CAV is considered a straggler depends not only on its channel quality, but also on the latency requirements of the ego CAV. Second, how should the compression ratio be determined? This involves a trade-off: while higher compression reduces transmission delay, it also leads to greater information loss, lowering the utility of the received BEV features. Conversely, lower compression preserves data quality but fails to address straggler-induced latency.
V-A Volatility-aware Straggler Identification
Motivated by the dynamic nature of urban driving environments, as discussed in Section II-C, we identify straggler CAVs through a quantitative assessment of driving volatility. To capture the extent to which a vehicle’s driving state diverges from that of its surrounding environment [volatility1, volatility2], the definition of driving volatility is presented below.
Definition 2.
Driving Volatility. Given objects within the ego CAV’s FoV, driving volatility is quantified as the root mean square of the relative longitude velocity deviations111This section focuses on the dynamics within a single time slot. For simplicity, the current time slot is omitted from the notations.:
| (11) |
where and denote the longitude velocity of the ego CAV and -th surrounding object, respectively.
Remark 3. A higher driving volatility indicates more drastic changes in the road environment, thereby exacerbating the negative impact of the straggler effect on collaborative perception accuracy. Consequently, the ego CAV has a more urgent demand for fresh BEV features, necessitating a higher compression rate from collaborative CAVs. With , stragglers can be identified by setting a fusion deadline.
Definition 3.
Fusion Deadline. The fusion deadline, denoted , specifies the latest time by which collaborative CAVs must deliver their BEV features for fusion. is calculated by:
| (12) |
where is a decay constant that balances sensitivity to with deadline flexibility. and denote the earliest and latest times at which the ego CAV can initiate fusion, respectively, their values are set based on V2V channel quality.
Remark 4. With respect to driving volatility, from (12) satisfies the following properties: 1) Boundedness: with , , when , we have . 2) Monotonicity: since , the value of decreases monotonously with driving volatility. This indicates that the deadline tightens as driving volatility increases. 3) Convexity: calculate the second derivative of with respect to , we have , which exhibits the convexity of the function and diminishing marginal increase in freshness data demand as driving volatility grows. These properties guarantee that the fusion deadline can adaptively reflect the requirements of the ego CAV. Vehicles that fail to transmit BEV features by are identified as stragglers, enabling the ego CAV to adjust compression ratios dynamically.
V-B Adaptive BEV Feature Compression
To address the second question, we leverage deep neural networks (DNNs) for adaptive BEV feature compression. Let denote the BEV feature compression ratio. Since the variation in compression and encoding latency resulting from different compression ratios is negligible compared to the transmission latency induced by fluctuations in V2V channel quality, stragglers primarily control the arrival time of BEV features at the ego CAV by adjusting .
The workflow for mitigating the straggler effect proceeds as follows. Each identified straggler first selects the minimum compression ratio that ensures its transmission completes before the fusion deadline . It then compresses its extracted BEV features using a DNN-based encoder trained to minimize task-relevant information loss. Upon reception, the ego CAV reconstructs the features with the goal of maximizing perceptual fidelity. Finally, acknowledging that compression introduces inevitable information loss, the marginal BEV contribution of straggler CAVs is compensated based on the applied compression ratio. Specifically, let denote the degradation in marginal BEV contribution due to compression, which is computed as:
| (13) |
where means no compression, and are scenario-dependent constants and can be obtained through offline fitting. Using (13), is updated according to .
VI Performance Evaluation
VI-A Experiment Setup
Dataset and backbones. Our experiments are conducted on OPV2V [opv2v], which is a large-scale simulated dataset tailored for V2V collaborative perception. We select three representative urban intersection scenarios with collaborative CAVs and unconnected vehicles (objects). When computing the marginal BEV contribution, each CAV’s FoV is defined as a rectangular area. We consider two V2V sidelink throughput scenarios [sidelink] in which the ego CAV is allocated bandwidth resources of 15–25 Mbps (low throughput) and 40–50 Mbps (high throughput), respectively. Beyond the backbones introduced in Section II, we use squeeze-and-excitation network [compress] to performs channel-wise BEV feature compression. The compression ratio is selected from where a value of denotes no compression. The performance of BEV segmentation task is evaluated using the mIoU metric and average data size for camera image and BEV feature size is MB and KB per frame, respectively.
Real-world testbeds. To simulate different onboard computational capabilities, our method is deployed on both NVIDIA RTX 3080Ti and Jetson Orin hardware platform. The default values of and are experimentally set to and , respectively. Performance analysis under varying parameter settings is also presented below. Through offline fitting, the value of and in (13) are set to and . In each time slot, the parameters and in (12) are set to the time required by the CAV with the poorest bandwidth to transmit BEV features under the highest and lowest compression ratios.
Benchmarks. Our method is compared with the following benchmarks. (1) ECOP [jiawei]: An upper-confidence-bound-based method that selects CAVs according to , where is the selection counter of CAV up to time . (2) MASS [mass]: It selects CAVs sequentially according to the sum , where is the last time ego CAV selects CAV . (3) Random: A baseline that randomly selects collaborative CAVs without considering utility. (4) Harbor [harbor]: Any CAV whose BEV data transmission latency exceeds ms is classified as a straggler by ego CAV and their perception data is discarded. (5) Max and Min: These two methods constitute baseline approaches for BEV feature transmission by consistently applying the maximum and minimum compression ratios, respectively. (6) Early Fusion and No fusion: The ego CAV requests raw images from surrounding CAVs or constructs the BEV map in stand-alone manner.
VI-B Performance Analysis
Superiority of CAV selection. We begin by evaluating the performance gap between various collaborative CAV selection methods and the Optimal approach, which has hindsight information and always selects the top- CAVs. As shown in Figure 8(a) and 8(b), our method achieves the smallest average gap across three driving scenarios. This improvement stems from the alternating exploration–exploitation mechanism in Algorithm 1, which enables the ego CAV to effectively balance learning and exploitation. Figure 8(c) further shows that when the selection budget is limited to , our method yields the most significant gains, outperforming ECOP and MASS by and , respectively. This is because, with only a single selection opportunity, the ego CAV must quickly identify the most valuable collaborator. Our method addresses this by efficiently collecting marginal BEV contributions from all candidates. Furthermore, as shown in Figure 8(d), increasing the value of extends the exploration phase, highlighting the algorithm’s adaptability to varying driving environments through appropriate tuning of .
End-to-end performance comparison. Next, we evaluate the BEV segmentation mIoU and average BEV feature transmission latency across different methods. As shown in Figure 9(a) and 9(b), our method outperforms the benchmarks in both V2V networking conditions. In specific, compared with Harbor, our method achieves up to and end-to-end latency reduction and BEV segmentation accuracy improvement. This gain stems from dynamically evaluating driving volatility to set fusion deadlines as shown in Figure 9(d), which in turn enables identification of a greater number of stragglers as shown in Figure 9(c). Unlike Harbor, which uses a fixed deadline and discards delayed data, our approach employs adaptive feature compression, allowing the ego CAV to incorporate features from more collaborators. Additionally, compared to the Min baseline, our method constructs more accurate BEV maps, demonstrating the effectiveness of feature compression in mitigating the straggler effect.
Decomposed time overhead on devices. To further demonstrate the low computational overhead of our method, Figure 10 presents the time overhead of the BEV‐feature‐based collaborative perception system across two distinct testbeds. As shown in Figure 10(a), the average system inference latency on an RTX 3080Ti is ms. Communication latency dominates the total overhead, accounting for up to , while Algorithm 1 contributes only . According to Figure 10(b), when the CAV’s onboard computing capability is limited, the time overhead associated with BEV feature extraction, compression, and fusion on Jetson Orin constitutes the primary bottleneck to the system’s real-time performance, accounting for , while our method adds only overhead.
Advantage of the volatility-aware feature fusion. Finally, we visualize BEV maps at an intersection constructed by the ego CAV using our method and Harbor. As shown in Figure 11, when V2V throughput to a collaborative CAV is low, both methods identify it as a straggler. Nonetheless, our method ensures timely BEV data delivery, yielding a more accurate and comprehensive map. This underscores its vital role in enhancing the accuracy of autonomous driving systems.
VII Related Works
CAV Selection in Collaborative Perception. Prior studies have investigated strategies for selecting collaboration partners in V2V-based perception. Who2com [liu2020who2com] introduced a three-stage handshaking mechanism, which enables targeted collaboration but incurs substantial round-trip latency. To reduce communication overhead, methods such as V2VNet [v2vnet], CPIM [zhang2025], and Where2com [hu2022where2comm] allow CAVs to broadcast perception data without explicit coordination. However, this broadcast paradigm becomes inefficient in dense traffic scenarios due to significant communication load. Recent works employ online learning-based CAV selection, enabling the ego CAV to gradually infer the utility of neighbors. For example, ECOP [jiawei] and MASS [mass] leverage multi-armed bandit (MAB) frameworks to select LiDAR-equipped CAVs based on historical confidence scores or detection accuracy. However, these methods are not directly applicable to collaborative BEV perception, as they cannot effectively evaluate the utility of BEV features or balance the trade-off between exploration and exploitation of utilities in dynamic environments.
Communication Optimizations in Collaborative Perception. The communication latency significantly impacts collaborative perception gains, consequently, several studies [luo2023edgecooper, pacp, cp1, cp2] consider the total V2V transmission latency as one of the constraints in their optimization problem. In addition, Harbor [harbor] mitigates the straggler effect by discarding the perception data of CAVs whose transmission latency exceeds a predefined deadline. While effective, these methods account only for a fixed latency constraint or fusion deadline, overlooking the varying urgency of the ego CAV’s data requirements under different driving conditions.
VIII Conclusion
In this work, we have proposed BEVCooper. Through preliminary studies, we have developed a novel BEV feature evaluation metric and identified key challenges and potential directions for improving perception accuracy in dynamic driving environments. Building upon these insights, BEVCooper incorporates an online learning based CAV selection strategy that balances exploration and exploitation to maximize efficiency and accommodate promising but underutilized vehicles. To mitigate straggler effect, BEVCooper designs a volatility-aware fusion mechanism that adapts to environmental dynamics and V2V link quality. The effectiveness of BEVCooper has been validated through implementation on two real-world testbeds and experiments across diverse scenarios. We believe that the design of BEVCooper plays a critical role in enhancing the safety and robustness of autonomous driving systems.
Appendix A Proof of Theorem 1
The cumulative regret in (10) is firstly decomposed into two distinct terms, each corresponding to a different source of suboptimal collaborative CAV selection:
| (14) |
where is the expected number of times CAV is selected up to time . The term calculates the stationary rewards according to Algorithm 1. Equation (14) exhibits that the cumulative regret of Algorithm 1 can be decomposes into two parts: (a) the regret incurred by selecting CAVs whose expected reward is suboptimal; (b) the mismatch between the expected stationary rewards and the actual rewards obtained. The discrepancy in (b) results from the intermittent selection of CAVs in the restless environment, which causes their state distributions at selection times to deviate from the stationary distribution. In what follows we bound the two parts separately.
A-A Bounding part (b) in (14)
According to Lemma 1 from [restless3] and Lemma 2.1 from [restless2], we use the following results on Markov chain.
Lemma 1.
Assume the hidden Markov transition model of each CAV’s reward (marginal BEV contribution) irreducible and aperiodic. If CAV is selected for consecutive time steps. then its cumulative expected contribution satisfies: , where is a constant only depends on the transition model and is independent of .
Lemma 1 establishes that in a Markovian reward setting, frequent switching between CAVs interrupts the convergence of their internal state processes to the stationary distribution. As a result, when a CAV is selected after a period of inactivity, its reward is drawn from a transient distribution, incurring a regret bounded by a constant . Therefore, bounding part (b) in (14) reduces to analyzing the number of CAV switches under the alternating exploration–exploitation structure.
We begin by analyzing the exploration phase. Since the duration of each exploration epoch grows geometrically with base 2, starting from , the total time spent on CAV during exploration up to time is upper bounded by . Consequently, the number of exploration phases can be bounded by:
| (15) |
Consequently, the total number of switches involving the selected CAV set during exploration is at most .
By time , the total duration allocated to the exploitation phase is upper bounded by , due to the initial exploration phase consuming at least time slots. Analogously, we have , which implies:
| (16) |
Since each phase may incur up to CAV switches, the total number of switches involving the selected CAV set during exploitation is upper bounded by . Letting , the regret corresponding to part (b) in (14) is bounded by:
| (17) |
where the right side of the inequality is a logarithmic function of time .
A-B Bounding part (a) in (14)
In this subsection, we further decompose the total expected number of suboptimal CAV selections into contributions from the exploration and exploitation phases. According to (15), the accumulative number of time slots used to explore CAV by time , denoted by is bounded by:
| (18) |
In each exploration round, up to CAVs, including both optimal and suboptimal ones, are selected. The regret incurred from selecting suboptimal CAVs in each round is:
| (19) |
As stated in Theorem 1, is a non-negative constant, then (19) increases with the same order of . We now analyze the selection of suboptimal CAVs during exploitation phases. Although the exploitation phase is intended to utilize the best-performing CAVs based on empirical rewards, estimation noise, particularly under Markovian reward processes, may occasionally result in the selection of suboptimal CAVs. Specifically, such errors occur when the empirical mean of a suboptimal CAV, e.g., , exceeds that of an -optimal one, e.g., , at the beginning of an exploitation epoch. To quantify the probability of overestimation of CAV or underestimation of CAV , we invoke the following lemma [restless2].
Lemma 2.
Let arms and be governed by irreducible, aperiodic Markov chains on finite state spaces and , with stationary distributions , and stationary means . Denote by the start time of the -th exploitation phase, and let be the event that arm ’s sample mean exceeds that of at exploitation phase , then the probability of event satisfies:
where depends on the initial reward distribution of each CAV’s marginal BEV contribution and denotes the minimum value of stationary distribution probabilities across all collaborative CAVs and states.
The cumulative regret resulting from the the selection of suboptimal CAVs during the exploitation phases is:
| (20) |
where the term bounds the regret from selecting one group of collaborative CAVs in exploitation phase. Then, according to Lemma 2, (20) can be bounded by: (21) (22) Inequality (22) holds since when . Combining (17), (19) and (22), it can be observed that both regret parts in (14) grows logarithmically with time. This concludes the proof of Theorem 1.