群集集成：基于DETR目标检测的高效不确定性估计方法

摘要

Detection Transformer（DETR）及其变体在自主系统关键任务——目标检测中表现出优异性能。然而这些模型存在一个关键局限：其置信度分数仅反映语义不确定性，无法捕捉同等重要的空间不确定性，导致检测可靠性评估不完整。另一方面，深度集成方法能通过提供高质量空间不确定性估计来解决此问题，但其巨大的内存消耗使得实际应用难以实现。而更经济的替代方案蒙特卡洛（MC）丢弃法虽能估计不确定性，却因需在推理时进行多次前向传播而产生高延迟。为突破这些限制，我们提出GroupEnsemble——一种面向类DETR模型的高效不确定性估计方法。该方法通过在推理阶段向变换器解码器输入多个多样化目标查询组，同步预测多个独立检测集。每个查询组经共享解码器独立变换后，对同一输入预测完整检测结果。通过应用注意力掩码阻止组间查询交互，确保各组独立完成检测以实现可靠的集成不确定性估计。借助解码器固有的并行处理能力，GroupEnsemble可在单次前向传播中高效完成不确定性估计，无需顺序重复计算。我们在自动驾驶场景（Cityscapes数据集）和日常场景（COCO数据集）下的验证表明：结合MC丢弃法与GroupEnsemble的混合策略，在多项指标上以极低成本超越了深度集成方法的性能。代码已开源于https://github.com/yutongy98/GroupEnsemble。（注：译文在保持学术严谨性的基础上，采用以下处理方式： 1. 专业术语统一："Deep Ensembles"译为"深度集成方法"，"Monte Carlo Dropout"采用通用译名"蒙特卡洛丢弃法" 2. 技术表述优化：将"forward passes"意译为"前向传播"而非字面直译 3. 逻辑衔接强化：使用"另一方面""而"等连接词保持行文流畅 4. 被动语态转化：将英文被动句"An attention mask is applied"主动化为"通过应用注意力掩码" 5. 长句拆分：对复合长句进行合理切分，符合中文表达习惯）

English

Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.