GroupEnsemble: DETR 기반 객체 탐지를 위한 효율적 불확실성 추정

초록

Detection Transformer(DETR)와 그 변형 모델들은 자율 시스템의 핵심 과제인 객체 탐지에서 강력한 성능을 보여줍니다. 그러나 이러한 모델들의 중요한 한계점은 신뢰도 점수가 의미론적 불확실성만 반영할 뿐, 동등하게 중요한 공간적 불확실성을 포착하지 못한다는 것입니다. 이로 인해 탐지 신뢰성에 대한 평가가 불완전해집니다. 한편, 딥 앙상블은 고품질의 공간적 불확실성 추정치를 제공하여 이 문제를 해결할 수 있습니다. 하지만 막대한 메모리 소비로 인해 실제 응용 프로그램에는 비현실적입니다. 더 저렴한 대안인 Monte Carlo(MC) Dropout은 불확실성 추정을 위해 추론 과정에서 다수의 순전파 과정이 필요해 높은 지연 시간 문제가 있습니다. 이러한 한계점을 해결하기 위해 우리는 DETR 유사 모델들을 위한 효율적이고 효과적인 불확실성 추정 방법인 GroupEnsemble을 소개합니다. GroupEnsemble은 추론 시 트랜스포머 디코더에 객체 쿼리의 추가적이고 다양한 그룹을 입력하여 여러 개별 탐지 세트를 동시에 예측합니다. 각 쿼리 그룹은 공유 디코더에 의해 독립적으로 변환되어 동일한 입력에 대한 완전한 탐지 세트를 예측합니다. 디코더에는 어텐션 마스크가 적용되어 그룹 간 쿼리 상호 작용을 방지함으로써 각 그룹이 독립적으로 탐지하여 신뢰할 수 있는 앙상블 기반 불확실성 추정을 달성합니다. 디코더의 내재적 병렬 처리 능력을 활용함으로써, GroupEnsemble은 순차적 반복 없이 단일 순전파만으로 불확실성을 효율적으로 추정합니다. 우리는 Cityscapes와 COCO 데이터셋을 각각 사용하여 자율 주행 환경과 일반 일상 환경에서 본 방법론을 검증했습니다. 결과에 따르면 MC-Dropout과 GroupEnsemble을 결합한 하이브리드 접근법이 훨씬 적은 비용으로 여러 메트릭에서 딥 앙상블을 능가하는 것으로 나타났습니다. 코드는 https://github.com/yutongy98/GroupEnsemble에서 확인할 수 있습니다.

English

Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.

GroupEnsemble: DETR 기반 객체 탐지를 위한 효율적 불확실성 추정

GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection

초록

Support