GroupEnsemble: DETRベースの物体検出における効率的な不確実性推定

要旨

Detection Transformer（DETR）とその変種は、自律システムにおける重要課題である物体検出において強力な性能を示す。しかし、これらのモデルには重大な限界がある。その信頼度スコアは意味的な不確実性のみを反映し、同様に重要な空間的不確実性を捉えることができないため、検出信頼性の評価が不完全となる。一方、Deep Ensemblesは高品質な空間的不確実性の推定を提供できるが、膨大なメモリ消費量のため実世界の応用には非現実的である。より低コストな代替法であるモンテカルロ（MC）Dropoutは、推論時に不確実性推定のために複数の順伝播を必要とするため、高レイテンシが課題である。これらの限界に対処するため、我々はDETR系モデル向けの効率的かつ効果的な不確実性推定手法であるGroupEnsembleを提案する。GroupEnsembleは、推論時にトランスフォーマーのデコーダに多様なオブジェクトクエリの追加グループを入力することで、複数の個別検出セットを同時に予測する。各クエリグループは共有デコーダによって独立して変換され、同一入力に対する完全な検出セットを予測する。デコーダにはアテンションマスクを適用し、グループ間のクエリ相互作用を防止することで、各グループが独立して検出を行い、信頼性の高いアンサンブルベースの不確実性推定を実現する。デコーダの内在的な並列性を活用することで、GroupEnsembleは順次繰り返しを必要とせず、単一の順伝播で効率的に不確実性を推定する。自動運転シーンおよび一般的な日常シーンに対して、それぞれCityscapesデータセットとCOCOデータセットを用いて本手法を検証した。その結果、MC-DropoutとGroupEnsembleを組み合わせたハイブリッド手法が、Deep Ensemblesをコストのごく一部で上回り、複数の指標で優れた性能を示した。コードはhttps://github.com/yutongy98/GroupEnsemble で公開されている。

English

Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.

GroupEnsemble: DETRベースの物体検出における効率的な不確実性推定

GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection

要旨

Support