EgoNight：挑戦的なベンチマークを用いた夜間におけるエゴセントリックビジョン理解へのアプローチ

要旨

既存のエゴセントリックビジョン理解のためのベンチマークの多くは、主に昼間のシナリオに焦点を当てており、実世界のアプリケーションで避けられない低照度条件を見落としています。このギャップを調査するため、我々は夜間のエゴセントリックビジョンにおける最初の包括的なベンチマークであるEgoNightを提示し、視覚的質問応答（VQA）を中核タスクとします。EgoNightの主な特徴は、昼と夜のアラインメントされたビデオの導入であり、昼間のデータを使用して夜間のアノテーション品質を向上させ、照明条件間の明確な性能差を明らかにします。これを実現するため、Blenderでレンダリングされた合成ビデオと実世界の記録の両方を収集し、シーンとアクションが視覚的および時間的にアラインメントされるようにします。これらのペアビデオを活用し、新規の昼間補強夜間自動ラベリングエンジンと広範な人間による検証を通じて精緻化されたEgoNight-VQAを構築します。各QAペアは信頼性のためにアノテーターによって二重チェックされます。EgoNight-VQAは、90のビデオにわたる3658のQAペアを含み、12の多様なQAタイプをカバーし、300時間以上の人間の作業を要します。最先端のマルチモーダル大規模言語モデル（MLLM）の評価では、昼から夜への転移時に大幅な性能低下が明らかになり、低照度条件下での推論の課題が浮き彫りになります。VQAを超えて、EgoNightは昼夜対応検索と夜間のエゴセントリック深度推定という2つの補助タスクも導入し、既存モデルの限界をさらに探ります。我々は、EgoNight-VQAがアプリケーション駆動型のエゴセントリックビジョン研究を推進し、照明領域を横断して一般化するモデルを開発するための強固な基盤を提供すると信じています。すべてのデータとコードは受理後に公開されます。

English

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

EgoNight：挑戦的なベンチマークを用いた夜間におけるエゴセントリックビジョン理解へのアプローチ

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

要旨

Support