EgoNight：面向夜間自我中心視覺理解的挑戰性基準

摘要

現有的大多數自我中心視覺理解基準主要聚焦於日間場景，忽視了現實應用中不可避免的低光條件。為探究這一差距，我們提出了EgoNight，首個針對夜間自我中心視覺的綜合基準，其核心任務為視覺問答（VQA）。EgoNight的一個關鍵特點是引入了晝夜對齊的視頻，這些視頻利用日間數據提升了夜間註釋的質量，並揭示了光照條件之間的明顯性能差距。為實現這一點，我們收集了由Blender渲染的合成視頻和真實世界的錄像，確保場景和動作在視覺和時間上保持一致。基於這些配對視頻，我們構建了EgoNight-VQA，並通過一種新穎的日間增強夜間自動標註引擎以及廣泛的人工驗證進行了精細化處理。每個問答對都經過註釋者的雙重檢查以確保可靠性。總計，EgoNight-VQA包含了90個視頻中的3658個問答對，涵蓋12種多樣的問答類型，耗費了超過300小時的人工工作量。對當前最先進的多模態大語言模型（MLLMs）的評估顯示，從日間轉移到夜間時性能大幅下降，凸顯了在低光條件下進行推理的挑戰。除了VQA，EgoNight還引入了兩個輔助任務，即晝夜對應檢索和夜間自我中心深度估計，進一步探索現有模型的邊界。我們相信EgoNight-VQA為推動應用驅動的自我中心視覺研究以及開發能夠跨光照領域泛化的模型提供了堅實的基礎。所有數據和代碼將在論文被接受後公開。

English

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

EgoNight：面向夜間自我中心視覺理解的挑戰性基準

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

摘要

Support