EgoNight：面向夜间自我中心视觉理解的挑战性基准

摘要

现有的大多数以自我为中心的视觉理解基准主要聚焦于日间场景，忽视了现实应用中不可避免的低光照条件。为探究这一空白，我们推出了EgoNight，首个针对夜间自我中心视觉的综合基准，其核心任务为视觉问答（VQA）。EgoNight的一大特色是引入了昼夜对齐的视频，通过利用日间数据提升夜间标注质量，并揭示出光照条件间的显著性能差异。为此，我们收集了由Blender渲染的合成视频及真实世界录制的视频，确保场景与动作在视觉和时间上均保持一致。依托这些配对视频，我们构建了EgoNight-VQA，辅以创新的日间增强夜间自动标注引擎，并通过大量人工验证进行精炼。每个问答对均经过标注员双重核查以确保可靠性。总计，EgoNight-VQA包含90个视频中的3658个问答对，涵盖12种多样的问答类型，凝聚了超过300小时的人工劳动。对当前最先进的多模态大语言模型（MLLMs）的评估显示，从日间向夜间迁移时性能大幅下降，凸显了低光环境下推理的挑战。除VQA外，EgoNight还引入了两项辅助任务——昼夜对应检索及夜间自我中心深度估计，进一步探索现有模型的边界。我们相信，EgoNight-VQA为推进应用导向的自我中心视觉研究及开发跨光照域泛化模型奠定了坚实基础。所有数据与代码将在论文被接受后公开。

English

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

EgoNight：面向夜间自我中心视觉理解的挑战性基准

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

摘要

Support