EgoNight: 야간 환경에서의 자아 중심 시각 이해를 위한 도전적인 벤치마크

초록

기존의 대부분의 에고센트릭 비전 이해 벤치마크는 주로 주간 시나리오에 초점을 맞추고 있어, 실제 응용에서 불가피한 저조도 조건을 간과하고 있습니다. 이러한 격차를 조사하기 위해, 우리는 야간 에고센트릭 비전을 위한 첫 번째 포괄적인 벤치마크인 EgoNight를 제안하며, 시각적 질의응답(VQA)을 핵심 과제로 삼습니다. EgoNight의 주요 특징은 주간과 야간이 정렬된 비디오를 도입하여, 주간 데이터를 활용해 야간 주석의 품질을 향상시키고 조명 조건 간의 명확한 성능 격차를 드러내는 것입니다. 이를 위해, 우리는 Blender로 렌더링된 합성 비디오와 실제 녹화물을 모두 수집하여 장면과 행동이 시각적 및 시간적으로 정렬되도록 보장합니다. 이러한 짝을 이루는 비디오를 활용하여, 우리는 새로운 주간-증강 야간 자동 라벨링 엔진과 광범위한 인간 검증을 통해 정제된 EgoNight-VQA를 구축합니다. 각 질의응답 쌍은 신뢰성을 위해 주석자에 의해 이중 검사됩니다. 총 90개의 비디오에 걸쳐 3658개의 질의응답 쌍이 포함된 EgoNight-VQA는 12가지 다양한 질의응답 유형을 포괄하며, 300시간 이상의 인간 작업이 투입되었습니다. 최첨단 멀티모달 대형 언어 모델(MLLMs)의 평가 결과, 주간에서 야간으로 전환할 때 상당한 성능 하락이 나타나 저조도 조건에서의 추론의 어려움을 강조합니다. VQA를 넘어, EgoNight는 주간-야간 대응 검색 및 야간 에고센트릭 깊이 추정이라는 두 가지 보조 과제를 도입하여 기존 모델의 한계를 더욱 탐구합니다. 우리는 EgoNight-VQA가 응용 중심의 에고센트릭 비전 연구를 발전시키고 조명 영역을 넘나드는 일반화된 모델 개발을 위한 강력한 기반을 제공할 것이라고 믿습니다. 모든 데이터와 코드는 승인 후 공개될 예정입니다.

English

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

EgoNight: 야간 환경에서의 자아 중심 시각 이해를 위한 도전적인 벤치마크

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

초록

Support