现实世界中的情境感知学习

摘要

人类感知的核心要素是情境感知能力，即理解自身与周围物理环境的关系并基于情境推理可能行为的能力。然而，现有多模态基础模型（MFM）的评测基准大多关注以环境为中心的空间关系（场景中物体间的关系），而忽视了需要基于智能体视角、姿态和运动进行推理的观察者中心关系。为弥补这一空白，我们推出SAW-Bench（现实世界情境感知基准），这是一个利用真实世界视频评估具身情境感知能力的新型基准。该基准包含786段使用Ray-Ban Meta（第二代）智能眼镜自摄的涵盖多样室内外环境的视频，以及2,071组人工标注的问答对。通过六类不同的感知任务，该基准可探测模型对观察者中心关系的理解能力。综合评估显示，即使表现最佳的MFM模型Gemini 3 Flash，其与人类表现的差距仍达37.66%。除这一差距外，深度分析还揭示了若干重要发现：例如，虽然模型能利用具身视频中的部分几何线索，但往往无法推断连贯的相机几何参数，导致系统性空间推理错误。我们将SAW-Bench定位为具身空间智能的评测基准，推动研究从被动观察转向对物理 grounded 的观察者中心动态的理解。

English

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.