EgoNormia：物理社交規範理解的基準測試

摘要

人類行為受規範所調節。在現實世界中進行活動時，人類不僅遵循規範，還會考慮不同規範之間的權衡。然而，機器在訓練時往往缺乏對規範理解與推理的明確指導，尤其是當這些規範植根於物理和社會情境中時。為了提升並評估視覺語言模型（VLMs）的規範推理能力，我們提出了EgoNormia |ε|，該數據集包含1,853段以自我為中心的人類互動視頻，每段視頻均配有兩個相關問題，用於評估對規範行為的預測與合理性解釋。這些規範行為涵蓋七大類別：安全、隱私、空間距離、禮貌、合作、協調/主動性以及溝通/清晰度。為了大規模編制此數據集，我們提出了一種新穎的流程，結合了視頻採樣、自動答案生成、過濾及人工驗證。我們的研究表明，當前最先進的視覺語言模型在規範理解方面存在明顯不足，在EgoNormia上的最高得分僅為45%（相比之下，人類基準為92%）。我們對各維度表現的分析揭示了在應用於現實世界代理時，安全、隱私方面的重大風險，以及合作與溝通能力的缺失。此外，我們還展示了通過基於檢索的生成方法，利用EgoNomia來增強視覺語言模型的規範推理能力是可行的。

English

Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia |epsilon|, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.

EgoNormia：物理社交規範理解的基準測試

EgoNormia: Benchmarking Physical Social Norm Understanding

摘要

Support