N體問題:基於單人第一人稱視角的並行執行
The N-Body Problem: Parallel Execution from Single-Person Egocentric Video
December 12, 2025
作者: Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen
cs.AI
摘要
人类能够凭直觉并行处理复杂活动,但模型能否通过观察单个人的行为来学习这种能力?基于单个第一视角视频,我们提出N体问题:假设有N个参与者,如何协同完成视频中观察到的同一组任务。该问题的目标在于最大化加速效率,但简单地将视频片段分配给不同个体往往会违反现实约束,导致诸如两人共用同一物体或占据同一空间等物理上不可能实现的场景。为此,我们正式定义了N体问题,并提出一套评估指标,同时衡量性能(加速比、任务覆盖率)与可行性(空间碰撞、物体冲突及因果约束)。我们进而提出一种结构化提示策略,引导视觉语言模型(VLM)对三维环境、物体使用和时序依赖进行推理,从而生成可行的并行执行方案。在EPIC-Kitchens和HD-EPIC数据集的100个视频上,针对N=2的场景,我们的方法相较于Gemini 2.5 Pro的基线提示,动作覆盖率提升45%,同时将碰撞率、物体冲突和因果冲突分别降低55%、45%和55%。
English
Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.