N-체 문제: 단일인칭 에고센트릭 비디오에서의 병렬 실행

초록

인간은 복잡한 활동을 직관적으로 병렬화할 수 있지만, 모델이 단일 사용자의 관찰만으로 이를 학습할 수 있을까? 우리는 하나의 에고센트릭 비디오가 주어졌을 때, N명의 개인이 동일한 작업 집합을 가상으로 수행할 수 있는 방법인 N-바디 문제(N-Body Problem)를 제안한다. 목표는 속도 향상을 극대화하는 것이지만, 비디오 세그먼트를 개인에게 단순히 할당하는 것은 종종 현실 세계의 제약을 위반하여, 두 사람이 동일한 객체를 사용하거나 동일한 공간을 점유하는 것과 같이 물리적으로 불가능한 시나리오를 초래한다. 이를 해결하기 위해 우리는 N-바디 문제를 공식화하고 성능(속도 향상, 작업 범위)과 실행 가능성(공간 충돌, 객체 충돌 및 인과 관계 제약)을 모두 평가하기 위한一套의 메트릭을 제안한다. 그런 다음 비전-언어 모델(VLM)이 3D 환경, 객체 사용 및 시간적 의존성에 대해 추론하여 실행 가능한 병렬 실행 계획을 생성하도록 유도하는 구조화된 프롬프팅 전략을 소개한다. EPIC-Kitchens와 HD-EPIC의 100개 비디오에 대해, 우리의 방법(N=2)은 Gemini 2.5 Pro용 기준 프롬프트 대비 동작 범위를 45% 향상시키는 동시에 충돌률, 객체 및 인과 관계 충돌을 각각 55%, 45%, 55% 절감하였다.

English

Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.

N-체 문제: 단일인칭 에고센트릭 비디오에서의 병렬 실행

The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

초록

Support