비디오 모델은 조기에 추론한다: 미로 해결을 위한 계획 실행 전략 활용

초록

비디오 확산 모델은 미로 및 퍼즐 해결과 같은 창발적 추론 능력을 보여주지만, 생성 과정에서 어떻게 추론을 수행하는지에 대해서는 거의 알려져 있지 않습니다. 우리는 이를 이해하기 위한 첫 걸음으로, 통제된 실험 환경으로 2D 미로 해결을 사용하여 비디오 모델의 내부 계획 동역학을 연구합니다. 우리의 분석은 두 가지 발견을 제시합니다. 첫 번째 발견은 **초기 계획 확정(Early Plan Commitment)** 입니다: 비디오 확산 모델은 처음 몇 개의 노이즈 제거 단계 내에서 높은 수준의 운동 계획을 확정하며, 이후의 노이즈 제거는 시각적 세부 사항은 변경하지만 근본적인 궤적은 변경하지 않습니다. 두 번째 발견은 장애물 밀도가 아닌 **경로 길이(Path Length)** 가 미로 난이도의 주요 예측 변수이며, 12단계에서 실패 임계값이 급격하게 나타난다는 점입니다. 이는 비디오 모델이 긴 미로를 해결하기 위해서는 여러 순차적 생성을 연결해야만 함을 의미합니다. 우리의 발견이 실용적으로 가져오는 이점을 입증하기 위해, **초기 계획 연쇄법(Chaining with Early Planning, ChEaP)** 을 소개합니다. 이 방법은 유망한 초기 계획을 가진 시드(seed)에만 계산 자원을 집중하고 이를 연결하여 복잡한 미로를 해결합니다. 이를 통해 장기간 미로(Long-horizon mazes)에서 정확도가 7%에서 67%로 향상되었으며, Frozen Lake 및 VR-Bench의 어려운 과제에서 전반적으로 Wan2.2-14B 및 HunyuanVideo-1.5 모델에 걸쳐 2.5배 향상되었습니다. 우리의 분석은 현재의 비디오 모델이 이전에 인식되었던 것보다 더 깊은 추론 능력을 보유하고 있으며, 더 나은 추론 시간 스케일링을 통해 이를 더 안정적으로 이끌어낼 수 있음을 보여줍니다.

English

Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.

비디오 모델은 조기에 추론한다: 미로 해결을 위한 계획 실행 전략 활용

Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

초록

Support