LooseControlVideo: 공간 블로킹을 이용한 연출적 비디오 제어

초록

텍스트-비디오 생성 작업에서 정밀한 3차원 공간 조율은 여전히 중요한 과제로 남아 있으며, 특히 의미론적 배치와 시간적 역학이 종종 얽히는 다중 객체 장면에서 더욱 그렇다. 기존의 심층 조건화 모델은 우수한 구조적 충실도를 달성하지만, 변형 가능한 객체를 포함하는 동적 이벤트에 대해 프레임 단위의 정확한 안내를 필요로 하며, 이를 제작하는 데 많은 노동력이 소요된다. 본 논문에서는 희박하고 방향성을 가진 3D 박스를 '차단' 프록시로 활용하여 직관적이고 표현력 있는 제어를 가능하게 하는 LooseControlVideo 프레임워크를 제안한다. 이를 통해 사용자는 높은 수준의 배치와 궤적을 저작하는 동시에 비디오 생성 모델이 현실적인 폐색, 역학 및 상호작용을 생성하도록 할 수 있다. 우리는 3D 크기, 방향 및 깊이 순서 폐색에 대한 새로운 인코딩인 DNOCS로 주석이 달린 비디오 데이터셋에서 Wan 2.2 백본을 미세 조정하여 이를 달성한다. 또한, 본 방법은 점프 궤적 조정이나 상호작용 추가와 같은 국소적 개선을 전역 장면 맥락을 최소한으로 교란하면서 가능하게 한다. nuScenes, HO-3D 및 BEHAVE 벤치마크에 대한 광범위한 평가는 LooseControlVideo가 기존의 2D 박스 및 흐름 기반 기준선을 크게 능가함을 보여준다. 우리의 결과는 최신 배치 조건화 모델 대비 궤적 오차에서 1.2배에서 3배 개선, 강체 운동 일관성에서 2배 개선, 폐색 정확도에서 1.5배에서 2배 증가를 나타내며, 이는 방향성 3D 프리미티브가 복잡한 다중 에이전트 비디오 저작을 위한 우수한 기하학적 사전을 제공함을 입증한다.

English

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.