Boximator: 비디오 합성을 위한 풍부하고 제어 가능한 모션 생성

초록

풍부하고 제어 가능한 동작을 생성하는 것은 비디오 합성에서 핵심적인 과제입니다. 우리는 미세한 동작 제어를 위한 새로운 접근 방식인 Boximator를 제안합니다. Boximator는 하드 박스(hard box)와 소프트 박스(soft box)라는 두 가지 제약 유형을 도입합니다. 사용자는 조건 프레임에서 하드 박스를 사용해 객체를 선택한 후, 두 유형의 박스 중 하나를 사용해 미래 프레임에서 객체의 위치, 형태 또는 동작 경로를 대략적으로 또는 엄격하게 정의할 수 있습니다. Boximator는 기존 비디오 확산 모델에 플러그인 방식으로 작동합니다. 이의 학습 과정은 기본 모델의 지식을 보존하기 위해 원래 가중치를 고정하고 제어 모듈만을 학습시킵니다. 학습의 어려움을 해결하기 위해, 우리는 박스-객체 상관관계 학습을 크게 단순화하는 새로운 자체 추적(self-tracking) 기술을 도입했습니다. 실험적으로, Boximator는 두 가지 기본 모델을 개선한 최신 비디오 품질(FVD) 점수를 달성했으며, 박스 제약을 통합한 후 더욱 향상되었습니다. 강력한 동작 제어 가능성은 경계 상자 정렬 메트릭의 급격한 증가로 검증되었습니다. 또한 인간 평가에서도 사용자들이 기본 모델보다 Boximator의 생성 결과를 선호하는 것으로 나타났습니다.

English

Generating rich and controllable motion is a pivotal challenge in video synthesis. We propose Boximator, a new approach for fine-grained motion control. Boximator introduces two constraint types: hard box and soft box. Users select objects in the conditional frame using hard boxes and then use either type of boxes to roughly or rigorously define the object's position, shape, or motion path in future frames. Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base model's knowledge by freezing the original weights and training only the control module. To address training challenges, we introduce a novel self-tracking technique that greatly simplifies the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.

Boximator: 비디오 합성을 위한 풍부하고 제어 가능한 모션 생성

Boximator: Generating Rich and Controllable Motions for Video Synthesis

초록

Support