TokenDial: 시공간 토큰 오프셋을 통한 텍스트-비디오 변환에서의 연속적 속성 제어

초록

우리는 사전 학습된 텍스트-비디오 생성 모델에서 연속적인 슬라이더 방식의 속성 제어를 위한 TokenDial 프레임워크를 제안한다. 최신 생성 모델은 전반적으로 우수한 비디오를 생성하지만, 객체의 정체성, 배경, 시간적 일관성을 훼손하지 않으면서 속성 변화의 정도(예: 효과 강도 또는 움직임 크기)를 제어하는 기능은 제한적이다. TokenDial은 중간 시공간 시각 패치 토큰 공간에서의 가산적 오프셋이 의미론적 제어 방향을 형성한다는 관찰에 기반한다. 즉, 오프셋 크기를 조절하면 외관과 움직임 역학 모두에 대해 일관되고 예측 가능한 편집이 가능해진다. 우리는 백본 모델을 재학습시키지 않고, 사전 학습된 이해 신호(외관 제어를 위한 의미론적 방향 매칭, 움직임 제어를 위한 움직임 크기 스케일링)를 활용하여 속성별 토큰 오프셋을 학습한다. 다양한 속성과 프롬프트에 대한 TokenDial의 효과를 입증하며, 방대한 정량적 평가와 인간 평가를 통해 기존 최첨단 기준선보다 더 강력한 제어성과 높은 품질의 편집을 달성한다.

English

We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.

TokenDial: 시공간 토큰 오프셋을 통한 텍스트-비디오 변환에서의 연속적 속성 제어

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

초록

Support