TokenDial: 時空間トークンオフセットによるテキスト動画生成における連続的属性制御

要旨

我々は、事前学習済みテキスト映像生成モデルにおける連続的なスライダー形式の属性制御フレームワーク「TokenDial」を提案する。現代の映像生成モデルは全体的に高品質な映像を生成するが、属性の変化量（例：効果の強度や動きの大きさ）を、アイデンティティ・背景・時間的一貫性を損なうことなく制御する機能は限られている。TokenDialは、中間時空間視覚パッチトークン空間における加法的オフセットが意味的制御方向を形成し、そのオフセット量を調整することで、外観と運動ダイナミクスの両方に対して一貫性のある予測可能な編集が実現するという観察に基づく。本手法では、事前学習済みの理解信号を利用し、バックボーンモデルの再学習なしに属性固有のトークンオフセットを学習する。具体的には、外観制御には意味的方向マッチングを、動きの制御には動きの大きさスケーリングを用いる。多様な属性とプロンプトにおいて、TokenDialの有効性を実証し、大規模な定量的評価と人間による評価に基づき、既存の最先端手法よりも優れた制御性と高品質な編集を実現する。

English

We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.

TokenDial: 時空間トークンオフセットによるテキスト動画生成における連続的属性制御

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

要旨

Support