TokenDial:通过时空令牌偏移实现文本到视频的连续属性控制
TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets
March 29, 2026
作者: Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-Hsien Wang
cs.AI
摘要
我们提出TokenDial框架,用于在预训练文本到视频生成模型中实现连续滑块式属性控制。尽管现代生成器能产出整体质量优秀的视频,但在不改变主体身份、背景或时间连贯性的前提下,其对属性变化程度(如特效强度或运动幅度)的控制能力有限。TokenDial基于以下发现:在中间时空视觉补丁标记空间中的加性偏移可形成语义控制方向,通过调节偏移幅度能实现外观与运动动态的连贯可预测编辑。我们无需重新训练主干网络,仅利用预训练理解信号学习属性特定的标记偏移:外观控制采用语义方向匹配,运动控制采用运动幅度缩放。通过在多样化属性和提示词上的实验验证,配合大量定量评估与人工研究,TokenDial在控制力和编辑质量方面均优于当前最优基线方法。
English
We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.