TokenDial:通过时空令牌偏移实现文本到视频的连续属性控制
TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets
March 29, 2026
作者: Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-Hsien Wang
cs.AI
摘要
我们提出了TokenDial框架,用于在预训练文本到视频生成模型中实现连续的滑块式属性控制。尽管现代生成器能产生整体质量优秀的视频,但在不改变主体身份、背景或时间连贯性的前提下,其对属性变化程度(如特效强度或运动幅度)的控制能力有限。TokenDial基于以下发现:在中间时空视觉补丁标记空间中的加性偏移可形成语义控制方向,通过调整偏移幅度能实现外观与运动动态的连贯可预测编辑。我们利用预训练理解信号——针对外观的语义方向匹配和针对运动的幅度缩放——在不重训主干网络的情况下学习属性特定的标记偏移。通过大量定量评估和人工研究,我们验证了TokenDial在多样化属性和提示词上的有效性,其可控性和编辑质量均优于现有先进基线方法。
English
We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.