Talk2Move:基于强化学习的场景中文本指令驱动的物体级几何变换
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
January 5, 2026
作者: Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto
cs.AI
摘要
我们推出Talk2Move——一种基于强化学习的扩散框架,用于实现场景中物体的文本指令式空间变换。通过自然语言对场景中的物体进行空间操控,是多模态生成系统面临的一大挑战。现有基于文本的操控方法虽能调整物体外观或风格,但由于缺乏配对监督数据和像素级优化的限制,难以实现物体级别的几何变换(如平移、旋转或缩放)。Talk2Move采用群组相对策略优化(GRPO),通过输入图像和轻量级文本变体生成多样化推演来探索几何动作,无需昂贵的配对数据。空间奖励引导模型将几何变换与语言描述对齐,同时离轨步数评估和主动步数采样通过聚焦于信息丰富的变换阶段来提升学习效率。此外,我们设计了以物体为中心的空间奖励机制,直接评估位移、旋转和缩放行为,从而实现可解释且连贯的变换。在精选基准测试上的实验表明,Talk2Move能够实现精确、一致且语义保真的物体变换,在空间准确性和场景连贯性方面均优于现有文本引导编辑方法。
English
We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.