Talk2Move:基於強化學習的文字指令場景物體幾何變換系統
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
January 5, 2026
作者: Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto
cs.AI
摘要
我們推出Talk2Move——一個基於強化學習(RL)的擴散框架,用於實現場景內物體基於文本指令的空間變換。通過自然語言對場景中的物體進行空間操作,對多模態生成系統而言是一項挑戰。現有基於文本的操作方法雖能調整外觀或風格,但由於缺乏配對監督數據和像素級優化的限制,難以實現物體層級的幾何變換(例如平移、旋轉或縮放物體)。Talk2Move採用群組相對策略優化(GRPO),通過輸入圖像和輕量級文本變體生成多樣化推演軌跡來探索幾何動作,無需耗費高昂成本的配對數據。空間獎勵引導模型將幾何變換與語言描述對齊,同時離策略步進評估和主動步進取樣通過聚焦於信息豐富的變換階段來提升學習效率。此外,我們設計了以物體為中心的空間獎勵機制,直接評估位移、旋轉和縮放行為,從而實現可解釋且連貫的變換。在精選基準測試上的實驗表明,Talk2Move能實現精確、一致且語義保真的物體變換,在空間準確性和場景連貫性上均優於現有的文本引導編輯方法。
English
We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.