ChatPaper.aiChatPaper

RoboVIP:基於視覺身份提示的多視角影片生成技術增強機器人操作能力

RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

January 8, 2026
作者: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
cs.AI

摘要

操作資料的多樣性、數量與品質對於訓練有效的機器人策略至關重要。然而受限於硬體與實體環境配置,在大規模現實環境中收集多樣化操作資料仍難以實現。近期研究採用文字提示條件化的影像擴散模型,透過改變視覺觀測中的背景與桌面物件來擴增操作資料。但這類方法往往忽略先進策略模型對多視角與時間連貫性觀測的實際需求,且僅憑文字提示難以可靠地指定場景配置。為向擴散模型提供明確的視覺引導,我們提出視覺識別提示技術,以範例影像作為條件輸入來引導生成目標場景配置。為此,我們還建構了可擴展的流程,從大規模機器人資料集中篩選視覺識別樣本庫。使用經我們擴增的操作資料訓練視覺-語言-動作與視覺運動策略模型,在模擬與真實機器人環境中均能實現穩定的效能提升。
English
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
PDF192January 10, 2026