ChatPaper.aiChatPaper

Dita:面向通用视觉-语言-动作策略的扩散Transformer扩展框架

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

March 25, 2025
作者: Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, Yuntao Chen
cs.AI

摘要

尽管近期在多样化机器人数据集上训练的视觉-语言-动作模型展现出在有限领域数据下良好的泛化能力,但其依赖紧凑的动作头来预测离散或连续动作,限制了其对异构动作空间的适应性。我们提出了Dita,一个可扩展的框架,它利用Transformer架构通过统一的多模态扩散过程直接去噪连续动作序列。不同于以往方法通过浅层网络在融合嵌入上进行条件去噪,Dita采用了上下文条件化——使得去噪动作与历史观察中的原始视觉标记之间实现细粒度对齐。这一设计明确建模了动作增量与环境细微差别。通过将扩散动作去噪器与Transformer的可扩展性相结合,Dita有效地整合了跨实体数据集,涵盖多样的相机视角、观察场景、任务及动作空间。这种协同作用增强了对各种变化的鲁棒性,并促进了长时程任务的成功执行。在广泛的基准测试中,Dita在模拟环境中展示了最先进或可比的性能。值得注意的是,Dita通过仅使用第三人称相机输入的10次微调,实现了对环境变化和复杂长时程任务的鲁棒现实世界适应。该架构为通用机器人策略学习建立了一个多功能、轻量级且开源的基线。项目页面:https://robodita.github.io。
English
While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.

Summary

AI-Generated Summary

PDF502March 27, 2025