Dita:擴展擴散變壓器以實現通用視覺-語言-行動策略
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
March 25, 2025
作者: Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, Yuntao Chen
cs.AI
摘要
儘管近期在多樣化機器人數據集上訓練的視覺-語言-動作模型展現出在有限領域數據下具備良好的泛化能力,但其依賴於緊湊的動作頭部來預測離散或連續動作,這限制了對異質動作空間的適應性。我們提出了Dita,這是一個可擴展的框架,利用Transformer架構通過統一的多模態擴散過程直接對連續動作序列進行去噪。與先前通過淺層網絡基於融合嵌入進行去噪的方法不同,Dita採用了上下文條件化——實現了去噪動作與來自歷史觀察的原始視覺標記之間的細粒度對齊。這一設計明確地建模了動作增量與環境細微差別。通過將擴散動作去噪器與Transformer的可擴展性相結合,Dita有效地整合了跨實體數據集,涵蓋了多樣的相機視角、觀察場景、任務和動作空間。這種協同作用增強了對各種變異的魯棒性,並促進了長時程任務的成功執行。在廣泛的基準測試中,Dita在模擬環境中展示了頂尖或可比的性能。值得注意的是,Dita通過僅使用第三人稱相機輸入的10次微調,實現了對環境變異和複雜長時程任務的強健現實世界適應。該架構為通用機器人策略學習建立了一個多功能、輕量級且開源的基線。項目頁面:https://robodita.github.io。
English
While recent vision-language-action models trained on diverse robot datasets
exhibit promising generalization capabilities with limited in-domain data,
their reliance on compact action heads to predict discretized or continuous
actions constrains adaptability to heterogeneous action spaces. We present
Dita, a scalable framework that leverages Transformer architectures to directly
denoise continuous action sequences through a unified multimodal diffusion
process. Departing from prior methods that condition denoising on fused
embeddings via shallow networks, Dita employs in-context conditioning --
enabling fine-grained alignment between denoised actions and raw visual tokens
from historical observations. This design explicitly models action deltas and
environmental nuances. By scaling the diffusion action denoiser alongside the
Transformer's scalability, Dita effectively integrates cross-embodiment
datasets across diverse camera perspectives, observation scenes, tasks, and
action spaces. Such synergy enhances robustness against various variances and
facilitates the successful execution of long-horizon tasks. Evaluations across
extensive benchmarks demonstrate state-of-the-art or comparative performance in
simulation. Notably, Dita achieves robust real-world adaptation to
environmental variances and complex long-horizon tasks through 10-shot
finetuning, using only third-person camera inputs. The architecture establishes
a versatile, lightweight and open-source baseline for generalist robot policy
learning. Project Page: https://robodita.github.io.Summary
AI-Generated Summary