种子策略：基于自演进扩散策略的机器人操作水平扩展

摘要

模仿学习（IL）使机器人能够通过专家示范掌握操作技能。扩散策略（DP）可建模多模态专家行为，但随着观测时域的延长会出现性能下降，限制了长时程操作能力。我们提出自演进门控注意力（SEGA）时序模块，该模块通过门控注意力维持随时间演进的潜状态，实现高效的循环更新，将长时程观测压缩为固定尺寸表征的同时过滤无关时序信息。将SEGA整合至DP形成自演进扩散策略（SeedPolicy），该方法解决了时序建模瓶颈，能以适中开销实现可扩展的时域延伸。在包含50项操作任务的RoboTwin 2.0基准测试中，SeedPolicy超越了DP及其他IL基线方法。在CNN与Transformer两种骨干网络下，SeedPolicy在标准场景中相对DP实现36.8%的性能提升，在随机化挑战场景中相对提升达169%。与参数量达12亿的视觉-语言-动作模型（如RDT）相比，SeedPolicy以少一至两个数量级的参数量达到相当性能，展现出卓越的效能与可扩展性。这些结果确立了SeedPolicy作为长时程机器人操作模仿学习的先进地位。代码已开源：https://github.com/Youqiang-Gui/SeedPolicy。

English

Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but suffers performance degradation as observation horizons increase, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and enables scalable horizon extension with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves competitive performance with one to two orders of magnitude fewer parameters, demonstrating strong efficiency and scalability. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://github.com/Youqiang-Gui/SeedPolicy.