π-StepNFT：基于流程的视觉语言智能体中在线强化学习的更广阔空间需更精细步进

摘要

基于流模型的视觉-语言-动作模型在具身控制任务中表现卓越，但在多步采样过程中存在难以处理的似然性问题，阻碍了在线强化学习的应用。我们提出\textit{boldsymbolπ-StepNFT}（步进式负向感知微调），该框架无需价值函数网络与似然计算，每个优化步骤仅需单次前向传播。研究发现，更广阔的探索空间需要更细粒度的步进式对齐指导。实验表明，π-StepNFT在LIBERO数据集上展现出具有竞争力的少样本鲁棒性，释放了潜在性能。此外，在ManiSkill任务中实现了卓越的泛化能力，通过避免对多模态特征的过拟合，在分布外场景下超越了基于价值函数的基线方法。这一特性为复杂现实应用提供了可扩展的解决方案。

English

Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textit{boldsymbolπ-StepNFT} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, π-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.

π-StepNFT：基于流程的视觉语言智能体中在线强化学习的更广阔空间需更精细步进

π-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

摘要

Support