π-StepNFT: 흐름 기반 가변 길이 정책 최적화를 위한 온라인 강화학습에서 더 넓은 공간은 더 세분화된 단계를 요구한다

초록

Flow-based 시각-언어-행동(VLA) 모델은 구체화된 제어에서 뛰어난 성능을 보이지만, 다단계 샘플링 동안 계산이 어려운 가능도 문제로 온라인 강화 학습에 어려움을 겪습니다. 본 연구에서는 최적화 단계당 단일 순전파만 필요로 하며 보조 가치 네트워크를 제거한 critic-and-likelihood-free 프레임워크인 \textit{boldsymbolπ-StepNFT}(Step-wise Negative-aware Fine-Tuning)를 제안합니다. 우리는 더 넓은 탐색 공간이 정렬을 위해 더 세분화된 단계별 지도가 필요함을 확인했습니다. 실험적으로 π-StepNFT는 LIBERO에서 경쟁력 있는 few-shot 강건성과 함께 잠재력을 발휘했습니다. 또한 ManiSkill에서 우수한 일반화 성능을 달성하며, 다중 모드 특징에의 과적합을 방지함으로써 OOD 시나리오에서 가치 기반 베이스라인을 능가했습니다. 이러한 특성은 복잡한 실제 응용 프로그램에 유용한 확장 가능한 해결책을 제시합니다.

English

Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textit{boldsymbolπ-StepNFT} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, π-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.

π-StepNFT: 흐름 기반 가변 길이 정책 최적화를 위한 온라인 강화학습에서 더 넓은 공간은 더 세분화된 단계를 요구한다

π-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

초록

Support