PolicyTrim: 시각-언어-행동 모델의 내재적 정책 효율성 향상

초록

시각-언어-행동(VLA) 모델은 로봇 조작을 위한 통합 패러다임을 제공하지만, 실제 환경에서의 배포는 종종 실행 효율성에 의해 병목 현상이 발생한다. 기존 연구들은 주로 연산 중심 효율성에 초점을 맞춰 단계별 추론 지연 시간을 줄이는 데 집중했지만, 이러한 모델의 내재적 정책 효율성은 여전히 거의 탐구되지 않았다. 정책 효율성은 예측된 행동 청크의 효과적인 실행 가능 길이와 작업 완료에 필요한 총 물리적 단계 수라는 두 가지 요인에 의해 근본적으로 영향을 받는다. 이 두 요인은 실행 중 순방향 추론 호출의 총 횟수를 함께 결정한다. 우리는 현재 VLA 정책이 계획의 불안정성과 행동 중복으로 어려움을 겪고 있으며, 행동 청크의 끝부분에서 심각한 예측 성능 저하를 보이고 불필요하게 중복된 물리적 단계를 생성하는 경향이 있음을 관찰했다. 이 문제를 해결하기 위해, 우리는 신뢰할 수 있는 행동 청크 길이를 확장하고 중복된 물리적 단계를 줄이는 강화 학습 기반 사후 훈련 프레임워크인 PolicyTrim을 제안한다. 신뢰할 수 있는 청크 확장을 위해, 우리는 더 긴 실행 가능 길이의 성공적인 완료를 명시적으로 보상하는 동적 탐색 전략을 사용하여 신뢰 가능한 예측 지평을 경험적 한계까지 점진적으로 확장한다. 단계 효율성을 위해, 우리는 더 적은 단계로 성공적인 작업 완료를 직접적으로 선호하고 재현 불가능한 지름길을 패널티로 주는 중복 인식 보상을 설계하여 중복된 물리적 행동을 효과적으로 제거한다. 세 가지 벤치마크와 세 가지 VLA 모델에 걸친 광범위한 실험 결과, PolicyTrim은 행동 청크 활용도를 3배 향상시키고 물리적 실행 단계를 51.4% 감소시켰다. 궁극적으로, 우리의 프레임워크는 작업 성공률을 저하시키지 않으면서 최대 5.83배의 종단 간 배포 속도 향상을 제공한다.

English

Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic policy efficiency of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors, namely the effective executable length of predicted action chunks and the total physical steps required to complete a task. These two factors jointly determine the total number of forward inference calls during execution. We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose PolicyTrim, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps. For reliable chunk extension, we employ a dynamic exploration strategy that explicitly rewards the successful completion of longer executable lengths, progressively pushing the trustworthy prediction horizon to its empirical limit. For step efficiency, we design a redundancy-aware reward that directly favors successful task completions with fewer steps while penalizing unreproducible shortcuts, effectively eliminating redundant physical actions. Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3times and reduces physical execution steps by 51.4\%. Ultimately, our framework delivers up to a 5.83times end-to-end deployment speedup without compromising task success rates.