정책을 직접 구성하라! 테스트 시점 분포 수준 구성 기법을 통한 확산 기반 또는 흐름 기반 로봇 정책 개선

초록

로봇 제어를 위한 확산 기반 모델, 특히 시각-언어-행동(VLA) 및 시각-행동(VA) 정책은 상당한 능력을 입증했습니다. 그러나 대규모 상호작용 데이터셋을 획득하는 데 드는 높은 비용으로 인해 이들의 발전이 제한되고 있습니다. 본 연구는 추가적인 모델 학습 없이 정책 성능을 향상시키는 대안적 패러다임을 제시합니다. 놀랍게도, 우리는 구성된 정책이 부모 정책 중 어느 하나의 성능을 초과할 수 있음을 보여줍니다. 우리의 기여는 세 가지입니다. 첫째, 여러 확산 모델의 분포 점수를 볼록 조합하면 단일 단계 기능적 목표에서 어떤 개별 점수보다 우수한 결과를 얻을 수 있음을 이론적으로 입증합니다. 그뢴월(Grönwall) 유형의 경계를 사용하여 이러한 단일 단계 개선이 전체 생성 궤적에 걸쳐 전파되어 시스템적 성능 향상으로 이어짐을 보입니다. 둘째, 이러한 결과에 동기를 부여받아, 우리는 사전 훈련된 여러 정책의 분포 점수를 볼록 조합과 테스트 시 탐색을 통해 결합하여 성능을 향상시키는 학습이 필요 없는 방법인 일반 정책 구성(General Policy Composition, GPC)을 제안합니다. GPC는 VA 및 VLA 모델뿐만 아니라 확산 또는 흐름 매칭 기반의 정책을 포함한 이질적인 정책의 플러그 앤 플레이 구성이 가능하며, 입력 시각 양식에 관계없이 적용할 수 있습니다. 셋째, 광범위한 실증적 검증을 제공합니다. Robomimic, PushT, RoboTwin 벤치마크에서의 실험과 실제 로봇 평가를 통해 GPC가 다양한 작업에서 일관되게 성능과 적응성을 개선함을 확인합니다. 대체 구성 연산자와 가중치 전략에 대한 추가 분석은 GPC의 성공 메커니즘에 대한 통찰을 제공합니다. 이러한 결과는 GPC가 기존 정책을 활용하여 제어 성능을 개선하는 간단하면서도 효과적인 방법임을 입증합니다.

English

Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance without additional model training. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold. First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Gr\"onwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.

정책을 직접 구성하라! 테스트 시점 분포 수준 구성 기법을 통한 확산 기반 또는 흐름 기반 로봇 정책 개선

Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

초록

Support