PIPer: 온라인 강화 학습을 통한 온디바이스 환경 설정

초록

환경 설정 - 특정 소프트웨어 프로젝트와 함께 작동하도록 시스템을 구성하는 과정 - 은 소프트웨어 공학(SE)에서 지속적으로 직면하는 과제입니다. 자동화된 환경 설정 방법은 개발자들이 수동 작업 없이도 임의의 저장소에 대해 완전히 구성된 환경을 제공함으로써 도움을 줄 수 있습니다. 이는 또한 SE 연구자들이 실행 기반 벤치마크를 확장하는 데에도 기여합니다. 그러나 최근 연구에 따르면, 최첨단 대형 언어 모델(LLM)조차도 이 작업을 자동화하는 데 있어 제한적인 성공을 거두고 있습니다. 이러한 한계를 해결하기 위해, 우리는 환경 설정에 특화된 모델을 튜닝했습니다. 우리는 정확한 Bash 스크립트 생성을 위한 지도 학습 미세 조정과 검증 가능한 보상을 활용한 강화 학습(RLVR)을 결합하여 이 모델을 환경 설정 작업에 적응시켰습니다. EnvBench-Python에서, 우리의 방법은 소비자용 하드웨어에서 실행 가능한 모델인 Qwen3-8B가 더 큰 모델인 Qwen3-32B 및 GPT-4o와 동등한 성능을 발휘할 수 있도록 했습니다. 학습 코드와 모델 체크포인트는 온라인에서 확인할 수 있습니다: https://github.com/JetBrains-Research/PIPer.

English

Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.

PIPer: 온라인 강화 학습을 통한 온디바이스 환경 설정

PIPer: On-Device Environment Setup via Online Reinforcement Learning

초록

Support