PIPer：オンライン強化学習によるオンデバイス環境設定

要旨

環境設定—特定のソフトウェアプロジェクトで動作するようにシステムを構成するプロセス—は、ソフトウェア工学（SE）における持続的な課題である。自動化された環境設定手法は、開発者が手動での作業なしに任意のリポジトリに対して完全に構成された環境を提供することで支援する。これはまた、SE研究者が実行ベースのベンチマークを拡張するのにも役立つ。しかし、最近の研究では、最先端の大規模言語モデル（LLM）でさえ、このタスクの自動化において限定的な成功しか収めていないことが明らかになっている。この制限に対処するため、我々は環境設定に特化したモデルを調整する。正しいBashスクリプトを生成するための教師ありファインチューニングと、検証可能な報酬を用いた強化学習（RLVR）を組み合わせて、環境設定タスクに適応させる。EnvBench-Pythonにおいて、我々の手法は、消費者向けハードウェアで動作可能なモデルであるQwen3-8Bを、より大規模なモデルであるQwen3-32BおよびGPT-4oと同等の性能に導く。トレーニングコードとモデルのチェックポイントはオンラインで公開されている：https://github.com/JetBrains-Research/PIPer。

English

Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.

PIPer：オンライン強化学習によるオンデバイス環境設定

PIPer: On-Device Environment Setup via Online Reinforcement Learning

要旨

Support