経験の時代における言語ベースの試行錯誤の限界

要旨

大規模言語モデル（LLM）は言語ベースのエージェントタスクで優れた性能を発揮するが、未経験の非言語環境（例：記号的または空間的タスク）への適用性は依然として限定的である。従来の研究は、この性能差の原因を事前学習分布とテスト分布のミスマッチに帰してきた。本研究では、主要なボトルネックが探索の膨大なコストにあることを示す。これらのタスクを習得するには大規模な試行錯誤が必要であるが、高次元の意味空間で動作するパラメータ数の多いLLMにとって、これは計算量的に持続不可能である。この問題に対処するため、我々は探索と活用を分離する新規フレームワークSCOUT（未経験タスクにおけるサブスケール協調）を提案する。軽量な「スカウト」（例：小規模なMLP）を活用し、LLMをはるかに上回る速度と規模で環境ダイナミクスの探索を行う。収集した軌跡データは教師ありファインチューニング（SFT）によりLLMのブートストラップに利用され、その後、多段階の強化学習（RL）によってその潜在的な世界知識を活性化する。実験では、SCOUTによりQwen2.5-3B-Instructモデルが平均スコア0.86を達成し、Gemini-2.5-Pro（0.60）を含む専有モデルを大幅に上回りながら、GPU時間消費量を約60%削減できることを実証した。

English

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

経験の時代における言語ベースの試行錯誤の限界

Language-based Trial and Error Falls Behind in the Era of Experience

要旨

Support