语言试错，难敌体验时代

摘要

尽管大语言模型在基于语言的代理任务中表现出色，但其在未见过的非语言环境（如符号或空间任务）中的适用性仍然有限。先前研究将这种性能差距归因于预训练分布与测试分布之间的不匹配。本研究通过实验证明，主要瓶颈在于探索成本过高：掌握这些任务需要大量试错过程，这对于在高维语义空间中运行的大参数量大语言模型而言，在计算上是不可持续的。为此，我们提出SCOUT（未见任务的子规模协作）框架，该创新方案将探索与利用过程解耦。我们采用轻量级"侦察器"（如小型MLP），以远超大语言模型的速度和规模探测环境动态。收集到的轨迹数据通过监督微调来引导大语言模型，再经过多轮强化学习激活其潜在的世界知识。实验表明，SCOUT框架使Qwen2.5-3B-Instruct模型平均得分达到0.86，显著优于Gemini-2.5-Pro（0.60）等专有模型，同时节省约60%的GPU时耗。

English

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

语言试错，难敌体验时代

Language-based Trial and Error Falls Behind in the Era of Experience

摘要

Support