扩展大语言模型的动作空间，使其推理超越语言范畴

摘要

大型语言模型（LLMs）在自然语言推理方面展现出强大能力，但其行为通常仅限于输出词汇标记。因此，与外部环境（如符号操作符或模拟器）的交互必须通过预定义格式的文本表达，经解析后路由至外部接口。这使模型的语言负担了推理与控制双重职责，并需依赖一个独立于LLM之外的手工解析器。为解决此问题，我们通过将环境交互内化于词汇之外的扩展动作空间（ExpA），实现了与语言的解耦。模型初始在默认语言环境中进行推理，但可随时触发路由动作切换至外部环境。在此环境下，模型仅能调用特定于环境的动作，接收环境反馈，并可能据此路由回语言环境。为促进对扩展动作空间及新环境的有效探索，我们引入了基于反事实策略优化的扩展动作强化学习（EARL）。在需要多轮交互与条件规划的任务中，EARL超越了受限于词汇动作的强基线模型。在基于计算器的多任务学习场景下表现稳健，并在部分可观测的排序问题中，实现了Sort-4的完美准确率，同时自主发现了一种可与经典设计相媲美的高效算法。

English

Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.