擴展大型語言模型的行動空間，使其推理能力超越語言範疇

摘要

大型语言模型（LLMs）在自然语言处理中展现出强大的推理能力，但其行为通常仅限于输出词汇标记。因此，与外部环境（如符号操作符或模拟器）的交互必须通过预定义格式的文本表达，经过解析后路由至外部接口。这种做法使得模型的语言既要承担推理任务，又要负责控制功能，并且需要一个独立于LLM的手工解析器。为解决这一问题，我们将环境交互从语言中解耦，将其内化到一个超越词汇的扩展动作空间（ExpA）中。模型首先在默认的语言环境中进行推理，但可以随时触发路由动作并切换到外部环境。在此之后，模型只能调用特定于环境的动作，接收环境的反馈，并可能因此路由回语言环境。为了促进对扩展动作空间和新环境的有效探索，我们引入了带有反事实策略优化的扩展动作强化学习（EARL）。在需要多轮交互和条件规划的任务中，EARL在词汇受限动作的强基线模型上表现更优。它在基于计算器的多任务学习中表现稳健，在部分可观测的排序问题中，实现了完美的Sort-4准确率，同时自主发现了一种与经典设计相媲美的高效算法。

English

Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.