超越测试时训练：通过硬件高效最优控制实现推理学习

摘要

长期以来，关联记忆一直是序列模型设计的基础。除了记忆回溯，人类还能通过预测未来状态并选择目标导向的行动进行推理——这种能力对现代语言模型日益重要，却未被原生编码。已有研究虽采用强化学习或测试时训练，但规划功能始终外置于模型架构。我们将推理建模为最优控制问题，提出测试时控制层（TTC）：该层在推理时对隐状态执行有限时域的LQR规划，在神经网络架构内部表征价值函数，并将其作为嵌套目标实现预测前的规划。为确保可扩展性，我们基于辛几何 formulation 推导出硬件高效的LQR求解器，将其实现为融合CUDA内核，支持并行计算且开销极小。将TTC层作为适配器集成至预训练大语言模型后，在MATH-500上的数学推理性能提升最高达27.8%，在AMC和AIME上的Pass@8指标提升2-3倍，证明将最优控制嵌入架构可为推理提供超越测试时训练的有效可扩展机制。

English

Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning as optimal control and introduce the Test-Time Control (TTC) layer, which performs finite-horizon LQR planning over latent states at inference time, represents a value function within neural architectures, and leverages it as the nested objective to enable planning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8% on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training.

超越测试时训练：通过硬件高效最优控制实现推理学习

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

摘要

Support