BeamPERL：基于可验证奖励的参数高效强化学习框架，专攻结构化梁力学推理的紧凑型大语言模型

摘要

能否通过带有严格可验证奖励的强化学习，教会紧凑语言模型进行物理推理？抑或它主要学会的是对正确答案的模式匹配？我们通过训练一个15亿参数推理模型研究该问题，该模型基于梁静力学这一经典工程问题，采用参数高效的RLVR方法，仅使用符号求解器提供的二元正确性奖励，而无需教师生成的推理轨迹。最佳BeamPERL检查点在Pass@1指标上较基础模型提升66.7%。然而习得的能力呈现各向异性：模型能实现组合泛化（更多载荷），却在需要相同平衡方程的结构拓扑变化（支座位移）场景中失效。中间检查点展现出最强推理能力，而持续优化会降低鲁棒性却维持奖励值。这些发现揭示了结果层面对齐的核心局限：基于精确物理奖励的强化学习诱导的是程序化解题模板，而非对控制方程的内化理解。即使奖励信号具有解析精确性，其本身并不能保证可迁移的物理推理能力。我们的结果表明，可验证奖励可能需要与结构化推理支架相结合，才能突破模板匹配的局限，实现稳健的科学推理。

English

Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.

BeamPERL：基于可验证奖励的参数高效强化学习框架，专攻结构化梁力学推理的紧凑型大语言模型

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

摘要

Support