BeamPERL：基于可验证奖励的参数高效强化学习技术，专为结构化梁力学推理优化紧凑型大型语言模型

摘要

基於可驗證硬獎勵的強化學習，能否教會緊湊型語言模型進行物理推理？抑或其主要習得的是面向正確答案的模式匹配？我們通過在梁結構靜力學這一經典工程問題上訓練15億參數的推理模型來研究此問題，採用參數高效的RLVR方法，僅使用符號求解器提供的二元正確性獎勵，且無需教師生成的推理軌跡。最佳BeamPERL檢查點相比基礎模型實現了66.7%的Pass@1提升。然而所學能力呈現各向異性：模型能實現組合泛化（增加載荷），卻在需要相同平衡方程的拓撲變化（移動支座）時失效。中間檢查點產生了最強推理能力，而持續優化會在保持獎勵的同時降低魯棒性。這些發現揭示了結果層面對齊的關鍵局限：採用精確物理獎勵的強化學習會誘導程序化解決方案模板的生成，而非對控制方程的內化。即使獎勵信號具有解析精確性，其本身並不能保證可遷移的物理推理能力。我們的結果表明，可驗證獎勵可能需要與結構化推理支架相結合，才能突破模板匹配的局限，實現魯棒的科學推理。

English

Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.

BeamPERL：基于可验证奖励的参数高效强化学习技术，专为结构化梁力学推理优化紧凑型大型语言模型

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

摘要

Support