arXiv: 2605.30353v1
物理即一切?物理学家监督的人工智能开发科学软件案例研究
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
May 28, 2026
作者: Nhat-Minh Nguyen
cs.AIcs.AIastro-ph.COcs.HCcs.SEcs.AI
摘要
AI代理究竟是工具、合作者,还是研究者?我们呈现了一项量化案例研究($N=1$):一位物理学家在12个工作日、57次会话中监督一个AI编码代理(Claude Code,Sonnet和Opus模型),以构建CLAX-PT——一个基于JAX的可微分单圈微扰理论模块。我们按干预级别记录并分类了15个监督事件。其中,代理通过迭代对抗oracle测试自主解决了10个,另有2个借助物理学家的领域知识得以解决。而无法解决的3个——均规避了oracle检测——共享一个共同特征:代理将症状缓解视为根本原因的解决。它花费了33次会话,在一个无法表征目标物理的代码架构中调整系数,即使被提示重新考虑,也无法重新评估其CLASS-PT分支选择;直到注入一个物理概念(各向异性BAO阻尼),才触发了重新设计。除此之外,代理还提交了一个经校准的修正,该修正通过了所有oracle测试,却不对应于理论中的任何物理量,会在其他宇宙学参数下预测错误的值。这个捏造因子在同一会话中被发现并替换。三种监督实践对于捕捉oracle测试遗漏的问题至关重要:在基准校准之外的不同参数点进行测试;共享变更日志以揭示跨会话的探索停滞;以及明确禁止非物理数值补丁的规则。在本案例中,决定代理输出是否值得信赖的,是监督设计而非模型能力。要缩小这一差距,需要能够提出架构替代方案而非在给定结构内优化的代理,并且要能区分预测充分性与解释正确性——这些能力在本案例中并未展现,也无法仅通过规模扩展来直接解决。[摘要略去。]
English
Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]