arXiv: 2605.30353v1
物理就是一切?:物理學家監督下的人工智慧科學軟體開發案例研究
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
May 28, 2026
作者: Nhat-Minh Nguyen
cs.AIcs.AIastro-ph.COcs.HCcs.SEcs.AI
摘要
AI代理是工具、合著者,還是研究者?我們提出一個量化案例研究($N=1$):一位物理學家在12個工作日、57個會話中監督一個AI編碼代理(Claude Code,Sonnet與Opus模型),以建構CLAX-PT——一個基於JAX的可微單圈微擾理論模組。我們按干預層級記錄並分類了15個監督事件。代理透過對照神諭測試反覆迭代,自主解決了其中10個事件。其餘兩個則借助物理學家的領域知識得以解決。至於代理無法解決的三個事件——它們均逃脫了神諭檢測——共享一個共同特徵:代理將症狀緩解視為根本原因解決。它在57個會話中花費了33個會話,在一個無法表徵目標物理的程式架構中調整係數,且即使被提示重新考慮,也無法重新評估其CLASS-PT分支選擇;僅有注入一個物理概念(各向異性重子聲學振盪阻尼)才觸發了重新設計。此外,代理提交了一個經過校準的修正,該修正通過了所有神諭測試,但在理論中對應任何量,並在其他任意的宇宙學參數下預測出錯誤數值。這個湊合係數在同一會話中被發現並替換。三項監督實務對於捕捉神諭測試遺漏的問題至關重要:在基準校準之外的多樣參數點進行測試;透過共享變更日誌揭露跨會話停滯的探索;以及明確禁止不合物理的數值修補。在本案例中,決定代理輸出是否可信的是監督設計,而非模型能力。縮小差距的關鍵在於代理應能提出架構替代方案,而非在既定結構內進行最佳化,並能區分預測充分性與解釋正確性——這些能力在本案例中未曾展現,且顯然無法僅靠規模擴展來解決。[摘要刪減。]
English
Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]