ChatPaper.aiChatPaper

REPOT:透過檢查點修復實現可恢復的程式思維

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

May 28, 2026
作者: Parsa Mazaheri
cs.AI

摘要

單次思路程式(PoT)會生成一個Python程式,該程式印出原始動作計畫;任何一個無效動作都會直接使整個軌跡失效。我們提出RePoT(可恢復思路程式):一種確定性驗證重放機制,它會沿著計畫在環境中執行至第一個無效轉換,然後透過一次LLM調用從已驗證的前綴繼續執行。在PoT失敗的約14%問題中,RePoT最多只需額外一次LLM調用。在PuzzleZoo-775的四個閉源模型配置上,RePoT比PoT高出+3至+11個百分點,並在gpt-5.4-mini-medium上達到96.9%對86.3%的峰值;相較於預算匹配的PoT重試基線,RePoT在Gemini上取得決定性勝利(+3.8pp,95%信賴區間[+2.2,+5.4]),在GPT-medium和Claude上則落在抽樣雜訊範圍內,但在GPT-mini上表現較差——這是一種能力規模化的模式,我們開始以自適應RePoT來因應,這是一種基於規則的調度器,會根據已驗證前綴的長度在後綴修復與全新PoT重試之間進行路由(初步結果)。我們在PlanBench Blocksworld上重現了結果(+1.1至+11.4pp),並在四個開放權重模型上取得三個模型+3.3至+20.0pp的提升。在我們的控制恢復基準Derail-550中,所有能存取檢查點資訊的條件,在GPT-medium上達到>=30%,在Gemini上達到>=70%,相較之下僅提供錯誤回饋的條件<=3.1%——這顯示,真正承擔恢復重責的訊號是檢查點資訊,而非特定的已驗證前綴尾部。
English
One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.