REPOT: 通过检查点修复实现可恢复的思维程序
REPOT: Recoverable Program-of-Thought via Checkpoint Repair
May 28, 2026
作者: Parsa Mazaheri
cs.AI
摘要
单次思维程序(One-shot PoT)会生成一个打印原始动作计划的Python程序;单个无效动作会无声地使整个轨迹失效。我们提出RePoT(可恢复思维程序):一种确定性验证重放机制,它遍历计划直至第一个无效转换,然后通过一次LLM调用从验证前缀继续执行。在PoT失败的约14%问题上,RePoT最多仅需额外一次LLM调用。在PuzzleZoo-775基准上,RePoT在四种闭源模型配置中比PoT高出3到11个百分点,且在gpt-5.4-mini-medium上达到96.9%对86.3%的峰值;与匹配预算的PoT重试基线相比,RePoT在Gemini上取得决定性胜利(+3.8个百分点,95%置信区间[+2.2, +5.4]),在GPT-medium和Claude上处于采样噪声范围内,但在GPT-mini上表现较差——这种能力扩展模式,我们通过自适应RePoT(一种基于规则的调度器,根据验证前缀长度在后缀修复与全新PoT重试之间进行路由)初步着手解决。我们在PlanBench Blocksworld(提升1.1至11.4个百分点)以及四个开源权重模型(四个中有三个提升3.3至20.0个百分点)上复现了该结果。在我们受控恢复基准Derail-550上,所有能够访问检查点信息的条件在GPT-medium上均超过30%,在Gemini上超过70%,而仅凭错误反馈的条件不超过3.1%——这表明检查点信息(而非特定的验证前缀尾部)才是承载恢复能力的关键信号。
English
One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.