REPOT: チェックポイント修復による回復可能なProgram-of-Thought

要旨

ワンショットのProgram-of-Thought（PoT）は、プリミティブなアクションプランを出力するPythonプログラムを生成するが、単一の無効アクションが軌道全体を無効化してしまう。我々はRePoT（Recoverable PoT）を導入する。これは、計画を環境内で最初の無効遷移まで実行し、その後、検証済みプレフィックスから再開する1回のLLM呼び出しを行う、決定論的で検証可能なリプレイである。RePoTは、PoTが失敗する約14%の問題に対して、最大で1回の追加LLM呼び出しを要する。RePoTは、PuzzleZoo-775における4つのクローズドモデル構成でPoTを+3〜+11pp上回り、gpt-5.4-mini-mediumでは86.3%に対して96.9%のピークを達成する。同予算のPoTリトライベースラインと比較して、RePoTはGeminiで決定的に勝利し（+3.8pp、95%信頼区間[+2.2,+5.4]）、GPT-mediumとClaudeではサンプリングノイズの範囲内であり、GPT-miniでは敗北する——これは能力スケーリングパターンであり、我々はこれをAdaptive RePoT（検証済みプレフィックス長に基づいてサフィックス修復と新規PoTリトライを振り分けるルールベースのディスパッチャー、予備的）で対処し始めている。我々はこれをPlanBench Blocksworld（+1.1〜+11.4pp）および4つのオープンウェイトモデル（4つのうち3つで+3.3〜+20.0pp）でも再現する。我々の制御されたリカバリベンチマークであるDerail-550では、チェックポイント情報にアクセスできるすべての条件が、GPT-mediumで30%以上、Geminiで70%以上の成功率を達成し、エラーのみのフィードバックでは3.1%以下である——これは、具体的な検証済みプレフィックスの末尾ではなく、チェックポイント情報が負荷のかかるリカバリ信号であることを示している。

English

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.