REPOT: 체크포인트 복구를 통한 복구 가능한 Program-of-Thought

초록

원샷 Program-of-Thought (PoT)는 기본 동작 계획을 출력하는 Python 프로그램을 생성하며, 단 하나의 유효하지 않은 동작이 궤적 전체를 무효화한다. 우리는 RePoT(Recoverable PoT)를 제안한다: 이는 검증된 재생(verified replay)을 통해 계획을 환경에서 첫 번째 유효하지 않은 전이까지 실행한 후, 한 번의 LLM 호출로 검증된 접두사(verified prefix)부터 재개하는 결정론적 방법이다. RePoT는 PoT가 실패하는 약 14%의 문제에서 최대 한 번의 추가 LLM 호출만을 필요로 한다. RePoT는 PuzzleZoo-775에서 네 가지 폐쇄형 모델 구성에 대해 PoT 대비 +3~+11%p의 성능 향상을 보이며, gpt-5.4-mini-medium에서 86.3% 대비 96.9%의 최고 성능을 달성한다. 동일 예산의 PoT 재시도 기준선과 비교하여 RePoT는 Gemini에서 확실한 승리(+3.8%p, 95% CI [+2.2, +5.4])를 거두고, GPT-medium과 Claude에서는 샘플링 노이즈 이내이며, GPT-mini에서는 패배한다. 이는 적응형 RePoT(Adaptive RePoT)로 대응하기 시작한 능력 확장 패턴으로, 적응형 RePoT는 검증된 접두사의 길이에 기반하여 접미사 복구(suffix repair)와 새로운 PoT 재시도 간을 라우팅하는 규칙 기반 디스패처이다(예비 연구). PlanBench Blocksworld(+1.1~+11.4%p)와 네 개의 오픈 가중치 모델(네 개 중 세 개에서 +3.3~+20.0%p)에서도 결과를 재현한다. 통제된 복구 벤치마크인 Derail-550에서, 체크포인트 정보에 접근할 수 있는 모든 조건은 GPT-medium에서 30% 이상, Gemini에서 70% 이상의 성공률을 달성한 반면, 오류 정보만 제공된 조건은 3.1% 이하에 그쳤다. 이는 체크포인트 정보가 특정 검증된 접두사의 꼬리 부분이 아니라 복구의 핵심 신호임을 보여준다.

English

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.