修订还是重解？多LLM流程中二次处理增益的分解分析

摘要

多模型修订流程（即由第二模型对首轮模型生成的草稿进行审阅优化）的效能提升，通常被归因于实质性的错误修正。我们通过受控分解实验对这一假设提出质疑：该实验采用四种匹配条件，将次轮增益分解为三个可叠加的组成部分——重新求解、框架支撑与内容优化。我们在三组涵盖知识密集型选择题和竞技编程的测试基准上，对两种模型组合进行了评估。结果表明，多模型修订的收益并非单一机制，而是取决于任务结构、草稿质量及草稿信息类型。在选择题任务中，由于答案空间受限且草稿提供的结构指引有限，大部分增益与强模型重新求解的效应一致，此时将问题直接路由至强模型比修订弱模型草稿更有效。然而在代码生成任务中，两阶段提示法仍具价值：因为即使语义空洞的草稿也能提供显著的结构支撑，而低质量草稿内容则可能产生负面影响。最终，角色反转实验表明强质量草稿能明显提升弱模型的审阅效能。本研究揭示：多模型修订的效用受到任务结构与草稿质量的动态制约，这要求我们设计更具针对性的流程方案，而非采用通用的修订策略。

English

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

修订还是重解？多LLM流程中二次处理增益的分解分析

Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

摘要

Support