修订还是重解？多LLM流程中二次处理增益的分解分析

摘要

多轮大语言模型修订流程通常被认为其增益源自对初稿错误的实质性修正。我们通过受控分解实验对这一假设提出质疑，该实验采用四种匹配条件将第二轮的增益拆分为三个可叠加成分：重新求解、框架支撑与内容优化。我们在三组涵盖知识密集型选择题和竞争性编程的基准测试中，评估了两组模型对的性能表现。结果表明，多轮修订的收益并非单一机制，而是取决于任务结构、初稿质量及初稿信息类型。在选择题任务中，由于答案空间受限且初稿缺乏结构性指导，大部分增益与强模型直接重新求解的结果一致，此时将问题直接路由至强模型比修订弱模型初稿更有效。然而在代码生成任务中，即使语义空洞的初稿也能提供显著的结构支撑，而低质量初稿内容可能产生负面影响，因此两阶段提示法仍具价值。角色反转实验进一步表明，高质量初稿能明显提升弱模型的评审效果。最终我们的研究揭示：多轮修订的效用受到任务结构与初稿质量的动态制约，这要求我们设计更具针对性的流程方案，而非采用通用的修订策略。

English

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.