行为差异如何塑造智能体准确度：一致性的放大效应

摘要

随着基于大语言模型的智能体被部署到生产系统中，理解其行为一致性（即在执行相同任务时是否产生相似的动作序列）对确保可靠性至关重要。本研究以SWE-bench这一需要复杂多步推理的软件工程基准为背景，对行为一致性展开探讨。通过对比Claude 4.5 Sonnet、GPT-5和Llama-3.1-70B模型各50次运行结果（10项任务×5次运行），我们发现模型间存在一致性越高则准确率越高的趋势：Claude变异系数最低（CV: 15.2%）且准确率最高（58%），GPT-5处于中间水平（CV: 32.2%，准确率: 32%），而Llama变异系数最高（CV: 47.0%）且准确率最低（4%）。然而在模型内部，一致性可能同时放大正确与错误的解读。分析揭示关键细微差别：一致性强化结果而非保证正确性。Claude的失败案例中71%源于"持续性错误解读"，即在所有运行中重复相同错误假设。值得注意的是，GPT-5虽与Claude达成相似的早期策略共识（分别于第3.4步与第3.2步开始分化），但其变异系数高出2.1倍，表明分化时机并非决定一致性的唯一因素。这些发现说明，对于生产环境部署，解读准确性比执行一致性更为重要，这对智能体评估与训练具有重要启示。

English

As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks times 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: consistency amplifies outcomes rather than guaranteeing correctness. 71\% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1times higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

行为差异如何塑造智能体准确度：一致性的放大效应

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

摘要

Support