行为一致性增强：行为差异如何塑造智能体精准度

摘要

随着基于大语言模型的智能体被部署到生产系统中，理解其行为一致性（即在相同任务下是否产生相似动作序列）对确保可靠性至关重要。本研究以SWE-bench这一需要复杂多步推理的软件工程基准为背景，对Claude 4.5 Sonnet、GPT-5和Llama-3.1-70B进行行为一致性分析。通过每组50次运行（10项任务×5次重复）的对比发现：模型间比较时，更高的一致性对应更高的准确率——Claude方差最低（变异系数15.2%）且准确率最高（58%），GPT-5处于中间水平（变异系数32.2%，准确率32%），Llama方差最高（变异系数47.0%）且准确率最低（4%）。然而在模型内部，一致性可能同时放大正确与错误的理解。分析揭示关键细微差别：一致性强化结果而非保证正确性。Claude的失败案例中71%源于“持续性错误解读”，即所有运行均出现相同错误假设。值得注意的是，GPT-5虽与Claude达成相似的早期策略共识（分别于第3.4步与第3.2步开始分化），但其方差高出2.1倍，表明分化时机并非决定一致性的唯一因素。这些发现提示在生产部署中，理解准确度比执行一致性更重要，这对智能体评估与训练具有重要启示。

English

As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks times 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: consistency amplifies outcomes rather than guaranteeing correctness. 71\% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1times higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.