ChatPaper.aiChatPaper

TIDE:基于轨迹的LLM智能体测试时性能改进诊断评估

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

February 2, 2026
作者: Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin
cs.AI

摘要

近期自主LLM智能体的研究进展表明,其能通过与环境的迭代交互持续提升表现。我们将这种范式定义为测试时优化(TTI)。然而,TTI成功或失败的内在机制尚不明确,现有评估指标也未能有效捕捉其任务优化效率、错误行动后的行为适应性,以及工作记忆对任务完成的具体效用。为填补这些空白,我们提出测试时优化诊断评估框架(TIDE),该框架与智能体及环境解耦,将TTI分解为三个相互关联的维度:量化(1)任务完成的整体时序动态,(2)判别性能瓶颈主要源于递归循环行为还是(3)记忆累积负担。通过多智能体与多环境的大规模实验,TIDE揭示出提升智能体性能不仅需要扩展内部推理能力,更需显式优化智能体与环境间的交互动力学机制。
English
Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.
PDF291February 6, 2026