LongMemEval-V2: 경험이 풍부한 동료를 향한 장기 에이전트 메모리 평가

초록

장기 기억(long-term memory)은 특화된 웹 환경에서 에이전트가 성공하기 위해 필수적이며, 성공 여부는 인터페이스 어포던스(affordances), 상태 동역학(state dynamics), 워크플로, 반복적 실패 모드에 대한 기억에 달려 있다. 그러나 기존의 에이전트를 위한 기억 벤치마크는 대부분 사용자 이력, 짧은 트레이스(trace), 또는 하위 작업 성공에 초점을 맞추고 있어, 기억 시스템이 환경 특화 경험을 효과적으로 내재화하는지를 직접 평가하는 방법은 여전히 미해결 과제로 남아 있다. 이러한 공백을 해결하기 위해, 우리는 LongMemEval-V2(LME-V2)를 소개한다. 이는 기억 시스템이 에이전트로 하여금 맞춤형 환경에서 유능한 동료가 되기 위해 필요한 경험을 습득하도록 돕는지 평가하기 위한 벤치마크이다. LME-V2는 웹 에이전트의 다섯 가지 핵심 기억 능력(정적 상태 회상, 동적 상태 추적, 워크플로 지식, 환경 트릭(gotchas), 전제 인식)을 다루는 451개의 수동으로 선별된 질문을 포함한다. 질문에는 최대 500개의 트레이스와 1억 1500만 개의 토큰을 포함하는 이력 트레이스(history trajectories)가 쌍으로 제공된다. 우리는 맥락 수집(context gathering) 방식을 사용한다. 기억 시스템은 이력 트레이스를 소비하고, 하위 질의응답을 위한 간결한 증거를 반환한다. 우리는 두 가지 기억 방법을 제안한다: 원시 상태 관찰, 이벤트, 전략 노트를 위한 지식 풀을 갖춘 효율적인 RAG 기반 기억인 AgentRunbook-R과, 트레이스를 파일로 저장하고 증강된 샌드박스(augmented sandbox)에서 증거를 수집하기 위해 코딩 에이전트를 호출하는 AgentRunbook-C이다. 실험 결과, AgentRunbook-C는 평균 정확도 72.5%로 가장 우수한 성능을 보이며, 가장 강력한 RAG 기준 모델(48.5%)과 기성 코딩 에이전트 기준 모델(69.3%)을 능가했다. 그러나 강력한 성능 향상에도 불구하고 코딩 에이전트 기반 방법은 높은 지연 비용을 수반한다. AgentRunbook-C가 정확도-지연 시간 파레토 프론티어(accuracy-latency Pareto frontier)를 개선했지만, 개선의 여지는 여전히 크게 남아 있다. 이러한 결과는 LME-V2가 환경 경험을 위한 장기 기억 시스템 개발을 위한 도전적인 테스트베드임을 입증한다.

English

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

LongMemEval-V2: 경험이 풍부한 동료를 향한 장기 에이전트 메모리 평가

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

초록

Support