LLM 에이전트의 초장기 대화 메모리 평가

초록

기존의 장기 오픈 도메인 대화 연구는 대체로 5회 이내의 채팅 세션에서 모델 응답을 평가하는 데 초점을 맞추고 있다. 장기 문맥 대형 언어 모델(LLM) 및 검색 증강 생성(RAG) 기술의 발전에도 불구하고, 이러한 기술이 매우 장기적인 대화에서의 효용성은 아직 탐구되지 않았다. 이러한 연구 격차를 해결하기 위해, 우리는 LLM 기반 에이전트 아키텍처를 활용하고, 그들의 대화를 페르소나와 시간적 이벤트 그래프에 기반하여 구축하는 기계-인간 파이프라인을 도입하였다. 또한, 각 에이전트가 이미지를 공유하고 반응할 수 있는 능력을 갖추도록 하였다. 생성된 대화는 인간 주석자에 의해 장기적 일관성과 이벤트 그래프에 대한 근거를 검증 및 편집되었다. 이 파이프라인을 사용하여, 우리는 각각 평균 300턴과 9K 토큰을 포함하며 최대 35회의 세션에 걸친 매우 장기적인 대화 데이터셋인 LoCoMo를 수집하였다. LoCoMo를 기반으로, 우리는 질문 응답, 이벤트 요약, 다중 모달 대화 생성 작업을 포함한 모델의 장기 기억을 측정하기 위한 포괄적인 평가 벤치마크를 제시한다. 우리의 실험 결과는 LLM이 긴 대화를 이해하고 대화 내 장기적인 시간적 및 인과적 역학을 이해하는 데 어려움을 겪는다는 것을 나타낸다. 장기 문맥 LLM이나 RAG와 같은 전략을 사용하면 개선이 가능하지만, 이러한 모델들은 여전히 인간의 성능에 크게 뒤처진다.

English

Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

LLM 에이전트의 초장기 대화 메모리 평가

Evaluating Very Long-Term Conversational Memory of LLM Agents

초록

Support