REALTALK：一個為長期對話設計的21天真實世界數據集

摘要

長期、開放領域的對話能力對於旨在回憶過往互動並展現情感智能（EI）的聊天機器人至關重要。然而，現有研究大多依賴於合成、由大型語言模型（LLM）生成的數據，這使得真實世界中的對話模式仍存有疑問。為填補這一空白，我們引入了REALTALK，這是一個為期21天的真實即時通訊應用對話語料庫，為直接對比真實人類互動提供了基準。我們首先進行了數據集分析，聚焦於情感智能屬性和角色一致性，以理解真實世界對話所帶來的獨特挑戰。通過與LLM生成的對話進行比較，我們突出了關鍵差異，包括多樣的情感表達和角色穩定性的變化，這些往往是合成對話所未能捕捉的。基於這些洞察，我們提出了兩項基準任務：（1）角色模擬，即模型在給定先前對話上下文的情況下，代表特定用戶繼續對話；（2）記憶探測，即模型回答需要長期記憶過去互動的特定問題。我們的研究發現，模型僅憑對話歷史難以模擬用戶，而對特定用戶聊天進行微調則能提升角色模仿能力。此外，現有模型在回憶和利用真實世界對話中的長期上下文方面面臨顯著挑戰。

English

Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations.

REALTALK：一個為長期對話設計的21天真實世界數據集

REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

摘要

Support