长情:评估大语言模型在长上下文交互中的情感智能
LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction
September 9, 2025
作者: Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong Xu, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong
cs.AI
摘要
大型语言模型(LLMs)在情感智能(EI)和长上下文理解方面取得了显著进展。然而,现有的基准测试往往忽视了长上下文场景中EI的某些方面,尤其是在现实、实用的交互环境中,这些交互通常冗长、多样且常常带有噪声。为了向这种现实环境迈进,我们提出了LongEmotion,一个专为长上下文EI任务设计的基准测试。它涵盖了一系列多样化的任务,包括情感分类、情感检测、情感问答、情感对话、情感总结和情感表达。这些任务的平均输入长度达到8,777个标记,其中情感表达任务需要长文本生成。为了在现实约束下提升性能,我们引入了检索增强生成(RAG)和协作情感建模(CoEM),并将它们与标准的基于提示的方法进行了比较。与传统方法不同,我们的RAG方法同时利用对话上下文和大型语言模型本身作为检索源,避免了对外部知识库的依赖。CoEM方法通过将任务分解为五个阶段,进一步提升了性能,整合了检索增强和有限知识注入。实验结果表明,RAG和CoEM在大多数长上下文任务中均能持续提升与EI相关的性能,推动LLMs向更实用和现实世界的EI应用迈进。此外,我们在GPT系列上进行了对比案例研究实验,展示了不同模型在EI方面的差异。代码可在GitHub上获取,项目页面也提供了详细信息。
English
Large language models (LLMs) make significant progress in Emotional
Intelligence (EI) and long-context understanding. However, existing benchmarks
tend to overlook certain aspects of EI in long-context scenarios, especially
under realistic, practical settings where interactions are lengthy, diverse,
and often noisy. To move towards such realistic settings, we present
LongEmotion, a benchmark specifically designed for long-context EI tasks. It
covers a diverse set of tasks, including Emotion Classification, Emotion
Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion
Expression. On average, the input length for these tasks reaches 8,777 tokens,
with long-form generation required for Emotion Expression. To enhance
performance under realistic constraints, we incorporate Retrieval-Augmented
Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them
with standard prompt-based methods. Unlike conventional approaches, our RAG
method leverages both the conversation context and the large language model
itself as retrieval sources, avoiding reliance on external knowledge bases. The
CoEM method further improves performance by decomposing the task into five
stages, integrating both retrieval augmentation and limited knowledge
injection. Experimental results show that both RAG and CoEM consistently
enhance EI-related performance across most long-context tasks, advancing LLMs
toward more practical and real-world EI applications. Furthermore, we conducted
a comparative case study experiment on the GPT series to demonstrate the
differences among various models in terms of EI. Code is available on GitHub at
https://github.com/LongEmotion/LongEmotion, and the project page can be found
at https://longemotion.github.io/.