LongEmotion:測量大語言模型在長上下文互動中的情緒智能
LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction
September 9, 2025
作者: Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong Xu, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong
cs.AI
摘要
大型語言模型(LLMs)在情感智能(EI)和長上下文理解方面取得了顯著進展。然而,現有的基準測試往往忽略了長上下文情境下EI的某些方面,尤其是在實際應用場景中,互動內容冗長、多樣且通常帶有噪音。為了邁向這種現實情境,我們提出了LongEmotion,這是一個專為長上下文EI任務設計的基準測試。它涵蓋了多樣化的任務,包括情感分類、情感檢測、情感問答、情感對話、情感摘要和情感表達。這些任務的平均輸入長度達到8,777個詞元,其中情感表達任務需要長文本生成。為了在現實約束下提升性能,我們引入了檢索增強生成(RAG)和協作情感建模(CoEM),並將其與標準的提示方法進行比較。與傳統方法不同,我們的RAG方法同時利用對話上下文和大型語言模型本身作為檢索來源,避免了對外部知識庫的依賴。CoEM方法則通過將任務分解為五個階段,整合了檢索增強和有限知識注入,進一步提升了性能。實驗結果表明,RAG和CoEM在多數長上下文任務中持續提升了與EI相關的性能,推動LLMs向更實用和現實世界中的EI應用邁進。此外,我們在GPT系列上進行了比較案例研究實驗,展示了不同模型在EI方面的差異。代碼可在GitHub上獲取,網址為https://github.com/LongEmotion/LongEmotion,項目頁面則位於https://longemotion.github.io/。
English
Large language models (LLMs) make significant progress in Emotional
Intelligence (EI) and long-context understanding. However, existing benchmarks
tend to overlook certain aspects of EI in long-context scenarios, especially
under realistic, practical settings where interactions are lengthy, diverse,
and often noisy. To move towards such realistic settings, we present
LongEmotion, a benchmark specifically designed for long-context EI tasks. It
covers a diverse set of tasks, including Emotion Classification, Emotion
Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion
Expression. On average, the input length for these tasks reaches 8,777 tokens,
with long-form generation required for Emotion Expression. To enhance
performance under realistic constraints, we incorporate Retrieval-Augmented
Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them
with standard prompt-based methods. Unlike conventional approaches, our RAG
method leverages both the conversation context and the large language model
itself as retrieval sources, avoiding reliance on external knowledge bases. The
CoEM method further improves performance by decomposing the task into five
stages, integrating both retrieval augmentation and limited knowledge
injection. Experimental results show that both RAG and CoEM consistently
enhance EI-related performance across most long-context tasks, advancing LLMs
toward more practical and real-world EI applications. Furthermore, we conducted
a comparative case study experiment on the GPT series to demonstrate the
differences among various models in terms of EI. Code is available on GitHub at
https://github.com/LongEmotion/LongEmotion, and the project page can be found
at https://longemotion.github.io/.