LongEmotion: 大規模言語モデルの感情的知能を長文脈インタラクションにおいて測定する

要旨

大規模言語モデル（LLM）は、感情的知性（EI）および長文脈理解において著しい進歩を遂げている。しかし、既存のベンチマークは、特に現実的で実用的な設定において、相互作用が長く、多様で、しばしばノイズを含む長文脈シナリオにおけるEIの特定の側面を見落としがちである。このような現実的な設定に向けて、我々は長文脈EIタスクに特化したベンチマークであるLongEmotionを提案する。これには、感情分類、感情検出、感情QA、感情会話、感情要約、感情表現など、多様なタスクが含まれる。これらのタスクの平均入力長は8,777トークンに達し、感情表現では長文生成が要求される。現実的な制約下での性能向上を図るため、我々はRetrieval-Augmented Generation（RAG）およびCollaborative Emotional Modeling（CoEM）を組み込み、これらを標準的なプロンプトベースの手法と比較する。従来のアプローチとは異なり、我々のRAG手法は、会話文脈と大規模言語モデル自体を検索ソースとして活用し、外部知識ベースへの依存を回避する。CoEM手法は、タスクを5段階に分解し、検索拡張と限定的な知識注入を統合することで、さらに性能を向上させる。実験結果は、RAGとCoEMがほとんどの長文脈タスクにおいてEI関連の性能を一貫して向上させ、LLMをより実用的で現実世界のEIアプリケーションに近づけることを示している。さらに、我々はGPTシリーズにおける比較事例研究実験を行い、様々なモデル間のEIに関する差異を明らかにした。コードはGitHub（https://github.com/LongEmotion/LongEmotion）で公開されており、プロジェクトページはhttps://longemotion.github.io/で確認できる。

English

Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at https://github.com/LongEmotion/LongEmotion, and the project page can be found at https://longemotion.github.io/.

LongEmotion: 大規模言語モデルの感情的知能を長文脈インタラクションにおいて測定する

LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

要旨

Support