AVERE：基于偏好优化的视听情感推理能力提升

摘要

情感理解是构建社会智能体的关键。尽管当前多模态大语言模型在此任务上表现优异，但仍面临两大挑战：情绪与无关视听线索的伪关联问题，以及语言模型主干中文本先验驱动的视听线索幻觉。为量化并解析这些问题，我们提出EmoReAlM基准测试，用于评估多模态大模型在线索-情绪关联、幻觉现象及模态一致性方面的表现。进而我们提出AVEm-DPO偏好优化技术，使模型响应与视听输入及情绪中心查询对齐。具体而言，我们基于文本提示构建了对存在伪关联或幻觉的响应偏好，以及视听输入对的偏好选择。同时引入正则化项以惩罚对文本先验的依赖，从而缓解特定模态线索的幻觉问题。在DFEW、RAVDESS和EMER数据集上的实验表明，本方法使基线模型的零样本性能获得6-19%的相对提升。通过提供严谨的基准测试与鲁棒的优化框架，本研究为情感理解与社会人工智能领域的多模态大模型奠定了系统化评估与改进的基础。代码、模型及基准测试数据将于https://avere-iclr.github.io发布。

English

Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at https://avere-iclr.github.io.

AVERE：基于偏好优化的视听情感推理能力提升

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

摘要

Support