ChatPaper.aiChatPaper

SoCRATES:邁向跨領域與社會認知變異下主動式LLM調解的可靠自動評估

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

June 4, 2026
作者: Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song
cs.AI

摘要

評估LLM調解員仍具挑戰性,因為調解過程是根據爭議雙方不斷變化的情緒、意圖和情境而即時展開的動態軌跡。現有的測試平台依賴少數專家撰寫的領域,主要變化在於策略立場,並對每個主題的每一輪對話進行評分,從而引入與主題無關的雜訊。我們提出SoCRATES,一個用於在真實且多領域測試平台中評估主動式LLM調解員的基準測試。它透過一個涵蓋八個領域的智能代理管線,從真實衝突中構建場景,探測五個社會認知適應軸(策略立場、參與者組成、歷史長度、情緒反應性和文化身份),並僅針對推進每個主題的對話輪次,透過主題局部化評估器進行評分。該評估器與人類專家的共識度達到0.82,是每輪基線的兩倍以上。在對八個前沿LLM進行基準測試時,我們發現即使是最強的調解員,在多元且真實的測試平台中,也只能縮小約三分之一的未調解共識差距,且表現因社會認知軸而差異顯著,這凸顯了進步在於對多樣條件的社會適應能力。
English
Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.