SoCRATES：邁向跨領域與社會認知變異下主動式LLM調解的可靠自動評估

摘要

評估LLM調解員仍具挑戰性，因為調解過程是根據爭議雙方不斷變化的情緒、意圖和情境而即時展開的動態軌跡。現有的測試平台依賴少數專家撰寫的領域，主要變化在於策略立場，並對每個主題的每一輪對話進行評分，從而引入與主題無關的雜訊。我們提出SoCRATES，一個用於在真實且多領域測試平台中評估主動式LLM調解員的基準測試。它透過一個涵蓋八個領域的智能代理管線，從真實衝突中構建場景，探測五個社會認知適應軸（策略立場、參與者組成、歷史長度、情緒反應性和文化身份），並僅針對推進每個主題的對話輪次，透過主題局部化評估器進行評分。該評估器與人類專家的共識度達到0.82，是每輪基線的兩倍以上。在對八個前沿LLM進行基準測試時，我們發現即使是最強的調解員，在多元且真實的測試平台中，也只能縮小約三分之一的未調解共識差距，且表現因社會認知軸而差異顯著，這凸顯了進步在於對多樣條件的社會適應能力。

English

Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.