CLASH：評估語言模型在多視角下判斷高風險困境的能力

摘要

在涉及價值衝突的高風險困境中做出抉擇，即便對人類而言也是一大挑戰，更遑論人工智慧（AI）。然而，先前評估大型語言模型（LLMs）在此類情境下推理能力的研究，僅限於日常場景。為彌補這一缺口，本研究首先引入了CLASH（基於角色視角的高風險情境下LLM評估），這是一個精心策劃的數據集，包含345個高影響力困境及3,795個反映多元價值的個體視角。特別地，我們設計CLASH以支持研究先前工作中缺失的基於價值的決策過程關鍵方面，包括理解決策矛盾與心理不適，以及捕捉角色視角中價值觀的時序變化。通過對10個開源與閉源前沿模型的基準測試，我們揭示了幾個關鍵發現：（1）即便是最強大的模型，如GPT-4o和Claude-Sonnet，在識別應存在決策矛盾的情境時，準確率不足50%，而在明確情境下表現則顯著更佳。（2）雖然LLMs能合理預測人類標記的心理不適，但對涉及價值轉變的視角理解不足，表明LLMs需提升對複雜價值的推理能力。（3）我們的實驗還顯示，LLMs的價值偏好與其對特定價值的可引導性之間存在顯著相關性。（4）最後，相較於第一人稱設定，LLMs在從第三方視角進行價值推理時展現出更高的可引導性，儘管某些價值配對能從第一人稱框架中獲得獨特優勢。

English

Navigating high-stakes dilemmas involving conflicting values is challenging even for humans, let alone for AI. Yet prior work in evaluating the reasoning capabilities of large language models (LLMs) in such situations has been limited to everyday scenarios. To close this gap, this work first introduces CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. In particular, we design CLASH in a way to support the study of critical aspects of value-based decision-making processes which are missing from prior work, including understanding decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in characters' perspectives. By benchmarking 10 open and closed frontier models, we uncover several key findings. (1) Even the strongest models, such as GPT-4o and Claude-Sonnet, achieve less than 50% accuracy in identifying situations where the decision should be ambivalent, while they perform significantly better in clear-cut scenarios. (2) While LLMs reasonably predict psychological discomfort as marked by human, they inadequately comprehend perspectives involving value shifts, indicating a need for LLMs to reason over complex values. (3) Our experiments also reveal a significant correlation between LLMs' value preferences and their steerability towards a given value. (4) Finally, LLMs exhibit greater steerability when engaged in value reasoning from a third-party perspective, compared to a first-person setup, though certain value pairs benefit uniquely from the first-person framing.

CLASH：評估語言模型在多視角下判斷高風險困境的能力

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

摘要

Support