超新星事件数据集:通过关键事件分析解读大语言模型的个性特征
Supernova Event Dataset: Interpreting Large Language Model's Personality through Critical Event Analysis
June 13, 2025
作者: Pranav Agarwal, Ioana Ciucă
cs.AI
摘要
大型语言模型(LLMs)正日益融入日常应用中。随着其影响力的扩大,理解其决策过程及内在个性变得至关重要。在本研究中,我们利用提出的超新星事件数据集来解读模型个性,该数据集包含传记、历史事件、新闻和科学发现等多样化的文章。我们使用此数据集对LLMs进行基准测试,评估其从文本中提取和排序关键事件的能力,这是一项主观且复杂的挑战,需要长距离上下文推理和因果链建模。我们评估了小型模型如Phi-4、Orca 2和Qwen 2.5,以及更强大的大型模型如Claude 3.7、Gemini 2.5和OpenAI o3,并提出了一个框架,其中另一个LLM作为裁判,根据模型对事件的选择和分类推断其个性。我们的分析揭示了显著的个性特征:例如,Orca 2展现出关注人际动态的情感推理,而Qwen 2.5则表现出更为战略性和分析性的风格。在分析科学发现事件时,Claude Sonnet 3.7强调概念框架,Gemini 2.5 Pro优先考虑实证验证,而o3则偏好逐步的因果推理。这一分析提升了模型的可解释性,使其更易于用户友好地应用于广泛的多样化场景。
English
Large Language Models (LLMs) are increasingly integrated into everyday
applications. As their influence grows, understanding their decision making and
underlying personality becomes essential. In this work, we interpret model
personality using our proposed Supernova Event Dataset, a novel dataset with
diverse articles spanning biographies, historical events, news, and scientific
discoveries. We use this dataset to benchmark LLMs on extracting and ranking
key events from text, a subjective and complex challenge that requires
reasoning over long-range context and modeling causal chains. We evaluate small
models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as
Claude 3.7, Gemini 2.5, and OpenAI o3, and propose a framework where another
LLM acts as a judge to infer each model's personality based on its selection
and classification of events. Our analysis shows distinct personality traits:
for instance, Orca 2 demonstrates emotional reasoning focusing on interpersonal
dynamics, while Qwen 2.5 displays a more strategic, analytical style. When
analyzing scientific discovery events, Claude Sonnet 3.7 emphasizes conceptual
framing, Gemini 2.5 Pro prioritizes empirical validation, and o3 favors
step-by-step causal reasoning. This analysis improves model interpretability,
making them user-friendly for a wide range of diverse applications.