超新星事件數據集:透過關鍵事件分析解讀大型語言模型的個性特徵
Supernova Event Dataset: Interpreting Large Language Model's Personality through Critical Event Analysis
June 13, 2025
作者: Pranav Agarwal, Ioana Ciucă
cs.AI
摘要
大型語言模型(LLMs)正日益融入日常應用中。隨著其影響力的增長,理解其決策過程及內在性格變得至關重要。在本研究中,我們利用提出的超新星事件數據集來解讀模型性格,這是一個包含傳記、歷史事件、新聞和科學發現等多樣化文章的新穎數據集。我們使用此數據集來評估LLMs從文本中提取和排序關鍵事件的能力,這是一項需要長程上下文推理和因果鏈建模的主觀且複雜的挑戰。我們評估了如Phi-4、Orca 2和Qwen 2.5等小型模型,以及Claude 3.7、Gemini 2.5和OpenAI o3等更強大的大型模型,並提出了一個框架,其中另一個LLM作為評判者,根據模型對事件的選擇和分類來推斷其性格。我們的分析揭示了不同的性格特質:例如,Orca 2展現了關注人際動態的情感推理,而Qwen 2.5則表現出更具戰略性和分析性的風格。在分析科學發現事件時,Claude Sonnet 3.7強調概念框架,Gemini 2.5 Pro優先考慮實證驗證,而o3則偏愛逐步的因果推理。此分析提升了模型的可解釋性,使其更適合廣泛多樣的應用場景。
English
Large Language Models (LLMs) are increasingly integrated into everyday
applications. As their influence grows, understanding their decision making and
underlying personality becomes essential. In this work, we interpret model
personality using our proposed Supernova Event Dataset, a novel dataset with
diverse articles spanning biographies, historical events, news, and scientific
discoveries. We use this dataset to benchmark LLMs on extracting and ranking
key events from text, a subjective and complex challenge that requires
reasoning over long-range context and modeling causal chains. We evaluate small
models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as
Claude 3.7, Gemini 2.5, and OpenAI o3, and propose a framework where another
LLM acts as a judge to infer each model's personality based on its selection
and classification of events. Our analysis shows distinct personality traits:
for instance, Orca 2 demonstrates emotional reasoning focusing on interpersonal
dynamics, while Qwen 2.5 displays a more strategic, analytical style. When
analyzing scientific discovery events, Claude Sonnet 3.7 emphasizes conceptual
framing, Gemini 2.5 Pro prioritizes empirical validation, and o3 favors
step-by-step causal reasoning. This analysis improves model interpretability,
making them user-friendly for a wide range of diverse applications.