초신성 이벤트 데이터셋: 중대 사건 분석을 통한 대형 언어 모델의 성격 해석

초록

대형 언어 모델(LLMs)은 점점 더 일상적인 애플리케이션에 통합되고 있습니다. 그 영향력이 커짐에 따라, 이들의 의사 결정 과정과 내재된 성격을 이해하는 것이 필수적입니다. 본 연구에서는 우리가 제안한 슈퍼노바 이벤트 데이터셋을 사용하여 모델의 성격을 해석합니다. 이 데이터셋은 전기, 역사적 사건, 뉴스, 과학적 발견 등 다양한 주제의 기사로 구성된 새로운 데이터셋입니다. 우리는 이 데이터셋을 사용하여 LLMs가 텍스트에서 주요 이벤트를 추출하고 순위를 매기는 능력을 벤치마킹합니다. 이는 장거리 문맥을 추론하고 인과 관계를 모델링해야 하는 주관적이고 복잡한 과제입니다. 우리는 Phi-4, Orca 2, Qwen 2.5와 같은 소형 모델과 Claude 3.7, Gemini 2.5, OpenAI o3와 같은 대형 강력 모델을 평가하고, 또 다른 LLM이 판단자 역할을 하여 각 모델의 이벤트 선택 및 분류를 기반으로 성격을 추론하는 프레임워크를 제안합니다. 우리의 분석은 뚜렷한 성격 특성을 보여줍니다: 예를 들어, Orca 2는 대인 관계 역학에 초점을 맞춘 감정적 추론을 보이는 반면, Qwen 2.5는 더 전략적이고 분석적인 스타일을 보입니다. 과학적 발견 이벤트를 분석할 때, Claude Sonnet 3.7은 개념적 틀을 강조하고, Gemini 2.5 Pro는 경험적 검증을 우선시하며, o3는 단계별 인과적 추론을 선호합니다. 이 분석은 모델의 해석 가능성을 향상시켜 다양한 애플리케이션에서 사용자 친화적으로 만듭니다.

English

Large Language Models (LLMs) are increasingly integrated into everyday applications. As their influence grows, understanding their decision making and underlying personality becomes essential. In this work, we interpret model personality using our proposed Supernova Event Dataset, a novel dataset with diverse articles spanning biographies, historical events, news, and scientific discoveries. We use this dataset to benchmark LLMs on extracting and ranking key events from text, a subjective and complex challenge that requires reasoning over long-range context and modeling causal chains. We evaluate small models like Phi-4, Orca 2, and Qwen 2.5, and large, stronger models such as Claude 3.7, Gemini 2.5, and OpenAI o3, and propose a framework where another LLM acts as a judge to infer each model's personality based on its selection and classification of events. Our analysis shows distinct personality traits: for instance, Orca 2 demonstrates emotional reasoning focusing on interpersonal dynamics, while Qwen 2.5 displays a more strategic, analytical style. When analyzing scientific discovery events, Claude Sonnet 3.7 emphasizes conceptual framing, Gemini 2.5 Pro prioritizes empirical validation, and o3 favors step-by-step causal reasoning. This analysis improves model interpretability, making them user-friendly for a wide range of diverse applications.