大型语言模型的行为特征识别

摘要

当前，针对大型语言模型（LLMs）的基准测试主要聚焦于性能指标，往往未能捕捉到区分它们微妙行为特征的关键要素。本文提出了一种创新的“行为指纹识别”框架，旨在超越传统评估方法，通过构建模型内在认知与交互风格的多维度画像来深入理解模型。我们利用精心设计的诊断提示套件和一个创新的自动化评估流程，其中由一款强大的LLM担任公正评判者，对跨越不同能力层级的十八个模型进行了分析。研究结果揭示了LLM领域的一个关键分歧：尽管顶级模型在抽象与因果推理等核心能力上趋于一致，但在诸如迎合性（sycophancy）和语义鲁棒性等与对齐相关的行为上却表现出显著差异。此外，我们还记录了一种跨模型的默认人格聚类现象（ISTJ/ESTJ），这很可能反映了普遍的对齐激励机制。综合来看，这些发现表明，模型的交互特性并非其规模或推理能力的自然涌现，而是开发者特定且高度可变的对齐策略的直接结果。我们的框架为揭示这些深层次行为差异提供了一种可复现且可扩展的方法论。项目地址：https://github.com/JarvisPei/Behavioral-Fingerprinting

English

Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting'' framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model's intrinsic cognitive and interactive styles. Using a curated Diagnostic Prompt Suite and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model's interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting