大型語言模型的行為指紋識別

摘要

当前针对大型语言模型（LLMs）的基准测试主要集中于性能指标，往往未能捕捉到区分它们的微妙行为特征。本文提出了一种新颖的“行为指纹”框架，旨在超越传统评估方法，通过创建模型内在认知与交互风格的多维度画像。利用精心设计的诊断提示套件及一个创新的自动化评估流程——其中强大的LLM扮演公正裁判的角色，我们分析了跨越不同能力层级的十八个模型。研究结果揭示了LLM领域的一个关键分歧：尽管顶级模型在抽象与因果推理等核心能力上趋于一致，但在诸如奉承性与语义鲁棒性等与对齐相关的行为上却表现出显著差异。我们进一步记录了一种跨模型的默认人格聚类现象（ISTJ/ESTJ），这很可能反映了普遍的对齐激励。综合来看，这表明模型的交互特性并非其规模或推理能力的自然涌现，而是开发者特定且高度可变的对齐策略的直接结果。我们的框架为揭示这些深层次的行为差异提供了一种可复现且可扩展的方法论。项目地址：https://github.com/JarvisPei/Behavioral-Fingerprinting

English

Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting'' framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model's intrinsic cognitive and interactive styles. Using a curated Diagnostic Prompt Suite and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model's interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting

大型語言模型的行為指紋識別

Behavioral Fingerprinting of Large Language Models

摘要

Support