PersonaX：基于大语言模型推断行为特征的多模态数据集

摘要

理解人类行为特质在人机交互、计算社会科学以及个性化AI系统应用中占据核心地位。这种理解通常需要整合多种模态以捕捉细微的模式与关系。然而，现有资源鲜少提供将行为描述符与面部属性、传记信息等互补模态相结合的数据集。为填补这一空白，我们推出了PersonaX，一个精心策划的多模态数据集集合，旨在实现跨模态公共特质的全面分析。PersonaX包含两部分：(1) CelebPersona，涵盖来自不同职业的9444位公众人物；(2) AthlePersona，覆盖7大体育联盟的4181名职业运动员。每个数据集均包含由三个高性能大语言模型推断的行为特质评估，以及面部图像和结构化传记特征。我们从两个互补层面分析PersonaX：首先，从文本描述中抽象出高层特质评分，并应用五种统计独立性检验来探究它们与其他模态的关系；其次，我们引入了一种新颖的因果表示学习（CRL）框架，专为多模态和多测量数据设计，提供了理论上的可识别性保证。在合成数据和真实世界数据上的实验验证了我们方法的有效性。通过统一结构化和非结构化分析，PersonaX为结合视觉与传记属性研究大语言模型推断的行为特质奠定了基础，推动了多模态特质分析与因果推理的进步。

English

Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.

PersonaX：基于大语言模型推断行为特征的多模态数据集

PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

摘要

Support