ADAM：人类多样性档案库——用于评估与提升大语言模型在传记推理中的表现

摘要

我们引入了ADAM（人类多元档案库），一个用于评估和提升多模态大语言模型（MLLMs）在传记推理方面能力的框架。据我们所知，这是首次系统性地考察大语言模型在传记这一关键但尚未充分探索的事实知识维度上的表现。ADAM的核心在于AdamDB，这是一个多语言、多模态的数据集，涵盖了跨越地理、时间和职业的超过400万个人物；而AdamBench则基于布鲁姆分类法，提供了认知结构化的评估体系，涵盖了英语及母语中的六个推理层次。针对模型在描述鲜为人知人物时易产生的幻觉问题，我们提出了AdamRAG，一个专为传记情境设计的检索增强生成系统。实验表明，AdamRAG显著提升了开源模型的表现，并对闭源模型也有一定程度的帮助，尤其是在较低层次推理任务上效果最为显著。人物知名度对准确性有显著影响，而通过面部图像的多模态输入相较于检索带来的改进较小且不够稳定。ADAM首次建立了基于认知、文化及多模态的传记评估基准与框架，推动了多语言、准确且抗幻觉的多模态大语言模型的发展。

English

We introduce ADAM (A Diverse Archive of Mankind), a framework for evaluating and improving multimodal large language models (MLLMs) in biographical reasoning. To the best of our knowledge, this is the first work to systematically examine LLM capabilities in biography, a critical yet underexplored dimension of factual knowledge. At its core, AdamDB is a multilingual and multimodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom's taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to biographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing the development of multilingual, accurate, and hallucination-resistant MLLMs.