ADAM: 생물학적 추론에서 LLM 평가 및 향상을 위한 인류의 다양한 아카이브

초록

우리는 생애사 추론에서 다중모드 대형 언어 모델(MLLMs)을 평가하고 개선하기 위한 프레임워크인 ADAM(A Diverse Archive of Mankind)을 소개한다. 우리가 아는 한, 이는 사실적 지식의 중요한 측면이면서도 충분히 탐구되지 않은 생애사 영역에서 LLM의 능력을 체계적으로 조사한 첫 번째 연구이다. ADAM의 핵심은 지리, 시간, 직업을 아우르는 400만 명 이상의 개인을 다루는 다국어 및 다중모드 데이터셋인 AdamDB와, 영어 및 모국어로 블룸의 분류체계에 기반한 여섯 가지 추론 수준을 포괄하는 인지 구조화 평가인 AdamBench로 구성된다. 특히 덜 알려진 인물에 대한 환각(hallucination) 문제를 해결하기 위해, 우리는 생애사 맥락에 맞춤화된 검색 증강 생성 시스템인 AdamRAG를 제안한다. 실험 결과, AdamRAG는 오픈소스 모델을 상당히 개선하고, 클로즈드소스 모델에도 소폭의 이점을 제공하며, 하위 수준 추론에서 가장 큰 성과를 보였다. 인기도는 정확도에 강력한 매개 효과를 미쳤으며, 얼굴 이미지를 통한 다중모드 입력은 검색보다 작고 일관성 없는 개선 효과를 보였다. ADAM은 인지적, 문화적, 다중모드적으로 기반을 둔 생애사 평가를 위한 첫 번째 벤치마크와 프레임워크를 확립함으로써, 다국어적이고 정확하며 환각에 강건한 MLLM의 개발을 진전시킨다.

English

We introduce ADAM (A Diverse Archive of Mankind), a framework for evaluating and improving multimodal large language models (MLLMs) in biographical reasoning. To the best of our knowledge, this is the first work to systematically examine LLM capabilities in biography, a critical yet underexplored dimension of factual knowledge. At its core, AdamDB is a multilingual and multimodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom's taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to biographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing the development of multilingual, accurate, and hallucination-resistant MLLMs.