VaseVQA:古希腊陶器多模态智能体与基准测试平台
VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
September 21, 2025
作者: Jinchao Ge, Tengfei Cheng, Biao Wu, Zeyu Zhang, Shiya Huang, Judith Bishop, Gillian Shepherd, Meng Fang, Ling Chen, Yang Zhao
cs.AI
摘要
分析文化遗产文物对于多模态大语言模型(MLLMs)仍具挑战性:通用模型缺乏领域专业知识,而监督微调(SFT)往往过度拟合表面模式,导致在真伪鉴定和历史归属方面的推理脆弱。这引发了一个问题:如何为MLLMs配备针对古希腊陶器的专家级稳健推理能力。我们提出了VaseVL,一个先SFT后强化学习(RL)的系统,它将评估转化为监督:我们构建了问题类型的分类体系,通过探测SFT模型定位特定类型的性能差距,并针对这些差距,采用类型条件化、面向组合性的奖励进行优化。同时,我们发布了VaseVQA,一个包含31,773张图像的全面基准测试集,旨在深入理解。实验结果表明,在风格分类和历史归属任务上,我们的方法取得了最先进的成果,相较于仅使用SFT的基线模型,在组合稳健性上显著提升,验证了基于诊断引导、分类体系条件化的奖励工程,并为未来研究提供了可复用的资源。代码和数据集将在https://github.com/AIGeeksGroup/VaseVQA 公开。
English
Analyzing cultural-heritage artifacts remains challenging for MLLMs: general
models lack domain expertise, and SFT often overfits superficial patterns,
yielding brittle reasoning for authentication and historical attribution. This
raises the question of how to equip MLLMs with robust, expert-level reasoning
for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns
evaluation into supervision: we construct a taxonomy of question types, probe
the SFT model to localize type-specific performance gaps, and optimize with
type-conditioned, compositionality-oriented rewards targeting those gaps. We
also release VaseVQA, a comprehensive benchmark of 31,773 images designed to
probe deep understanding. Experiments show state-of-the-art results on style
classification and historical attribution with marked gains in compositional
robustness over SFT-only baselines, validating diagnosis-guided,
taxonomy-conditioned reward engineering and providing a reusable resource for
future research. Code and dataset will be available at
https://github.com/AIGeeksGroup/VaseVQA.