VaseVQA:古希臘陶器的多模態代理與基準測試
VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
September 21, 2025
作者: Jinchao Ge, Tengfei Cheng, Biao Wu, Zeyu Zhang, Shiya Huang, Judith Bishop, Gillian Shepherd, Meng Fang, Ling Chen, Yang Zhao
cs.AI
摘要
分析文化遺產文物對於多模態大語言模型(MLLMs)而言仍具挑戰性:通用模型缺乏領域專業知識,而監督微調(SFT)往往過度擬合表面模式,導致在鑑定和歷史歸因方面產生脆弱的推理能力。這引發了一個問題:如何為MLLMs配備針對古希臘陶器的穩健、專家級推理能力。我們提出了VaseVL,這是一個先SFT後強化學習(RL)的系統,將評估轉化為監督:我們構建了一個問題類型的分類體系,探測SFT模型以定位特定類型的性能差距,並針對這些差距進行類型條件化、面向組合性的獎勵優化。我們還發布了VaseVQA,這是一個包含31,773張圖像的綜合基準,旨在探測深度理解。實驗結果顯示,在風格分類和歷史歸因方面達到了最先進的水平,相較於僅使用SFT的基線,在組合穩健性上取得了顯著提升,驗證了診斷引導、分類體系條件化的獎勵工程,並為未來研究提供了可重複使用的資源。代碼和數據集將在https://github.com/AIGeeksGroup/VaseVQA 上公開。
English
Analyzing cultural-heritage artifacts remains challenging for MLLMs: general
models lack domain expertise, and SFT often overfits superficial patterns,
yielding brittle reasoning for authentication and historical attribution. This
raises the question of how to equip MLLMs with robust, expert-level reasoning
for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns
evaluation into supervision: we construct a taxonomy of question types, probe
the SFT model to localize type-specific performance gaps, and optimize with
type-conditioned, compositionality-oriented rewards targeting those gaps. We
also release VaseVQA, a comprehensive benchmark of 31,773 images designed to
probe deep understanding. Experiments show state-of-the-art results on style
classification and historical attribution with marked gains in compositional
robustness over SFT-only baselines, validating diagnosis-guided,
taxonomy-conditioned reward engineering and providing a reusable resource for
future research. Code and dataset will be available at
https://github.com/AIGeeksGroup/VaseVQA.