VaseVQA: 고대 그리스 도자기를 위한 멀티모달 에이전트 및 벤치마크

초록

문화유산 유물 분석은 MLLM(Multimodal Large Language Models)에게 여전히 도전적인 과제입니다: 일반 모델은 도메인 전문성이 부족하고, SFT(Supervised Fine-Tuning)는 종종 표면적인 패턴에 과적합되어 인증 및 역사적 귀속에 취약한 추론을 생성합니다. 이는 고대 그리스 도자기에 대한 전문가 수준의 견고한 추론 능력을 MLLM에 어떻게 부여할 수 있을지라는 질문을 제기합니다. 우리는 평가를 지도 학습으로 전환하는 SFT-then-RL 시스템인 VaseVL을 제안합니다: 질문 유형의 분류 체계를 구축하고, SFT 모델을 탐색하여 유형별 성능 격차를 파악하며, 이러한 격차를 대상으로 유형 조건화 및 조합성 지향 보상을 통해 최적화합니다. 또한, 깊은 이해를 탐구하기 위해 설계된 31,773개의 이미지로 구성된 포괄적인 벤치마크인 VaseVQA를 공개합니다. 실험 결과, 스타일 분류 및 역사적 귀속에서 최첨단 성적을 보이며 SFT만 사용한 베이스라인 대비 조합적 견고성에서 뚜렷한 향상을 보여, 진단 기반 및 분류 체계 조건화 보상 엔지니어링의 유효성을 입증하고 향후 연구를 위한 재사용 가능한 자원을 제공합니다. 코드와 데이터셋은 https://github.com/AIGeeksGroup/VaseVQA에서 확인할 수 있습니다.

English

Analyzing cultural-heritage artifacts remains challenging for MLLMs: general models lack domain expertise, and SFT often overfits superficial patterns, yielding brittle reasoning for authentication and historical attribution. This raises the question of how to equip MLLMs with robust, expert-level reasoning for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns evaluation into supervision: we construct a taxonomy of question types, probe the SFT model to localize type-specific performance gaps, and optimize with type-conditioned, compositionality-oriented rewards targeting those gaps. We also release VaseVQA, a comprehensive benchmark of 31,773 images designed to probe deep understanding. Experiments show state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, validating diagnosis-guided, taxonomy-conditioned reward engineering and providing a reusable resource for future research. Code and dataset will be available at https://github.com/AIGeeksGroup/VaseVQA.

VaseVQA: 고대 그리스 도자기를 위한 멀티모달 에이전트 및 벤치마크

VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

초록

Support