视觉嵌入降维与空间探索:基于聚类引导的视觉富文档理解训练数据增强方法
VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
January 8, 2026
作者: Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal
cs.AI
摘要
本研究提出VERSE方法,通过探索视觉语言模型在视觉富文档理解任务中的视觉嵌入空间,实现对这些模型的分析与优化。该技术能够可视化潜在表征,辅助评估模型可行性,并支持识别问题区域以指导生成针对性合成数据。我们在合成数据集MERIT上训练模型,并在真实场景数据集MERIT Secret上验证:结果表明VERSE能有效揭示易出错簇的视觉特征,而针对这些特征的样本重训练可在保持泛化能力的同时显著提升F1值。此外,实验证明Donut、Idefics2等本地模型经VERSE优化后,其性能可媲美甚至超越GPT-4、Pixtral等SaaS解决方案。
English
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.