VERSE:视觉嵌入降维与空间探索。面向富视觉文档理解训练数据增强的聚类引导洞察
VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
January 8, 2026
作者: Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal
cs.AI
摘要
本研究提出VERSE方法,通过探索视觉语言模型在视觉富文档理解任务中的嵌入空间,实现对其分析与改进。该技术能可视化潜在表征以评估模型可行性,辅助识别问题区域,并指导生成合成数据以强化特定簇群的性能。我们在合成数据集MERIT上训练模型,并在真实数据集MERIT Secret上验证:结果表明VERSE能有效揭示易出错簇群相关的视觉特征,而针对这些特征的样本进行再训练可在保持泛化能力的同时显著提升F1分数。此外,研究证明Donut、Idefics2等本地模型经VERSE优化后,其性能可媲美甚至超越GPT-4、Pixtral等SaaS解决方案。
English
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.