佛罗伦兹:视觉语言模型系统性泛化的扩展法则
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models
March 12, 2025
作者: Julian Spravil, Sebastian Houben, Sven Behnke
cs.AI
摘要
跨语言迁移使得视觉-语言模型(VLMs)能够仅用一种语言的训练数据,在多种语言中执行视觉任务。当前方法依赖于大规模预训练的多语言语言模型。然而,这些方法面临多语言诅咒,即为了多语言能力而牺牲下游任务性能,难以处理词汇歧义,且未能跟上最新进展。在本研究中,我们探讨了单语VLMs在系统化泛化方面的扩展规律,针对多语言任务,着重分析了模型规模与已见训练样本的影响。我们提出了Florenz,一个参数规模从0.4B到11.2B的单语编码器-解码器VLM,它结合了预训练的VLM Florence-2和大语言模型Gemma-2。Florenz在不同计算预算下,在一个特意设计为语言覆盖不全的合成数据集上进行训练,该数据集用于图像描述,从而测试从完全覆盖的翻译任务中的泛化能力。我们不仅证明了间接学习未见任务-语言对遵循扩展规律,还展示了通过我们的数据生成管道及提出的Florenz模型系列,即便仅提供翻译任务的数据,特定语言的图像描述能力也能涌现。在下游数据集混合微调后,Florenz在多模态机器翻译(Multi30K, CoMMuTE)、词汇消歧(CoMMuTE)及图像描述(Multi30K, XM3600, COCO Karpathy)任务中展现出竞争力,并显示出良好的扩展趋势。
English
Cross-lingual transfer enables vision-language models (VLMs) to perform
vision tasks in various languages with training data only in one language.
Current approaches rely on large pre-trained multilingual language models.
However, they face the curse of multilinguality, sacrificing downstream task
performance for multilingual capabilities, struggling with lexical ambiguities,
and falling behind recent advances. In this work, we study the scaling laws of
systematic generalization with monolingual VLMs for multilingual tasks,
focusing on the impact of model size and seen training samples. We propose
Florenz, a monolingual encoder-decoder VLM with 0.4B to 11.2B parameters
combining the pre-trained VLM Florence-2 and the large language model Gemma-2.
Florenz is trained with varying compute budgets on a synthetic dataset that
features intentionally incomplete language coverage for image captioning, thus,
testing generalization from the fully covered translation task. We show that
not only does indirectly learning unseen task-language pairs adhere to a
scaling law, but also that with our data generation pipeline and the proposed
Florenz model family, image captioning abilities can emerge in a specific
language even when only data for the translation task is available. Fine-tuning
on a mix of downstream datasets yields competitive performance and demonstrates
promising scaling trends in multimodal machine translation (Multi30K, CoMMuTE),
lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO
Karpathy).Summary
AI-Generated Summary