弗洛伦茨:视觉-语言模型系统性泛化的缩放定律
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models
March 12, 2025
作者: Julian Spravil, Sebastian Houben, Sven Behnke
cs.AI
摘要
跨語言遷移使視覺語言模型(VLMs)能夠僅憑一種語言的訓練數據,在多種語言中執行視覺任務。現有方法依賴於大型預訓練的多語言語言模型。然而,這些模型面臨多語言性的詛咒,在追求多語言能力的同時犧牲了下游任務的性能,難以應對詞彙歧義,且未能跟上最新的技術進展。在本研究中,我們探討了使用單語言VLMs進行多語言任務時系統性泛化的規模法則,重點關注模型大小和已見訓練樣本的影響。我們提出了Florenz,這是一個參數量從0.4B到11.2B不等的單語言編碼器-解碼器VLM,它結合了預訓練的VLM Florence-2和大語言模型Gemma-2。Florenz在不同計算預算下,於一個特意設計為語言覆蓋不全的合成數據集上進行訓練,該數據集用於圖像描述任務,從而測試從完全覆蓋的翻譯任務中的泛化能力。我們不僅展示了間接學習未見任務-語言對遵循規模法則,而且通過我們的數據生成管道和提出的Florenz模型家族,即使僅有翻譯任務的數據可用,特定語言的圖像描述能力也能夠湧現。在混合下游數據集上的微調展現了競爭力的性能,並在多模態機器翻譯(Multi30K, CoMMuTE)、詞彙消歧(CoMMuTE)以及圖像描述(Multi30K, XM3600, COCO Karpathy)任務中顯示出有前景的規模化趨勢。
English
Cross-lingual transfer enables vision-language models (VLMs) to perform
vision tasks in various languages with training data only in one language.
Current approaches rely on large pre-trained multilingual language models.
However, they face the curse of multilinguality, sacrificing downstream task
performance for multilingual capabilities, struggling with lexical ambiguities,
and falling behind recent advances. In this work, we study the scaling laws of
systematic generalization with monolingual VLMs for multilingual tasks,
focusing on the impact of model size and seen training samples. We propose
Florenz, a monolingual encoder-decoder VLM with 0.4B to 11.2B parameters
combining the pre-trained VLM Florence-2 and the large language model Gemma-2.
Florenz is trained with varying compute budgets on a synthetic dataset that
features intentionally incomplete language coverage for image captioning, thus,
testing generalization from the fully covered translation task. We show that
not only does indirectly learning unseen task-language pairs adhere to a
scaling law, but also that with our data generation pipeline and the proposed
Florenz model family, image captioning abilities can emerge in a specific
language even when only data for the translation task is available. Fine-tuning
on a mix of downstream datasets yields competitive performance and demonstrates
promising scaling trends in multimodal machine translation (Multi30K, CoMMuTE),
lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO
Karpathy).Summary
AI-Generated Summary