ChatPaper.aiChatPaper

Aya Vision:推动多语言多模态研究的前沿发展

Aya Vision: Advancing the Frontier of Multilingual Multimodality

May 13, 2025
作者: Saurabh Dash, Yiyang Nan, John Dang, Arash Ahmadian, Shivalika Singh, Madeline Smith, Bharat Venkitesh, Vlad Shmyhlo, Viraat Aryabumi, Walter Beller-Morales, Jeremy Pekmez, Jason Ozuzu, Pierre Richemond, Acyr Locatelli, Nick Frosst, Phil Blunsom, Aidan Gomez, Ivan Zhang, Marzieh Fadaee, Manoj Govindassamy, Sudip Roy, Matthias Gallé, Beyza Ermis, Ahmet Üstün, Sara Hooker
cs.AI

摘要

构建多模态语言模型面临根本性挑战:它需要对齐视觉与语言模态,精心策划高质量的指令数据,并在引入视觉功能时避免现有纯文本能力的退化。这些困难在多语言环境中进一步加剧,因为不同语言的多模态数据需求加剧了现有的数据稀缺问题,机器翻译往往扭曲原意,且灾难性遗忘现象更为显著。为应对上述挑战,我们引入了一系列涵盖数据与建模的创新技术。首先,我们开发了一个合成标注框架,用于策划高质量、多样化的多语言多模态指令数据,使Aya Vision模型能够针对多种语言的多模态输入生成自然、符合人类偏好的响应。此外,我们提出了一种跨模态模型融合技术,有效缓解了灾难性遗忘,在保持纯文本能力的同时,显著提升了多模态生成性能。与Qwen-2.5-VL-7B、Pixtral-12B等强劲多模态模型相比,Aya-Vision-8B展现出顶尖性能,甚至超越了规模大得多的Llama-3.2-90B-Vision。我们进一步将这一方法扩展至Aya-Vision-32B,其表现超越了规模超过其两倍的模型,如Molmo-72B和LLaMA-3.2-90B-Vision。我们的工作推动了多模态前沿的多语言进展,并提供了在实现极高性能的同时有效降低计算需求的技术洞见。
English
Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.

Summary

AI-Generated Summary

PDF72May 14, 2025