Aya Vision: 다국어 멀티모달리티의 최전선을 개척하다

초록

다중모달 언어 모델을 구축하는 것은 근본적으로 어려운 과제입니다: 시각과 언어 모달리티를 정렬하고, 고품질의 명령 데이터를 선별하며, 시각 모달리티가 도입된 후 기존의 텍스트 전용 기능이 저하되지 않도록 해야 합니다. 이러한 어려움은 다국어 환경에서 더욱 심화되는데, 다양한 언어로 된 다중모달 데이터의 필요성으로 인해 기존의 데이터 부족 문제가 악화되고, 기계 번역이 종종 의미를 왜곡하며, 치명적인 망각 현상이 더 두드러지기 때문입니다. 이러한 문제를 해결하기 위해, 우리는 데이터와 모델링 모두에 걸친 새로운 기법을 소개합니다. 먼저, 고품질이고 다양한 다국어 다중모달 명령 데이터를 선별하는 합성 주석 프레임워크를 개발하여 Aya Vision 모델이 다양한 언어로 된 다중모달 입력에 대해 자연스럽고 인간이 선호하는 응답을 생성할 수 있도록 합니다. 이를 보완하기 위해, 우리는 치명적인 망각 현상을 완화하고 텍스트 전용 기능을 효과적으로 보존하면서 동시에 다중모달 생성 성능을 향상시키는 교차 모달 모델 병합 기법을 제안합니다. Aya-Vision-8B는 Qwen-2.5-VL-7B, Pixtral-12B, 심지어 훨씬 더 큰 Llama-3.2-90B-Vision과 같은 강력한 다중모달 모델들과 비교했을 때 최고 수준의 성능을 달성합니다. 우리는 이 접근법을 Aya-Vision-32B로 확장하여, Molmo-72B와 LLaMA-3.2-90B-Vision과 같이 크기가 두 배 이상 큰 모델들을 능가하는 성능을 보여줍니다. 우리의 연구는 다중모달 분야에서 다국어 진전을 이루고, 극도로 높은 성능을 제공하면서도 컴퓨팅 자원의 필요성을 효과적으로 줄이는 기법에 대한 통찰을 제공합니다.

English

Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.

Aya Vision: 다국어 멀티모달리티의 최전선을 개척하다

Aya Vision: Advancing the Frontier of Multilingual Multimodality

초록

Support