NeoBabel: 시각적 생성을 위한 다국어 오픈 타워

초록

텍스트-이미지 생성 기술의 발전은 주로 영어 중심으로 이루어져 왔으며, 이는 비영어권 사용자들에게 장벽을 만들고 디지털 불평등을 고착화시켜 왔다. 기존 시스템들은 번역 파이프라인에 의존하지만, 이는 의미적 오차, 계산적 오버헤드, 그리고 문화적 불일치를 초래한다. 우리는 NeoBabel이라는 새로운 다국어 이미지 생성 프레임워크를 소개한다. 이 프레임워크는 성능, 효율성, 그리고 포용성 측면에서 새로운 파레토 최적 경계를 설정하며, 영어, 중국어, 네덜란드어, 프랑스어, 힌디어, 그리고 페르시아어 등 6개 언어를 지원한다. 이 모델은 대규모 다국어 사전 학습과 고해상도 지침 튜닝의 조합을 통해 훈련되었다. 그 능력을 평가하기 위해, 우리는 두 개의 영어 전용 벤치마크를 다국어 버전으로 확장하였다: m-GenEval과 m-DPG. NeoBabel은 강력한 영어 능력을 유지하면서도 최신의 다국어 성능을 달성하였으며, m-GenEval에서 0.75점, m-DPG에서 0.68점을 기록하였다. 특히, 영어 작업에서는 선두 모델들과 동등한 성능을 보이면서도 다국어 벤치마크에서는 +0.11과 +0.09로 그들을 능가하였다. 이는 다국어 기반 LLM을 기반으로 구축된 모델들임에도 불구하고 우리의 타겟팅된 정렬 훈련이 교차 언어 일반화를 보존하고 확장하는 데 효과적임을 보여준다. 우리는 또한 다국어 정렬과 코드 혼합 프롬프트에 대한 견고성을 엄격하게 평가하기 위해 두 가지 새로운 메트릭을 도입하였다. 특히, NeoBabel은 영어 전용 모델들과 동등하거나 그들을 능가하면서도 크기가 2-4배 더 작다. 우리는 포용적인 AI 연구를 진전시키기 위해 모든 코드, 모델 체크포인트, 124M개의 다국어 텍스트-이미지 쌍으로 구성된 큐레이션된 데이터셋, 그리고 표준화된 다국어 평가 프로토콜을 포함한 오픈 툴킷을 공개한다. 우리의 작업은 다국어 능력이 트레이드오프가 아니라 생성적 AI의 견고성, 효율성, 그리고 문화적 충실도를 향상시키는 촉매제임을 입증한다.

English

Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.

NeoBabel: 시각적 생성을 위한 다국어 오픈 타워

NeoBabel: A Multilingual Open Tower for Visual Generation

초록

Support