ChatPaper.aiChatPaper

新巴别塔:面向视觉生成的多语言开放架构

NeoBabel: A Multilingual Open Tower for Visual Generation

July 8, 2025
作者: Mohammad Mahdi Derakhshani, Dheeraj Varghese, Marzieh Fadaee, Cees G. M. Snoek
cs.AI

摘要

文本到圖像生成技術的進步主要集中於英語,這為非英語使用者設置了障礙,並加劇了數字不平等。雖然現有系統依賴於翻譯管道,但這些管道引入了語義漂移、計算開銷和文化錯位。我們推出了NeoBabel,這是一種新型的多語言圖像生成框架,在性能、效率和包容性方面設定了新的帕累托前沿,支持六種語言:英語、中文、荷蘭語、法語、印地語和波斯語。該模型結合了大規模多語言預訓練和高分辨率指令微調進行訓練。為了評估其能力,我們將兩個僅限英語的基準擴展為多語言等效基準:m-GenEval和m-DPG。NeoBabel在保持強大英語能力的同時,實現了最先進的多語言性能,在m-GenEval上得分0.75,在m-DPG上得分0.68。值得注意的是,它在英語任務上與領先模型表現相當,而在多語言基準上則分別超出+0.11和+0.09,儘管這些模型基於多語言基礎LLM構建。這證明了我們針對性對齊訓練在保持和擴展跨語言泛化方面的有效性。我們進一步引入了兩個新指標,以嚴格評估多語言對齊和對代碼混合提示的魯棒性。值得注意的是,NeoBabel在體積小2-4倍的情況下,與僅限英語的模型相當或超越。我們發布了一個開放工具包,包括所有代碼、模型檢查點、一個包含1.24億多語言文本-圖像對的策劃數據集,以及標準化的多語言評估協議,以推動包容性AI研究。我們的工作表明,多語言能力不是一種權衡,而是提高生成AI魯棒性、效率和文化保真度的催化劑。
English
Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.
PDF11July 9, 2025