长猫图像技术报告
LongCat-Image Technical Report
December 8, 2025
作者: Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, Jie Hu
cs.AI
摘要
我们推出LongCat-Image——一款开创性的开源双语(中英)图像生成基础模型,旨在解决当前主流模型在多语言文本渲染、照片级真实感、部署效率及开发者易用性方面的核心挑战。1)我们通过在预训练、中期训练和SFT(监督微调)阶段实施严格的数据策展策略,并结合RL(强化学习)阶段精心设计的奖励模型协同工作,使该模型成为新一代技术标杆,在文本渲染能力与照片级真实感方面表现卓越,并显著提升美学质量。2)尤其值得关注的是,该模型为汉字渲染设立了新的行业标准。即使面对复杂生僻字也能完美支持,在字符覆盖度上超越主流开源与商业方案,同时实现更高的准确性。3)凭借紧凑的模型设计,我们以仅60亿参数的核心扩散模型实现了显著效能提升。该模型规模远小于领域内常见的近200亿或更大规模的混合专家(MoE)架构,在保证极低显存占用的同时实现快速推理,大幅降低部署成本。除生成任务外,LongCat-Image在图像编辑领域同样表现卓越,在标准基准测试中取得SOTA结果,相比其他开源方案具有更优的编辑一致性。4)为全面赋能社区,我们构建了迄今最完整的开源生态体系:不仅发布包含中期训练与完整训练阶段检查点的多版本文本生成及图像编辑模型,更同步开放全流程训练工具链。我们相信LongCat-Image的开放性将为开发者和研究者提供坚实支撑,共同推动视觉内容创作的前沿发展。
English
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.