LLM-I：大语言模型天生具备交织多模态创作能力

摘要

我们提出了LLM-Interleaved（LLM-I），一个灵活且动态的框架，将交错式图文生成重新定义为工具使用问题。LLM-I旨在突破当前统一模型“单一工具”的瓶颈，这些模型局限于合成图像，难以应对需要事实依据或程序精确性的任务。我们的框架赋予核心LLM或MLLM代理智能调度多样化视觉工具的能力，包括在线图像搜索、基于扩散的生成、代码执行及图像编辑。通过强化学习（RL）框架，该代理被训练以熟练选择并应用这些工具，该框架采用混合奖励系统，结合了基于规则的逻辑与来自LLM和MLLM评估者的判断。利用四种不同模型骨干在多样化的新数据集上训练，LLM-I展现了最先进的性能，在四个基准测试中大幅超越现有方法。我们还引入了一种新颖的测试时扩展策略，进一步提升了性能表现。项目页面：https://github.com/ByteDance-BandAI/LLM-I。

English

We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.

LLM-I：大语言模型天生具备交织多模态创作能力

LLM-I: LLMs are Naturally Interleaved Multimodal Creators

摘要

Support