LLM-I:大语言模型天生具备交织多模态创作能力
LLM-I: LLMs are Naturally Interleaved Multimodal Creators
September 17, 2025
作者: Zirun Guo, Feng Zhang, Kai Jia, Tao Jin
cs.AI
摘要
我们提出了LLM-Interleaved(LLM-I),一个灵活且动态的框架,将交错式图文生成重新定义为工具使用问题。LLM-I旨在突破当前统一模型“单一工具”的瓶颈,这些模型局限于合成图像,难以应对需要事实依据或程序精确性的任务。我们的框架赋予核心LLM或MLLM代理智能调度多样化视觉工具的能力,包括在线图像搜索、基于扩散的生成、代码执行及图像编辑。通过强化学习(RL)框架,该代理被训练以熟练选择并应用这些工具,该框架采用混合奖励系统,结合了基于规则的逻辑与来自LLM和MLLM评估者的判断。利用四种不同模型骨干在多样化的新数据集上训练,LLM-I展现了最先进的性能,在四个基准测试中大幅超越现有方法。我们还引入了一种新颖的测试时扩展策略,进一步提升了性能表现。项目页面:https://github.com/ByteDance-BandAI/LLM-I。
English
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that
reframes interleaved image-text generation as a tool-use problem. LLM-I is
designed to overcome the "one-tool" bottleneck of current unified models, which
are limited to synthetic imagery and struggle with tasks requiring factual
grounding or programmatic precision. Our framework empowers a central LLM or
MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual
tools, including online image search, diffusion-based generation, code
execution, and image editing. The agent is trained to select and apply these
tools proficiently via a Reinforcement Learning (RL) framework that features a
hybrid reward system combining rule-based logic with judgments from LLM and
MLLM evaluators. Trained on a diverse new dataset using four different model
backbones, LLM-I demonstrates state-of-the-art performance, outperforming
existing methods by a large margin across four benchmarks. We also introduce a
novel test-time scaling strategy that provides further performance gains.
Project Page: https://github.com/ByteDance-BandAI/LLM-I.