ChatPaper.aiChatPaper

LLM-I:大语言模型天生具备交织多模态创作能力

LLM-I: LLMs are Naturally Interleaved Multimodal Creators

September 17, 2025
作者: Zirun Guo, Feng Zhang, Kai Jia, Tao Jin
cs.AI

摘要

我们提出了LLM-Interleaved(LLM-I),一个灵活且动态的框架,将交错式图文生成重新定义为工具使用问题。LLM-I旨在突破当前统一模型“单一工具”的瓶颈,这些模型局限于合成图像,难以应对需要事实依据或程序精确性的任务。我们的框架赋予核心LLM或MLLM代理智能调度多样化视觉工具的能力,包括在线图像搜索、基于扩散的生成、代码执行及图像编辑。通过强化学习(RL)框架,该代理被训练以熟练选择并应用这些工具,该框架采用混合奖励系统,结合了基于规则的逻辑与来自LLM和MLLM评估者的判断。利用四种不同模型骨干在多样化的新数据集上训练,LLM-I展现了最先进的性能,在四个基准测试中大幅超越现有方法。我们还引入了一种新颖的测试时扩展策略,进一步提升了性能表现。项目页面:https://github.com/ByteDance-BandAI/LLM-I。
English
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.
PDF72September 18, 2025