ChatPaper.aiChatPaper

LLM-I:大型語言模型天生就是交錯多模態的創造者

LLM-I: LLMs are Naturally Interleaved Multimodal Creators

September 17, 2025
作者: Zirun Guo, Feng Zhang, Kai Jia, Tao Jin
cs.AI

摘要

我們提出了LLM-Interleaved(LLM-I),這是一個靈活且動態的框架,將交錯的圖像-文本生成重新定義為工具使用問題。LLM-I旨在克服當前統一模型的“單一工具”瓶頸,這些模型僅限於合成圖像,並且在需要事實基礎或程序精確性的任務中表現不佳。我們的框架使一個核心LLM或MLLM代理能夠智能地協調多種專業視覺工具,包括在線圖像搜索、基於擴散的生成、代碼執行和圖像編輯。該代理通過一個強化學習(RL)框架進行訓練,該框架結合了基於規則的邏輯和來自LLM及MLLM評估者的判斷,從而熟練地選擇和應用這些工具。在四個不同模型骨幹上使用多樣化的新數據集進行訓練後,LLM-I展示了最先進的性能,在四個基準測試中大幅超越現有方法。我們還引入了一種新穎的測試時擴展策略,進一步提升了性能。項目頁面:https://github.com/ByteDance-BandAI/LLM-I。
English
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.
PDF72September 18, 2025