LLM-I: LLMは自然にインタリーブされたマルチモーダル生成器である

要旨

我々は、LLM-Interleaved（LLM-I）を提案する。これは、インタリーブされた画像-テキスト生成をツール使用問題として再定義する柔軟で動的なフレームワークである。LLM-Iは、合成画像に限定され、事実に基づいたタスクやプログラム的な精度を必要とするタスクに苦戦する現在の統一モデルの「単一ツール」ボトルネックを克服するために設計されている。本フレームワークは、中心的なLLMまたはMLLMエージェントが、オンライン画像検索、拡散ベースの生成、コード実行、画像編集などの専門的な視覚ツールの多様なツールキットをインテリジェントに調整することを可能にする。エージェントは、ルールベースのロジックとLLMおよびMLLM評価者の判断を組み合わせたハイブリッド報酬システムを特徴とする強化学習（RL）フレームワークを介して、これらのツールを熟練して選択および適用するように訓練される。4つの異なるモデルバックボーンを使用して多様な新しいデータセットで訓練されたLLM-Iは、4つのベンチマークにおいて既存の手法を大幅に上回る最先端の性能を実証する。また、さらなる性能向上を提供する新しいテストタイムスケーリング戦略も導入する。プロジェクトページ: https://github.com/ByteDance-BandAI/LLM-I。

English

We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.

LLM-I: LLMは自然にインタリーブされたマルチモーダル生成器である

LLM-I: LLMs are Naturally Interleaved Multimodal Creators

要旨

Support