LLM-I: 대형 언어 모델은 자연스럽게 인터리브된 멀티모달 생성자입니다

초록

우리는 인터리브된 이미지-텍스트 생성을 도구 사용 문제로 재구성하는 유연하고 동적인 프레임워크인 LLM-Interleaved(LLM-I)를 제안합니다. LLM-I는 합성 이미지에 국한되고 사실적 근거나 프로그래밍적 정밀도를 요구하는 작업에 어려움을 겪는 현재의 통합 모델들의 "단일 도구" 병목 현상을 극복하도록 설계되었습니다. 우리의 프레임워크는 중앙 LLM 또는 MLLM 에이전트가 온라인 이미지 검색, 확산 기반 생성, 코드 실행, 이미지 편집 등 다양한 전문 시각 도구를 지능적으로 조율할 수 있도록 지원합니다. 이 에이전트는 규칙 기반 논리와 LLM 및 MLLM 평가자의 판단을 결합한 하이브리드 보상 시스템을 특징으로 하는 강화 학습(RL) 프레임워크를 통해 이러한 도구를 능숙하게 선택하고 적용하도록 훈련됩니다. 네 가지 다른 모델 백본을 사용하여 다양한 새로운 데이터셋으로 훈련된 LLM-I는 네 가지 벤치마크에서 기존 방법을 큰 차이로 앞지르며 최첨단 성능을 보여줍니다. 또한, 추가적인 성능 향상을 제공하는 새로운 테스트 시점 스케일링 전략도 소개합니다. 프로젝트 페이지: https://github.com/ByteDance-BandAI/LLM-I.

English

We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.

LLM-I: 대형 언어 모델은 자연스럽게 인터리브된 멀티모달 생성자입니다

LLM-I: LLMs are Naturally Interleaved Multimodal Creators

초록

Support