STARFlow2: 언어 모델과 정규화 흐름을 연결한 통합 다중 모드 생성

초록

심층 생성 모델은 텍스트와 비전 분야에서 급속히 발전해 왔으며, 이는 텍스트-이미지 혼합 시퀀스를 이해하고 추론하며 생성할 수 있는 통합 멀티모달 시스템에 대한 동기를 부여하고 있다. 대부분의 기존 접근법은 자기회귀 언어 모델과 확산 기반 이미지 생성기를 결합하여, 인과적 텍스트 생성과 반복적 시각적 잡음 제거 사이의 구조적 불일치를 상속받는다. 우리는 자기회귀 정규화 흐름이 LLM과 동일한 인과 마스크, KV-캐시 메커니즘, 좌측에서 우측으로의 구조를 공유하는 자기회귀 트랜스포머라는 점에 주목하며, 이는 진정한 통합 멀티모달 생성을 위한 가장 자연스러운 패러다임임을 제시한다. 우리는 사전 훈련된 VLM 스트림과 TarFlow 스트림을 잔차 스킵 연결을 통해 수직으로 결합하는 Pretzel 아키텍처를 기반으로 구축된 STARFlow2를 소개한다. 두 스트림은 동일한 인과 마스크 아래에서 작동하며, 심층-천이 흐름 설계 및 통합 FAE 잠재 공간과 결합하여 STARFlow2는 텍스트와 시각적 출력이 재인코딩 없이 직접 KV-캐시에 입력되는 캐시 친화적 혼합 생성을 가능하게 한다. 실험 결과는 이미지 생성 및 멀티모달 이해 벤치마크에서 강력한 성능을 입증하며, 자기회귀 흐름이 통합 멀티모달 모델링을 위한 실행 가능한 기반임을 검증한다.

English

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

STARFlow2: 언어 모델과 정규화 흐름을 연결한 통합 다중 모드 생성

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

초록

Support