FlowTok: 텍스트와 이미지 토큰 간 원활한 흐름

초록

다양한 모달리티 간의 연결은 크로스 모달리티 생성의 핵심입니다. 기존의 접근 방식은 텍스트 모달리티를 조건 신호로 취급하여 가우시안 노이즈에서 목표 이미지 모달리티로 점진적으로 디노이징 과정을 안내하는 반면, 우리는 훨씬 더 간단한 패러다임인 플로우 매칭을 통해 텍스트와 이미지 모달리티 간의 직접적인 진화를 탐구합니다. 이를 위해서는 두 모달리티를 공유된 잠재 공간에 투영해야 하는데, 이는 그들의 본질적으로 다른 표현 방식으로 인해 상당한 도전 과제를 제기합니다: 텍스트는 고도로 의미론적이며 1D 토큰으로 인코딩되는 반면, 이미지는 공간적으로 중복적이고 2D 잠재 임베딩으로 표현됩니다. 이를 해결하기 위해, 우리는 이미지를 간결한 1D 토큰 표현으로 인코딩함으로써 텍스트와 이미지 간의 원활한 흐름을 가능하게 하는 FlowTok이라는 최소한의 프레임워크를 소개합니다. 이 설계는 256 해상도의 이미지에서 잠재 공간 크기를 3.3배 줄여 복잡한 조건 메커니즘이나 노이즈 스케줄링의 필요성을 없앱니다. 더욱이, FlowTok은 동일한 공식 하에서 이미지-텍스트 생성으로 자연스럽게 확장됩니다. 간결한 1D 토큰을 중심으로 한 간소화된 아키텍처 덕분에, FlowTok은 매우 메모리 효율적이며 상당히 적은 훈련 자원을 필요로 하고 훨씬 더 빠른 샘플링 속도를 달성합니다. 이 모든 것은 최신 모델과 비슷한 성능을 제공하면서 이루어집니다. 코드는 https://github.com/bytedance/1d-tokenizer에서 제공될 예정입니다.

English

Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code will be available at https://github.com/bytedance/1d-tokenizer.

FlowTok: 텍스트와 이미지 토큰 간 원활한 흐름

FlowTok: Flowing Seamlessly Across Text and Image Tokens

초록

Support