FlowTok:在文本与图像标记间无缝流转
FlowTok: Flowing Seamlessly Across Text and Image Tokens
March 13, 2025
作者: Ju He, Qihang Yu, Qihao Liu, Liang-Chieh Chen
cs.AI
摘要
跨模态生成的核心在于桥接不同模态。传统方法将文本模态视为条件信号,逐步引导从高斯噪声到目标图像模态的去噪过程,而我们探索了一种更为简洁的范式——通过流匹配直接在文本与图像模态间进行转换。这需要将两种模态投影到一个共享的潜在空间中,但由于它们本质上的表示差异,这一任务极具挑战性:文本具有高度语义性,编码为一维标记,而图像则具有空间冗余性,表示为二维潜在嵌入。为解决这一问题,我们提出了FlowTok,这是一个极简框架,通过将图像编码为紧凑的一维标记表示,实现了文本与图像间的无缝流转。与现有方法相比,该设计在256分辨率下将潜在空间大小减少了3.3倍,无需复杂的条件机制或噪声调度。此外,FlowTok在同一框架下自然扩展至图像到文本的生成。凭借其围绕紧凑一维标记构建的简洁架构,FlowTok在保持与最先进模型相当性能的同时,具有极高的内存效率,显著减少了训练资源需求,并实现了更快的采样速度。代码将在https://github.com/bytedance/1d-tokenizer 提供。
English
Bridging different modalities lies at the heart of cross-modality generation.
While conventional approaches treat the text modality as a conditioning signal
that gradually guides the denoising process from Gaussian noise to the target
image modality, we explore a much simpler paradigm-directly evolving between
text and image modalities through flow matching. This requires projecting both
modalities into a shared latent space, which poses a significant challenge due
to their inherently different representations: text is highly semantic and
encoded as 1D tokens, whereas images are spatially redundant and represented as
2D latent embeddings. To address this, we introduce FlowTok, a minimal
framework that seamlessly flows across text and images by encoding images into
a compact 1D token representation. Compared to prior methods, this design
reduces the latent space size by 3.3x at an image resolution of 256,
eliminating the need for complex conditioning mechanisms or noise scheduling.
Moreover, FlowTok naturally extends to image-to-text generation under the same
formulation. With its streamlined architecture centered around compact 1D
tokens, FlowTok is highly memory-efficient, requires significantly fewer
training resources, and achieves much faster sampling speeds-all while
delivering performance comparable to state-of-the-art models. Code will be
available at https://github.com/bytedance/1d-tokenizer.Summary
AI-Generated Summary