FlowTok: テキストと画像トークンをシームレスに流れる

要旨

異なるモダリティを橋渡しすることは、クロスモダリティ生成の核心に位置する。従来のアプローチでは、テキストモダリティをガウシアンノイズからターゲット画像モダリティへと徐々に導く条件付け信号として扱うが、我々はよりシンプルなパラダイム、すなわちフローマッチングを通じてテキストと画像モダリティの間を直接進化させる方法を探求する。これには、両モダリティを共有潜在空間に投影する必要があるが、それらが本質的に異なる表現を持つため、大きな課題となる。テキストは高度に意味的で1Dトークンとしてエンコードされるのに対し、画像は空間的に冗長で2D潜在埋め込みとして表現される。この問題に対処するため、我々はFlowTokを導入する。これは、画像をコンパクトな1Dトークン表現にエンコードすることで、テキストと画像の間をシームレスに流れる最小限のフレームワークである。従来の方法と比較して、この設計は256の画像解像度において潜在空間のサイズを3.3倍削減し、複雑な条件付けメカニズムやノイズスケジューリングの必要性を排除する。さらに、FlowTokは同じ定式化の下で画像からテキストへの生成にも自然に拡張される。コンパクトな1Dトークンを中心とした合理化されたアーキテクチャにより、FlowTokは高いメモリ効率を実現し、大幅に少ないトレーニングリソースを必要とし、はるかに高速なサンプリング速度を達成する。これらすべてを、最先端のモデルに匹敵する性能を維持しながら実現する。コードはhttps://github.com/bytedance/1d-tokenizerで公開予定である。

English

Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code will be available at https://github.com/bytedance/1d-tokenizer.

FlowTok: テキストと画像トークンをシームレスに流れる

FlowTok: Flowing Seamlessly Across Text and Image Tokens

要旨

Support