CoF-T2I:將影片模型作為純視覺推理器應用於文字生成圖像任務
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
January 15, 2026
作者: Chengzhuo Tong, Mingkun Chang, Shenglong Zhang, Yuran Wang, Cheng Liang, Zhizheng Zhao, Ruichuan An, Bohan Zeng, Yang Shi, Yifan Dai, Ziming Zhao, Guanbin Li, Pengfei Wan, Yuanxing Zhang, Wentao Zhang
cs.AI
摘要
近期影片生成模型展現出幀序列推理能力的湧現,實現了逐幀視覺推演。憑藉此能力,影片模型已成功應用於多種視覺任務(如迷宮求解、視覺謎題)。然而,由於文本到圖像生成過程中缺乏明確的視覺推理起點與可解釋的中間狀態,其在增強文本到圖像生成方面的潛力仍未被充分探索。為此,我們提出CoF-T2I模型,通過漸進式視覺優化將幀序列推理融入文本到圖像生成流程,其中中間幀作為顯式推理步驟,最終幀作為輸出結果。為建立此顯式生成過程,我們構建了CoF-Evol-Instruct數據集,該數據集包含模擬從語義到美學生成過程的幀序列軌跡。為進一步提升質量並避免動態偽影,我們實現了對每幀的獨立編碼操作。實驗表明,CoF-T2I顯著超越基礎影片模型,並在挑戰性基準測試中達到競爭力表現——GenEval得分0.86,Imagine-Bench得分7.468。這些結果證實影片模型在推進高質量文本到圖像生成方面具有巨大潛力。
English
Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.