CoF-T2I：将视频模型作为纯视觉推理器用于文本到图像生成

摘要

近期视频生成模型展现出帧间推理链（CoF）能力，实现了逐帧视觉推断。凭借这一特性，视频模型已成功应用于多种视觉任务（如迷宫求解、视觉谜题）。然而，由于文本到图像（T2I）生成过程中缺乏明确的视觉推理起点和可解释的中间状态，其在增强T2I生成方面的潜力尚未得到充分探索。为弥补这一差距，我们提出CoF-T2I模型，通过渐进式视觉优化将CoF推理融入T2I生成——以中间帧作为显式推理步骤，最终帧作为输出结果。为构建此类显式生成过程，我们构建了CoF-Evol-Instruct数据集，其中包含模拟从语义到美学生成过程的CoF轨迹链。为进一步提升质量并避免运动伪影，我们实现了每帧独立编码机制。实验表明，CoF-T2I显著超越基础视频模型，在挑战性基准测试中达到竞争优势：GenEval得分0.86，Imagine-Bench得分7.468。这些结果证明了视频模型在推进高质量文本到图像生成方面的巨大潜力。

English

Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.