透過閉環驗證推理解鎖複雜視覺生成

摘要

尽管发展迅速，当前的文本到图像（T2I）模型仍主要依赖单步生成范式，在处理复杂语义时表现不佳，且参数规模扩展的收益呈递减趋势。近年来，多步推理方法虽展现出潜力，但受限于缺乏验证的无依据规划幻觉、单一的事后反思机制、长上下文优化不稳定以及高昂的推理延迟。为突破这些瓶颈，我们提出闭环视觉推理（CLVR）框架，这是一个深度融合视觉语言逻辑规划与像素级扩散生成的综合系统。CLVR引入带步骤级视觉验证的自动化数据引擎，用于合成可靠的推理轨迹，并提出代理提示强化学习（PPRL），通过将交错的多模态历史蒸馏为显式奖励信号以实现精准因果归因，从而解决长上下文优化不稳定问题。此外，为缓解迭代去噪带来的严重延迟瓶颈，我们提出Δ空间权重融合（DSWM）这一具有理论依据的方法，将对齐权重与现成的蒸馏先验融合，将每步推理成本降低至仅需4次神经函数评估（NFEs），且无需昂贵的重新蒸馏。大量实验表明，CLVR在多个基准测试中超越现有开源基线，并接近专有商业模型的性能，为复杂视觉生成解锁了通用的测试时扩展能力。

English

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose Δ-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.