透過閉環驗證推理解鎖複雜視覺生成
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
May 14, 2026
作者: Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du
cs.AI
摘要
尽管发展迅速,当前的文本到图像(T2I)模型仍主要依赖单步生成范式,在处理复杂语义时表现不佳,且参数规模扩展的收益呈递减趋势。近年来,多步推理方法虽展现出潜力,但受限于缺乏验证的无依据规划幻觉、单一的事后反思机制、长上下文优化不稳定以及高昂的推理延迟。为突破这些瓶颈,我们提出闭环视觉推理(CLVR)框架,这是一个深度融合视觉语言逻辑规划与像素级扩散生成的综合系统。CLVR引入带步骤级视觉验证的自动化数据引擎,用于合成可靠的推理轨迹,并提出代理提示强化学习(PPRL),通过将交错的多模态历史蒸馏为显式奖励信号以实现精准因果归因,从而解决长上下文优化不稳定问题。此外,为缓解迭代去噪带来的严重延迟瓶颈,我们提出Δ空间权重融合(DSWM)这一具有理论依据的方法,将对齐权重与现成的蒸馏先验融合,将每步推理成本降低至仅需4次神经函数评估(NFEs),且无需昂贵的重新蒸馏。大量实验表明,CLVR在多个基准测试中超越现有开源基线,并接近专有商业模型的性能,为复杂视觉生成解锁了通用的测试时扩展能力。
English
Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling.
While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose Δ-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.