閉ループ検証推論による複雑な視覚生成の実現

要旨

近年の急速な進歩にもかかわらず、既存のテキスト画像生成（T2I）モデルは主に単一段階生成パラダイムに依存しており、複雑な意味論の処理に難渋し、パラメータスケーリングによる収穫逓減の課題に直面している。最近の多段階推論アプローチは有望であるが、検証を欠いた根拠なき計画の幻覚、モノリシックな事後的反映、長文脈最適化の不安定性、及び許容できない推論レイテンシといった問題に妨げられている。これらのボトルネックを克服するため、我々は閉ループ視覚推論（CLVR）フレームワークを提案する。これは、視覚言語論理計画とピクセルレベルの拡散生成を深く結合した包括的システムである。CLVRは、信頼性の高い推論軌跡を合成するためにステップレベル視覚検証を備えた自動データエンジンを導入し、長文脈最適化の不安定性を解決するために、インターリーブされたマルチモーダル履歴を明示的な報酬信号に蒸留し、正確な因果帰属を実現する代理プロンプト強化学習（PPRL）を提案する。さらに、反復的デノイジングによる深刻なレイテンシボトルネックを緩和するため、我々はΔ空間重み統合（DSWM）を提案する。これは、アライメント重みを既製の蒸留事前分布と融合する理論的に基づいた手法であり、高価な再蒸留を必要とせずに、ステップあたりの推論コストをわずか4 NFEsに削減する。広範な実験により、CLVRは複数のベンチマークにおいて既存のオープンソースベースラインを凌駕し、プロプライエタリな商用モデルの性能に迫るとともに、複雑な視覚生成における汎用的なテスト時間スケーリング能力を実現することを実証する。

English

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose Δ-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.