폐루프 검증 추론을 통한 복잡한 시각적 생성 실현

초록

최근 급속한 발전에도 불구하고, 현재의 텍스트-이미지(T2I) 모델은 주로 단일 단계 생성 패러다임에 의존하고 있으며, 이는 복잡한 의미 관계를 처리하는 데 어려움을 겪고 파라미터 스케일링에서 수익 체감 현상에 직면하고 있다. 최근 다중 단계 추론 접근 방식이 가능성을 보여주고 있지만, 검증이 부재한 근거 없는 계획 환각, 단일 방식의 사후 반성, 장문맥 최적화 불안정성, 그리고 실용화를 어렵게 하는 추론 지연 시간 등의 문제로 인해 제약을 받고 있다. 이러한 병목 현상을 극복하기 위해, 우리는 시각-언어 논리적 계획과 픽셀 수준 확산 생성을 긴밀하게 결합한 포괄적 시스템인 폐루프 시각 추론(CLVR) 프레임워크를 제안한다. CLVR은 신뢰할 수 있는 추론 경로를 합성하기 위해 단계별 시각 검증 기능을 갖춘 자동화 데이터 엔진을 도입하고, 교차 배치된 다중 모달 히스토리를 정확한 인과적 귀인을 위한 명시적 보상 신호로 증류함으로써 장문맥 최적화 불안정성을 해결하는 프록시 프롬프트 강화 학습(PPRL)을 제안한다. 또한, 반복적 노이즈 제거로 인한 심각한 지연 시간 병목을 완화하기 위해, 우리는 이론적으로 기반한 방법인 Δ-공간 가중치 병합(DSWM)을 제안한다. 이는 정렬 가중치를 기성 증류 사전 지식과 융합하여, 비용이 많이 드는 재증류 없이도 단계당 추론 비용을 단 4 NFE로 감소시킨다. 광범위한 실험을 통해 CLVR이 여러 벤치마크에서 기존 오픈소스 기준선을 능가하고 독점 상용 모델의 성능에 근접하여, 복잡한 시각 생성을 위한 일반적인 테스트 시간 스케일링 능력을 발휘함을 입증한다.

English

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose Δ-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.