Visual-CoG: 텍스트-이미지 생성을 위한 단계 인식 강화 학습과 가이던스 체인

초록

최근 자동회귀 모델들이 텍스트-이미지(T2I) 생성 분야에서 유망한 진전을 보이고 있지만, 다중 속성과 모호한 프롬프트를 처리하는 능력은 여전히 제한적입니다. 이러한 한계를 해결하기 위해 기존 연구들은 단계별 시각적 합성을 가능하게 하는 사고의 연쇄(CoT)를 적용하고, 추론 능력을 향상시키기 위해 강화 학습(RL)을 활용해 왔습니다. 그러나 대부분의 모델들은 생성 단계의 마지막에서만 보상 신호를 제공합니다. 이러한 단일적인 최종 지도 방식은 최종 결과에 긍정적으로 기여한 단계를 식별하기 어렵게 만들고, 최적이 아닌 정책을 초래할 수 있습니다. 이 문제를 해결하기 위해, 우리는 시맨틱 추론, 과정 정제, 결과 평가의 세 단계로 구성된 시각적 지도의 연쇄(Visual-CoG) 패러다임을 제안합니다. 이 패러다임은 이미지 생성 파이프라인 전반에 걸쳐 즉각적인 지도를 제공하는 단계별 보상을 포함합니다. 또한, 우리는 시맨틱 추론의 효과를 평가하기 위해 네 가지 하위 작업으로 구성된 시각적 인지 벤치마크, VisCog-Bench를 구축했습니다. GenEval, T2I-CompBench, 그리고 제안된 VisCog-Bench에 대한 종합 평가에서 각각 15%, 5%, 19%의 개선을 보여주며, 제안된 Visual-CoG의 우수한 성능을 입증했습니다. 모든 리소스를 곧 공개할 예정입니다.

English

Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

Visual-CoG: 텍스트-이미지 생성을 위한 단계 인식 강화 학습과 가이던스 체인

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

초록

Support