Stable-Layers: VLM 점수 기반 강화 학습을 통한 이미지 레이어 분해 모델의 미세 조정

초록

본 논문에서는 Stable-Layers를 제안한다. 이는 사전 학습된 레이어 분해 모델을 시각-언어 모델(VLM)의 피드백만을 사용하여 미세 조정함으로써 쌍으로 된 지도 학습의 필요성을 제거하는 강화 학습 프레임워크이다. Qwen-Image-Layered를 출발점으로 삼아, LoRA 적응을 적용한 Flow-GRPO를 활용하며, 이미지당 여러 후보 분해 결과를 샘플링하고 VLM으로 점수를 매긴 후, 그룹 상대적 이점으로부터 정책을 최적화한다. 핵심 과제는 신뢰할 수 있는 보상 신호를 설계하는 데 있다. VLM이 샘플을 개별적으로 평가할 때 그 판단을 좁은 범위로 압축하는 경향이 있어, GRPO가 학습할 수 있는 그룹 내 분산이 거의 남지 않게 된다. 이를 해결하기 위해, 다섯 가지 편집 중심 기준에 걸친 구조화된 샘플별 평가와, VLM이 모든 후보를 나란히 다시 점수 매기는 격자 기반 보정 단계를 결합한 2단계 평가 파이프라인을 도입한다. Stable-Layers는 기본 모델에 비해 Crello 데이터셋에서 더 강력한 레이어 분리, 더 적은 수의 빈 레이어나 아티팩트가 많은 레이어, 그리고 더 낮은 레이어별 재구성 오류를 달성한다.

English

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.