Stable-Layers：使用VLM评分强化学习微调图像层分解模型

摘要

我们提出Stable-Layers，这是一个强化学习框架，通过仅利用视觉语言模型（VLM）的反馈来微调预训练的图层分解模型，从而消除了对成对监督的需求。以Qwen-Image-Layered为起点，我们应用Flow-GRPO和LoRA适配，对每张图像采样多个候选分解，用VLM进行评分，并基于组相对优势优化策略。关键挑战在于设计可靠的奖励信号：单独对样本评分的VLM倾向于将判断压缩到狭窄范围内，导致GRPO缺乏足以学习的组内方差。我们通过一个两阶段评估流水线解决了这一问题，该流水线将基于五个编辑中心标准的每个样本结构化评分与基于网格的校准步骤相结合，在此步骤中，VLM对所有候选分解进行并列重新评分。与基础模型相比，Stable-Layers在Crello数据集上生成的分解具有更强的图层分离能力、更少的空白或伪影图层，以及更低的每层重建误差。

English

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.