穩定層：以VLM評分強化學習微調影像層分解模型

摘要

我們提出了 Stable-Layers，這是一個強化學習架構，透過僅使用來自視覺語言模型（VLM）的回饋來微調預訓練的圖層分解模型，從而消除對配對監督的需求。以 Qwen-Image-Layered 為基礎，我們應用結合 LoRA 適配的 Flow-GRPO，對每張影像取樣多個候選分解，使用 VLM 進行評分，並根據群體相對優勢來優化策略。關鍵挑戰在於設計可靠的獎勵訊號：單獨對樣本評分的 VLM 傾向於將其判斷壓縮到狹窄的範圍內，使得 GRPO 缺乏組內變異來學習。我們透過兩階段評估流程解決此問題，該流程將基於五項編輯中心標準的結構化逐樣本評分，與基於網格的校準步驟相結合，在該步驟中 VLM 會並列重新評分所有候選方案。與基礎模型相比，Stable-Layers 在 Crello 資料集上產生的分解具有更強的圖層分離、更少的空白或偽影層，以及更低的每層重建誤差。

English

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.