ChatPaper.aiChatPaper

穩定層:以VLM評分強化學習微調影像層分解模型

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

May 28, 2026
作者: Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, Mark Boss
cs.AI

摘要

我們提出了 Stable-Layers,這是一個強化學習架構,透過僅使用來自視覺語言模型(VLM)的回饋來微調預訓練的圖層分解模型,從而消除對配對監督的需求。以 Qwen-Image-Layered 為基礎,我們應用結合 LoRA 適配的 Flow-GRPO,對每張影像取樣多個候選分解,使用 VLM 進行評分,並根據群體相對優勢來優化策略。關鍵挑戰在於設計可靠的獎勵訊號:單獨對樣本評分的 VLM 傾向於將其判斷壓縮到狹窄的範圍內,使得 GRPO 缺乏組內變異來學習。我們透過兩階段評估流程解決此問題,該流程將基於五項編輯中心標準的結構化逐樣本評分,與基於網格的校準步驟相結合,在該步驟中 VLM 會並列重新評分所有候選方案。與基礎模型相比,Stable-Layers 在 Crello 資料集上產生的分解具有更強的圖層分離、更少的空白或偽影層,以及更低的每層重建誤差。
English
We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.