Stable-Layers: VLMスコアリング強化学習による画像レイヤー分解モデルのファインチューニング

要旨

我々はStable-Layersを提案する。これは、ペア型教師データを必要とせず、事前学習済みのレイヤ分解モデルを、視覚言語モデル（VLM）からのフィードバックのみを用いて微調整する強化学習フレームワークである。Qwen-Image-Layeredを出発点とし、Flow-GRPOにLoRA適応を適用し、画像ごとに複数の候補分解結果をサンプリングし、VLMでスコアリングし、グループ相対優位度に基づいてポリシーを最適化する。主要な課題は信頼性の高い報酬信号の設計にある。すなわち、VLMが個別のサンプルを単独でスコアリングすると、その判断が狭い範囲に圧縮される傾向があり、GRPOが学習に利用できるグループ内分散が小さくなる。この問題に対して我々は、5つの編集中心の評価基準にわたる構造化されたサンプル単位のスコアリングと、VLMがすべての候補を並べて再スコアリングするグリッドベースの較正ステップを組み合わせた2段階評価パイプラインを導入する。Stable-Layersは、ベースモデルと比較して、Crelloデータセットにおいて、より強いレイヤ分離、ブランクやアーティファクトの多いレイヤの減少、およびレイヤごとの再構成誤差の低減を実現する。

English

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.