無需人工介入：自主高質量圖像編輯三元組挖掘

摘要

生成模型的最新進展使得圖像編輯助手能夠遵循自然語言指令而無需額外的用戶輸入。其監督訓練需要數以百萬計的三元組：原始圖像、指令、編輯後的圖像。然而，挖掘像素級精確的示例十分困難。每次編輯必須僅影響指令指定的區域，保持風格的一致性，尊重物理合理性，並保留視覺吸引力。缺乏穩健的自動化編輯質量指標，阻礙了大規模的可靠自動化。我們提出了一個自動化、模塊化的流程，該流程跨領域、分辨率、指令複雜性和風格挖掘高保真度的三元組。基於公開的生成模型並在無人干預的情況下運行，我們的系統使用任務定制的Gemini驗證器直接評分指令遵循度和美學，無需任何分割或基礎模型。反演和組合引導將挖掘的集合擴大了約2.2倍，從而實現了大規模的高保真度訓練數據。通過自動化最重複的註釋步驟，該方法允許在沒有人工標籤工作的情況下進行新規模的訓練。為了使這一資源密集型領域的研究民主化，我們發布了NHR-Edit：一個包含358k高質量三元組的開放數據集。在最大的跨數據集評估中，它超越了所有公開的替代方案。我們還發布了Bagel-NHR-Edit，一個開源的微調Bagel模型，它在我們的實驗中達到了最先進的指標。

English

Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.

無需人工介入：自主高質量圖像編輯三元組挖掘

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

摘要

Support