无需人工干预：自主高质量图像编辑三元组挖掘

摘要

近期生成建模的进展使得图像编辑助手能够直接遵循自然语言指令，无需额外用户输入。这类助手的监督训练需要数百万个三元组：原始图像、指令、编辑后的图像。然而，挖掘像素级精确的示例颇具挑战。每次编辑必须仅影响指令指定的区域，保持风格一致性，尊重物理合理性，并保留视觉吸引力。缺乏稳健的自动编辑质量评估指标，阻碍了大规模可靠自动化的发展。我们提出了一种自动化、模块化的流程，能够跨领域、分辨率、指令复杂度和风格挖掘高保真三元组。该系统基于公开的生成模型运行，无需人工干预，采用任务定制的Gemini验证器直接评分指令遵循度和美学效果，省去了分割或基础模型的需求。通过反演和组合式自举，挖掘到的数据集扩大了约2.2倍，为大规模高保真训练数据提供了可能。通过自动化最重复的标注步骤，该方法实现了无需人工标注的大规模训练。为了促进这一资源密集型领域的研究民主化，我们发布了NHR-Edit：一个包含358k高质量三元组的开放数据集。在最大规模的跨数据集评估中，它超越了所有公开的替代方案。我们还发布了Bagel-NHR-Edit，一个开源的微调Bagel模型，在我们的实验中达到了最先进的指标。

English

Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.

无需人工干预：自主高质量图像编辑三元组挖掘

NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

摘要

Support