MAOAM: 基于视觉语言模型的统一对象与材质选择

摘要

选择操作是交互式图像编辑中的核心环节。在实际应用中，用户应能通过文本或点击交互来指定并消除所期望选择区域的歧义，系统不仅要支持对象选择，还需覆盖其他维度，如材质选择。基于材质的选择对于重纹理化表面或编辑特定材质实例等任务具有重要价值。然而，现有基于视觉-语言模型的选择方法通常以对象为中心，且仅支持单一交互模式，限制了其实用性。为此，我们提出MAOAM（掩膜任意对象与材质）——一个统一的选取框架，支持文本和点击两种交互方式，实现精确的对象级与材质级选择。MAOAM利用带有分割头的视觉-语言模型，从用户提示中生成像素级精度的掩膜：视觉-语言模型解读用户的选择意图（对象级或材质级）并编码视觉实体、属性及空间关系，而分割头则将输出令牌解码为掩膜。一个关键挑战在于缺乏带有文本标注的材质选择数据集。我们提出可扩展的数据生成流程：收集带有材质掩膜的实景与合成图像，并利用视觉-语言模型生成富含视觉语义的材质描述。通过多任务目标训练MAOAM，涵盖基于点击和文本的选择，并结合从材质描述中导出的辅助视觉问答任务，以促进对材质的深层理解。尽管仅使用单模态提示训练，我们的模型在推理时结合文本与点击后展现出涌现性的选择能力提升，从而支持灵活的图像编辑工作流。实验表明，该模型在多样化对象、材质及交互场景下均能实现准确且连贯的选择，凸显了实际应用中的鲁棒性。

English

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.