MAOAM：基於視覺語言模型的統一物體與材質選取

摘要

選取是互動式影像編輯中的核心操作。為了實用，使用者應能透過文字或點擊式互動，指定並釐清所需的選取區域，且系統應不僅支援選取物體，還能選取其他準則，例如材質。基於材質的選取對於重新紋理化表面或編輯特定材質的實例等任務極具價值。然而，現有的基於視覺語言模型（VLM）的選取方法主要以物體為中心，且通常僅支援單一互動模態，限制了其適用性。為此，我們在本研究中提出「遮罩任意物體與材質」（MAOAM）框架，這是一個統一的選取框架，能夠在文字與點擊式互動中實現精確的物體及材質層級選取。MAOAM 利用帶有分割頭的視覺語言模型，從使用者提示中產生像素級精準的遮罩：視覺語言模型解讀使用者的選取意圖（物體或材質層級），並編碼視覺特徵、屬性及空間關係，而分割頭則將輸出標記解碼為遮罩。一個關鍵挑戰是缺乏附有文字標註的材質選取資料集。我們提出一個可擴展的資料生成流程：收集帶有材質遮罩的真實與合成影像，並利用視覺語言模型生成富含視覺語義的材質描述。我們透過多任務目標來訓練 MAOAM，涵蓋點擊與文字為基礎的選取，並輔以從材質描述衍生的輔助視覺問答任務，以促進更深層的材質理解。儘管模型僅以單一模態提示進行訓練，但在推理時結合文字與點擊，展現出選取能力的湧現性提升，從而實現靈活的影像編輯工作流程。實驗結果顯示，該方法在各種物體、材質及互動情境中能達到準確且一致的選取，凸顯其實際應用中的穩健性。

English

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.