MAOAM: 視覚言語モデルによる物体と素材の統一的選択

要旨

選択は、インタラクティブな画像編集における中核的な操作である。実用的には、ユーザーはテキストまたはクリックベースのインタラクションを通じて所望の選択領域を指定し曖昧さを解消できるべきであり、システムはオブジェクトだけでなく素材などの他の基準も選択できるようにすべきである。素材ベースの選択は、表面の再テクスチャリングや特定の素材のインスタンス編集といったタスクにおいて有用である。しかし、既存の視覚言語モデル（VLM）ベースの選択手法はオブジェクト中心であり、通常は単一のインタラクションモダリティしかサポートしておらず、その適用可能性が制限されている。そこで本研究では、Mask Any Object And Material (MAOAM) を提案する。これは、テキストベースとクリックベースの両方のインタラクションにおいて、オブジェクトおよび素材レベルの正確な選択を可能にする統一的な選択フレームワークである。MAOAMは、セグメンテーションヘッドを備えたVLMを活用し、ユーザープロンプトからピクセル精度のマスクを生成する。VLMはユーザーの選択意図（オブジェクトまたは素材レベル）を解釈し、視覚的エンティティ、属性、空間関係をエンコードし、セグメンテーションヘッドは出力トークンをマスクにデコードする。主要な課題は、テキストアノテーションが付与された素材選択データセットの不足である。我々はスケーラブルなデータ生成パイプラインを提案する。素材マスクを持つ実画像と合成画像を収集し、VLMを活用して豊かな視覚意味論を持つ素材記述を生成する。我々は、クリックおよびテキストベースの選択に対するマルチタスク目的と、素材記述から派生した補助的なVQAタスクを用いてMAOAMを訓練し、より深い素材理解を促進する。単一モーダルのプロンプトで訓練されているにもかかわらず、我々のモデルは推論時にテキストとクリックを組み合わせることで選択性能の創発的な向上を示し、柔軟な画像編集ワークフローを可能にする。実験により、多様なオブジェクト、素材、インタラクションシナリオにわたって正確で一貫性のある選択が実証され、実際のロバスト性が示された。

English

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.