MAOAM: 비전-언어 모델을 활용한 통합 객체 및 재료 선택

초록

선택(Selection)은 대화형 이미지 편집에서 핵심적인 연산이다. 실용적으로 사용하기 위해서는 사용자가 텍스트 또는 클릭 기반 상호작용을 통해 원하는 선택 영역을 지정하고 명확히 할 수 있어야 하며, 시스템은 객체뿐만 아니라 재질과 같은 다른 기준의 선택도 지원해야 한다. 재질 기반 선택은 표면 재질감 변경이나 특정 재질의 인스턴스 편집과 같은 작업에 유용하다. 그러나 기존의 비전-언어 모델(VLM) 기반 선택 방법은 객체 중심이며 일반적으로 단일 상호작용 양식만을 지원하여 그 적용 가능성에 한계가 있다. 이에 본 연구에서는 텍스트 기반 및 클릭 기반 상호작용 모두에서 정밀한 객체 및 재질 수준 선택을 가능하게 하는 통합 선택 프레임워크인 MAOAM(Mask Any Object And Material)을 제안한다. MAOAM은 분할 헤드를 갖춘 VLM을 활용하여 사용자 프롬프트로부터 픽셀 단위 정확한 마스크를 생성한다. VLM은 사용자의 선택 의도(객체 또는 재질 수준)를 해석하고 시각적 개체, 속성 및 공간 관계를 인코딩하며, 분할 헤드는 출력 토큰을 마스크로 디코딩한다. 주요 과제는 텍스트 주석이 포함된 재질 선택 데이터셋의 부족이다. 우리는 확장 가능한 데이터 생성 파이프라인을 제안한다. 재질 마스크가 있는 실제 및 합성 이미지를 수집하고, VLM을 활용하여 풍부한 시각-의미론을 갖춘 재질 설명을 생성한다. 우리는 MAOAM을 클릭 및 텍스트 기반 선택에 대한 다중 작업 목표와 재질 설명에서 파생된 보조 VQA 작업을 통해 학습시켜 더 깊은 재질 이해를 촉진한다. 단일 양식 프롬프트로 학습되었음에도 불구하고, 우리 모델은 추론 시 텍스트와 클릭을 결합할 때 선택 성능이 발생적으로 개선되어 유연한 이미지 편집 워크플로우를 가능하게 한다. 실험 결과, 다양한 객체, 재질 및 상호작용 시나리오에서 정확하고 일관된 선택을 보여주며 실제 환경에서의 강건성을 입증한다.

English

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.