ChatPaper.aiChatPaper

PixelRefer:支持任意粒度时空目标指代的统一框架

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

October 27, 2025
作者: Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi
cs.AI

摘要

多模态大语言模型(MLLMs)已在开放世界视觉理解任务中展现出强大的通用能力。然而,现有MLLM大多聚焦于整体场景层级的理解,往往忽视了细粒度、以对象为中心的推理需求。本文提出PixelRefer——一个统一的区域级MLLM框架,能够对图像和视频中用户指定区域进行精细化理解。基于大语言模型注意力机制主要聚焦于对象级标记的发现,我们设计了尺度自适应对象标记器(SAOT),从任意形状区域生成紧凑且语义丰富的对象表征。分析表明全局视觉标记主要在大语言模型浅层发挥作用,由此启发我们开发了高效变体PixelRefer-Lite。该版本通过对象中心注入模块将全局上下文预融合至对象标记,形成轻量化的纯对象框架,在保持高语义保真度的同时显著降低计算成本。为支持细粒度指令微调,我们构建了包含220万样本的高质量对象中心指令数据集PixelRefer-2.2M。大量实验表明:PixelRefer在多个基准测试中以更少训练样本实现领先性能,而PixelRefer-Lite在保持竞争力的准确率同时展现出显著效率优势。
English
Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.
PDF222December 31, 2025