透過簡單點預測實現像素級視覺語言模型感知

摘要

我们提出了SimpleSeg，一种极其简单却高效的方法，旨在为多模态大语言模型（MLLMs）赋予原生像素级感知能力。该方法将分割任务重新定义为简单的序列生成问题：模型直接在语言空间内预测描述物体边界的点序列（文本坐标）。为实现高精度匹配，我们引入了两阶段SFtoRL训练流程，通过基于交并比奖励的强化学习来优化点序列，使其精确贴合真实轮廓。研究发现，标准MLLM架构本身具备强大的底层感知潜力，无需专用结构即可激活。在分割基准测试中，SimpleSeg取得了与依赖复杂任务专用设计的方法相媲美甚至更优的性能。这项工作揭示了精确空间理解可通过简单点预测实现，挑战了当前对辅助组件的普遍依赖，为构建更统一、更强大的视觉语言模型开辟了新路径。项目主页：https://simpleseg.github.io/

English

We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SFtoRL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: https://simpleseg.github.io/

透過簡單點預測實現像素級視覺語言模型感知

Towards Pixel-Level VLM Perception via Simple Points Prediction

摘要

Support