YOLOE-26:融合YOLO26与YOLOE实现实时开放词汇实例分割
YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation
January 29, 2026
作者: Ranjan Sapkota, Manoj Karkee
cs.AI
摘要
本文提出YOLOE-26——一个将部署优化的YOLO26(或称YOLOv26)架构与YOLOE开放词汇学习范式相融合的实时开放词汇实例分割统一框架。该方案基于YOLOv26无需非极大值抑制的端到端设计,在保留YOLO系列标志性高效性与确定性的同时,将模型能力扩展至封闭集识别之外。YOLOE-26采用卷积主干网络配合PAN/FPN式多尺度特征聚合,后接端到端回归头与实例分割头。其核心架构创新在于使用物体嵌入头替代固定类别逻辑值,将分类任务转化为与文本描述、视觉示例或内置词汇表生成的提示嵌入进行相似度匹配。为实现高效开放词汇推理,框架集成三大组件:零开销文本提示的可重参数化区域-文本对齐模块(RepRTA)、示例引导分割的语义激活视觉提示编码器(SAVPE),以及无需提示推理的惰性区域提示对比机制。所有提示模态均在统一物体嵌入空间中运行,支持文本提示、视觉提示与全自主分割模式的无缝切换。大量实验表明,在不同模型规模下,无论采用提示或无需提示设置,该框架均呈现一致的缩放特性与优越的精度-效率平衡。训练策略通过多任务优化利用大规模检测与定位数据集,并完全兼容Ultralytics生态的训练、验证及部署流程。总体而言,YOLOE-26为动态现实场景中的实时开放词汇实例分割提供了实用且可扩展的解决方案。
English
This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.