Mamba-YOLO-World：将YOLO-World与Mamba相结合，实现开放词汇检测

摘要

开放词汇检测（OVD）旨在检测超出预定义类别集的对象。作为将YOLO系列纳入OVD的开创性模型，YOLO-World非常适用于注重速度和效率的场景。然而，其性能受到其颈部特征融合机制的阻碍，导致二次复杂度和有限的引导感受野。为了解决这些限制，我们提出了Mamba-YOLO-World，这是一种新颖的基于YOLO的OVD模型，采用了提出的MambaFusion Path Aggregation Network（MambaFusion-PAN）作为其颈部架构。具体而言，我们引入了一种基于状态空间模型的特征融合机制，包括具有线性复杂度和全局引导感受野的并行引导选择扫描算法和串行引导选择扫描算法。它利用多模态输入序列和mamba隐藏状态来指导选择性扫描过程。实验证明，我们的模型在零样本和微调设置下在COCO和LVIS基准测试中优于原始的YOLO-World，同时保持可比的参数和FLOPs。此外，它以更少的参数和FLOPs超越现有的最先进OVD方法。

English

Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency.However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields.To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process.Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.

Mamba-YOLO-World：将YOLO-World与Mamba相结合，实现开放词汇检测

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

摘要

Support