猎鹰感知 - 论文详情

摘要

感知系统通常采用模块化编码器-解码器流水线实现：通过视觉主干网络进行特征提取，再经由独立解码器（或后融合模块）完成任务预测。这引出一个核心问题：这种架构分离是否必要？能否通过单一早期融合堆栈实现大规模感知与任务建模？我们提出Falcon Perception——一种在共享参数空间中从首层开始同步处理图像块与文本标记的统一稠密Transformer，采用混合注意力机制（图像标记间双向注意力，预测标记间因果注意力）将全局视觉上下文与自回归的变长实例生成相结合。为保持稠密输出的实用性，Falcon Perception保留轻量级标记接口，并通过专用头解码连续空间输出，实现并行高分辨率掩码预测。我们的设计追求简洁性：维持单一可扩展主干网络，将复杂度转移至数据与训练信号，仅在输出需连续稠密处添加小型预测头。在SA-Co数据集上，Falcon Perception将掩码质量提升至68.0 Macro-F_1，显著优于SAM3的62.3。我们还推出PBench基准测试，针对组合式提示（OCR、空间约束、关系推理）和稠密长上下文场景，模型在该基准上表现出更大优势。最后，我们将相同早期融合方案扩展至Falcon OCR：这个仅3亿参数的紧凑模型在olmOCR上达到80.3%准确率，在OmniDocBench上获得88.64分。

English

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F_1 compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.