猎鹰感知系统

摘要

以感知為核心的系統通常採用模組化編碼器-解碼器流程實現：使用視覺骨幹網路進行特徵提取，並透過獨立解碼器（或晚期融合模組）進行任務預測。這引發了一個核心問題：這種架構分離是否必要？抑或單一的早期融合堆疊能否在大規模場景下同時完成感知與任務建模？我們提出Falcon Perception——一個統一的稠密Transformer架構，從第一層起就在共享參數空間中處理圖像塊與文本標記，透過混合注意力模式（圖像標記間採用雙向注意力，預測標記採用因果注意力）將全域視覺上下文與自回歸的變長實例生成相結合。為確保稠密輸出的實用性，Falcon Perception保留輕量級標記介面，並透過專用頭部解碼連續空間輸出，實現並行高解析度遮罩預測。我們的設計追求簡潔性：維持單一可擴展骨幹網路，將複雜性轉移至資料與訓練信號層面，僅在輸出需連續稠密處添加小型頭部。在SA-Co數據集上，Falcon Perception將遮罩品質提升至68.0 Macro-F_1，優於SAM3的62.3。我們同時推出PBench基準測試，專注於組合式提示（OCR、空間約束、關係推理）與稠密長上下文場景，該模型在這些領域展現更顯著優勢。最後，我們將相同早期融合方案擴展至Falcon OCR：這個僅3億參數的緊湊模型在olmOCR上達到80.3%準確率，在OmniDocBench上獲得88.64分。

English

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F_1 compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.