ファルコン・パーセプション

要旨

知覚中心システムは、通常、モジュール式のエンコーダ-デコーダ・パイプラインとして実装される。すなわち、特徴抽出のための視覚バックボーンと、タスク予測のための独立したデコーダ（または後期融合モジュール）から構成される。これに対して核心的な疑問が生じる：この構造的な分離は本質的に必要なのか、それとも単一の早期融合スタックが大規模な知覚とタスクモデリングの両方を実行できるのか？我々はFalcon Perceptionを提案する。これは、画像パッチとテキストトークンを第一層から共有されたパラメータ空間で処理する統一密度トランスフォーマーであり、ハイブリッド注意パターン（画像トークン間では双方向的、予測トークンでは因果的）を用いて、大域的な視覚コンテキストと自己回帰的な可変長インスタンス生成を結合する。密度の高い出力を実用的に保つため、Falcon Perceptionは軽量なトークンインターフェースを保持し、専門化されたヘッドで連続的な空間的出力をデコードすることで、並列的な高解像度マスク予測を可能にする。我々の設計は単純さを重視している。すなわち、単一のスケーラブルなバックボーンを維持し、複雑性をデータと訓練信号に移行させ、出力が連続的かつ密である場合にのみ小さなヘッドを追加する。SA-Coベンチマークにおいて、Falcon Perceptionはマスク品質をSAM3の62.3に対して68.0 Macro-F_1に改善した。また、合成的プロンプト（OCR、空間的制約、関係性）と密な長文コンテキスト領域を対象としたベンチマークであるPBenchを導入し、本モデルがより大きな改善を示すことを確認した。最後に、同じ早期融合の手法をFalcon OCRにも拡張した。これは3億パラメータのコンパクトなモデルであり、olmOCRで80.3%、OmniDocBenchで88.64を達成している。

English

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F_1 compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.