Falcon Perceptie

Samenvatting

Perceptiegerichte systemen worden doorgaans geïmplementeerd met een modulaire encoder-decoderpijplijn: een visuele backbone voor kenmerkextractie en een aparte decoder (of late-fusiemodule) voor taakvoorspelling. Dit roept een centrale vraag op: is deze architecturale scheiding essentieel, of kan een enkele early-fusionstack zowel perceptie als taakmodellering op schaal uitvoeren? Wij introduceren Falcon Perception, een verenigde dense Transformer die beeldpatches en teksttokens vanaf de eerste laag verwerkt in een gedeelde parameterruimte, gebruikmakend van een hybride aandachtspatroon (bidirectioneel tussen beeldtokens, causaal voor voorspellingstokens) om globale visuele context te combineren met autoregressieve, variabele-lengte instantiegeneratie. Om dense outputs praktisch houdbaar te maken, behoudt Falcon Perception een lichtgewicht tokeninterface en decodeert het continue ruimtelijke outputs met gespecialiseerde heads, waardoor parallelle voorspelling van hoogresolutiemaskers mogelijk wordt. Ons ontwerp bevordert eenvoud: we houden een enkele schaalbare backbone aan en verschuiven complexiteit naar data en trainingssignalen, waarbij we alleen kleine heads toevoegen waar outputs continu en dense zijn. Op SA-Co verbetert Falcon Perception de maskerkwaliteit naar 68.0 Macro-F_1 vergeleken met 62.3 van SAM3. We introduceren ook PBench, een benchmark gericht op compositionele prompts (OCR, ruimtelijke constraints, relaties) en dense long-context regimes, waar het model betere verbeteringen laat zien. Ten slotte breiden we hetzelfde early-fusionrecept uit naar Falcon OCR: een compact 300M-parameter model dat 80.3% haalt op olmOCR en 88.64 op OmniDocBench.

English

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F_1 compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.