팔콘 퍼셉션

초록

인식 중심 시스템은 일반적으로 모듈식 인코더-디코더 파이프라인으로 구현됩니다. 즉, 특징 추출을 위한 비전 백본과 작업 예측을 위한 별도의 디코더(또는 후반 융합 모듈)로 구성됩니다. 이는 근본적인 질문을 제기합니다. 이러한 구조적 분리는 필수적인가, 아니면 단일 초기 융합 스택이 대규모로 인식과 작업 모델링을 모두 수행할 수 있는가? 우리는 Falcon Perception를 소개합니다. 이는 통합된 밀집 Transformer로, 이미지 패치와 텍스트 토큰을 첫 번째 계층부터 공유 매개변수 공간에서 처리하며, 하이브리드 어텐션 패턴(이미지 토큰 간에는 양방향, 예측 토큰에는 인과적)을 사용하여 전역 시각적 컨텍스트와 자기회귀적, 가변 길이 인스턴스 생성을 결합합니다. 밀집 출력을 실용적으로 유지하기 위해 Falcon Perception는 경량 토큰 인터페이스를 유지하고 특화된 헤드로 연속적인 공간 출력을 디코딩하여 병렬 고해상도 마스크 예측을 가능하게 합니다. 우리의 설계는 단순성을 지향합니다. 단일 확장 가능한 백본을 유지하고 복잡성을 데이터와 훈련 신호 쪽으로 전이시키며, 출력이 연속적이고 밀집된 경우에만 소규모 헤드를 추가합니다. SA-Co에서 Falcon Perception는 마스크 품질을 SAM3의 62.3에 비해 68.0 Macro-F_1로 향상시켰습니다. 또한 구성적 프롬프트(OCR, 공간 제약, 관계)와 밀집 장문 컨텍스트 영역을 대상으로 하는 벤치마크 PBench를 소개하며, 해당 모델이 더 큰 성능 향상을 보입니다. 마지막으로, 동일한 초기 융합 방식을 Falcon OCR에 확장 적용했습니다. 이는 300M 매개변수의 컴팩트한 모델로 olmOCR에서 80.3%, OmniDocBench에서 88.64의 성능을 달성했습니다.

English

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F_1 compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.