ABACUS：画像カウント理解と生成を橋渡しするための統一基盤モデルの適応

要旨

ABACUSは、物体カウント、群衆カウント、指示表現に基づくカウント、およびカウントに忠実な画像生成を、ベンチマーク固有のトレーニングを必要とせずに処理する統合視覚言語モデルです。本モデルは既存の3Bパラメータの統合基盤モデルを基盤とし、以下の3つの主要な革新を用いて物体位置特定タスクに適応しています：物体マップを用いた密度認識適応ズーミングによる空間接地；GRPOによる境界認識カウントポリシーでクロップ境界誤差を排除；さらに、サイクル一貫性GRPO戦略により理解ブランチが生成出力を自己批評し、外部アノテーションなしで理解と生成のギャップを解消します。ABACUSは7つのベンチマークで最先端の結果を達成し、タスク固有の専門家モデルやより大規模な汎用モデルを凌駕しています。

English

ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.