オリオン：マルチモーダル知覚・高度な視覚推論・実行を統合した視覚エージェント

要旨

我々は、あらゆるモダリティを入力として受け取り、あらゆるモダリティを生成可能な視覚エージェントフレームワーク「Orion」を提案する。複数のツール呼び出し機能を備えたエージェント型フレームワークを採用したOrionは、視覚AIタスク向けに設計され、State-of-the-Artの結果を達成する。記述的な出力を行う従来の視覚言語モデルとは異なり、Orionは物体検出、キーポイント定位、パノプティックセグメンテーション、光学文字認識、幾何学的解析など、専門的なコンピュータビジョンツール群を協調的に活用し、複雑な多段階の視覚ワークフローを実行する。本システムはMMMU、MMBench、DocVQA、MMLongBenchにおいて競争力のある性能を発揮するとともに、単一的な視覚言語モデルを製品レベルの視覚知能へと拡張する。神経回路網による知覚と記号的実行を組み合わせることで、Orionは自律的な視覚推論を実現し、受動的な視覚理解から能動的でツール駆動型の視覚知能への移行を象徴するものである。

English

We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.

オリオン：マルチモーダル知覚・高度な視覚推論・実行を統合した視覚エージェント

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

要旨

Support