獵戶座:面向多模態感知、高階視覺推理與執行的統一視覺智能體
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
November 18, 2025
作者: N Dinesh Reddy, Sudeep Pillai
cs.AI
摘要
我們推出Orion——一個能夠接收任意模態並生成任意模態的視覺智能體框架。該框架採用具備多工具調用能力的智能體架構,專為視覺AI任務設計並實現了最先進的性能。有別於僅產生描述性輸出的傳統視覺語言模型,Orion通過協調一系列專業計算機視覺工具(包括物件檢測、關鍵點定位、全景分割、光學字元辨識與幾何分析)來執行複雜的多步驟視覺工作流。該系統在MMMU、MMBench、DocVQA和MMLongBench等基準測試中達到競爭性表現,同時將單體式視覺語言模型擴展至生產級視覺智能水平。透過融合神經感知與符號執行,Orion實現了自主視覺推理,標誌著從被動視覺理解到主動工具驅動型視覺智能的轉型。
English
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.