猎户座:面向多模态感知、高级视觉推理与执行的一体化视觉智能体
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
November 18, 2025
作者: N Dinesh Reddy, Sudeep Pillai
cs.AI
摘要
我们推出Orion——一种能够接收任意模态输入并生成任意模态输出的智能体框架。该框架通过具备多工具调用能力的智能体架构,专为视觉AI任务设计并实现了顶尖性能。与传统视觉语言模型仅生成描述性输出不同,Orion通过协调包括目标检测、关键点定位、全景分割、光学字符识别和几何分析在内的专业计算机视觉工具集,来执行复杂的多步骤视觉工作流。该系统在MMMU、MMBench、DocVQA和MMLongBench等基准测试中达到领先水平,将单体视觉语言模型升级为生产级视觉智能系统。通过融合神经感知与符号执行,Orion实现了自主视觉推理,标志着从被动视觉理解到主动工具驱动型视觉智能的重要转变。
English
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.