猎户座：面向多模态感知、高级视觉推理与执行的一体化视觉智能体

摘要

我们推出Orion——一种能够接收任意模态输入并生成任意模态输出的智能体框架。该框架通过具备多工具调用能力的智能体架构，专为视觉AI任务设计并实现了顶尖性能。与传统视觉语言模型仅生成描述性输出不同，Orion通过协调包括目标检测、关键点定位、全景分割、光学字符识别和几何分析在内的专业计算机视觉工具集，来执行复杂的多步骤视觉工作流。该系统在MMMU、MMBench、DocVQA和MMLongBench等基准测试中达到领先水平，将单体视觉语言模型升级为生产级视觉智能系统。通过融合神经感知与符号执行，Orion实现了自主视觉推理，标志着从被动视觉理解到主动工具驱动型视觉智能的重要转变。

English

We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.