오리온: 다중 모달 인식, 고급 시각적 추론 및 실행을 위한 통합 시각 에이전트

초록

우리는 어떤 형태의 입력도 받아들이고 어떤 형태의 출력도 생성할 수 있는 시각 에이전트 프레임워크인 Orion을 소개합니다. 다중 도구 호출 기능을 갖춘 에이전트 프레임워크를 활용하는 Orion은 시각 AI 작업을 위해 설계되었으며 최첨단 성능을 달성합니다. 기술적인 출력을 생성하는 기존의 시각-언어 모델과 달리, Orion은 객체 감지, 키포인트 위치 추정, 파노픽 분할, 광학 문자 인식, 기하학적 분석 등 전문 컴퓨터 비전 도구들을 조율하여 복잡한 다단계 시각 워크플로를 실행합니다. 본 시스템은 MMMU, MMBench, DocVQA, MMLongBench에서 경쟁력 있는 성능을 보이는 동시에 단일 구조의 시각-언어 모델을 프로덕션 급 시각 인텔리전스로 확장합니다. 신경망 기반 인지와 기호적 실행을 결합함으로써 Orion은 자율적인 시각 추론을 가능하게 하며, 수동적인 시각 이해에서 능동적이고 도구 주도적인 시각 인텔리전스로의 전환을 이끕니다.

English

We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.