IndusAgent: 에이전트 도구를 통한 개방형 어휘 산업 이상 탐지 강화

초록

멀티모달 대규모 언어 모델(MLLM)은 시각적 인식과 텍스트 추론을 연계하는 뛰어난 능력을 보여주며, 다양한 산업 현장에서 제로샷 이해를 가능하게 한다. 그러나 개방형 어휘 산업 이상 탐지(IAD)에서 이들의 성능은 종종 도메인 불일치 추론과 환각적 구조 추론에 의해 제한된다. 이러한 문제를 해결하기 위해, 우리는 개방형 어휘 IAD를 위한 도구 증강 에이전트 프레임워크인 IndusAgent를 제안한다. 구체적으로, 먼저 Indus-CoT를 구축한다. 이는 전역 시각 관찰, 고해상도 지역 패치, 전문가 정상성 사전 정보를 통합한 구조화된 데이터셋으로, 엄격한 산업 검사 궤적에 대해 모델을 미세 조정하기 위한 감독을 제공한다. 이를 바탕으로 IndusAgent는 동적 지역 크롭핑, 고주파 특성 강화, 사전 검색을 포함한 외부 도구 집합을 동적으로 조율함으로써, 에이전트가 시각적 모호성을 적극적으로 해소하고 미세한 이상을 분리할 수 있게 한다. 또한, 게이트 보강 학습 목표를 도입하여 이상 분류, 위치 정확도, 이상 유형 추론, 효율적인 도구 사용을 공동으로 최적화하며, 도구 호출이 유익할 때만 발생하도록 보장한다. MVTec-AD, VisA, MPDD, DTD, SDD를 포함한 다섯 가지 산업 이상 벤치마크에 대한 광범위한 평가는 IndusAgent가 기존 모든 방법 중 최첨단 제로샷 성능을 달성함을 보여주며, 이는 우리의 강건성과 일반화 능력을 입증한다.

English

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose IndusAgent, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct Indus-CoT, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.