IndusAgent：利用智能体工具强化开放词汇工业异常检测

摘要

多模态大语言模型（MLLMs）在连接视觉感知与文本推理方面展现出卓越能力，能够对多种工业场景实现零样本理解。然而，它们在开放词汇工业异常检测（IAD）中的性能常受限于领域错配的推理和幻觉式的结构推断。为解决这些问题，我们提出IndusAgent——一种面向开放词汇工业异常检测的工具增强型智能体框架。具体而言，我们首先构建了结构化数据集Indus-CoT，该数据集整合了全局视觉观测、高分辨率局部图像块及专家正常性先验，为模型在严谨工业检测轨迹上的微调提供监督。在此基础上，IndusAgent动态协调一组外部工具，包括动态区域裁剪、高频特征增强和先验检索，从而使智能体能够主动消解视觉歧义并分离细微异常。此外，我们引入了一种门控强化学习目标，该目标联合优化异常分类、定位精度、异常类型推理及工具高效使用，确保仅在有益时调用工具。在MVTec-AD、VisA、MPDD、DTD和SDD五个工业异常基准上的广泛评估表明，IndusAgent在所有现有方法中取得了最先进的零样本性能，验证了其鲁棒性和泛化能力。

English

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose IndusAgent, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct Indus-CoT, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.