IndusAgent：透過代理工具強化開放詞彙工業異常檢測

摘要

多模态大语言模型（MLLMs）在连接视觉感知与文本推理方面展现出卓越能力，使其能够在多种工业场景中实现零样本理解。然而，在开放词汇的工业异常检测（IAD）任务中，其性能常受限于领域偏差推理与幻觉化结构推断。为应对这些挑战，我们提出IndusAgent——一个面向开放词汇工业异常检测的工具增强型智能体框架。具体而言，我们首先构建了Indus-CoT结构化数据集，该数据集融合全局视觉观测、高分辨率局部图像块及专家正常性先验知识，为模型在严格工业检测轨迹上的微调提供监督。在此基础上，IndusAgent动态协调一组外部工具，包括动态区域裁剪、高频特征增强及先验检索，使智能体能够主动解决视觉歧义并分离细微异常。此外，我们引入门控强化学习目标，联合优化异常分类、定位精度、异常类型推理及工具使用效率，确保仅在有益时才触发工具调用。在MVTec-AD、VisA、MPDD、DTD和SDD五个工业异常基准上的广泛评估表明，IndusAgent在所有现有方法中实现了最优的零样本性能，验证了其鲁棒性与泛化能力。

English

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose IndusAgent, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct Indus-CoT, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.