IndusAgent: エージェンティックツールによるオープンボキャブラリ産業異常検知の強化

要旨

マルチモーダル大規模言語モデル（MLLM）は、視覚認識とテキスト推論の橋渡しにおいて顕著な能力を示し、多様な産業シナリオにわたるゼロショット理解を実現している。しかし、オープンボキャブラリー産業用異常検知（IAD）におけるその性能は、ドメインに不整合な推論や幻覚的な構造推論によってしばしば制限される。これらの課題に取り組むため、我々はIndusAgentを提案する。これはツール拡張型のエージェントフレームワークであり、オープンボキャブラリーIADを対象とする。具体的には、まずIndus-CoTを構築する。これは構造化データセットであり、大域的な視覚観測、高解像度の局所パッチ、専門家による正常性事前知識を統合し、厳格な産業検査の軌跡に沿ったモデルのファインチューニングを教師あり学習で支援する。これに基づき、IndusAgentは動的領域クロッピング、高周波特徴量強調、事前知識検索などの外部ツール群を動的に調整し、エージェントが視覚的な曖昧さを能動的に解決し、微細な異常を解きほぐすことを可能にする。さらに、ゲート付き強化学習目的関数を導入し、異常分類、位置特定精度、異常タイプ推論、効率的なツール使用を共同で最適化し、ツール呼び出しが有益な場合にのみ行われるようにする。MVTec-AD、VisA、MPDD、DTD、SDDの5つの産業用異常ベンチマークにおける広範な評価により、IndusAgentが既存手法の中で最先端のゼロショット性能を達成し、そのロバスト性と汎化能力が実証された。

English

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose IndusAgent, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct Indus-CoT, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.