小さくても信頼できる：時系列異常検知のための効率的な視覚言語推論

要旨

近年、Vision-Language Models（VLM）の進歩により多くのタスクで優れた性能が達成されているが、時系列データの異常パターン検出に大規模言語モデルやマルチモーダルモデルを適用した場合、既存研究では不十分な性能が報告されている。公開されている異常検知ベンチマークは通常、区間アノテーションを提供するものの、自然言語による説明は提供しておらず、VLMをファインチューニングして根拠のある解釈可能な判断を生成することが困難である。このギャップを埋めるために、我々はVisAnomBenchを構築した。これは公開時系列データセットから構築され、複数の大規模VLMからタスク固有の細粒度な報酬を用いて選択された高品質な異常説明で拡張された厳選ベンチマークである。このベンチマークでのファインチューニングを通じて、時系列異常検知のためのパラメータ効率的なVLMであるVisAnomReasonerを開発した。VisAnomBenchでの実験結果は、VisAnomReasonerがより正確な異常位置特定を達成し、すべてのベースラインを一貫して上回り、精度とF1でそれぞれ少なくとも21.23パーセントポイントと23.87パーセントポイントの向上を示している。TSB-AD-Uベンチマークでの追加実験は、強力なクロスベンチマーク汎化能力を示し、VisAnomReasonerは精度とF1をそれぞれ9.57パーセントポイントと13.39パーセントポイント向上させた。

English

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.