작지만 신뢰할 수 있는: 시계열 이상 탐지를 위한 효율적인 시각-언어 추론

초록

최근 시각-언어 모델(VLM)의 발전은 다양한 작업에서 인상적인 성능을 달성했지만, 대규모 언어 모델이나 멀티모달 모델을 순차 데이터의 이상 패턴 탐지에 적용할 때는 만족스럽지 못한 성능이 보고되어 왔다. 공개 이상 탐지 벤치마크는 일반적으로 구간 주석을 제공하지만 자연어 설명을 제공하지 않아, 근거 기반의 해석 가능한 결정을 내릴 수 있도록 VLM을 미세 조정하기 어렵게 만든다. 이러한 격차를 해소하기 위해, 우리는 공개 시계열 데이터셋을 기반으로 구축되고 세분화된 작업별 보상을 사용하여 여러 대규모 VLM에서 선별된 고품질 이상 설명으로 보강된 벤치마크인 VisAnomBench를 구축한다. 이 벤치마크에 대한 미세 조정을 통해, 우리는 시계열 이상 탐지를 위한 매개변수 효율적 VLM인 VisAnomReasoner를 개발한다. VisAnomBench에 대한 실험 결과, VisAnomReasoner는 더 정확한 이상 위치 파악을 달성하며 모든 기준 모델을 지속적으로 능가하여 정밀도와 F1에서 각각 최소 21.23 및 23.87 퍼센트 포인트의 향상을 보였다. TSB-AD-U 벤치마크에 대한 추가 실험은 강력한 교차 벤치마크 일반화를 입증했으며, VisAnomReasoner는 정밀도와 F1을 각각 9.57 및 13.39 퍼센트 포인트 향상시켰다.

English

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.