SciLT: 科学画像ドメインにおけるロングテール分類

要旨

長尾認識は、基盤モデルとファインチューニングのパラダイムにより恩恵を受けてきたが、既存の研究やベンチマークは主に自然画像領域に限定されており、事前学習データとファインチューニングデータが類似した分布を共有する場合がほとんどである。一方、科学技術画像は、特有の視覚的特徴と教師信号を示し、こうした設定下での基盤モデルのファインチューニングの有効性に疑問を投げかけている。本研究では、純粋に視覚的かつパラメータ効率の良いファインチューニング（PEFT）パラダイムに基づく、科学技術領域における長尾認識を探求する。3つの科学技術ベンチマークでの実験により、基盤モデルのファインチューニングによる性能向上が限定的であることが示され、特に尾部クラスにおいて、最終層の一つ手前の層の特徴量が重要な役割を果たすことが明らかになった。これらの知見に基づき、我々は適応的特徴融合と二重教師学習を通じてマルチレベル表現を活用するフレームワーク、SciLTを提案する。最終層の一つ手前の層と最終層の特徴量を共同で活用することで、SciLTは頭部クラスと尾部クラスのバランスの取れた性能を達成する。大規模な実験により、SciLTが既存手法を一貫して上回り、科学技術長尾認識の強力かつ実用的なベースラインを確立し、ドメインシフトが大きい科学技術データへの基盤モデルの適応に関する貴重な指針を提供することが実証された。

English

Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and parameter-efficient fine-tuning (PEFT) paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.

SciLT: 科学画像ドメインにおけるロングテール分類

SciLT: Long-Tailed Classification in Scientific Image Domains

要旨

Support