ビデオトランスフォーマーの理解のための普遍的概念発見

要旨

本論文では、ビデオ向けトランスフォーマー表現の概念ベース解釈可能性の問題を研究する。具体的には、自動的に発見された高次元の時空間概念に基づいて、ビデオトランスフォーマーの意思決定プロセスを説明することを目指す。これまでの概念ベース解釈可能性に関する研究は、画像レベルのタスクにのみ焦点を当ててきた。一方、ビデオモデルは時間次元が追加されるため、複雑さが増し、時間経過に伴う動的な概念を特定する上で課題が生じる。本研究では、初のビデオトランスフォーマー概念発見（VTCD）アルゴリズムを導入し、これらの課題に体系的に取り組む。この目的のために、ビデオトランスフォーマー表現の単位（概念）を教師なしで効率的に特定し、モデルの出力に対するそれらの重要性をランク付けするアプローチを提案する。その結果得られる概念は非常に解釈可能であり、非構造化ビデオモデルにおける時空間推論メカニズムやオブジェクト中心表現を明らかにする。多様な教師ありおよび自己教師あり表現セットに対してこの分析を共同で行うことで、これらのメカニズムの一部がビデオトランスフォーマーにおいて普遍的であることを発見する。最後に、VTCDが細粒度タスクにおけるモデル性能の向上に利用できることを実証する。

English

This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we demonstrate that VTCDcan be used to improve model performance for fine-grained tasks.

ビデオトランスフォーマーの理解のための普遍的概念発見

Understanding Video Transformers via Universal Concept Discovery

要旨

Support