通過通用概念發現來理解視頻Transformer

摘要

本文研究了基於概念的視頻Transformer表示的可解釋性問題。具體而言，我們旨在解釋基於高級時空概念的視頻Transformer的決策過程，這些概念是自動發現的。先前關於基於概念的可解釋性的研究僅集中在圖像級任務上。相比之下，視頻模型處理了額外的時間維度，增加了複雜性並在識別隨時間變化的動態概念方面提出挑戰。在這項工作中，我們通過引入第一個視頻Transformer概念發現（VTCD）算法系統性地應對這些挑戰。為此，我們提出了一種有效的方法，用於無監督識別視頻Transformer表示的單元 - 概念，並對其對模型輸出的重要性進行排名。所得到的概念具有很高的可解釋性，揭示了在非結構化視頻模型中的時空推理機制和以物體為中心的表示。通過對多種監督和自監督表示聯合進行此分析，我們發現其中一些機制在視頻Transformer中是通用的。最後，我們展示了VTCD可用於改善精細任務的模型性能。

English

This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we demonstrate that VTCDcan be used to improve model performance for fine-grained tasks.

通過通用概念發現來理解視頻Transformer

Understanding Video Transformers via Universal Concept Discovery

摘要

Support