보편적 개념 발견을 통한 비디오 트랜스포머 이해

초록

본 논문은 비디오를 위한 트랜스포머 표현의 개념 기반 해석 가능성 문제를 연구한다. 구체적으로, 우리는 자동으로 발견된 고차원의 시공간적 개념을 기반으로 비디오 트랜스포머의 의사결정 과정을 설명하고자 한다. 개념 기반 해석 가능성에 대한 기존 연구는 주로 이미지 수준의 작업에 집중되어 있었다. 반면, 비디오 모델은 추가된 시간 차원을 다루며, 시간에 따른 동적 개념을 식별하는 데 있어 복잡성을 증가시키고 도전 과제를 제기한다. 본 연구에서는 이러한 도전 과제를 체계적으로 해결하기 위해 최초의 비디오 트랜스포머 개념 발견(VTCD) 알고리즘을 소개한다. 이를 위해, 우리는 비디오 트랜스포머 표현의 단위인 개념을 비지도 방식으로 효율적으로 식별하고, 모델 출력에 대한 이들의 중요도를 순위 매기는 접근법을 제안한다. 그 결과 도출된 개념은 매우 해석 가능하며, 비정형 비디오 모델에서 시공간적 추론 메커니즘과 객체 중심 표현을 드러낸다. 다양한 지도 및 자기 지도 표현 집합에 대해 이 분석을 공동으로 수행함으로써, 이러한 메커니즘 중 일부가 비디오 트랜스포머에서 보편적임을 발견한다. 마지막으로, VTCD가 세분화된 작업에서 모델 성능을 개선하는 데 사용될 수 있음을 입증한다.

English

This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we demonstrate that VTCDcan be used to improve model performance for fine-grained tasks.

보편적 개념 발견을 통한 비디오 트랜스포머 이해

Understanding Video Transformers via Universal Concept Discovery

초록

Support