多模式自動可解釋性代理程式
A Multimodal Automated Interpretability Agent
April 22, 2024
作者: Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba
cs.AI
摘要
本文描述了MAIA,一個多模式自動可解釋性代理。MAIA是一個系統,利用神經模型來自動執行神經模型理解任務,如特徵解釋和失敗模式發現。它為一個預先訓練的視覺語言模型配備了一組工具,支持對其他模型的子組件進行迭代實驗,以解釋它們的行為。這些工具包括人類可解釋性研究人員常用的工具:用於合成和編輯輸入、從現實世界數據集中計算最大激活實例,以及總結和描述實驗結果。MAIA提出的可解釋性實驗將這些工具組合起來描述和解釋系統行為。我們評估了MAIA應用於計算機視覺模型的情況。我們首先表徵了MAIA在學習圖像表示的特徵(神經元級別)方面的描述能力。在幾個訓練過的模型和一個新的合成視覺神經元數據集中,這些神經元與配對的地面真實描述,MAIA生成的描述與專家人類實驗者生成的描述相當。然後,我們展示了MAIA可以幫助完成兩個額外的可解釋性任務:減少對虛假特徵的敏感性,以及自動識別可能被錯誤分類的輸入。
English
This paper describes MAIA, a Multimodal Automated Interpretability Agent.
MAIA is a system that uses neural models to automate neural model understanding
tasks like feature interpretation and failure mode discovery. It equips a
pre-trained vision-language model with a set of tools that support iterative
experimentation on subcomponents of other models to explain their behavior.
These include tools commonly used by human interpretability researchers: for
synthesizing and editing inputs, computing maximally activating exemplars from
real-world datasets, and summarizing and describing experimental results.
Interpretability experiments proposed by MAIA compose these tools to describe
and explain system behavior. We evaluate applications of MAIA to computer
vision models. We first characterize MAIA's ability to describe (neuron-level)
features in learned representations of images. Across several trained models
and a novel dataset of synthetic vision neurons with paired ground-truth
descriptions, MAIA produces descriptions comparable to those generated by
expert human experimenters. We then show that MAIA can aid in two additional
interpretability tasks: reducing sensitivity to spurious features, and
automatically identifying inputs likely to be mis-classified.Summary
AI-Generated Summary