無需人工測試集評估偏見：從概念表徵視角看大型語言模型

摘要

大型語言模型（LLMs）中的偏見嚴重削弱了其可靠性和公平性。我們關注一種常見的偏見形式：當模型概念空間中的兩個參考概念（例如情感極性，如“正面”和“負面”）與第三個目標概念（如評論方面）不對稱地相關時，模型會表現出非預期的偏見。例如，對“食物”的理解不應偏向任何特定的情感。現有的偏見評估方法通過為不同社會群體構建標記數據並測量模型在這些群體中的反應來評估LLMs的行為差異，這一過程需要大量人力且僅能捕捉有限的社會概念。為克服這些限制，我們提出了BiasLens，這是一個基於模型向量空間結構的無測試集偏見分析框架。BiasLens結合概念激活向量（CAVs）和稀疏自編碼器（SAEs）來提取可解釋的概念表示，並通過測量目標概念與每個參考概念之間表示相似性的變化來量化偏見。即使沒有標記數據，BiasLens也與傳統的偏見評估指標表現出高度一致性（Spearman相關性r > 0.85）。此外，BiasLens揭示了使用現有方法難以檢測的偏見形式。例如，在模擬臨床場景中，患者的保險狀態可能導致LLM產生偏見的診斷評估。總體而言，BiasLens提供了一個可擴展、可解釋且高效的偏見發現範式，為提升LLMs的公平性和透明度鋪平了道路。

English

Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g., "positive" and "negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of "food" should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.

無需人工測試集評估偏見：從概念表徵視角看大型語言模型

Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs

摘要

Support