수동 테스트 세트 없이 편향성 평가하기: 대형 언어 모델을 위한 개념 표현 관점

초록

대규모 언어 모델(LLM)의 편향성은 그 신뢰성과 공정성을 크게 저해합니다. 우리는 모델의 개념 공간 내 두 참조 개념(예: 감정 극성 "긍정적"과 "부정적")이 세 번째 대상 개념(예: 리뷰 측면)과 비대칭적으로 상관관계를 보일 때 발생하는 일반적인 형태의 편향에 주목합니다. 예를 들어, "음식"에 대한 이해는 특정 감정으로 치우쳐서는 안 됩니다. 기존의 편향 평가 방법은 다양한 사회 집단에 대한 레이블 데이터를 구성하고 모델의 반응을 측정함으로써 LLM의 행동 차이를 평가하지만, 이는 상당한 인적 노력이 필요하며 제한된 사회적 개념만을 포착할 수 있습니다. 이러한 한계를 극복하기 위해, 우리는 모델의 벡터 공간 구조에 기반한 테스트 세트가 필요 없는 편향 분석 프레임워크인 BiasLens를 제안합니다. BiasLens는 개념 활성화 벡터(CAV)와 희소 오토인코더(SAE)를 결합하여 해석 가능한 개념 표현을 추출하고, 대상 개념과 각 참조 개념 간의 표현적 유사성 변이를 측정하여 편향을 정량화합니다. 레이블 데이터 없이도 BiasLens는 기존 편향 평가 지표와 강력한 일치성을 보입니다(Spearman 상관계수 r > 0.85). 더욱이 BiasLens는 기존 방법으로는 탐지하기 어려운 형태의 편향을 드러냅니다. 예를 들어, 시뮬레이션된 임상 시나리오에서 환자의 보험 상태가 LLM의 진단 평가에 편향을 초래할 수 있습니다. 전반적으로 BiasLens는 확장 가능하고 해석 가능하며 효율적인 편향 발견 패러다임을 제공함으로써 LLM의 공정성과 투명성 개선의 길을 열어줍니다.

English

Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g., "positive" and "negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of "food" should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.

수동 테스트 세트 없이 편향성 평가하기: 대형 언어 모델을 위한 개념 표현 관점

Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs

초록

Support