다시 생각해보자! 테스트 시점 계산이 대형 언어 모델의 선호도, 의견 및 신념에 미치는 영향

초록

대규모 언어 모델(LLMs)이 인간의 삶에 깊숙이 통합되고 의사결정에 점점 더 큰 영향을 미치면서, 이러한 모델이 주관적 선호, 의견 및 신념을 보이는지 여부와 그 정도를 평가하는 것이 중요해졌다. 이러한 경향은 모델 내부의 편향에서 비롯될 수 있으며, 이는 모델의 행동을 형성하고 사용자에게 제공하는 조언과 권장사항에 영향을 미치며 특정 관점을 강화할 가능성이 있다. 본 논문은 사회적, 문화적, 윤리적, 개인적 영역에 걸쳐 LLMs의 주관적 경향성을 평가하기 위해 개발된 벤치마크인 선호, 의견 및 신념 조사(POBs)를 소개한다. 우리는 이 벤치마크를 적용하여 주요 오픈소스 및 클로즈드소스 LLMs를 평가하고, 신뢰성, 중립성, 일관성과 같은 바람직한 특성을 측정했다. 또한, 추론 및 자기반성 메커니즘을 통해 테스트 시점의 계산량을 증가시키는 것이 이러한 지표에 미치는 영향을 조사했다. 다른 작업에서는 효과적이었지만, 우리의 결과는 이러한 메커니즘이 우리의 영역에서는 제한된 이점만 제공한다는 것을 보여준다. 더 나아가, 최신 모델 버전들이 점점 더 일관성이 떨어지고 특정 관점에 편향되는 경향을 보이며, 이는 블라인드 스팟과 우려스러운 추세를 강조한다. POBS: https://ibm.github.io/POBS

English

As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: https://ibm.github.io/POBS

다시 생각해보자! 테스트 시점 계산이 대형 언어 모델의 선호도, 의견 및 신념에 미치는 영향

Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

초록

Support