Revelando Valores e Opiniões Detalhadas em Modelos de Linguagem de Grande Escala

Resumo

Descobrir valores e opiniões latentes em grandes modelos de linguagem (LLMs) pode ajudar a identificar viéses e mitigar possíveis danos. Recentemente, isso tem sido abordado apresentando LLMs com perguntas de pesquisa e quantificando suas posturas em relação a declarações moral e politicamente carregadas. No entanto, as posturas geradas pelos LLMs podem variar consideravelmente dependendo de como são solicitados, e existem muitas maneiras de argumentar a favor ou contra uma determinada posição. Neste trabalho, propomos abordar isso analisando um conjunto de dados grande e robusto de 156k respostas de LLM às 62 proposições do Teste da Bússola Política (PCT) geradas por 6 LLMs usando 420 variações de prompts. Realizamos uma análise de granularidade grosseira de suas posturas geradas e uma análise de granularidade fina das justificativas em texto simples para essas posturas. Para a análise de granularidade fina, propomos identificar tropos nas respostas: frases semanticamente similares que são recorrentes e consistentes em diferentes prompts, revelando padrões no texto que um determinado LLM tem propensão a produzir. Descobrimos que características demográficas adicionadas aos prompts afetam significativamente os resultados no PCT, refletindo viés, bem como disparidades entre os resultados dos testes ao solicitar respostas de formato fechado versus domínio aberto. Além disso, padrões nas justificativas em texto simples via tropos mostram que justificativas semelhantes são geradas repetidamente entre modelos e prompts, mesmo com posturas díspares.

English

Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.

Revelando Valores e Opiniões Detalhadas em Modelos de Linguagem de Grande Escala

Revealing Fine-Grained Values and Opinions in Large Language Models

Resumo

Support