TrustLLM: Confiabilidade em Modelos de Linguagem de Grande Escala

Resumo

Grandes modelos de linguagem (LLMs), exemplificados pelo ChatGPT, têm recebido considerável atenção por suas excelentes capacidades de processamento de linguagem natural. No entanto, esses LLMs apresentam muitos desafios, particularmente no âmbito da confiabilidade. Portanto, garantir a confiabilidade dos LLMs surge como um tópico importante. Este artigo introduz o TrustLLM, um estudo abrangente sobre a confiabilidade em LLMs, incluindo princípios para diferentes dimensões da confiabilidade, um benchmark estabelecido, avaliação e análise da confiabilidade para LLMs mainstream, e uma discussão sobre desafios abertos e direções futuras. Especificamente, primeiro propomos um conjunto de princípios para LLMs confiáveis que abrangem oito dimensões diferentes. Com base nesses princípios, estabelecemos ainda um benchmark em seis dimensões, incluindo veracidade, segurança, justiça, robustez, privacidade e ética das máquinas. Em seguida, apresentamos um estudo avaliando 16 LLMs mainstream no TrustLLM, consistindo de mais de 30 conjuntos de dados. Nossas descobertas mostram, em primeiro lugar, que, em geral, a confiabilidade e a utilidade (ou seja, a eficácia funcional) estão positivamente relacionadas. Em segundo lugar, nossas observações revelam que LLMs proprietários geralmente superam a maioria de suas contrapartes de código aberto em termos de confiabilidade, levantando preocupações sobre os riscos potenciais de LLMs de código aberto amplamente acessíveis. No entanto, alguns LLMs de código aberto se aproximam muito dos proprietários. Em terceiro lugar, é importante notar que alguns LLMs podem estar excessivamente calibrados para exibir confiabilidade, a ponto de comprometerem sua utilidade ao tratar erroneamente prompts benignos como prejudiciais e, consequentemente, não responderem. Por fim, enfatizamos a importância de garantir a transparência não apenas nos próprios modelos, mas também nas tecnologias que sustentam a confiabilidade. Conhecer as tecnologias específicas de confiabilidade que foram empregadas é crucial para analisar sua eficácia.

English

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

TrustLLM: Confiabilidade em Modelos de Linguagem de Grande Escala

TrustLLM: Trustworthiness in Large Language Models

Resumo

Support