TrustLLM: Affidabilità nei Modelli Linguistici di Grande Dimensione

Abstract

I grandi modelli linguistici (LLM), esemplificati da ChatGPT, hanno attirato notevole attenzione per le loro eccellenti capacità di elaborazione del linguaggio naturale. Tuttavia, questi LLM presentano molteplici sfide, in particolare nel campo dell'affidabilità. Pertanto, garantire l'affidabilità degli LLM emerge come un tema di grande importanza. Questo articolo introduce TrustLLM, uno studio completo sull'affidabilità negli LLM, che include principi per diverse dimensioni dell'affidabilità, un benchmark consolidato, la valutazione e l'analisi dell'affidabilità per i principali LLM, e una discussione sulle sfide aperte e le direzioni future. Nello specifico, proponiamo innanzitutto un insieme di principi per LLM affidabili che coprono otto diverse dimensioni. Sulla base di questi principi, stabiliamo ulteriormente un benchmark che abbraccia sei dimensioni, tra cui veridicità, sicurezza, equità, robustezza, privacy ed etica delle macchine. Presentiamo poi uno studio che valuta 16 principali LLM in TrustLLM, utilizzando oltre 30 dataset. I nostri risultati mostrano innanzitutto che, in generale, l'affidabilità e l'utilità (ovvero l'efficacia funzionale) sono positivamente correlate. In secondo luogo, le nostre osservazioni rivelano che gli LLM proprietari generalmente superano la maggior parte delle controparti open-source in termini di affidabilità, sollevando preoccupazioni sui potenziali rischi degli LLM open-source ampiamente accessibili. Tuttavia, alcuni LLM open-source si avvicinano molto a quelli proprietari. In terzo luogo, è importante notare che alcuni LLM potrebbero essere eccessivamente calibrati per mostrare affidabilità, al punto da compromettere la loro utilità trattando erroneamente prompt benigni come dannosi e, di conseguenza, non rispondendo. Infine, sottolineiamo l'importanza di garantire la trasparenza non solo nei modelli stessi, ma anche nelle tecnologie che sostengono l'affidabilità. Conoscere le specifiche tecnologie affidabili impiegate è cruciale per analizzarne l'efficacia.

English

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

TrustLLM: Affidabilità nei Modelli Linguistici di Grande Dimensione

TrustLLM: Trustworthiness in Large Language Models

Abstract

Support