TruthRL: Incentivizzare LLM Veritieri tramite Apprendimento per Rinforzo

Abstract

Sebbene i grandi modelli linguistici (LLM) abbiano dimostrato prestazioni solide nel rispondere a domande fattuali, sono ancora soggetti a allucinazioni e risposte non veritiere, specialmente quando i compiti richiedono informazioni al di fuori della loro conoscenza parametrica. In effetti, la veridicità richiede più che precisione: i modelli devono anche riconoscere l'incertezza e astenersi quando non sono sicuri per evitare allucinazioni. Ciò rappresenta una sfida fondamentale per i metodi esistenti: approcci che ottimizzano per la precisione spesso amplificano le allucinazioni, mentre quelli che incoraggiano l'astensione possono diventare eccessivamente conservativi, sacrificando risposte corrette. Entrambi gli estremi compromettono alla fine la veridicità. In questo lavoro, presentiamo TruthRL, un framework generale di apprendimento per rinforzo (RL) che ottimizza direttamente la veridicità degli LLM. Nello specifico, implementiamo TruthRL utilizzando GRPO con una ricompensa ternaria semplice ma efficace che distingue risposte corrette, allucinazioni e astensioni. Incentiva i modelli a ridurre le allucinazioni non solo fornendo risposte corrette, ma anche consentendo l'astensione quando incerti, migliorando così la veridicità. Esperimenti estensivi su quattro benchmark ad alta intensità di conoscenza mostrano che, rispetto al RL standard, TruthRL riduce significativamente le allucinazioni del 28,9% e migliora la veridicità del 21,1%, con guadagni consistenti su vari modelli di base (ad esempio, Qwen, Llama) sia in configurazioni con che senza recupero di informazioni. Uno studio di ablazione approfondito dimostra che i metodi standard guidati dalla precisione, come il fine-tuning supervisionato o il RL con una ricompensa binaria, faticano a bilanciare correttezza fattuale e incertezza. Al contrario, il nostro TruthRL guidato dalla veridicità raggiunge prestazioni forti sia in termini di precisione che di veridicità, sottolineando l'importanza della progettazione degli obiettivi di apprendimento per sviluppare LLM veritieri.

English

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.

TruthRL: Incentivizzare LLM Veritieri tramite Apprendimento per Rinforzo

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Abstract

Support