T2S-Bench & Structure-of-Thought: Valutazione e Prompting del Ragionamento Completo da Testo a Struttura

Abstract

Riflettete su come l'essere umano affronta compiti di lettura complessi: segnando i punti chiave, inferendo le loro relazioni e strutturando le informazioni per guidare la comprensione e le risposte. Allo stesso modo, un modello linguistico di grandi dimensioni può trarre vantaggio dalla struttura del testo per migliorare le prestazioni di elaborazione? Per esplorarlo, in questo lavoro introduciamo prima Structure of Thought (SoT), una tecnica di prompting che guida esplicitamente i modelli a costruire strutture testuali intermedie, migliorando costantemente le prestazioni in otto compiti e tre famiglie di modelli. Sulla base di questa intuizione, presentiamo T2S-Bench, il primo benchmark progettato per valutare e migliorare le capacità di conversione da testo a struttura dei modelli. T2S-Bench include 1.800 campioni coprenti 6 domini scientifici e 32 tipi strutturali, costruiti rigorosamente per garantire accuratezza, equità e qualità. La valutazione su 45 modelli mainstream rivela un margine di miglioramento sostanziale: l'accuratezza media nel compito di ragionamento multi-step è solo del 52,1%, e persino il modello più avanzato raggiunge un'accuratezza dei nodi del 58,1% nell'estrazione end-to-end. Inoltre, su Qwen2.5-7B-Instruct, la sola SoT produce un miglioramento medio del +5,7% su otto diverse attività di elaborazione del testo, e il fine-tuning su T2S-Bench aumenta ulteriormente questo guadagno al +8,6%. Questi risultati evidenziano il valore della strutturazione esplicita del testo e i contributi complementari di SoT e T2S-Bench. Il dataset e il codice di valutazione sono stati rilasciati su https://t2s-bench.github.io/T2S-Bench-Page/.

English

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.

T2S-Bench & Structure-of-Thought: Valutazione e Prompting del Ragionamento Completo da Testo a Struttura

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Abstract

Support