T2S基準測試與結構化思維：全面性文本到結構推理的基準評估與提示引導

摘要

思考人類如何處理複雜閱讀任務：標記關鍵點、推斷其關聯性，並建構信息結構以引導理解與回應。同理，大型語言模型能否藉助文本結構來提升文本處理效能？為探索此問題，本研究首先提出「思維結構」（Structure of Thought, SoT）這一提示技術，明確引導模型建構中間文本結構，在八項任務與三類模型家族中實現了持續的性能提升。基於此洞見，我們推出首個專注於評估與提升模型「文本到結構」能力的基準測試T2S-Bench，涵蓋6大科學領域與32種結構類型共1.8萬個樣本，經嚴格構建以確保準確性、公平性與品質。對45個主流模型的評估揭示巨大改進空間：多跳推理任務平均準確率僅52.1%，即便最先進模型在端到端提取任務中的節點準確率也僅達58.1%。此外，在Qwen2.5-7B-Instruct模型上，僅使用SoT即可於八項文本處理任務實現平均+5.7%的提升，而結合T2S-Bench微調更將增益擴大至+8.6%。這些成果凸顯了顯式文本結構化的價值，以及SoT與T2S-Bench的互補性貢獻。數據集與評估代碼已發佈於：https://t2s-bench.github.io/T2S-Bench-Page/。

English

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.

T2S基準測試與結構化思維：全面性文本到結構推理的基準評估與提示引導

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

摘要

Support