ChatPaper.aiChatPaper

T2S基準測試與結構化思維:全面性文本到結構推理的基準評估與提示引導

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

March 4, 2026
作者: Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang, Martin Kuo, Zishan Shao, Dongting Li, Yueqian Lin, Ting Jiang, Chiyue Wei, Qi Qian, Wei Wen, Helen Li, Yiran Chen
cs.AI

摘要

思考人類如何處理複雜閱讀任務:標記關鍵點、推斷其關聯性,並建構信息結構以引導理解與回應。同理,大型語言模型能否藉助文本結構來提升文本處理效能?為探索此問題,本研究首先提出「思維結構」(Structure of Thought, SoT)這一提示技術,明確引導模型建構中間文本結構,在八項任務與三類模型家族中實現了持續的性能提升。基於此洞見,我們推出首個專注於評估與提升模型「文本到結構」能力的基準測試T2S-Bench,涵蓋6大科學領域與32種結構類型共1.8萬個樣本,經嚴格構建以確保準確性、公平性與品質。對45個主流模型的評估揭示巨大改進空間:多跳推理任務平均準確率僅52.1%,即便最先進模型在端到端提取任務中的節點準確率也僅達58.1%。此外,在Qwen2.5-7B-Instruct模型上,僅使用SoT即可於八項文本處理任務實現平均+5.7%的提升,而結合T2S-Bench微調更將增益擴大至+8.6%。這些成果凸顯了顯式文本結構化的價值,以及SoT與T2S-Bench的互補性貢獻。數據集與評估代碼已發佈於:https://t2s-bench.github.io/T2S-Bench-Page/。
English
Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.
PDF1043March 6, 2026