T2S-基准与思维结构：全面文本到结构推理的基准测试与提示方法

摘要

思考人类如何处理复杂阅读任务：标记关键点、推断其关联性、构建信息结构以引导理解与回应。同样地，大型语言模型能否通过利用文本结构来提升文本处理性能？为探索这一问题，本研究首先提出思维结构（SoT）提示技术，显式引导模型构建中间文本结构，在八项任务和三类模型家族中实现性能的持续提升。基于这一发现，我们推出首个专为评估和提升模型文本到结构转换能力而设计的基准测试T2S-Bench。该基准涵盖6个科学领域和32种结构类型，包含1800个经严格构建以确保准确性、公平性与质量的样本。对45个主流模型的评估揭示出巨大改进空间：多跳推理任务平均准确率仅为52.1%，即使最先进模型在端到端提取中的节点准确率也仅达58.1%。此外，在Qwen2.5-7B-Instruct模型上，仅使用SoT即可在八项多样化文本处理任务中实现平均5.7%的性能提升，而结合T2S-Bench微调后提升幅度进一步增至8.6%。这些结果凸显了显式文本结构化的价值，以及SoT与T2S-Bench的互补性贡献。数据集与评估代码已发布于https://t2s-bench.github.io/T2S-Bench-Page/。

English

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.