如何从网络挖掘教程流程:评估与优化大语言模型的方法
How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs
February 9, 2026
作者: Yapei Chang, Kyle Lo, Mohit Iyyer, Luca Soldaini
cs.AI
摘要
生成分步操作指南是大型语言模型(LLM)的核心能力:聊天机器人常被询问操作建议,而分步规划对复杂任务的推理至关重要。然而,在真实场景中大规模量化和提升流程有效性仍面临挑战且研究不足。为此,我们推出How2Everything框架,用于评估和改进目标导向型流程生成。该框架包含How2Mine组件,可从14个主题的98万个网页中挖掘35.1万条操作流程,并能轻松扩展至更大规模语料库。基于此我们构建How2Bench评估集,包含7000个平衡覆盖各主题的样本。为可靠评分模型输出,我们开发How2Score评估协议,利用LLM作为评判员检测生成内容是否包含阻碍目标达成的关键错误。为实现低成本可复现评估,我们将前沿模型蒸馏为80亿参数开源模型,与人工标注者达成80.5%的一致性。How2Bench清晰揭示了模型规模和训练阶段间的扩展规律,在预训练早期即可提供有效信号。最后,以How2Score作为奖励的强化学习使三个模型在How2Bench上的性能提升超10分,且未在标准基准测试中出现系统性衰退,其增益对表面化的源文档记忆或格式合规具有鲁棒性。How2Everything整体表明,预训练网络数据如何支撑能力评估与规模化改进的闭环系统。
English
Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.