VAREX：面向文档多模态结构化信息提取的基准评测体系

摘要

我们推出VAREX（多模式表单模式提取基准），该基准用于评估多模态基础模型从政府表格中提取结构化数据的性能。VAREX采用逆向标注流程，通过编程方式将合成数据填入PDF模板，生成经过三重质量验证的确定性标注真值。该基准包含1,777份文档，涵盖1,771种独特模式，涉及三种结构类别，每份文档提供四种输入模态：纯文本、保留布局文本（通过空格对齐模拟列位置）、文档图像、以及文本与图像结合。与现有仅评估单一输入表示的基准不同，VAREX为每份文档提供四种受控模态，可系统分析输入格式如何影响提取精度——这是现有基准所缺乏的能力。我们评估了从前沿专有模型到小型开源模型共20个模型，特别关注参数量≤40亿的模型，这类模型适合成本敏感和延迟受限的部署场景。实验结果表明：（1）在40亿参数以下，结构化输出合规性（而非提取能力）是主要瓶颈，特别是模式回声现象（模型生成符合模式的结构而非提取值）使受影响模型的得分降低45-65个百分点；（2）对20亿参数模型进行提取专项微调可实现+81个百分点的提升，表明指令跟随缺陷无需扩大规模即可解决；（3）保留布局文本带来最大精度增益（+3-18个百分点），超越像素级视觉线索；（4）本基准在60%-95%精度区间对模型区分度最高。数据集与评估代码已公开。

English

We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.