面向价值：面向大语言模型与视觉语言模型微调的高效前向数据估值方法

摘要

数据估值对于提升大语言模型（LLMs）与视觉语言模型（VLMs）的透明度和问责制至关重要。然而，现有方法通常依赖梯度计算，导致其在数十亿参数模型上计算成本过高，且无法实现批量并行化。本研究提出For-Value——一种纯前向的数据估值框架，该框架在保持有效性的同时，实现了高效的批量可扩展价值估计。通过利用预训练LLMs/VLMs的表达能力，我们从理论上证明数据估值可通过最终隐藏层表征与输出层预测误差之间的对齐关系来捕捉。基于这一洞见，For-Value采用简单的闭式表达式，仅需单次前向传播即可完成数据价值计算，无需昂贵的反向传播过程，并能实现大规模批量高效计算。大量实验表明，For-Value在识别影响力数据和错误标注数据任务中达到或超越基于梯度的基线方法，同时实现了显著的效率提升。

English

Data valuation is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing methods typically rely on gradient computations, making them computationally prohibitive for billion-parameter models and precluding batch parallelization. In this work, we introduce For-Value, a forward-only data valuation framework that enables efficient batch-scalable value estimation while maintaining effectiveness. Leveraging the expressive power of pretrained LLMs/VLMs, we theoretically demonstrate that data valuation can be captured by the alignment between the final hidden representations and prediction errors at the last layer. In light of this insight, For-Value computes data value using a simple closed-form expression with a single forward pass, eliminating the need for costly backpropagation and enabling efficient batch calculating at scale. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in detecting influential data and mislabeled data, while achieving significant efficiency improvements.