ViMRHP：基于人机协作标注的越南语多模态评论有用性预测基准数据集

摘要

多模态评论有用性预测（MRHP）是推荐系统中的一项关键任务，尤其在电子商务平台中尤为重要。判断用户生成评论的有用性能够提升用户体验并优化消费者决策。然而，现有数据集主要集中于英语和印尼语，导致语言多样性不足，特别是对于越南语等低资源语言。本文中，我们介绍了ViMRHP（越南语多模态评论有用性预测），这是一个针对越南语MRHP任务的大规模基准数据集。该数据集涵盖四个领域，包含2000种产品的46000条评论。同时，构建大规模数据集需要大量时间和成本。为了优化标注流程，我们利用人工智能辅助标注人员构建ViMRHP数据集。在AI的协助下，标注时间显著缩短（从每项任务90至120秒降至20至40秒），同时保持了数据质量，并降低了约65%的总成本。然而，AI生成的标注在复杂标注任务中仍存在局限性，我们通过详细的性能分析进一步探讨了这一点。在ViMRHP的实验中，我们评估了基线模型在人工验证和AI生成标注上的表现，以评估其质量差异。ViMRHP数据集已公开于https://github.com/trng28/ViMRHP。

English

Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP

ViMRHP：基于人机协作标注的越南语多模态评论有用性预测基准数据集

ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

摘要

Support