Conv-FinRe:面向实用导向金融推荐的对话式纵向基准框架
Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation
February 19, 2026
作者: Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie
cs.AI
摘要
当前大多数推荐基准主要评估模型模仿用户行为的能力。然而在金融投顾领域,受市场波动影响,观测到的用户行为可能包含噪声或存在短视性,与用户的长期目标产生冲突。若将用户选择作为唯一基准,则会将行为模仿与决策质量混为一谈。我们推出Conv-FinRe——一个面向股票推荐的对话式长期评估基准,旨在超越行为匹配的范畴对大型语言模型进行评估。该基准要求模型在给定入职访谈、渐进式市场背景和投顾对话后,在固定投资周期内生成股票排名。关键创新在于,Conv-FinRe提供多视角参考标准,能够基于投资者特定风险偏好区分描述性行为与规范性效用,从而诊断LLM是遵循理性分析、模仿用户噪声,还是受市场动量驱动。我们基于真实市场数据和人类决策轨迹构建该基准,实例化受控投顾对话场景,并对一系列前沿LLM进行评估。结果表明理性决策质量与行为对齐之间存在持续张力:基于效用排名表现优异的模型往往难以匹配用户选择,而行为对齐的模型则可能过度拟合短期噪声。该数据集已公开发布于Hugging Face平台,代码库可在GitHub获取。
English
Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.