ChatPaper.aiChatPaper

Conv-FinRe:面向實用性金融推薦的對話式縱向基準測試集

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

February 19, 2026
作者: Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie
cs.AI

摘要

當前多數推薦基準主要評估模型模仿用戶行為的準確度。然而在金融顧問領域,受市場波動影響,觀察到的用戶行為可能帶有雜訊或缺乏遠見,甚至與用戶的長期目標相悖。若僅將用戶的實際選擇視為唯一標準,便會混淆行為模仿與決策品質的界線。為此,我們提出Conv-FinRe——一個專注於股票推薦的對話式縱向基準,其評估重點超越單純的行為匹配。該基準要求模型在獲取用戶背景訪談、逐步更新的市場情境及顧問對話後,於固定投資週期內生成股票排名。關鍵在於,Conv-FinRe提供多視角參考標準,區分基於投資者特定風險偏好的規範性效用與描述性行為,從而診斷LLM是遵循理性分析、模仿用戶雜訊,還是受市場趨勢驅動。我們基於真實市場數據與人類決策軌跡構建此基準,設計受控的顧問對話情境,並評估一系列前沿LLM。結果顯示理性決策品質與行為對齊間存在持續矛盾:基於效用排名表現優異的模型常無法匹配用戶選擇,而行為對齊的模型則可能過度擬合短期雜訊。本數據集已公開於Hugging Face,程式碼庫發布於GitHub。
English
Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.
PDF112March 28, 2026