PaperFlow: 日々の論文ストリームのプロファイリング、推薦、および適応

要旨

科学論文推薦は通常、固定された候補集合に対する静的ランキングとして評価されるが、実際の科学研究の読書は、興味が変化しフィードバックが蓄積される日々の縦断的なプロセスとして展開される。我々はPaperFlowを提案する。これは、このプロセスを以下の三つの連携した段階に整理するフレームワークである。すなわち、プロファイリング（異種のコールドスタート証拠から構造化され検査可能な学術プロファイルを構築・維持する）、推薦（固定された表示予算の下でマルチシグナル集約により日付固有の論文ストリームをランキングする）、適応（意味的に異なるフィードバック信号からユーザー状態を更新し、日をまたいだ興味の移り変わりをモデル化する）である。さらに、共有された時間情報境界の下でユーザー、日付、候補プール、可視入力、隠れたシミュレーションされた関連性ラベルを固定する縦断的なユーザー日別ベンチマークを定義する。このベンチマークは、24名のシミュレーション研究ユーザー、50の日次論文ストリーム、1,200のユーザー日別エピソード、20,727件のユニークな論文、および497,448件のエピソード・論文レコードを含む。さらに、自動指標と専門家による判断との整合性を検証するためのブラインド人間評価プロトコルを規定する。五つの科学論文推薦ベースラインに対する実験の結果、PaperFlowは最も強力なオラクルベースのランキング、シミュレーションされた読書選択との最も高い行動的一致、および最良のブラインド人間評価スコアを達成することを示す。

English

Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real scientific reading unfolds as a daily, longitudinal process in which interests shift and feedback accumulates. We introduce PaperFlow, a framework that organizes it into three coupled stages: Profiling, which constructs and maintains a structured, inspectable scholarly profile from heterogeneous cold-start evidence; Recommending, which ranks each date-specific paper stream through multi-signal aggregation under a fixed display budget; and Adapting, which updates user state from semantically distinct feedback signals and models interest drift across days. We further define a longitudinal user-day benchmark that fixes users, dates, candidate pools, visible inputs, and hidden simulated relevance labels under a shared temporal information boundary. The benchmark contains 24 simulated research users, 50 daily paper streams, 1,200 user-day episodes, 20,727 unique papers, and 497,448 episode-paper records. We additionally specify a blind human-evaluation protocol to validate alignment between automatic metrics and expert judgments. Experiments against five scientific recommendation baselines show that PaperFlow achieves the strongest oracle-based ranking, the highest behavioral alignment with simulated reading selections, and the best blind human-evaluation score.