PaperFlow: 일일 논문 스트림 전반에 걸친 프로파일링, 추천 및 적응

초록

과학 논문 추천은 일반적으로 고정된 후보 집합에 대한 정적 순위 평가로 이루어지지만, 실제 학술 열람은 관심사가 변화하고 피드백이 축적되는 일상적이고 종단적인 과정으로 진행된다. 본 연구에서는 PaperFlow라는 프레임워크를 도입하여, 이를 세 가지 결합 단계로 구성한다. 프로파일링(Profiling) 단계에서는 이질적인 콜드 스타트 증거로부터 구조화되고 검사 가능한 학술 프로필을 구축 및 유지한다. 추천(Recommending) 단계에서는 고정된 표시 예산 하에서 다중 신호 집계를 통해 각 날짜별 논문 스트림을 순위화한다. 적응(Adapting) 단계에서는 의미적으로 구별되는 피드백 신호로부터 사용자 상태를 업데이트하고 며칠에 걸친 관심 표류를 모델링한다. 또한, 공유된 시간적 정보 경계 하에서 사용자, 날짜, 후보 풀, 가시적 입력, 숨겨진 시뮬레이션된 관련성 레이블을 고정하는 종단적 사용자-일 벤치마크를 정의한다. 이 벤치마크는 24명의 시뮬레이션된 연구 사용자, 50개의 일일 논문 스트림, 1,200개의 사용자-일 에피소드, 20,727개의 고유 논문, 497,448개의 에피소드-논문 기록을 포함한다. 또한, 자동 평가 지표와 전문가 판단 간의 일치성을 검증하기 위한 맹목적 인간 평가 프로토콜을 추가로 명시한다. 다섯 가지 과학 논문 추천 기준선에 대한 실험 결과, PaperFlow는 가장 강력한 오라클 기반 순위, 시뮬레이션된 읽기 선택과의 가장 높은 행동 일치성, 그리고 최고의 맹목적 인간 평가 점수를 달성함을 보여준다.

English

Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real scientific reading unfolds as a daily, longitudinal process in which interests shift and feedback accumulates. We introduce PaperFlow, a framework that organizes it into three coupled stages: Profiling, which constructs and maintains a structured, inspectable scholarly profile from heterogeneous cold-start evidence; Recommending, which ranks each date-specific paper stream through multi-signal aggregation under a fixed display budget; and Adapting, which updates user state from semantically distinct feedback signals and models interest drift across days. We further define a longitudinal user-day benchmark that fixes users, dates, candidate pools, visible inputs, and hidden simulated relevance labels under a shared temporal information boundary. The benchmark contains 24 simulated research users, 50 daily paper streams, 1,200 user-day episodes, 20,727 unique papers, and 497,448 episode-paper records. We additionally specify a blind human-evaluation protocol to validate alignment between automatic metrics and expert judgments. Experiments against five scientific recommendation baselines show that PaperFlow achieves the strongest oracle-based ranking, the highest behavioral alignment with simulated reading selections, and the best blind human-evaluation score.