私に基づく：長期パーソナライズされた参照記憶QA

要旨

パーソナライズドAIアシスタントは、画像、動画、メールなど複数のモダリティと情報源に自然にまたがる長期ユーザーメモリを想起し、推論する必要がある。しかし、既存の長期メモリベンチマークは主に対話履歴に焦点を当てており、実体験に基づく現実的なパーソナライズド参照を捉えられていない。本論文では、マルチモーダル・マルチソースのパーソナライズド参照メモリQAにおける初のベンチマークであるATM-Benchを提案する。ATM-Benchには、約4年分のプライバシー保護された個人メモリデータと、人間が注釈付けた質問応答ペアが含まれており、個人参照の解決、マルチソースからの複数証拠に基づく推論、矛盾する証拠の処理を必要とするクエリに対応する。また、異なる情報源に由来するメモリ項目を構造的に表現するため、スキーマ誘導メモリ（SGM）を提案する。実験では、標準的なRAGベースラインとともに5つの最先端メモリシステムを実装し、様々なメモリ取り込み、検索、応答生成技術のバリエーションを評価した。その結果、ATM-Bench-Hardセットでは低い性能（精度20%未満）が確認され、従来研究で一般的に採用されている記述的メモリよりもSGMが性能を向上させることがわかった。コードはhttps://github.com/JingbiaoMei/ATM-Bench で公開されている。

English

Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench

私に基づく：長期パーソナライズされた参照記憶QA

According to Me: Long-Term Personalized Referential Memory QA

要旨

Support