SAGA：一种面向多时间尺度概率预测的序列自适应生成架构，结合自适应时间共形预测

摘要

财政部和央行使用的微观模拟模型依赖于终生收入的参数化过程，这些过程仅捕捉条件分布的一阶和二阶矩，但遗漏了长期非线性结构。我们提出SAGA——一个专为非规则表格面板序列设计的仅含解码器的Transformer模型——并配以一个分离共形校准封装器，可为个体提供具有有限样本边际覆盖保证的预测区间。该模型基于1990年至2022年的瑞典LISA纵向登记数据（涵盖2,143,817名个体和61,284,903人年）进行训练，可预测1至30年时间跨度的年度劳动收入，并通过蒙特卡洛方法将其汇总为折现的终生收入分布。相较于经典的Guvenen、Karahan、Ozkan和Song参数化过程以及表格和循环基线模型，SAGA在十年期将连续排名概率得分降低31.9%，在二十年期将平均绝对误差降低37.7%。共形区间在边际上达到标称覆盖率的0.4个百分点以内，在最差人口统计子组上达到2.4个百分点以内。重建的终生收入基尼系数为0.327，而部分观测到的真实值为0.341，GKOS估计值为0.378。为支持在受保护的SCB MONA环境之外进行复现，我们公开了模型权重、校准表及一个合成等效数据集。

English

Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.