何时、何地、如何：表格自监督学习的自适应分箱

摘要

医学表格数据在临床研究中无处不在，但表格数据的深度学习仍未被充分探索，原因在于可靠标注通常需要成本高昂的专家裁定，尽管结构化临床变量常以表格形式常规可用。自监督学习可利用这些未标注表格，而近期基于分箱的预训练任务提供了有前景的归纳偏置，但现有方法采用固定的全局分位数离散化并应用与特征无关的监督。我们提出自适应分箱（Adaptive Binning），一种用于表格自监督学习的训练自适应离散化预训练任务，通过逐特征由粗到精的课程将离散化与学习耦合。受神经网络的谱偏置及课程学习原理启发，我们的方法在检测到平台期后逐步细化每个特征的离散化，并选择表征感知的分割点，以联合改善值空间集中性和表征空间一致性。一种异质性感知目标将分类重建与数值特征的序数监督统一起来，在统一评估协议下的公开医学表格数据集实验表明，线性探测和微调均获得了持续改进，无需针对数据集调整离散化。我们进一步引入了一个医学表格自监督学习基准，附带标准化协议，以支持这一未充分探索领域的可重复进展。我们的代码见 https://github.com/labhai/Adaptive-Binning。

English

Medical tabular data are ubiquitous in clinical research, but deep learning for tables remains underexplored because reliable labels often require costly expert adjudication, even though structured clinical variables are routinely available in tabular form. Self-supervised learning can leverage these unlabeled tables, and recent binning-based pretexts offer a promising inductive bias, but existing objectives fix a single global quantile discretization and apply feature-agnostic supervision. We propose Adaptive Binning, a training-adaptive discretization pretext for tabular SSL that couples discretization to learning through a feature-wise coarse-to-fine curriculum. Motivated by the spectral bias of neural networks and the principles of curriculum learning, our method progressively refines discretization per feature upon plateau detection and selects representation-aware splits to jointly improve value-space concentration and representation-space coherence. A heterogeneity-aware objective unifies categorical reconstruction with ordinal supervision for numerical features, and experiments on public medical tabular datasets under unified evaluation protocols show consistent gains for linear probing and fine-tuning without dataset-specific discretization tuning. We further introduce a medical tabular SSL benchmark with standardized protocols to support reproducible progress in this underexplored domain. Our code is available at https://github.com/labhai/Adaptive-Binning.