언제, 어디서, 어떻게: 표형 자기 지도 학습을 위한 적응형 비닝

초록

의료용 표 형식 데이터는 임상 연구에서 광범위하게 사용되지만, 표에 대한 심층 학습은 아직 충분히 탐구되지 않았다. 이는 구조화된 임상 변수가 표 형식으로 일상적으로 제공됨에도 불구하고 신뢰할 수 있는 레이블을 얻기 위해 종종 고비용의 전문가 판정이 필요하기 때문이다. 자기 지도 학습은 이러한 레이블이 없는 표를 활용할 수 있으며, 최근의 구간화 기반 사전 과제는 유망한 귀납적 편향을 제공하지만, 기존 목적 함수는 단일 전역 분위수 이산화를 고정하고 특징에 무관한 지도를 적용한다. 본 논문에서는 훈련 적응형 이산화 사전 과제인 적응형 구간화를 제안하며, 이는 특징별 조대-세밀 커리큘럼을 통해 이산화와 학습을 결합한다. 신경망의 스펙트럼 편향과 커리큘럼 학습 원리에 동기를 부여받은 본 방법은 고원 감지 시 특징별로 이산화를 점진적으로 세분화하고, 표현 인식 분할을 선택하여 값 공간 집중도와 표현 공간 일관성을 동시에 개선한다. 이질성 인식 목적 함수는 범주형 재구성과 수치형 특징에 대한 순서형 지도를 통합하며, 통합 평가 프로토콜 하에서 공개 의료용 표 형식 데이터셋에 대한 실험은 데이터셋별 이산화 튜닝 없이 선형 프로빙과 미세 조정에서 일관된 성능 향상을 보여준다. 또한, 이 덜 탐구된 영역에서 재현 가능한 발전을 지원하기 위해 표준화된 프로토콜을 갖춘 의료용 표 형식 자기 지도 학습 벤치마크를 도입한다. 본 코드는 https://github.com/labhai/Adaptive-Binning에서 확인할 수 있다.

English

Medical tabular data are ubiquitous in clinical research, but deep learning for tables remains underexplored because reliable labels often require costly expert adjudication, even though structured clinical variables are routinely available in tabular form. Self-supervised learning can leverage these unlabeled tables, and recent binning-based pretexts offer a promising inductive bias, but existing objectives fix a single global quantile discretization and apply feature-agnostic supervision. We propose Adaptive Binning, a training-adaptive discretization pretext for tabular SSL that couples discretization to learning through a feature-wise coarse-to-fine curriculum. Motivated by the spectral bias of neural networks and the principles of curriculum learning, our method progressively refines discretization per feature upon plateau detection and selects representation-aware splits to jointly improve value-space concentration and representation-space coherence. A heterogeneity-aware objective unifies categorical reconstruction with ordinal supervision for numerical features, and experiments on public medical tabular datasets under unified evaluation protocols show consistent gains for linear probing and fine-tuning without dataset-specific discretization tuning. We further introduce a medical tabular SSL benchmark with standardized protocols to support reproducible progress in this underexplored domain. Our code is available at https://github.com/labhai/Adaptive-Binning.