표 형식 데이터 예측을 위한 마르코프 경계의 장점, 단점, 그리고 문제점

초록

표준 그래프적 가정 하에서, 목표 변수의 마르코프 경계(Markov boundary)는 다른 모든 특징을 중복되게 만드는 가장 작은 특징 집합이다. 경계가 관측되면 목표 변수는 테이블의 나머지 부분과 조건부 독립이 된다. 이는 테이블 형식 예측에 있어 매력적인 대상인데, 모델이 필요로 하는 정확한 열들을 명시해 주기 때문이다. 그러나 현대의 회귀 모델들은 여전히 전체 특징 집합에 대해 훈련된다. 우리는 마르코프 경계가 SCM3K(3,450개의 과제로 구성된 합성 SCM 벤치마크로, 특징 수는 40에서 1000까지이고 여섯 가지 SCM 계열을 포함하며, 여섯 가지 회귀 모델로 평가됨)에서 예측에 실질적으로 유용한지 질문한다. 답은 이론이 시사하는 것보다 더 미묘하다. 회귀 모델을 오라클 경계(oracle boundary)로 제한하면 예측이 상당히 개선되는 경우가 많으며, 특징 공간이 더 크고 희소해질수록 개선 폭이 커진다. 그러나 인과 발견(causal discovery)을 통해 경계를 복원하고 복원된 마스크로 훈련하는 자연스러운 파이프라인은 성과를 내지 못한다. 기존 추정기들은 경계가 가장 큰 도움이 되는 영역에 도달하기 전에 계산 예산을 소진하며, 설령 실행되더라도 전체 특징 집합을 능가하는 경우는 거의 없다. 우리는 이 현상을 세 가지 원인으로 설명한다. 발견은 예측보다 구조 복원(structural recovery)을 최적화한다. 거짓 음성(false negative)과 거짓 양성(false positive)은 예측 비용에서 급격히 비대칭적이다. 정확한 경계는 모든 특징을 능가하는 많은 특징 집합 중 하나일 뿐이다. 그런 다음 이러한 사실들이 예측 지향적 특징 선택(prediction-aligned feature selection)과 인과 구조를 학습하는 테이블 형식 모델에 대해 시사하는 바를 논의한다.

English

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.