Lo bueno, lo malo y lo feo de la frontera de Markov para la predicción tabular

Resumen

Bajo los supuestos gráficos estándar, la frontera de Markov de una variable objetivo es el conjunto más pequeño de características que vuelve redundante cualquier otra característica. Una vez observada la frontera, la variable objetivo es condicionalmente independiente del resto de la tabla. Este es un objeto tentador para la predicción tabular, ya que nombra exactamente las columnas que un modelo debería necesitar. Sin embargo, los regresores modernos aún se entrenan con el conjunto completo de características. Nos preguntamos si la frontera de Markov es realmente útil para la predicción en SCM3K, un banco de pruebas sintético de SCM con 3,450 tareas, recuentos de características de 40 a 1000 y seis familias de SCM, evaluado con seis regresores. La respuesta es más matizada de lo que sugiere la teoría. Restringir un regresor a la frontera ideal (oráculo) a menudo mejora sustancialmente la predicción, y la mejora crece a medida que el espacio de características se vuelve más grande y disperso. Pero el proceso natural de recuperar la frontera mediante descubrimiento causal y entrenar con la máscara recuperada no funciona. Los estimadores existentes agotan el presupuesto computacional antes de alcanzar el régimen donde la frontera es más beneficiosa, e incluso cuando funcionan, rara vez superan al conjunto completo de características. Atribuimos esto a tres causas. El descubrimiento optimiza la recuperación estructural en lugar de la predicción. Los falsos negativos y los falsos positivos conllevan un costo predictivo marcadamente asimétrico. La frontera exacta es solo uno de muchos conjuntos de características que superan a todas las características. Luego desarrollamos lo que estos hechos implican para la selección de características alineada con la predicción y para los modelos tabulares que aprenden a usar la estructura causal.

English

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.