Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR

arXiv:2602.06713v2 Announce Type: replace Abstract: Missing data imputation, where a model is trained on observed data to estimate unobserved values, is a fundamental problem in machine learning. In this paper, we rigorously formulate imputation model learning as a mean-squared error risk minimisation problem. We show that when the probability of missingness depends on the data, many state-of-the-art methods fail to account for the resulting distribution shift between the observed data used for training and the full data distribution used for evaluation. Consequently, these approaches do not minimise mean-squared error on the full data distribution. Instead, we propose a novel imputation algorithm designed to learn an imputation model from the observed data while explicitly accounting for this distribution shift. Simulation studies show consistent improvements over otherwise identical uncorrected baselines, with average reductions of 3% in RMSE and 7% in Wasserstein distance.

Leave a Comment