Direction-Aware Offline-to-Online Learning in Linear Contextual Bandits
arXiv:2604.24016v2 Announce Type: replace
Abstract: Many bandit systems are deployed with offline historical data, such as past logs from earlier policies. Using these data can reduce early online exploration when they remain informative for the online problem. When the offline and online environments differ, such data can be biased for the online problem. For linear (contextual) bandits, this bias is directional: offline data may be informative in some feature directions and misleading in others. However, prior work typically controls this gap through a known Euclidean bound on the model parameters, which we prove is too coarse: even with the offline parameter known, bias in a single unknown direction can force dimension-dependent regret. To address this challenge, we introduce a directional bias certificate $(M_{\mathrm{bias}},\rho)$ that measures the offline-to-online gap through an $M_{\mathrm{bias}}$-induced norm and assigns different bias budgets to different directions. Building on this certificate, we propose \emph{Ellipsoidal-MINUCB}, which augments the online learning with an offline-pooled branch that safely exploits historical data. When the certificate is known, we show that the algorithm matches the standard SupLinUCB rate in the worst case and improves when offline coverage aligns with low-bias directions. When the certificate is unknown, we estimate it adaptively from offline and accumulated online data and establish a corresponding regret guarantee. Numerical experiments support the theory and show gains in aligned regimes.