cs.IT, cs.LG, math.IT, math.OC, stat.ML

Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL

arXiv:2506.20904v2 Announce Type: replace
Abstract: We study offline reinforcement learning in average-reward MDPs, which presents increased challenges from the perspectives of distribution shift and non-uniform coverage, and has been relatively under…