Inverse Reinforcement Learning with Just Classification and a Few Regressions

arXiv:2509.21172v2 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) aims to infer rewards from observed behavior, but rewards are not identified from the policy alone: many reward--value pairs can rationalize the same actions. Meaningful reward recovery therefore requires a normalization, yet existing normalized IRL methods often rely on anchor-action restrictions or specialized neural architectures. We study reward recovery in the maximum-entropy, or Gumbel-shock, model under a broad class of statewise affine normalizations, with anchor-action constraints as a special case. This yields Generalized Policy-to-$Q$-to-Reward (GenPQR), a modular procedure that estimates the behavior policy, evaluates its soft $Q$-function through the Bellman equation, and recovers the normalized reward. Both stages can be implemented with off-the-shelf classification and regression methods. We prove modular finite-sample guarantees under general function approximation, with separate policy-estimation and $Q$-estimation errors. As a concrete instantiation, we study GenPQR with fitted $Q$-evaluation, reducing IRL to policy estimation followed by regression. Experiments show that GenPQR matches or improves reward recovery relative to DeepPQR while remaining simpler and more modular. Compared with DeepPQR, our theory goes beyond anchor actions, accommodates large and continuous action spaces, makes coverage requirements explicit, and is not tied to a specific neural-network architecture or training procedure.

Leave a Comment