cs.LG

Interpreting Reinforcement Learning Agents with Susceptibilities

arXiv:2605.08007v1 Announce Type: new
Abstract: Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction…