cs.LG

Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare

arXiv:2605.01961v1 Announce Type: new
Abstract: Learning from human preference data is becoming a useful tool, from fine-tuning large language models to training reinforcement learning agents. However, in most scenarios, the model is trained on the av…