Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
arXiv:2604.13175v1 Announce Type: cross
Abstract: Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world appl…