Near-Optimal Privacy-Preserving Learning for Max-Min Fair Multi-Agent Bandits
arXiv:2306.04498v3 Announce Type: replace
Abstract: We study fair multi-agent multi-armed bandit learning under collision-only coordination. Agents cannot communicate explicitly during learning and observe only their own rewards and whether collisions occur when several agents access the same arm. The goal is to learn a max-min fair allocation while keeping each agent's reward samples and empirical reward estimates local. We propose a fully distributed algorithm for bounded rewards with unknown support, achieving regret $O\!\left(N^3 f(\log T)\log T\right)$, where $f$ is any nondecreasing diverging function satisfying $f(k-1)/f(k)\to 1$. The algorithm combines distributed agent ordering, cumulative round-robin exploration, endpoint-revalidated warm-started bisection, and a collision-based distributed auction for threshold-feasibility tests. Unlike leader-based optimal algorithms, no agent collects the reward observations, empirical estimates, or preferences of the others. Thus, the protocol preserves reward privacy in the operational sense of avoiding reward sharing, while coordinating only through collision outcomes. Compared with previous privacy-preserving algorithms for max--min fair bandits, which have exponential dependence on the number of agents, our method achieves polynomial $N^3$ dependence while retaining near-logarithmic dependence on $T$. The analysis uses concentration of cumulative empirical estimates and stability of endpoint-revalidated bisection. Simulations confirm the predicted scaling with horizon, number of agents, and max--min gap across representative numerical settings.