Balancing Multi-modal Sensor Learning via Multi-objective Optimization

arXiv:2511.06686v2 Announce Type: replace Abstract: Learning-enabled control systems increasingly rely on multiple sensing modalities (e.g., vision, audio, language, etc.) for perception and decision support. A key challenge is that multi-modal sensor training dynamics are often imbalanced: fast-to-learn sensing channels dominate optimization, while slower channels remain underutilized, degrading reliability under sensing perturbations. Existing balancing strategies are largely heuristic and can require computationally intensive subroutines. In this paper, we reformulate multi-modal sensor learning as a multi-objective optimization (MOO) problem that explicitly prioritizes the worst-performing modality while retaining the nominal multi-modal sensor fusion objective. We then propose a simple gradient-based method, MIMO (multi-modal sensor learning via MOO), for the resulting formulation. We provide convergence guarantees and evaluate the method on standard multi-modal benchmarks. Results show improved balanced performance over state-of-the-art balanced multi-modal learning and MOO baselines, together with up to ~20x reduction in subroutine computation time, highlighting the suitability of MIMO for resource-constrained control pipelines.

Leave a Comment