Text-Conditional JEPA for Learning Semantically Rich Visual Representations
arXiv:2605.03245v1 Announce Type: new
Abstract: Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty…