Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

arXiv:2505.19606v2 Announce Type: replace Abstract: Cross-lingual alignment in pretrained language models enables knowledge transfer across languages. Similar alignment has been reported in Whisper-style speech encoders, based on spoken translation retrieval using representational similarity. However, prior work does not control for phonetic overlap between equivalent utterances, which may artificially support retrieval. We conduct pronunciation-controlled experiments to test whether cross-lingual alignment arises from semantic rather than phonetic similarity. Results show that spoken translation retrieval remains strongly above chance without phonetic cues in the final layers of encoders trained with a speech translation objective, most clearly for models additionally trained on translation. We further test early-exiting the encoder to induce representations we hypothesize to be less tied to language-specific semantics. These experiments indeed reveal performance gains in automatic speech recognition on low-resource languages unseen during training.

Leave a Comment