UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
arXiv:2604.22209v1 Announce Type: cross
Abstract: Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. U…