Looking for recommendations for a small TTS model (<600M params) that can be fine tuned on a local language dataset.
I have ~150 hours of very clean single speaker audio with accurate transcripts/pronunciation.
Around 45000 text rows
I’ve tried:
• Orpheus: quality is good but model is too large
• Qwen3 0.6B: terrible results
• Qwen3 1.7B: Too slow
Need something lightweight, easy to fine tune locally, and good for low resource/non English.
Would love recommendations from people who’ve actually fine tuned smaller TTS models successfully.
[link] [comments]