Looking for recommendations for a small TTS model that can be fine tuned on a local language dataset.

Looking for recommendations for a small TTS model (<600M params) that can be fine tuned on a local language dataset.

I have ~150 hours of very clean single speaker audio with accurate transcripts/pronunciation.
Around 45000 text rows

I’ve tried:
• Orpheus: quality is good but model is too large
• Qwen3 0.6B: terrible results
• Qwen3 1.7B: Too slow

Need something lightweight, easy to fine tune locally, and good for low resource/non English.
Would love recommendations from people who’ve actually fine tuned smaller TTS models successfully.

submitted by /u/ContentAmbassador953
[link] [comments]

Leave a Comment