Qwen3-TTS but in OpenVINO, from scratch

Hello everyone,

I finally got around to preparing my implementation of Qwen3-TTS in OpenVINO format as a codebase. This work was done in early 2026, merged to OpenArc in March and I kept forgetting about releasing the code. Here we are. https://github.com/SearchSavior/Qwen3-TTS-OpenVINO

One guy from our discord speaks russian and I wanted to voice clone elmo on my A770,so I decided to from scratch Qwen3-TTS in pytorch, ignoring transformers (except for AutoTokenizer, my beloved) to really get inside how you design an OpenVINO conversion to their model format.

The key learning is: you take an nn.Module with some logic, it's forward method, study the data flow, then iterate until you find the combination of data flow and device placement which lets the openvino compiler choose the best kernels. Interfering with this process ie, custom kernels is a totally seperate mission for future work. There were a ton of steps in between, and a key learning for me in this project was taking better notes.

AI assistance was used... but honestly I'm not sure how it could be done without it. Even Opus 4.5 could not make good openvino flavored choices, especially around stateful kv cache and could not anticipate kernel fusion without extensive guidance. Intel does not put enough effort into documenting their engineering practices... which makes openvino feel not so open after all. BUT, with AI tools and some effort, it is possible.

This codebase can be generalized for optimizing any pytorch model for openvino IR format. I tried to make sure the code is easy to follow, but it is quite demanding conceptually, drawing on poorly documented openvino concepts Opus implemented based on targeted examples from the upstream source I was able to conjure from memory, with hours of testing on top. Though AI assisted, this code was in no way full send vibe coded.

It's all live in OpenArc now, covering only 1.7B size for CPUs and GPUs; I had issues with 0.6B I did not investigate further. NPU support PRs are most welcome.

Unlike other implementation posts, I haven't included any benchmarks mostly due to time constraints plus changes I made to the inference code in the OpenArc PR vs what's in this repo. If there is interest we can bench OpenArc vs pytorch cpu/xpu.

submitted by /u/Echo9Zulu-
[link] [comments]

Leave a Comment