QWEN3 TTS (text to speech)
Simon Scholz2026-02-263 min readComments /Feedback / Requests?

Want to dive deeper?

Book a 1:1 session with me to discuss your challenges and get tailored solutions.

Book a session

Nowadays I´d rather gain my knowledge by listening instead of reading through a blog post or script. Especially lightweight content can be consumed while walking or going other physical activities.

Recently my nephew, who is currently studying at a university, asked me if I could turn the scripts, he got from his professor, into audio.

I tried different things like piper tts and parler tts and then heard of qwen3 tts.

Install Dependencies

# audio processing
sudo apt update && sudo apt install sox libsox-dev

Setup Python Environment

Ensure that you have Python installed on your system.

# create a directory for the project
mkdir qwen-tts
cd qwen-tts

# create a virtual environment
python -m venv qwen_tts_env

# activate the virtual environment
source qwen_tts_env/bin/activate

# install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install qwen-tts soundfile

Since I want to run this on my local machine, I installed the CPU version of PyTorch. If you have a compatible GPU, you can install the GPU version for better performance.

Convert Text to Speech

Inside the project directory, create a Python script named text_to_speech.py. In the following code snippet, we load the Qwen3 TTS model, generate speech from a sample text, and save the output as a WAV file.

import os
import time
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
import transformers

# --- CONFIGURATION ---
SPEAKER = "Aiden"
NUM_THREADS = 6
MODEL_ID = "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice"
OUTPUT_FILE = "output.wav"
# ---------------------

# How many threads to use for CPU processing (adjust based on your CPU cores)
torch.set_num_threads(NUM_THREADS)

# Omit verbose logging from the model
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
transformers.logging.set_verbosity_error()


print(f"--- Running with {torch.get_num_threads()} threads ---")

# track overall time of the process
overall_start = time.perf_counter()

model = Qwen3TTSModel.from_pretrained(
    MODEL_ID,
    device_map="cpu",
    torch_dtype=torch.float32,
)

print("--- Model Loaded. Starting Generation... ---")

wavs, sr = model.generate_custom_voice(
    text="Hallo! Danke, dass du meine Tutorials liest. Probier doch mal verschiedene Sprecher aus.",
    language="German",
    speaker="Aiden",
)

# 4. Save the result
sf.write(OUTPUT_FILE, wavs[0], sr)
print(f"--- Finished! Audio saved as {OUTPUT_FILE} in {time.perf_counter() - overall_start:.2f} seconds ---")

In the configuration section, you can specify the speaker, number of threads for CPU processing, model ID, and output file name.

Available speakers include

  • Aiden
  • Lenn
  • Ryan
  • and more.

Also see https://qwen.ai/blog?id=qwen3-tts-1128 for more details.

For this german example I consider Aiden to be the best choice. Let me know in the comments which speaker you like best.

Run the script

python text_to_speech.py

Convert wav to m4a

To decrease the file size, you can convert the generated WAV file to M4A format using the sox command-line tool:

sox output.wav output.m4a
# specify bitrate
sox output.wav -C 128 output.m4a

The popular ffmpeg tool can also be used for this purpose:

ffmpeg -i output.wav -c:a aac -q:a 128k output.m4a
  • -i output.wav: Specifies the input WAV file.
  • -c:a aac: Use the AAC codec for audio encoding.
  • -q:a 128k: Set the audio quality to 128 kbps

Sources