Supertonic 3: On-Device Text-to-Speech with ONNX Runtime for CPU

Stanley Ulili

Updated on May 23, 2026

Why on-device TTS matters
Setup
Python usage
Test results
Performance summary
Limitations
SDK support
When to use Supertonic 3
Final thoughts

Supertonic 3 is a local text-to-speech engine that runs entirely on CPU via the ONNX Runtime. It requires no API key, no internet connection, and no GPU. At 99 million parameters, it is substantially smaller than most open-source TTS systems (which typically range from 0.7B to 2B parameters). It supports 31 languages and outputs 44.1kHz 16-bit WAV directly.

Green checkmarks next to "No API key," "No cloud request," and "No GPU" summarizing Supertonic's on-device benefits

Why on-device TTS matters

Cloud TTS APIs (OpenAI, Google, ElevenLabs) bill per character or request. For small projects the cost is negligible, but for production applications with high usage the bill scales directly with user activity.

Bar chart showing a monthly speech bill escalating from $9,225 to $24,536 as an app grows

Beyond cost, cloud APIs introduce two other constraints: network round-trip latency makes them unsuitable for real-time conversational applications, and sending user text to a third-party server creates privacy and compliance issues for sensitive use cases.

Local models have historically traded those problems for others: large model sizes (hundreds of GB in some cases), GPU requirements that exclude most consumer hardware, slow cold-start times, and poor handling of real-world text containing numbers, symbols, and non-standard formatting.

Supertonic 3 targets the intersection: a local model small enough to bundle, fast enough on CPU for real-time use, and reasonably accurate on standard text.

Setup

Copied!

mkdir supertonic_test && cd supertonic_test

Copied!

python -m venv venv && source venv/bin/activate

Copied!

pip install supertonic

Model files are downloaded and cached on first run. No additional configuration is required.

Python usage

main.py

Copied!

import os
import subprocess
import time
from supertonic import TTS

OUTPUT_DIR = "output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

tts = TTS(auto_download=True)

voices = ["M1", "M2", "F2", "F5"]
styles = {voice: tts.get_voice_style(voice_name=voice) for voice in voices}

demos = [
    ("en", "M1", "This is Supertonic running here on my Mac."),
    ("en", "M5", "The total invoice is $12,458.75 due on June 15th, 2026. Please call (555) 123-4567."),
    ("ar", "F2", "أنت دائمًا في قلبي. سأراك قريبًا يا حبيبي."),
    ("fr", "F5", "Si tu entends ça, souris pour moi. J'adore ton sourire."),
    ("ko", "M3", "안녕하세요, Supertonic은 기기에서 완전히 실행되며,"),
]

for i, (lang, voice_name, text) in enumerate(demos):
    style = styles.get(voice_name, tts.get_voice_style(voice_name=voice_name))
    gen_start = time.time()

    wav, duration = tts.synthesize(
        text=text,
        lang=lang,
        voice_name=voice_name,
        voice_style=style,
        total_steps=0,   # 0 = fast/lower quality; 12 = best quality (default)
        speed=1.0,
        excellent=False  # True for better quality at cost of speed
    )

    gen_time = time.time() - gen_start
    rtf = duration / gen_time if gen_time > 0 else 0

    filename = os.path.join(OUTPUT_DIR, f"demo_{lang}_{voice_name}_{i}.wav")
    tts.save_audio(wav, filename)
    print(f"[{i+1}] {duration:.2f}s audio in {gen_time:.2f}s (~{rtf:.1f}x real-time) -> {filename}")

    try:
        subprocess.run(["afplay", filename], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    except FileNotFoundError:
        pass  # afplay is macOS-specific; use aplay on Linux

Python script in a code editor highlighting the line from supertonic import TTS

total_steps=0 uses the fast inference path. Setting it to 12 (the default) improves quality at the cost of generation time. The excellent=False flag controls a separate quality enhancement pass.

Test results

Standard English

Input: "This is Supertonic running here on my Mac. If you like this, subscribe to the Better Stack channel."

Clear, natural-sounding output. Generation speed is well above real-time (audio is produced substantially faster than its playback duration). This baseline test passes without issues.

Complex numbers and formatted text

Input: "The total invoice is $12,458.75 due on June 15th, 2026. Please call (555) 123-4567."

This is where the local version shows a significant limitation. The currency amount is not correctly normalized; the model reads digits as disconnected numbers rather than "twelve thousand, four hundred fifty-eight dollars and seventy-five cents." The year is handled correctly, but the monetary value is not.

This is a practical problem for any application that reads financial data, invoices, prices, or phone numbers. The underlying voice quality is fine; the issue is text normalization (converting symbols and formatted numbers into pronounceable words). The free local version does not include robust normalization for these patterns.

Multilingual performance

Arabic, French, and Korean all produced clear, accurate-sounding output on standard sentences. For a model of this size, multilingual quality is a notable strength and holds up well against cloud alternatives for everyday text.

Performance summary

Word Error Rate chart for Supertonic 3 across various languages showing competitive accuracy

Speed on CPU is consistently above real-time for short to medium-length text. Cold start (first initialization) takes a few seconds while models load; subsequent calls are fast.

Limitations

Complex number formatting. Currency, large numbers, and formatted dates require either preprocessing the text before passing it to Supertonic, or using a separate text normalization step. Without this, output quality degrades significantly for data-heavy applications.

Emotional tags are paywalled. Tags like <laugh> and <sigh> are not available in the free local version. Accessing expressive features requires Supertone's paid cloud API, which reintroduces network dependency and per-request billing.

Supertone pricing page showing tiers for unlimited generation and API access

SDK support

Supertonic provides official examples and SDKs for Python, Node.js, C++, C#, Go, Rust, and Flutter. It also provides an OpenAI-compatible local server mode, which allows drop-in replacement of OpenAI TTS API calls in existing code.

Logos for supported languages including Python, Go, Java, C#, C++, Flutter, and Node.js

When to use Supertonic 3

Supertonic 3 is well-suited for: privacy-first voice agents where user text must not leave the device, offline-capable applications, rapid prototyping where cost is a concern, desktop and embedded software where the hardware environment is controlled, and multilingual content readers.

For applications that need to read financial data, invoice amounts, or any structured numeric text, the normalization gap needs to be addressed first, either by preprocessing text before synthesis or by evaluating whether the paid API's normalization is acceptable. For expressive or emotional speech (games, audiobooks, virtual characters), cloud options or Supertone's paid tier will produce better results.

Final thoughts

Supertonic 3 delivers on its core promise for standard text: fast, private, CPU-only synthesis with genuine multilingual support in a small package. The pip install supertonic path and OpenAI-compatible server mode make integration straightforward.

The text normalization weakness is the main practical constraint. Applications that generate text programmatically (from APIs, databases, or templates) should audit whether their outputs will contain currency, complex numbers, or special symbols before committing to the local version. For clean prose or conversational speech, the quality is production-suitable.

Documentation and additional examples are at supertone.ai.

Got an article suggestion? Let us know

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Supertonic 3: On-Device Text-to-Speech with ONNX Runtime for CPU

Contents

Why on-device TTS matters

Setup

Python usage

Test results

Standard English

Complex numbers and formatted text

Multilingual performance

Performance summary

Limitations

SDK support

When to use Supertonic 3

Final thoughts

Please accept cookies