Supertonic 3: On-Device Text-to-Speech with ONNX Runtime for CPU
Supertonic 3 is a local text-to-speech engine that runs entirely on CPU via the ONNX Runtime. It requires no API key, no internet connection, and no GPU. At 99 million parameters, it is substantially smaller than most open-source TTS systems (which typically range from 0.7B to 2B parameters). It supports 31 languages and outputs 44.1kHz 16-bit WAV directly.
Why on-device TTS matters
Cloud TTS APIs (OpenAI, Google, ElevenLabs) bill per character or request. For small projects the cost is negligible, but for production applications with high usage the bill scales directly with user activity.
Beyond cost, cloud APIs introduce two other constraints: network round-trip latency makes them unsuitable for real-time conversational applications, and sending user text to a third-party server creates privacy and compliance issues for sensitive use cases.
Local models have historically traded those problems for others: large model sizes (hundreds of GB in some cases), GPU requirements that exclude most consumer hardware, slow cold-start times, and poor handling of real-world text containing numbers, symbols, and non-standard formatting.
Supertonic 3 targets the intersection: a local model small enough to bundle, fast enough on CPU for real-time use, and reasonably accurate on standard text.
Setup
Model files are downloaded and cached on first run. No additional configuration is required.
Python usage
total_steps=0 uses the fast inference path. Setting it to 12 (the default) improves quality at the cost of generation time. The excellent=False flag controls a separate quality enhancement pass.
Test results
Standard English
Input: "This is Supertonic running here on my Mac. If you like this, subscribe to the Better Stack channel."
Clear, natural-sounding output. Generation speed is well above real-time (audio is produced substantially faster than its playback duration). This baseline test passes without issues.
Complex numbers and formatted text
Input: "The total invoice is $12,458.75 due on June 15th, 2026. Please call (555) 123-4567."
This is where the local version shows a significant limitation. The currency amount is not correctly normalized; the model reads digits as disconnected numbers rather than "twelve thousand, four hundred fifty-eight dollars and seventy-five cents." The year is handled correctly, but the monetary value is not.
This is a practical problem for any application that reads financial data, invoices, prices, or phone numbers. The underlying voice quality is fine; the issue is text normalization (converting symbols and formatted numbers into pronounceable words). The free local version does not include robust normalization for these patterns.
Multilingual performance
Arabic, French, and Korean all produced clear, accurate-sounding output on standard sentences. For a model of this size, multilingual quality is a notable strength and holds up well against cloud alternatives for everyday text.
Performance summary
Speed on CPU is consistently above real-time for short to medium-length text. Cold start (first initialization) takes a few seconds while models load; subsequent calls are fast.
Limitations
Complex number formatting. Currency, large numbers, and formatted dates require either preprocessing the text before passing it to Supertonic, or using a separate text normalization step. Without this, output quality degrades significantly for data-heavy applications.
Emotional tags are paywalled. Tags like <laugh> and <sigh> are not available in the free local version. Accessing expressive features requires Supertone's paid cloud API, which reintroduces network dependency and per-request billing.
SDK support
Supertonic provides official examples and SDKs for Python, Node.js, C++, C#, Go, Rust, and Flutter. It also provides an OpenAI-compatible local server mode, which allows drop-in replacement of OpenAI TTS API calls in existing code.
When to use Supertonic 3
Supertonic 3 is well-suited for: privacy-first voice agents where user text must not leave the device, offline-capable applications, rapid prototyping where cost is a concern, desktop and embedded software where the hardware environment is controlled, and multilingual content readers.
For applications that need to read financial data, invoice amounts, or any structured numeric text, the normalization gap needs to be addressed first, either by preprocessing text before synthesis or by evaluating whether the paid API's normalization is acceptable. For expressive or emotional speech (games, audiobooks, virtual characters), cloud options or Supertone's paid tier will produce better results.
Final thoughts
Supertonic 3 delivers on its core promise for standard text: fast, private, CPU-only synthesis with genuine multilingual support in a small package. The pip install supertonic path and OpenAI-compatible server mode make integration straightforward.
The text normalization weakness is the main practical constraint. Applications that generate text programmatically (from APIs, databases, or templates) should audit whether their outputs will contain currency, complex numbers, or special symbols before committing to the local version. For clean prose or conversational speech, the quality is production-suitable.
Documentation and additional examples are at supertone.ai.