Audio: Text-to-Speech & Speech-to-Text
Atelico ships two on-device audio models out of the box:
- TTS — Kokoro 82M (54 voices, 9 languages) and Pocket TTS (English-only, fastest synthesis on Metal/CUDA, instant voice cloning).
- STT — OpenAI Whisper (tiny / base / small / medium / large-v3 / large-v3-turbo / distil-large-v3, English-only and multilingual variants, plus quantized GGUF builds).
Both are wired into the in-memory:: backend and exposed via OpenAI-compatible HTTP routes:
| Route | Method | Body |
|---|---|---|
/v1/audio/speech | POST | JSON — text to synthesize |
/v1/audio/transcriptions | POST | multipart form upload — audio file |
Text-to-Speech
Quick start
curl http://localhost:11434/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::tts",
"input": "Hello from Atelico.",
"voice": "af_heart"
}' \
--output hello.wav
Response is a WAV file (PCM, 24 kHz for Kokoro, 24 kHz for Pocket TTS). MP3/Opus/AAC encoding is not currently supported.
Choosing a model
The model field after the in-memory:: prefix selects the engine:
model value | Engine | Languages | Voice cloning |
|---|---|---|---|
tts, kokoro, kokoro-82m | Kokoro 82M | 9 (en-US, en-GB, ja, zh, es, fr, hi, it, pt-BR) | Evolutionary (slow — see below) |
pocket, pocket-tts | Pocket TTS | English only | Instant (single reference clip) |
tts defaults to kokoro-82m. Unrecognised ids return a 5xx with a clear error string.
Streaming
Pass "stream": true to receive Server-Sent Events with one AudioSpeechChunk per sentence/clause as it's synthesized. Both Kokoro and Pocket TTS support streaming with sentence-boundary chunking; Pocket TTS additionally sub-splits long sentences at clause boundaries (,;:—–) to stay inside the model's 50-token training window. Each native SDK exposes the same per-sentence streaming through its idiomatic primitive (Rust StreamHandle, Python iterator, FFI poll loop, Unity callback, Unreal multicast delegate, Godot signals); see the SDK usage section below for examples.
curl -N http://localhost:11434/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::pocket",
"input": "First sentence. Second one comes right after.",
"voice": "alba",
"stream": true
}'
Each SSE event payload:
{
"sequence": 0,
"audio": "<base64 WAV bytes>",
"duration_seconds": 1.42,
"text": "First sentence."
}
Voices
Kokoro voices (54 total)
Voice IDs encode language and gender: <lang><gender>_<name>. For example af_heart is en-US female "heart", bm_lewis is en-GB male "lewis", jf_alpha is Japanese female "alpha".
Default: af_heart. List all voices via GET /v1/models (the engine emits one model id per available voice when probed) or browse the voices/ directory in the cached hexgrad/Kokoro-82M asset.
Pocket TTS voices (24 built-in, English)
| Voice | Voice | Voice |
|---|---|---|
alba | anna | azelma |
bill_boerst | caro_davy | charles |
cosette | eponine | eve |
fantine | george | jane |
javert | jean | kate |
marius | mary | michael |
morgan | paul | peter_yearsley |
stuart_bell | vera | ian |
ian, morgan, and kate are clones built from clean reference recordings of well-known voices and ship in the asset cache at models/kyutai/pocket-tts-without-voice-cloning/languages/english/embeddings/<voice>.safetensors.
Voice cloning
Pocket TTS — instant cloning
Pocket TTS supports instant voice cloning from a single reference clip via the pocket_tts_create_voice_pack binary. Build a new voice pack from any clean recording:
cargo run --release --features metal \
-p atelico-audio --bin pocket_tts_create_voice_pack -- \
--input my_voice.wav \
--output ~/.cache/atelico/models/kyutai/pocket-tts-without-voice-cloning/languages/english/embeddings/my_voice.safetensors \
--max-seconds 30
Once written into the embeddings directory, the new voice id is automatically picked up by the next pocket-tts model load. Use --max-seconds to trim long reference recordings — 20–40 s of clean speech produces the best results.
For programmatic use from Rust, see atelico_audio::tts::pocket_tts::PocketTts::create_voice_pack.
Kokoro — evolutionary cloning
Kokoro supports a slower evolutionary cloning path that searches voice-pack space against a reference clip. New in 0.8.1: a programmatic API on KokoroTts:
use atelico_audio::tts::kokoro::{KokoroTts, voice_evolution_config::EvolutionConfig};
let path = tts.clone_voice_from_audio(
"my_voice", // new voice id
&samples, // f32 PCM samples
sample_rate, // resampled to 24 kHz internally if needed
&[], // seed voices — empty = [af_heart, am_adam, bf_emma]
50, // generations
None, // EvolutionConfig (default population_size=20)
)?;
The result is persisted to <atelico_cache>/models/<asset_id>/voices/<voice_id>.safetensors and the voice is immediately usable in synthesize / synthesize_streaming. The voice survives a server restart (the existing voices-dir scan picks it up at load).
:::caution Cost Evolutionary cloning runs the full Kokoro synth pipeline per candidate: with default settings (50 generations × 20 candidates) expect 2–4 hours on Metal or RTX 3090, longer on CPU. For instant high-quality cloning, prefer Pocket TTS. :::
Quantization (Kokoro)
Kokoro's linear layers can be quantized to Q8_0 or Q4_0 via the ATELICO_KOKORO_QUANT env var:
ATELICO_KOKORO_QUANT=q8_0 ./atelico-server # 12% memory saving (164 → 144 MB)
ATELICO_KOKORO_QUANT=q4_0 ./atelico-server # 18% memory saving (164 → 134 MB)
The convolutional decoder (~71% of params) stays at the configured float dtype — there is no quantized Conv1d/Conv2d kernel yet — so the savings are modest. Useful for iOS bundle pressure. Listening tests show no audible quality regression at either Q8 or Q4 vs F16.
Pocket TTS dtype tuning
Pocket TTS defaults to BF16 on Metal and CUDA (~8% faster than F32 on Metal end-to-end), F32 on CPU. Override with ATELICO_TTS_DTYPE:
| Value | Notes |
|---|---|
bf16 (default on GPU) | Best speed/quality balance |
f32 (default on CPU) | Reference precision; required for the parity test suite |
f16 | ~12% faster than BF16 on Metal but Whisper-roundtrip WER drifts measurably (0.00 → 0.06 on Bilbo paragraph). Available for users who want the extra speed. |
Pocket TTS chunks at 50 SentencePiece tokens (the training distribution boundary). Override with ATELICO_POCKET_TTS_MAX_TOKENS if you have a specific reason — values above 50 will produce out-of-distribution audio degradation.
Audio-to-Face (A2F)
The engine ports NVIDIA's Audio2Face-3D pipeline: 16 kHz PCM audio in, 52-channel ARKit blendshape weights @ 30 fps out. Use it to drive a talking-head facial rig directly from TTS output (or any voice clip) without an animator in the loop.
The pipeline is a HuBERT speech encoder feeding a 2-step diffusion network that emits ARKit-52 blendshape trajectories. Three built-in voice identities are bundled (claire, james, mark); each was trained on a different speaker so picking the closest match to your TTS voice gives the cleanest lip sync.
Output format
A2F emits the full ARKit Face Tracking 52-blendshape set in canonical order (eyeBlinkLeft, eyeLookDownLeft, … tongueOut). Each frame is a [52] float vector of weights in [0, 1]. The model produces one frame every 33.33 ms; ship the resulting [N, 52] matrix to your character rig at 30 fps, or interpolate to the rig's native frame rate.
The 52 channels are symmetric pairs for everything below the brow line — a downstream rig that only blends symmetric expressions can collapse left/right pairs by averaging.
Demo
demos/audio-to-face is a desktop app that combines Pocket TTS + A2F + ARKit-52 visualization:
- Type or paste text, pick a TTS voice and an A2F character, hit play
- Live ARKit dashboard shows the 52 active weights per frame
- 3D point-cloud head reconstruction renders the blendshapes in real time
- Seekable timeline; streaming playback starts on the first decoded chunk
The demo is the recommended starting point for hooking A2F into a game engine — it isolates the audio/animation handoff before you wire it into a character rig.
SDK usage
A2F is exposed across every binding alongside TTS:
- Rust SDK —
atelico_sdk::audio::AudioToFace::new(&engine, character_id)?.run(pcm_samples)? - FFI —
atelico_audio_to_face_run(handle, pcm_ptr, pcm_len, character, blendshapes_out) - Python —
engine.audio_to_face(pcm, character="claire")returns a NumPy[N, 52]array - Godot / Unity / Unreal — same
AudioToFacesubsystem object as the other audio modules; methods accept a PCM byte buffer and return either an array of frames or fire a per-frame callback for streaming consumers
See the per-language audio sections below for the canonical request shape; A2F follows the same handle / dispose pattern as TTS and STT.
Speech-to-Text
Quick start
curl http://localhost:11434/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F model=in-memory::whisper \
-F file=@speech.wav
Response:
{
"text": "Hello from Atelico.",
"language": "en",
"duration": 1.4
}
Pass response_format=verbose_json to also get segments (segment-level timing) and timestamp_granularities[]=word to get word-level timestamps.
Choosing a model
After the in-memory:: prefix:
model value | Variant | Languages | RTF (Metal) | RTF (3090) |
|---|---|---|---|---|
whisper (default) | base.en | English | 0.05 | 0.01 |
whisper-tiny, whisper-tiny.en | tiny / tiny.en | multi / English | 0.03 | 0.01 |
whisper-base, whisper-base.en | base / base.en | multi / English | 0.05 | 0.01 |
whisper-small, whisper-small.en | small / small.en | multi / English | 0.12 | 0.03 |
whisper-medium, whisper-medium.en | medium / medium.en | multi / English | — | — |
whisper-large-v3 | large-v3 | multi | — | 0.09 |
whisper-large-v3-turbo | large-v3-turbo | multi | 0.17 (Q5_0) | 0.03 |
distil-large-v3 | distil-large-v3 | multi | — | 0.03 |
RTF (real-time factor) = wall-clock time / audio duration; lower is faster. Numbers are from the in-tree whisper_bench harness on a clean LibriSpeech dev-clean corpus.
Multilingual decoding
Multilingual variants (whisper-large-v3, large-v3-turbo, distil-large-v3) require the language token to be inserted into the decoder prompt. Atelico does this automatically based on WhisperSize::is_english_only. If you set language in the request, that ISO 639-1 code is used; otherwise the model auto-detects.
Quantized GGUF variants
Whisper supports GGUF Q5_0 quantization (large-v3 → ~1 GB) for memory-constrained deployments. Programmatic access via WhisperStt::load_quantized(); the HTTP route auto-selects a quantized variant when the requested model id resolves to one in the asset store.
Streaming transcription with VAD
For real-time / live-mic use, the engine ships a StreamingTranscriber that accumulates audio, detects speech boundaries via energy-based VAD, and emits TranscriptionChunk { text, start_time, end_time, is_final }. It's generic over T: SpeechToTextModel (default WhisperStt) so future STT backends slot in unchanged. See the audio-stt demo for a complete macOS / iOS live-mic implementation.
API request reference
AudioSpeechRequest
| Field | Type | Default | Description |
|---|---|---|---|
model | string | required | Model id, e.g. in-memory::tts |
input | string | required | Text to synthesize |
voice | string | af_heart | Voice identifier (model-specific) |
response_format | string | wav | Currently only wav is supported |
speed | f32 | 1.0 | Speech speed multiplier (0.25–4.0) |
stream | bool | false | If true, return SSE chunks per sentence |
AudioTranscriptionRequest
Multipart form upload. Fields:
| Field | Type | Default | Description |
|---|---|---|---|
model | string | required | Model id, e.g. in-memory::whisper |
file | binary | required | Audio file (WAV, MP3, FLAC, etc.) |
language | string | auto-detect | ISO 639-1 code (en, ja, …) |
response_format | string | json | json or verbose_json |
temperature | f64 | 0.0 | Decoder sampling temperature (0 = greedy) |
timestamp_granularities[] | string[] | [] | segment and/or word |
Example: voice-acted NPC dialogue pipeline
A typical game pipeline chains chat → TTS:
# 1. Generate dialogue from a chat completion
RESPONSE=$(curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are a gruff dwarven blacksmith."},
{"role": "user", "content": "Greet the player."}
]
}' | jq -r '.choices[0].message.content')
# 2. Speak it with a matching voice
curl -s http://localhost:11434/v1/audio/speech \
-H "Content-Type: application/json" \
-d "$(jq -n --arg input "$RESPONSE" '{
model: "in-memory::pocket",
input: $input,
voice: "stuart_bell",
stream: true
}')" \
--output dialogue.wav
For voice variety per NPC, pin one Pocket TTS voice per character and clone new ones from short reference clips as your cast grows.
SDK usage (in-process)
The audio subsystem is exposed in every native binding under the same name, mirroring the HTTP route surface. Audio bytes cross the FFI boundary as base64-encoded WAV files (RIFF + PCM); each binding decodes the WAV header to recover the sample rate, so callers don't need to pass a separate sample_rate field.
Rust SDK
use atelico_sdk::{Engine, AudioSpeechRequest, AudioTranscriptionRequest};
let engine = Engine::new()?;
// Blocking synthesis
let speech = AudioSpeechRequest {
model: "in-memory::tts".into(),
input: "Hello from Atelico.".into(),
voice: "af_heart".into(),
..Default::default()
};
let response = engine.audio().synthesize_sync(speech)?;
std::fs::write("hello.wav", &response.audio_data)?;
// Streaming synthesis — async, one AudioSpeechChunk per sentence
let stream_request = AudioSpeechRequest {
model: "in-memory::pocket-tts".into(),
input: "First sentence. Second one comes right after.".into(),
voice: "alba".into(),
..Default::default()
};
let mut stream = engine.audio().synthesize_stream(stream_request).await?;
while let Some(chunk) = stream.next().await? {
println!("[{}] {}", chunk.sequence, chunk.text);
let wav_bytes = base64::engine::general_purpose::STANDARD.decode(&chunk.audio)?;
// play wav_bytes immediately
}
// Transcription — pass pre-decoded f32 PCM samples directly
let (samples, sample_rate) =
atelico_audio::processing::wav_io::read_wav_file(std::path::Path::new("speech.wav"))?;
let transcribe = AudioTranscriptionRequest {
model: "in-memory::whisper".into(),
audio_samples: samples,
sample_rate,
..Default::default()
};
let result = engine.audio().transcribe_sync(transcribe)?;
println!("{}", result.text);
Both blocking methods (synthesize, transcribe) also have async equivalents; synthesize_stream is async-only and returns a [StreamHandle] you can drive via next() (async), next_blocking() (sync), or poll() (game-loop frame integration).
Python
import base64, json
from atelico import Engine
engine = Engine()
# Blocking synthesis
resp = json.loads(engine.audio_synthesize(json.dumps({
"model": "in-memory::tts",
"input": "Hello from Atelico.",
"voice": "af_heart",
})))
open("hello.wav", "wb").write(base64.b64decode(resp["audio_b64"]))
# Streaming synthesis — yields one AudioSpeechChunk JSON per sentence
stream = engine.audio_synthesize_stream(json.dumps({
"model": "in-memory::pocket-tts",
"input": "First sentence. Second one comes right after.",
"voice": "alba",
}))
for chunk_json in stream:
chunk = json.loads(chunk_json)
wav_bytes = base64.b64decode(chunk["audio"])
# play wav_bytes through your audio backend immediately
# Transcription — base64-encode the WAV file before passing
wav_b64 = base64.b64encode(open("speech.wav", "rb").read()).decode()
result = json.loads(engine.audio_transcribe(json.dumps({
"model": "in-memory::whisper",
"audio_b64": wav_b64,
})))
print(result["text"])
C / FFI
// Blocking TTS
const char* request = "{\"model\":\"in-memory::tts\",\"input\":\"Hello.\",\"voice\":\"af_heart\"}";
const char* response_json = NULL;
atelico_audio_synthesize(engine, request, &response_json);
// response_json: {"audio_b64":"UklGRn...","duration_seconds":1.42,
// "format":"wav","sample_rate":24000}
// Streaming TTS — start chunk, then poll each frame
uint64_t stream = 0;
atelico_audio_synthesize_stream(engine,
"{\"model\":\"in-memory::pocket-tts\",\"input\":\"First. Second.\",\"voice\":\"alba\"}",
&stream);
const char* chunk_json = NULL;
while (1) {
int rc = atelico_stream_poll(engine, stream, &chunk_json);
if (rc == ATELICO_OK) {
// chunk_json: {"sequence":N, "audio":"<b64 WAV>", "duration_seconds":..,"text":".."}
} else if (rc == ATELICO_ERR_STREAM_DONE) {
break;
}
// ATELICO_ERR_STREAM_EMPTY: no chunk this frame, try again next tick.
}
atelico_stream_destroy(engine, stream);
// Blocking STT
const char* stt_req = "{\"model\":\"in-memory::whisper\",\"audio_b64\":\"UklGRn...\"}";
const char* stt_resp = NULL;
atelico_audio_transcribe(engine, stt_req, &stt_resp);
// stt_resp: {"text":"Hello.","language":"en","duration":1.0}
The output pointer is owned by the FFI layer's thread-local return buffer — copy it before the next API call on the same thread. The streaming API reuses the existing atelico_stream_poll / atelico_stream_destroy infrastructure used for chat completions; see the C FFI getting started guide for the canonical poll loop.
Unity (C#)
// Blocking TTS
string speechResp = AtelicoEngine.Instance.Audio.Synthesize(@"{
""model"": ""in-memory::tts"",
""input"": ""Hello from Atelico."",
""voice"": ""af_heart""
}");
// Parse speechResp, base64-decode "audio_b64", feed bytes through a WAV loader
// into an AudioSource.
// Streaming TTS — chunks delivered via a callback (polled centrally each frame
// by AtelicoEngine, just like ChatCompletionStream).
AtelicoEngine.Instance.Audio.SynthesizeStream(@"{
""model"": ""in-memory::pocket-tts"",
""input"": ""First sentence. Second one comes right after."",
""voice"": ""alba""
}",
onChunk: chunkJson => {
// {"sequence":N,"audio":"<b64 WAV>","duration_seconds":..,"text":".."}
// queue the decoded clip into your AudioSource
},
onComplete: () => Debug.Log("done"),
onError: err => Debug.LogError(err));
// Blocking STT
string sttResp = AtelicoEngine.Instance.Audio.Transcribe(
$"{{\"model\":\"in-memory::whisper\",\"audio_b64\":\"{wavB64}\"}}");
// {"text":"Hello from Atelico.","language":"en","duration":1.0}
Subsystem accessor: AtelicoEngine.Instance.Audio.{Synthesize, SynthesizeStream, Transcribe}.
Unreal (Blueprint / C++)
UAtelicoAISubsystem* Atelico = GetGameInstance()->GetSubsystem<UAtelicoAISubsystem>();
// Blocking TTS
FString SpeechResponse = Atelico->SynthesizeAudio(TEXT(R"({
"model": "in-memory::tts",
"input": "Hello from Atelico.",
"voice": "af_heart"
})"));
// Parse JSON, base64-decode audio_b64, feed bytes to USoundWave (e.g. via the
// Runtime Audio Importer plugin) attached to a UAudioComponent.
// Streaming TTS — bind delegates, then start the stream. The subsystem polls
// the FFI stream in Tick() and broadcasts each chunk on OnAudioChunkReceived.
Atelico->OnAudioChunkReceived.AddDynamic(this, &AMyActor::HandleAudioChunk);
Atelico->OnAudioCompleted.AddDynamic(this, &AMyActor::HandleAudioCompleted);
Atelico->OnAudioFailed.AddDynamic(this, &AMyActor::HandleAudioFailed);
Atelico->SynthesizeAudioStream(TEXT(R"({
"model": "in-memory::pocket-tts",
"input": "First sentence. Second one comes right after.",
"voice": "alba"
})"));
// HandleAudioChunk(const FString& ChunkJson):
// ChunkJson is an AudioSpeechChunk: {"sequence":N,"audio":"<b64 WAV>",...}
// Blocking STT
FString SttResponse = Atelico->TranscribeAudio(FString::Printf(
TEXT(R"({"model":"in-memory::whisper","audio_b64":"%s"})"), *WavB64));
Blueprint nodes available under Atelico AI | Audio: Synthesize Audio, Synthesize Audio Stream, Transcribe Audio. Streaming events are exposed under Atelico AI | Events: On Audio Chunk Received, On Audio Completed, On Audio Failed.
Godot (GDScript)
var engine := AtelicoCoreNode.new()
add_child(engine)
# Blocking TTS
var speech_response_json: String = engine.audio_synthesize(JSON.stringify({
"model": "in-memory::tts",
"input": "Hello from Atelico.",
"voice": "af_heart",
}))
var speech: Dictionary = JSON.parse_string(speech_response_json)
var wav_bytes: PackedByteArray = Marshalls.base64_to_raw(speech["audio_b64"])
# Load wav_bytes into AudioStreamWAV, attach to AudioStreamPlayer.
# Streaming TTS — chunks arrive on the audio_synthesis_chunk signal,
# end-of-stream on audio_synthesis_completed.
engine.audio_synthesis_chunk.connect(_on_audio_chunk)
engine.audio_synthesis_completed.connect(_on_audio_done)
var job_id: int = engine.audio_synthesize_stream(JSON.stringify({
"model": "in-memory::pocket-tts",
"input": "First sentence. Second one comes right after.",
"voice": "alba",
}))
func _on_audio_chunk(job_id: int, chunk_json: String) -> void:
var chunk: Dictionary = JSON.parse_string(chunk_json)
var bytes: PackedByteArray = Marshalls.base64_to_raw(chunk["audio"])
# queue bytes into AudioStreamPlayer; chunk["text"] is the source sentence
func _on_audio_done(job_id: int, success: bool) -> void:
print("synth done: ", success)
# Blocking STT
var stt_response_json: String = engine.audio_transcribe(JSON.stringify({
"model": "in-memory::whisper",
"audio_b64": Marshalls.raw_to_base64(load_wav_bytes_from_disk("speech.wav")),
}))
var stt: Dictionary = JSON.parse_string(stt_response_json)
print(stt["text"])
See also
- Models — full supported-model catalogue
- Server Configuration — env vars (
ATELICO_KOKORO_QUANT,ATELICO_TTS_DTYPE,ATELICO_POCKET_TTS_MAX_TOKENS) - NPC Dialogue — end-to-end character pipeline