Models
Listing Available Models
curl http://localhost:11434/v1/models
Returns all models the server can serve:
{
"object": "list",
"data": [
{"id": "in-memory::meta-llama/Llama-3.2-1B-Instruct", "object": "model", "owned_by": "atelico"},
{"id": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M", "object": "model", "owned_by": "atelico"},
...
]
}
Model Naming
Models use a prefix::model-name syntax:
| Prefix | Backend | Description |
|---|---|---|
in-memory:: | Local GPU/CPU | On-device inference (default if no prefix) |
openai:: | OpenAI proxy | Forwards to OpenAI API |
image-generation:: | Local GPU/CPU | Image generation models |
mock:: | Mock | Returns hardcoded responses (for testing) |
If you omit the prefix, in-memory:: is assumed.
Supported Architectures
The engine supports these model architectures in both float (SafeTensors) and quantized (GGUF) formats:
| Architecture | Example Models |
|---|---|
| LLaMA / Mistral | Llama 3.x, Mistral 7B |
| Qwen 3 | Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B |
| Qwen 3.5 | Qwen3.5-0.8B (hybrid DeltaNet + attention; optional megakernel decode on CUDA) |
| Qwen 3.6 | Qwen3.6 family incl. 35B-A3B (MoE, 256 experts / 8 active + 1 shared, hybrid DeltaNet + Gated Attention, 262K native context, YaRN extrapolation to ~1M) |
| Gemma 4 | Gemma 4 |
| Nemotron-H | Nemotron-H 4B Nano (dense, hybrid Mamba2 + attention) |
| Nemotron-3 Elastic | Nemotron-3 Elastic 12B / 23B / 30B (MoE, hybrid Mamba2 + attention; GGUF Q4_K_M supported) |
| Parcae | Parcae (stable looped transformer) |
| SmolLM | SmolLM2-135M, SmolLM2-360M, SmolLM2-1.7B |
| SmolLM3 | SmolLM3-3B |
| Bonsai 1-bit | PrismML Bonsai 1-bit (Qwen3 architecture) |
Any HuggingFace model using one of these architectures can be loaded. The engine auto-detects the architecture from config.json or GGUF metadata.
Available LLM Models
Use ./atelico-asset-downloader list --namespace models to see all models available in your asset store. The models listed by GET /v1/models are pre-configured defaults:
| Model | Parameters | Format | VRAM |
|---|---|---|---|
meta-llama/Llama-3.2-1B-Instruct | 1B | float | ~2 GB |
meta-llama/Llama-3.2-1B-Instruct-Q4_K_M | 1B | GGUF | ~0.8 GB |
meta-llama/Llama-3.2-3B-Instruct | 3B | float | ~6 GB |
meta-llama/Llama-3.2-3B-Instruct-Q4_K_M | 3B | GGUF | ~2 GB |
meta-llama/Llama-3.1-8B-Instruct | 8B | float | ~16 GB |
meta-llama/Llama-3.1-8B-Instruct-Q4_K_M | 8B | GGUF | ~5 GB |
Any model downloaded to the cache can be used by passing its ID to the model field, even if it's not in the default list above.
Hybrid and MoE Models
Newer architectures with hybrid attention or sparse Mixture-of-Experts are routed by the same in-memory:: prefix. Approximate footprints for the Q4_K_M GGUF variant:
| Model | Active / Total Params | Format | VRAM |
|---|---|---|---|
Qwen3.5-0.8B | 0.8B / 0.8B (dense) | float / GGUF | ~1.6 GB / ~0.5 GB |
Qwen3.6-35B-A3B-Q4_K_M | 3B / 35B (MoE) | GGUF | ~20 GB |
nvidia/Nemotron-3-Elastic-12B-Q4_K_M | A3B / 12B (MoE) | GGUF | ~7 GB |
nvidia/Nemotron-3-Elastic-23B-Q4_K_M | A3B / 23B (MoE) | GGUF | ~13 GB |
nvidia/Nemotron-3-Elastic-30B-Q4_K_M | A3B / 30B (MoE) | GGUF | ~17 GB |
Indicative GGUF Q4_K_M decode throughput for Nemotron-3 Elastic 12B: Metal (M3 Max) ~59 tok/s (1.74× the BF16 path at ~58% of the memory footprint), CUDA (RTX 3090) ~220 tok/s.
Megakernel Decode (Qwen 3.5-0.8B, CUDA)
Qwen 3.5-0.8B ships with an optional fused-megakernel decode path that collapses every layer into a single CUDA kernel launch. It runs 1.38×–1.45× faster for decode + sampling on Ampere+ vs the standard graph-replay path, at the cost of being CUDA + BF16 only.
Opt in either by appending -mk to the model id (in-memory::Qwen3.5-0.8B-mk) or by setting the environment variable ATELICO_MEGAKERNEL=1 when launching the server. Metal and CPU backends ignore the flag and use the standard path.
Image Generation Models
| Model | Description |
|---|---|
pixart-alpha | PixArt-Alpha (DMD one-step) |
pixart-sigma | PixArt-Sigma (multi-step DPM-Solver) |
sana-sprint | Sana Sprint (SCM 2-step, fast) |
sana-0.6b | Sana 0.6B (flow matching, 20-step) |
sana-1.6b | Sana 1.6B (flow matching, 20-step, highest quality) |
Use with the image-generation:: prefix: "model": "image-generation::sana-sprint".
Text-to-Speech Models
model id (after in-memory::) | Engine | Description |
|---|---|---|
tts, kokoro, kokoro-82m | Kokoro 82M | 54 voices, 9 languages, streaming output. Optional Q8/Q4 linear quantization via ATELICO_KOKORO_QUANT. |
pocket, pocket-tts | Pocket TTS | English-only, ~15× realtime on Metal / ~34× on RTX 3090, instant voice cloning from a single reference clip, 24 built-in voices. |
TTS is accessed via the /v1/audio/speech endpoint (OpenAI-compatible). Specify a voice id in the voice field; language is detected automatically by Kokoro. Set stream: true to receive Server-Sent Events with one chunk per sentence. Default model is kokoro-82m. See the Audio guide for full details.
Speech-to-Text Models
All Whisper variants are routed via in-memory::<id>:
model id | Variant | Languages | Notes |
|---|---|---|---|
whisper | base.en | English | Default |
whisper-tiny, whisper-tiny.en | tiny / tiny.en | multi / English | Smallest, fastest |
whisper-base, whisper-base.en | base / base.en | multi / English | |
whisper-small, whisper-small.en | small / small.en | multi / English | |
whisper-medium, whisper-medium.en | medium / medium.en | multi / English | |
whisper-large-v3 | large-v3 | multi | Highest WER on hard corpora |
whisper-large-v3-turbo | large-v3-turbo | multi | ~33× realtime on RTX 3090 |
distil-large-v3 | distil-large-v3 | multi | ~33× realtime on RTX 3090 |
GGUF Q5_0 quantization is supported for the larger variants (large-v3 → ~1 GB) — automatic when the asset store provides the quantized file. STT is accessed via /v1/audio/transcriptions (OpenAI-compatible).
Choosing a Model
- Dialogue, simple conversation: 1B quantized is fast enough
- NPC personalities, creative writing: 3B quantized is a good balance
- Complex reasoning, structured generation: 8B quantized for best results
- Shipping a game: Quantized models are recommended -- smaller download, less VRAM, similar quality
Downloading Models
Use the asset downloader to fetch models:
# List what's available
./atelico-asset-downloader list --namespace models
# Download a specific model
./atelico-asset-downloader download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M
# Interactive mode -- browse and select
./atelico-asset-downloader interactive
Custom Asset Store
If your team hosts models on a private store:
./atelico-asset-downloader \
--store-url https://your-store.example.com \
--access-key YOUR_KEY \
--secret-key YOUR_SECRET \
download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M
HuggingFace Fallback
If a model isn't in the asset store, the server can fetch it from HuggingFace as a last resort. Set the HF_TOKEN environment variable for gated models:
HF_TOKEN=hf_your_token_here ./atelico-server
Model Formats
| Format | Extension | Description |
|---|---|---|
| SafeTensors | .safetensors | Full-precision weights (F16/F32) |
| GGUF | .gguf | Quantized weights (Q4, Q8, etc.) |
| Bonsai 1-bit | .safetensors (special) | 1-bit quantized (experimental) |
Cache Location
Downloaded models are stored locally:
| Platform | Path |
|---|---|
| macOS | ~/Library/Caches/atelico/models/ |
| Linux | ~/.cache/atelico/models/ |
| Windows | %LOCALAPPDATA%\atelico\models\ |
Override with the ATELICO_CACHE_DIR environment variable.
Using a Proxy Backend
Forward requests to OpenAI or any OpenAI-compatible API:
OPENAI_API_KEY=sk-... ./atelico-server
Then use the openai:: prefix:
{
"model": "openai::gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello"}]
}
For other providers, use the generic proxy syntax:
PROXY_ANTHROPIC_API_KEY=sk-... \
PROXY_ANTHROPIC_BASE_URL=https://api.anthropic.com/v1 \
./atelico-server
Then use anthropic::model-name as the model identifier.