Version: 0.9

Models

Listing Available Models

curl http://localhost:11434/v1/models

Returns all models the server can serve:

{
  "object": "list",
  "data": [
    {"id": "in-memory::meta-llama/Llama-3.2-1B-Instruct", "object": "model", "owned_by": "atelico"},
    {"id": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M", "object": "model", "owned_by": "atelico"},
    ...
  ]
}

Model Naming

Models use a prefix::model-name syntax:

Prefix	Backend	Description
`in-memory::`	Local GPU/CPU	On-device inference (default if no prefix)
`openai::`	OpenAI proxy	Forwards to OpenAI API
`image-generation::`	Local GPU/CPU	Image generation models
`mock::`	Mock	Returns hardcoded responses (for testing)

If you omit the prefix, in-memory:: is assumed.

Supported Architectures

The engine supports these model architectures in both float (SafeTensors) and quantized (GGUF) formats:

Architecture	Example Models
LLaMA / Mistral	Llama 3.x, Mistral 7B
Qwen 3	Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B
Qwen 3.5	Qwen3.5-0.8B (hybrid DeltaNet + attention; optional megakernel decode on CUDA)
Qwen 3.6	Qwen3.6 family incl. 35B-A3B (MoE, 256 experts / 8 active + 1 shared, hybrid DeltaNet + Gated Attention, 262K native context, YaRN extrapolation to ~1M)
Gemma 4	Gemma 4
Nemotron-H	Nemotron-H 4B Nano (dense, hybrid Mamba2 + attention)
Nemotron-3 Elastic	Nemotron-3 Elastic 12B / 23B / 30B (MoE, hybrid Mamba2 + attention; GGUF Q4_K_M supported)
Parcae	Parcae (stable looped transformer)
SmolLM	SmolLM2-135M, SmolLM2-360M, SmolLM2-1.7B
SmolLM3	SmolLM3-3B
Bonsai 1-bit	PrismML Bonsai 1-bit (Qwen3 architecture)

Any HuggingFace model using one of these architectures can be loaded. The engine auto-detects the architecture from config.json or GGUF metadata.

Available LLM Models

Use ./atelico-asset-downloader list --namespace models to see all models available in your asset store. The models listed by GET /v1/models are pre-configured defaults:

Model	Parameters	Format	VRAM
`meta-llama/Llama-3.2-1B-Instruct`	1B	float	~2 GB
`meta-llama/Llama-3.2-1B-Instruct-Q4_K_M`	1B	GGUF	~0.8 GB
`meta-llama/Llama-3.2-3B-Instruct`	3B	float	~6 GB
`meta-llama/Llama-3.2-3B-Instruct-Q4_K_M`	3B	GGUF	~2 GB
`meta-llama/Llama-3.1-8B-Instruct`	8B	float	~16 GB
`meta-llama/Llama-3.1-8B-Instruct-Q4_K_M`	8B	GGUF	~5 GB

Any model downloaded to the cache can be used by passing its ID to the model field, even if it's not in the default list above.

Hybrid and MoE Models

Newer architectures with hybrid attention or sparse Mixture-of-Experts are routed by the same in-memory:: prefix. Approximate footprints for the Q4_K_M GGUF variant:

Model	Active / Total Params	Format	VRAM
`Qwen3.5-0.8B`	0.8B / 0.8B (dense)	float / GGUF	~1.6 GB / ~0.5 GB
`Qwen3.6-35B-A3B-Q4_K_M`	3B / 35B (MoE)	GGUF	~20 GB
`nvidia/Nemotron-3-Elastic-12B-Q4_K_M`	A3B / 12B (MoE)	GGUF	~7 GB
`nvidia/Nemotron-3-Elastic-23B-Q4_K_M`	A3B / 23B (MoE)	GGUF	~13 GB
`nvidia/Nemotron-3-Elastic-30B-Q4_K_M`	A3B / 30B (MoE)	GGUF	~17 GB

Indicative GGUF Q4_K_M decode throughput for Nemotron-3 Elastic 12B: Metal (M3 Max) ~59 tok/s (1.74× the BF16 path at ~58% of the memory footprint), CUDA (RTX 3090) ~220 tok/s.

Megakernel Decode (Qwen 3.5-0.8B, CUDA)

Qwen 3.5-0.8B ships with an optional fused-megakernel decode path that collapses every layer into a single CUDA kernel launch. It runs 1.38×–1.45× faster for decode + sampling on Ampere+ vs the standard graph-replay path, at the cost of being CUDA + BF16 only.

Opt in either by appending -mk to the model id (in-memory::Qwen3.5-0.8B-mk) or by setting the environment variable ATELICO_MEGAKERNEL=1 when launching the server. Metal and CPU backends ignore the flag and use the standard path.

Image Generation Models

Model	Description
`pixart-alpha`	PixArt-Alpha (DMD one-step)
`pixart-sigma`	PixArt-Sigma (multi-step DPM-Solver)
`sana-sprint`	Sana Sprint (SCM 2-step, fast)
`sana-0.6b`	Sana 0.6B (flow matching, 20-step)
`sana-1.6b`	Sana 1.6B (flow matching, 20-step, highest quality)

Use with the image-generation:: prefix: "model": "image-generation::sana-sprint".

Text-to-Speech Models

`model` id (after `in-memory::`)	Engine	Description
`tts`, `kokoro`, `kokoro-82m`	Kokoro 82M	54 voices, 9 languages, streaming output. Optional Q8/Q4 linear quantization via `ATELICO_KOKORO_QUANT`.
`pocket`, `pocket-tts`	Pocket TTS	English-only, ~15× realtime on Metal / ~34× on RTX 3090, instant voice cloning from a single reference clip, 24 built-in voices.

TTS is accessed via the /v1/audio/speech endpoint (OpenAI-compatible). Specify a voice id in the voice field; language is detected automatically by Kokoro. Set stream: true to receive Server-Sent Events with one chunk per sentence. Default model is kokoro-82m. See the Audio guide for full details.

Speech-to-Text Models

All Whisper variants are routed via in-memory::<id>:

`model` id	Variant	Languages	Notes
`whisper`	base.en	English	Default
`whisper-tiny`, `whisper-tiny.en`	tiny / tiny.en	multi / English	Smallest, fastest
`whisper-base`, `whisper-base.en`	base / base.en	multi / English
`whisper-small`, `whisper-small.en`	small / small.en	multi / English
`whisper-medium`, `whisper-medium.en`	medium / medium.en	multi / English
`whisper-large-v3`	large-v3	multi	Highest WER on hard corpora
`whisper-large-v3-turbo`	large-v3-turbo	multi	~33× realtime on RTX 3090
`distil-large-v3`	distil-large-v3	multi	~33× realtime on RTX 3090

GGUF Q5_0 quantization is supported for the larger variants (large-v3 → ~1 GB) — automatic when the asset store provides the quantized file. STT is accessed via /v1/audio/transcriptions (OpenAI-compatible).

Choosing a Model

Dialogue, simple conversation: 1B quantized is fast enough
NPC personalities, creative writing: 3B quantized is a good balance
Complex reasoning, structured generation: 8B quantized for best results
Shipping a game: Quantized models are recommended -- smaller download, less VRAM, similar quality

Downloading Models

Use the asset downloader to fetch models:

# List what's available
./atelico-asset-downloader list --namespace models

# Download a specific model
./atelico-asset-downloader download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M

# Interactive mode -- browse and select
./atelico-asset-downloader interactive

Custom Asset Store

If your team hosts models on a private store:

./atelico-asset-downloader \
  --store-url https://your-store.example.com \
  --access-key YOUR_KEY \
  --secret-key YOUR_SECRET \
  download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M

HuggingFace Fallback

If a model isn't in the asset store, the server can fetch it from HuggingFace as a last resort. Set the HF_TOKEN environment variable for gated models:

HF_TOKEN=hf_your_token_here ./atelico-server

Model Formats

Format	Extension	Description
SafeTensors	`.safetensors`	Full-precision weights (F16/F32)
GGUF	`.gguf`	Quantized weights (Q4, Q8, etc.)
Bonsai 1-bit	`.safetensors` (special)	1-bit quantized (experimental)

Cache Location

Downloaded models are stored locally:

Platform	Path
macOS	`~/Library/Caches/atelico/models/`
Linux	`~/.cache/atelico/models/`
Windows	`%LOCALAPPDATA%\atelico\models\`

Override with the ATELICO_CACHE_DIR environment variable.

Using a Proxy Backend

Forward requests to OpenAI or any OpenAI-compatible API:

OPENAI_API_KEY=sk-... ./atelico-server

Then use the openai:: prefix:

{
  "model": "openai::gpt-4o-mini",
  "messages": [{"role": "user", "content": "Hello"}]
}

For other providers, use the generic proxy syntax:

PROXY_ANTHROPIC_API_KEY=sk-... \
PROXY_ANTHROPIC_BASE_URL=https://api.anthropic.com/v1 \
./atelico-server

Then use anthropic::model-name as the model identifier.

Listing Available Models​

Model Naming​

Supported Architectures​

Available LLM Models​

Hybrid and MoE Models​

Megakernel Decode (Qwen 3.5-0.8B, CUDA)​

Image Generation Models​

Text-to-Speech Models​

Speech-to-Text Models​

Choosing a Model​

Downloading Models​

Custom Asset Store​

HuggingFace Fallback​

Model Formats​

Cache Location​

Using a Proxy Backend​