Skip to main content
Version: 0.8

Models

Listing Available Models

curl http://localhost:11434/v1/models

Returns all models the server can serve:

{
"object": "list",
"data": [
{"id": "in-memory::meta-llama/Llama-3.2-1B-Instruct", "object": "model", "owned_by": "atelico"},
{"id": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M", "object": "model", "owned_by": "atelico"},
...
]
}

Model Naming

Models use a prefix::model-name syntax:

PrefixBackendDescription
in-memory::Local GPU/CPUOn-device inference (default if no prefix)
openai::OpenAI proxyForwards to OpenAI API
image-generation::Local GPU/CPUImage generation models
mock::MockReturns hardcoded responses (for testing)

If you omit the prefix, in-memory:: is assumed.

Supported Architectures

The engine supports these model architectures in both float (SafeTensors) and quantized (GGUF) formats:

ArchitectureExample Models
LLaMA / MistralLlama 3.x, Mistral 7B
Qwen 3Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B
Qwen 3.5Qwen3.5 (hybrid DeltaNet + attention)
Gemma 4Gemma 4
Nemotron-HNemotron-H (hybrid Mamba2 + attention)
ParcaeParcae (stable looped transformer)
SmolLMSmolLM2-135M, SmolLM2-360M, SmolLM2-1.7B
SmolLM3SmolLM3-3B
Bonsai 1-bitPrismML Bonsai 1-bit (Qwen3 architecture)

Any HuggingFace model using one of these architectures can be loaded. The engine auto-detects the architecture from config.json or GGUF metadata.

Available LLM Models

Use ./atelico-asset-downloader list --namespace models to see all models available in your asset store. The models listed by GET /v1/models are pre-configured defaults:

ModelParametersFormatVRAM
meta-llama/Llama-3.2-1B-Instruct1Bfloat~2 GB
meta-llama/Llama-3.2-1B-Instruct-Q4_K_M1BGGUF~0.8 GB
meta-llama/Llama-3.2-3B-Instruct3Bfloat~6 GB
meta-llama/Llama-3.2-3B-Instruct-Q4_K_M3BGGUF~2 GB
meta-llama/Llama-3.1-8B-Instruct8Bfloat~16 GB
meta-llama/Llama-3.1-8B-Instruct-Q4_K_M8BGGUF~5 GB

Any model downloaded to the cache can be used by passing its ID to the model field, even if it's not in the default list above.

Image Generation Models

ModelDescription
pixart-alphaPixArt-Alpha (DMD one-step)
pixart-sigmaPixArt-Sigma (multi-step DPM-Solver)
sana-sprintSana Sprint (SCM 2-step, fast)
sana-0.6bSana 0.6B (flow matching, 20-step)
sana-1.6bSana 1.6B (flow matching, 20-step, highest quality)

Use with the image-generation:: prefix: "model": "image-generation::sana-sprint".

Text-to-Speech Models

model id (after in-memory::)EngineDescription
tts, kokoro, kokoro-82mKokoro 82M54 voices, 9 languages, streaming output. Optional Q8/Q4 linear quantization via ATELICO_KOKORO_QUANT.
pocket, pocket-ttsPocket TTSEnglish-only, ~15× realtime on Metal / ~34× on RTX 3090, instant voice cloning from a single reference clip, 24 built-in voices.

TTS is accessed via the /v1/audio/speech endpoint (OpenAI-compatible). Specify a voice id in the voice field; language is detected automatically by Kokoro. Set stream: true to receive Server-Sent Events with one chunk per sentence. Default model is kokoro-82m. See the Audio guide for full details.

Speech-to-Text Models

All Whisper variants are routed via in-memory::<id>:

model idVariantLanguagesNotes
whisperbase.enEnglishDefault
whisper-tiny, whisper-tiny.entiny / tiny.enmulti / EnglishSmallest, fastest
whisper-base, whisper-base.enbase / base.enmulti / English
whisper-small, whisper-small.ensmall / small.enmulti / English
whisper-medium, whisper-medium.enmedium / medium.enmulti / English
whisper-large-v3large-v3multiHighest WER on hard corpora
whisper-large-v3-turbolarge-v3-turbomulti~33× realtime on RTX 3090
distil-large-v3distil-large-v3multi~33× realtime on RTX 3090

GGUF Q5_0 quantization is supported for the larger variants (large-v3 → ~1 GB) — automatic when the asset store provides the quantized file. STT is accessed via /v1/audio/transcriptions (OpenAI-compatible).

Choosing a Model

  • Dialogue, simple conversation: 1B quantized is fast enough
  • NPC personalities, creative writing: 3B quantized is a good balance
  • Complex reasoning, structured generation: 8B quantized for best results
  • Shipping a game: Quantized models are recommended -- smaller download, less VRAM, similar quality

Downloading Models

Use the asset downloader to fetch models:

# List what's available
./atelico-asset-downloader list --namespace models

# Download a specific model
./atelico-asset-downloader download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M

# Interactive mode -- browse and select
./atelico-asset-downloader interactive

Custom Asset Store

If your team hosts models on a private store:

./atelico-asset-downloader \
--store-url https://your-store.example.com \
--access-key YOUR_KEY \
--secret-key YOUR_SECRET \
download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M

HuggingFace Fallback

If a model isn't in the asset store, the server can fetch it from HuggingFace as a last resort. Set the HF_TOKEN environment variable for gated models:

HF_TOKEN=hf_your_token_here ./atelico-server

Model Formats

FormatExtensionDescription
SafeTensors.safetensorsFull-precision weights (F16/F32)
GGUF.ggufQuantized weights (Q4, Q8, etc.)
Bonsai 1-bit.safetensors (special)1-bit quantized (experimental)

Cache Location

Downloaded models are stored locally:

PlatformPath
macOS~/Library/Caches/atelico/models/
Linux~/.cache/atelico/models/
Windows%LOCALAPPDATA%\atelico\models\

Override with the ATELICO_CACHE_DIR environment variable.

Using a Proxy Backend

Forward requests to OpenAI or any OpenAI-compatible API:

OPENAI_API_KEY=sk-... ./atelico-server

Then use the openai:: prefix:

{
"model": "openai::gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello"}]
}

For other providers, use the generic proxy syntax:

PROXY_ANTHROPIC_API_KEY=sk-... \
PROXY_ANTHROPIC_BASE_URL=https://api.anthropic.com/v1 \
./atelico-server

Then use anthropic::model-name as the model identifier.