Atelico AI Engine
On-device AI inference engine for games and interactive applications. LLMs, image generation, embeddings, classifiers, guardrails, and a semantic memory system -- running locally on the player's hardware, with native integration for Godot, Unity, and Unreal Engine.
Three Things That Make Atelico Different
Creative Control
AI left to its own devices produces unreliable output. Atelico gives you the tools to keep it on script:
- Structured generation forces output into exact JSON schemas -- guaranteed valid data your game code can parse directly, every time
- Semantic KV Store lets you author dialogue, lore, and game data, then retrieve the right piece at the right moment based on meaning, not keywords
- Guardrails filter unsafe content at multiple levels (keyword blocklists, ML classifiers, LLM judges) with customizable presets and the option to rewrite rather than just block
- LoRA adapters hot-swap model personality at runtime -- different voices for different NPCs without loading separate models
Runs in Your Game
The engine embeds directly in your game process with native SDKs for Godot, Unity, and Unreal Engine. Not a sidecar process, not an HTTP call to localhost -- in-process, sharing your GPU.
Frame-aware scheduling lets you control the priority:
- Prioritize Graphics during action sequences -- AI yields GPU time to keep FPS smooth
- Balance for normal gameplay -- even split between inference and rendering
- Prioritize Compute during dialogue scenes -- fastest AI responses
On NVIDIA GPUs with compatible drivers, hardware-level GPU sharing (Compute-in-Graphics) eliminates context-switching overhead entirely.
Runs on Device
No cloud, no API keys, no per-token costs, no internet required. Models ship with your game as bundled assets.
- Metal on Apple Silicon (macOS, iOS)
- CUDA on NVIDIA GPUs (Windows, Linux)
- CPU everywhere else
- Quantized models (GGUF) run with as little as 0.8 GB of VRAM for a 1B model
- 1-bit models (Bonsai) compress an 8B model to ~1.15 GB
Player data never leaves the device.
Two Ways to Integrate
HTTP Server
An OpenAI-compatible REST API on port 11434. Any code that works with OpenAI works with Atelico by changing the base URL. Use this for prototyping, tools, content pipelines, or when your game communicates over HTTP.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [{"role": "user", "content": "Hello!"}]}'
Native SDKs
Embed the engine directly inside your game process:
| SDK | Integration | Streaming Pattern |
|---|---|---|
| Godot | GDExtension (zero-overhead Rust binding) | Signals |
| Unity | UPM package (C# via P/Invoke) | Callbacks |
| Unreal | Plugin (UGameInstanceSubsystem) | Delegates / Blueprints |
| Python | Native module (PyO3) | Iterator |
| C/C++ | Static/dynamic library | Poll loop |
Both the server and SDKs use the same JSON request format -- code and prompts are portable between them.
Capabilities
| Capability | Description |
|---|---|
| LLM Chat | Multi-turn conversation with system prompts, streaming, and temperature control |
| Structured Generation | Constrain output to a JSON Schema, regex, choice, or grammar -- guaranteed parseable |
| Text-to-Speech | On-device TTS via Kokoro 82M (54 voices, 9 languages) or Pocket TTS (English, instant voice cloning, ~15× realtime on Metal). Streaming audio output. |
| Speech-to-Text | On-device transcription via Whisper (tiny → large-v3, multilingual, GGUF quantized variants). Streaming transcription with VAD for live-mic input on macOS / iOS. |
| Image Generation | Generate images from text prompts on-device (~1s on Apple Silicon) |
| Vision Embeddings | DINOv2 vision embeddings and MAETok image tokenization |
| Embeddings | Convert text to semantic vectors for similarity search |
| Hybrid Search | Combine semantic (vector) + lexical (full-text) retrieval with weighted-sum reranking and per-row score traces |
| Semantic KV Store | Store authored content and retrieve by meaning with faceted filtering |
| Text Classifiers | Categorize text (intent detection, content moderation) |
| Guardrails | Layered safety: keyword filters, ML classifiers, LLM judges, content rewriting |
| LoRA Adapters | Hot-swap model personality at runtime without reloading |
| Prefix Cache | Capture a prompt's KV state once and replay it across many requests (system-prompt reuse, dialogue branching, automatic radix-tree sharing) |
| Answer Cache | In-memory prompt-result cache with TTL and LRU eviction, isolated per adapter / namespace / temperature |
| Matcher | Select one option from many (embedding cosine, LM choice, or cascading escalation) — useful for intent routing, dialogue picking, NPC reactions |
| LM Function Programs | Declarative prompts with resolvers (random tables, files, retrieval, choose-via-matcher), output parsers (tolerant JSON, choice index, recursive field extraction), and reusable LmFunction definitions |
| Generation Policy | Composable retry/repair/fallback loop on top of any LmFunction: validators (JSON parse, schema subset, regex, choice, custom), local JSON repair, retry-with-error, fallback prompt/model/static, function-level guardrails, prompt-result cache |
| LM Primitives | Tokenize, detokenize, and inspect model capabilities (vocab size, max position) directly via the SDK |
| Multi-Backend Routing | Seamlessly mix local inference with cloud API proxies (OpenAI, etc.) |
Supported Models
| Architecture | Example Models | Formats |
|---|---|---|
| LLaMA / Mistral | Llama 3.x, Mistral 7B | SafeTensors, GGUF |
| Qwen 3 | Qwen3-0.6B through Qwen3-8B | SafeTensors, GGUF |
| Qwen 3.5 | Qwen3.5 (hybrid DeltaNet + attention) | SafeTensors, GGUF |
| Gemma 4 | Gemma 4 (MoE) | SafeTensors, GGUF |
| Parcae | Parcae (stable looped transformer) | SafeTensors |
| SmolLM | SmolLM2, SmolLM3 | SafeTensors, GGUF |
| Bonsai 1-bit | PrismML Bonsai 1.7B, 8B | SafeTensors |
| PixArt / Sana | PixArt-Alpha, Sana Sprint | SafeTensors |
| Kokoro | Kokoro 82M TTS (Q8/Q4 linear quantization optional) | SafeTensors |
| Pocket TTS | Pocket TTS — English, instant voice cloning, 24 built-in voices | SafeTensors |
| Whisper | Whisper STT — tiny / base / small / medium / large-v3 / large-v3-turbo / distil-large-v3 | SafeTensors, GGUF (Q5_0) |
Any HuggingFace model using a supported architecture works. Quantized (GGUF) models are recommended for shipping games -- smaller download, less VRAM, minimal quality loss.
Supported Platforms
| Platform | GPU Backend | Server | Godot | Unity | Unreal | Python | C FFI |
|---|---|---|---|---|---|---|---|
| macOS (Apple Silicon) | Metal | Yes | Yes | Yes | Yes | Yes | Yes |
| Windows (NVIDIA) | CUDA | Yes | Yes | Yes | Yes | Yes | Yes |
| Linux (NVIDIA) | CUDA | Yes | Yes | Yes | Yes | Yes | Yes |
| Any platform | CPU | Yes | Yes | Yes | Yes | Yes | Yes |
| iOS | Metal | -- | -- | -- | -- | -- | Yes |
Get Started
Server path -- quickest way to try the engine:
- Getting Started -- download a model, start the server, send your first request
- Chat Completions API -- streaming, multi-turn, temperature control
- Structured Generation -- force JSON output matching a schema
SDK path -- for shipping games:
Guides:
- NPC Dialogue -- personality, streaming, multi-turn memory, emotion tags
- Structured Game Data -- quests, items, encounters as typed JSON
- Prompts & Generation Policy --
LmFunction, resolvers, parsers, validators, repair, fallbacks, prompt result cache - Hybrid & Lexical Search -- semantic + full-text retrieval with weighted reranking
- Audio (TTS & STT) -- voices, streaming synthesis, voice cloning, live-mic transcription
- Models -- choosing, downloading, and managing models