Version: 0.9

Atelico AI Engine

On-device AI inference engine for games and interactive applications. LLMs, image generation, embeddings, classifiers, guardrails, and a semantic memory system -- running locally on the player's hardware, with native integration for Godot, Unity, and Unreal Engine.

Three Things That Make Atelico Different

Creative Control

AI left to its own devices produces unreliable output. Atelico gives you the tools to keep it on script:

Structured generation forces output into exact JSON schemas -- guaranteed valid data your game code can parse directly, every time
Semantic KV Store lets you author dialogue, lore, and game data, then retrieve the right piece at the right moment based on meaning, not keywords
Guardrails filter unsafe content at multiple levels (keyword blocklists, ML classifiers, LLM judges) with customizable presets and the option to rewrite rather than just block
LoRA adapters hot-swap model personality at runtime -- different voices for different NPCs without loading separate models

Runs in Your Game

The engine embeds directly in your game process with native SDKs for Godot, Unity, and Unreal Engine. Not a sidecar process, not an HTTP call to localhost -- in-process, sharing your GPU.

Frame-aware scheduling lets you control the priority:

Prioritize Graphics during action sequences -- AI yields GPU time to keep FPS smooth
Balance for normal gameplay -- even split between inference and rendering
Prioritize Compute during dialogue scenes -- fastest AI responses

On NVIDIA GPUs with compatible drivers, hardware-level GPU sharing (Compute-in-Graphics) eliminates context-switching overhead entirely.

Runs on Device

No cloud, no API keys, no per-token costs, no internet required. Models ship with your game as bundled assets.

Metal on Apple Silicon (macOS, iOS)
CUDA on NVIDIA GPUs (Windows, Linux)
CPU everywhere else
Quantized models (GGUF) run with as little as 0.8 GB of VRAM for a 1B model
1-bit models (Bonsai) compress an 8B model to ~1.15 GB

Player data never leaves the device.

Two Ways to Integrate

HTTP Server

An OpenAI-compatible REST API on port 11434. Any code that works with OpenAI works with Atelico by changing the base URL. Use this for prototyping, tools, content pipelines, or when your game communicates over HTTP.

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
       "messages": [{"role": "user", "content": "Hello!"}]}'

Native SDKs

Embed the engine directly inside your game process:

SDK	Integration	Streaming Pattern
Godot	GDExtension (zero-overhead Rust binding)	Signals
Unity	UPM package (C# via P/Invoke)	Callbacks
Unreal	Plugin (UGameInstanceSubsystem)	Delegates / Blueprints
Python	Native module (PyO3)	Iterator
C/C++	Static/dynamic library	Poll loop

Both the server and SDKs use the same JSON request format -- code and prompts are portable between them.

Capabilities

Capability	Description
LLM Chat	Multi-turn conversation with system prompts, streaming, and temperature control
Structured Generation	Constrain output to a JSON Schema, regex, choice, or grammar -- guaranteed parseable
Text-to-Speech	On-device TTS via Kokoro 82M (54 voices, 9 languages) or Pocket TTS (English, instant voice cloning, ~15× realtime on Metal). Streaming audio output.
Speech-to-Text	On-device transcription via Whisper (tiny → large-v3, multilingual, GGUF quantized variants). Streaming transcription with VAD for live-mic input on macOS / iOS.
Image Generation	Generate images from text prompts on-device (~1s on Apple Silicon)
Vision Embeddings	DINOv2 vision embeddings and MAETok image tokenization
Embeddings	Convert text to semantic vectors for similarity search
Hybrid Search	Combine semantic (vector) + lexical (full-text) retrieval with weighted-sum reranking and per-row score traces
Semantic KV Store	Store authored content and retrieve by meaning with faceted filtering
Text Classifiers	Categorize text (intent detection, content moderation)
Guardrails	Layered safety: keyword filters, ML classifiers, LLM judges, content rewriting
LoRA Adapters	Hot-swap model personality at runtime without reloading
Prefix Cache	Capture a prompt's KV state once and replay it across many requests (system-prompt reuse, dialogue branching, automatic radix-tree sharing)
Answer Cache	In-memory prompt-result cache with TTL and LRU eviction, isolated per adapter / namespace / temperature
Matcher	Select one option from many (embedding cosine, LM choice, or cascading escalation) — useful for intent routing, dialogue picking, NPC reactions
LM Function Programs	Declarative prompts with resolvers (random tables, files, retrieval, choose-via-matcher), output parsers (tolerant JSON, choice index, recursive field extraction), and reusable `LmFunction` definitions
Generation Policy	Composable retry/repair/fallback loop on top of any `LmFunction`: validators (JSON parse, schema subset, regex, choice, custom), local JSON repair, retry-with-error, fallback prompt/model/static, function-level guardrails, prompt-result cache
LM Primitives	Tokenize, detokenize, and inspect model capabilities (vocab size, max position) directly via the SDK
Multi-Backend Routing	Seamlessly mix local inference with cloud API proxies (OpenAI, etc.)

Supported Models

Architecture	Example Models	Formats
LLaMA / Mistral	Llama 3.x, Mistral 7B	SafeTensors, GGUF
Qwen 3	Qwen3-0.6B through Qwen3-8B	SafeTensors, GGUF
Qwen 3.5	Qwen3.5 (hybrid DeltaNet + attention)	SafeTensors, GGUF
Gemma 4	Gemma 4 (MoE)	SafeTensors, GGUF
Parcae	Parcae (stable looped transformer)	SafeTensors
SmolLM	SmolLM2, SmolLM3	SafeTensors, GGUF
Bonsai 1-bit	PrismML Bonsai 1.7B, 8B	SafeTensors
PixArt / Sana	PixArt-Alpha, Sana Sprint	SafeTensors
Kokoro	Kokoro 82M TTS (Q8/Q4 linear quantization optional)	SafeTensors
Pocket TTS	Pocket TTS — English, instant voice cloning, 24 built-in voices	SafeTensors
Whisper	Whisper STT — tiny / base / small / medium / large-v3 / large-v3-turbo / distil-large-v3	SafeTensors, GGUF (Q5_0)

Any HuggingFace model using a supported architecture works. Quantized (GGUF) models are recommended for shipping games -- smaller download, less VRAM, minimal quality loss.

Supported Platforms

Platform	GPU Backend	Server	Godot	Unity	Unreal	Python	C FFI
macOS (Apple Silicon)	Metal	Yes	Yes	Yes	Yes	Yes	Yes
Windows (NVIDIA)	CUDA	Yes	Yes	Yes	Yes	Yes	Yes
Linux (NVIDIA)	CUDA	Yes	Yes	Yes	Yes	Yes	Yes
Any platform	CPU	Yes	Yes	Yes	Yes	Yes	Yes
iOS	Metal	--	--	--	--	--	Yes

Get Started

Server path -- quickest way to try the engine:

Getting Started -- download a model, start the server, send your first request
Chat Completions API -- streaming, multi-turn, temperature control
Structured Generation -- force JSON output matching a schema

SDK path -- for shipping games:

Godot | Unity | Unreal | Python | C FFI

Guides:

NPC Dialogue -- personality, streaming, multi-turn memory, emotion tags
Structured Game Data -- quests, items, encounters as typed JSON
Prompts & Generation Policy -- LmFunction, resolvers, parsers, validators, repair, fallbacks, prompt result cache
Hybrid & Lexical Search -- semantic + full-text retrieval with weighted reranking
Audio (TTS & STT) -- voices, streaming synthesis, voice cloning, live-mic transcription
Models -- choosing, downloading, and managing models

Three Things That Make Atelico Different​

Creative Control​

Runs in Your Game​

Runs on Device​

Two Ways to Integrate​

HTTP Server​

Native SDKs​

Capabilities​

Supported Models​

Supported Platforms​

Get Started​