Skip to main content
Version: 0.9

Atelico AI Engine

On-device AI inference engine for games and interactive applications. LLMs, image generation, embeddings, classifiers, guardrails, and a semantic memory system -- running locally on the player's hardware, with native integration for Godot, Unity, and Unreal Engine.

Three Things That Make Atelico Different

Creative Control

AI left to its own devices produces unreliable output. Atelico gives you the tools to keep it on script:

  • Structured generation forces output into exact JSON schemas -- guaranteed valid data your game code can parse directly, every time
  • Semantic KV Store lets you author dialogue, lore, and game data, then retrieve the right piece at the right moment based on meaning, not keywords
  • Guardrails filter unsafe content at multiple levels (keyword blocklists, ML classifiers, LLM judges) with customizable presets and the option to rewrite rather than just block
  • LoRA adapters hot-swap model personality at runtime -- different voices for different NPCs without loading separate models

Runs in Your Game

The engine embeds directly in your game process with native SDKs for Godot, Unity, and Unreal Engine. Not a sidecar process, not an HTTP call to localhost -- in-process, sharing your GPU.

Frame-aware scheduling lets you control the priority:

  • Prioritize Graphics during action sequences -- AI yields GPU time to keep FPS smooth
  • Balance for normal gameplay -- even split between inference and rendering
  • Prioritize Compute during dialogue scenes -- fastest AI responses

On NVIDIA GPUs with compatible drivers, hardware-level GPU sharing (Compute-in-Graphics) eliminates context-switching overhead entirely.

Runs on Device

No cloud, no API keys, no per-token costs, no internet required. Models ship with your game as bundled assets.

  • Metal on Apple Silicon (macOS, iOS)
  • CUDA on NVIDIA GPUs (Windows, Linux)
  • CPU everywhere else
  • Quantized models (GGUF) run with as little as 0.8 GB of VRAM for a 1B model
  • 1-bit models (Bonsai) compress an 8B model to ~1.15 GB

Player data never leaves the device.

Two Ways to Integrate

HTTP Server

An OpenAI-compatible REST API on port 11434. Any code that works with OpenAI works with Atelico by changing the base URL. Use this for prototyping, tools, content pipelines, or when your game communicates over HTTP.

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [{"role": "user", "content": "Hello!"}]}'

Native SDKs

Embed the engine directly inside your game process:

SDKIntegrationStreaming Pattern
GodotGDExtension (zero-overhead Rust binding)Signals
UnityUPM package (C# via P/Invoke)Callbacks
UnrealPlugin (UGameInstanceSubsystem)Delegates / Blueprints
PythonNative module (PyO3)Iterator
C/C++Static/dynamic libraryPoll loop

Both the server and SDKs use the same JSON request format -- code and prompts are portable between them.

Capabilities

CapabilityDescription
LLM ChatMulti-turn conversation with system prompts, streaming, and temperature control
Structured GenerationConstrain output to a JSON Schema, regex, choice, or grammar -- guaranteed parseable
Text-to-SpeechOn-device TTS via Kokoro 82M (54 voices, 9 languages) or Pocket TTS (English, instant voice cloning, ~15× realtime on Metal). Streaming audio output.
Speech-to-TextOn-device transcription via Whisper (tiny → large-v3, multilingual, GGUF quantized variants). Streaming transcription with VAD for live-mic input on macOS / iOS.
Image GenerationGenerate images from text prompts on-device (~1s on Apple Silicon)
Vision EmbeddingsDINOv2 vision embeddings and MAETok image tokenization
EmbeddingsConvert text to semantic vectors for similarity search
Hybrid SearchCombine semantic (vector) + lexical (full-text) retrieval with weighted-sum reranking and per-row score traces
Semantic KV StoreStore authored content and retrieve by meaning with faceted filtering
Text ClassifiersCategorize text (intent detection, content moderation)
GuardrailsLayered safety: keyword filters, ML classifiers, LLM judges, content rewriting
LoRA AdaptersHot-swap model personality at runtime without reloading
Prefix CacheCapture a prompt's KV state once and replay it across many requests (system-prompt reuse, dialogue branching, automatic radix-tree sharing)
Answer CacheIn-memory prompt-result cache with TTL and LRU eviction, isolated per adapter / namespace / temperature
MatcherSelect one option from many (embedding cosine, LM choice, or cascading escalation) — useful for intent routing, dialogue picking, NPC reactions
LM Function ProgramsDeclarative prompts with resolvers (random tables, files, retrieval, choose-via-matcher), output parsers (tolerant JSON, choice index, recursive field extraction), and reusable LmFunction definitions
Generation PolicyComposable retry/repair/fallback loop on top of any LmFunction: validators (JSON parse, schema subset, regex, choice, custom), local JSON repair, retry-with-error, fallback prompt/model/static, function-level guardrails, prompt-result cache
LM PrimitivesTokenize, detokenize, and inspect model capabilities (vocab size, max position) directly via the SDK
Multi-Backend RoutingSeamlessly mix local inference with cloud API proxies (OpenAI, etc.)

Supported Models

ArchitectureExample ModelsFormats
LLaMA / MistralLlama 3.x, Mistral 7BSafeTensors, GGUF
Qwen 3Qwen3-0.6B through Qwen3-8BSafeTensors, GGUF
Qwen 3.5Qwen3.5 (hybrid DeltaNet + attention)SafeTensors, GGUF
Gemma 4Gemma 4 (MoE)SafeTensors, GGUF
ParcaeParcae (stable looped transformer)SafeTensors
SmolLMSmolLM2, SmolLM3SafeTensors, GGUF
Bonsai 1-bitPrismML Bonsai 1.7B, 8BSafeTensors
PixArt / SanaPixArt-Alpha, Sana SprintSafeTensors
KokoroKokoro 82M TTS (Q8/Q4 linear quantization optional)SafeTensors
Pocket TTSPocket TTS — English, instant voice cloning, 24 built-in voicesSafeTensors
WhisperWhisper STT — tiny / base / small / medium / large-v3 / large-v3-turbo / distil-large-v3SafeTensors, GGUF (Q5_0)

Any HuggingFace model using a supported architecture works. Quantized (GGUF) models are recommended for shipping games -- smaller download, less VRAM, minimal quality loss.

Supported Platforms

PlatformGPU BackendServerGodotUnityUnrealPythonC FFI
macOS (Apple Silicon)MetalYesYesYesYesYesYes
Windows (NVIDIA)CUDAYesYesYesYesYesYes
Linux (NVIDIA)CUDAYesYesYesYesYesYes
Any platformCPUYesYesYesYesYesYes
iOSMetal----------Yes

Get Started

Server path -- quickest way to try the engine:

  1. Getting Started -- download a model, start the server, send your first request
  2. Chat Completions API -- streaming, multi-turn, temperature control
  3. Structured Generation -- force JSON output matching a schema

SDK path -- for shipping games:

Guides: