Skip to main content
Version: 0.8

Python: Getting Started

This guide walks you through setting up the Atelico Python SDK and using it for LLM inference, streaming, and structured generation.

What You'll Build

A Python script that loads a model, sends chat requests, streams token-by-token output, and generates structured JSON — the foundation for game tools, content pipelines, or scripted NPC behavior.

By the end, you'll understand:

  1. How to install the SDK and create an Engine
  2. How to send a blocking chat request
  3. How to stream tokens with a Python iterator
  4. How to maintain conversation history
  5. How to use structured generation for typed output

Prerequisites

  • Python 3.9 or later
  • The Atelico server bundle (atelico-asset-downloader)
  • A downloaded model:
./atelico-asset-downloader download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M

Step 1: Install the SDK

From a Release Wheel

pip install atelico-0.6.2-cp39-abi3-macosx_11_0_arm64.whl # macOS Apple Silicon
pip install atelico-0.6.2-cp39-abi3-linux_x86_64.whl # Linux

From Source (Development)

Requires Rust and maturin:

pip install maturin

cd atelico-python

# macOS (Apple Silicon, Metal GPU)
maturin develop --release --features metal

# Linux/Windows (NVIDIA GPU)
maturin develop --release --features cuda

# CPU only
maturin develop --release

Verify Installation

import atelico
print("Atelico SDK loaded")

Step 2: Create an Engine

from atelico import Engine

# Auto-detect the best GPU backend (Metal on Mac, CUDA on NVIDIA, CPU fallback)
engine = Engine()

# Or specify explicitly
engine = Engine(device="metal") # macOS Apple Silicon
engine = Engine(device="cuda") # NVIDIA GPU
engine = Engine(device="cpu") # CPU only

The engine supports Python's context manager:

with Engine() as engine:
# ... use engine ...
pass # Automatically cleans up

Step 3: Load a Model

# Blocking — downloads from cache/HuggingFace if needed
engine.load_model("meta-llama/Llama-3.2-3B-Instruct-Q4_K_M")

# Check if loaded
assert engine.model_is_loaded("meta-llama/Llama-3.2-3B-Instruct-Q4_K_M")

The first call downloads the model to the local cache. Subsequent calls load from cache instantly.

Step 4: Blocking Chat Request

import json

request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are a friendly tavern keeper named Boris. Keep responses under 2 sentences."},
{"role": "user", "content": "What's on the menu today?"},
],
"max_tokens": 100,
"temperature": 0.7,
})

response_json = engine.llm_chat_completion(request)
response = json.loads(response_json)
print(response["choices"][0]["message"]["content"])

The call blocks until the full response is generated. The GIL is released during inference, so other Python threads can run.

Step 5: Streaming

The streaming API returns a TokenStream iterator — use it in a regular for loop:

import json

request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are a narrator for a fantasy RPG."},
{"role": "user", "content": "Describe the entrance to the dungeon."},
],
"max_tokens": 200,
"temperature": 0.8,
})

stream = engine.llm_chat_completion_stream(request)

full_text = ""
for chunk_json in stream:
chunk = json.loads(chunk_json)
delta = chunk["choices"][0]["delta"]
if "content" in delta and delta["content"] is not None:
full_text += delta["content"]
print(delta["content"], end="", flush=True)

print() # newline after streaming
print(f"Full response: {full_text}")

Each iteration of the for loop blocks until the next token is available, then yields a JSON string in OpenAI ChatCompletionChunk format.

Step 6: Multi-Turn Conversation

Build up the messages array across turns:

import json

conversation = [
{"role": "system", "content": "You are Greta, a grumpy blacksmith. You secretly care about the player but never admit it. Keep responses under 3 sentences."}
]

def chat(player_message: str) -> str:
conversation.append({"role": "user", "content": player_message})

request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": conversation,
"max_tokens": 150,
"temperature": 0.8,
})

response = json.loads(engine.llm_chat_completion(request))
reply = response["choices"][0]["message"]["content"]

# Store the reply for future context
conversation.append({"role": "assistant", "content": reply})
return reply

# Interactive loop
while True:
user_input = input("You: ")
if user_input.lower() in ("quit", "exit"):
break
print(f"Greta: {chat(user_input)}")

Step 7: Structured Generation

Force the model to output valid JSON matching a schema:

import json

request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "user", "content": "Generate a random fantasy weapon for a level 5 rogue"},
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "Weapon",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string", "enum": ["sword", "axe", "bow", "staff", "dagger"]},
"damage": {"type": "integer", "minimum": 1, "maximum": 100},
"rarity": {"type": "string", "enum": ["common", "uncommon", "rare", "legendary"]},
"description": {"type": "string"},
},
"required": ["name", "type", "damage", "rarity", "description"],
},
"strict": True,
},
},
})

response = json.loads(engine.llm_chat_completion(request))
weapon = json.loads(response["choices"][0]["message"]["content"])

print(f"Found a {weapon['rarity']} {weapon['type']}: {weapon['name']}")
print(f"Damage: {weapon['damage']}")
print(f"Description: {weapon['description']}")

The output is guaranteed to be valid JSON matching the schema.

Error Handling

The SDK provides typed exceptions:

from atelico import (
Engine,
AtelicoError,
ModelLoadError,
InferenceError,
DeviceError,
GuardrailBlockedError,
)

try:
engine.load_model("nonexistent/model")
except ModelLoadError as e:
print(f"Model not found: {e}")
except DeviceError as e:
print(f"GPU error: {e}")
except AtelicoError as e:
print(f"Engine error: {e}")

GPU Scheduling

Control GPU resource allocation:

engine.set_scheduling_mode("balance") # Default
engine.set_scheduling_mode("prioritize_compute") # Fast inference
engine.set_scheduling_mode("prioritize_graphics") # Smooth rendering

engine.set_vram_budget_mb(4096) # Cap VRAM usage
engine.set_target_tps(15) # Limit tokens/sec

Audio: TTS & STT

The Python SDK exposes two blocking methods (audio_synthesize, audio_transcribe) and one streaming iterator (audio_synthesize_stream). Audio bytes cross the binding as base64-encoded WAV files.

import base64, json

# Blocking TTS — Kokoro by default, Pocket TTS via "in-memory::pocket-tts"
resp = json.loads(engine.audio_synthesize(json.dumps({
"model": "in-memory::tts",
"input": "Hello from Atelico.",
"voice": "af_heart",
})))
open("hello.wav", "wb").write(base64.b64decode(resp["audio_b64"]))

# Streaming TTS — yields one AudioSpeechChunk per sentence
stream = engine.audio_synthesize_stream(json.dumps({
"model": "in-memory::pocket-tts",
"input": "First sentence. Second one comes right after.",
"voice": "alba",
}))
for chunk_json in stream:
chunk = json.loads(chunk_json)
wav = base64.b64decode(chunk["audio"])
# play `wav` immediately while subsequent sentences synthesize

# Blocking STT — encode a WAV file as base64
wav_b64 = base64.b64encode(open("speech.wav", "rb").read()).decode()
result = json.loads(engine.audio_transcribe(json.dumps({
"model": "in-memory::whisper",
"audio_b64": wav_b64,
})))
print(result["text"])

Whisper variant ids: whisper (default → whisper-base.en), whisper-tiny[.en], whisper-base[.en], whisper-small[.en], whisper-medium[.en], whisper-large-v3[-turbo], distil-large-v3. TTS ids: tts (default → kokoro-82m), kokoro, kokoro-82m, pocket, pocket-tts.

For the full feature matrix (24 Pocket TTS voices, instant voice cloning, Kokoro Q8/Q4 quantization, BF16/F16 dtypes, language overrides, segment/word timestamps), see the Audio guide.

Next Steps