Python: Getting Started
This guide walks you through setting up the Atelico Python SDK and using it for LLM inference, streaming, and structured generation.
What You'll Build
A Python script that loads a model, sends chat requests, streams token-by-token output, and generates structured JSON — the foundation for game tools, content pipelines, or scripted NPC behavior.
By the end, you'll understand:
- How to install the SDK and create an Engine
- How to send a blocking chat request
- How to stream tokens with a Python iterator
- How to maintain conversation history
- How to use structured generation for typed output
Prerequisites
- Python 3.9 or later
- The Atelico server bundle (
atelico-asset-downloader) - A downloaded model:
./atelico-asset-downloader download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M
Step 1: Install the SDK
From a Release Wheel
pip install atelico-0.6.2-cp39-abi3-macosx_11_0_arm64.whl # macOS Apple Silicon
pip install atelico-0.6.2-cp39-abi3-linux_x86_64.whl # Linux
From Source (Development)
Requires Rust and maturin:
pip install maturin
cd atelico-python
# macOS (Apple Silicon, Metal GPU)
maturin develop --release --features metal
# Linux/Windows (NVIDIA GPU)
maturin develop --release --features cuda
# CPU only
maturin develop --release
Verify Installation
import atelico
print("Atelico SDK loaded")
Step 2: Create an Engine
from atelico import Engine
# Auto-detect the best GPU backend (Metal on Mac, CUDA on NVIDIA, CPU fallback)
engine = Engine()
# Or specify explicitly
engine = Engine(device="metal") # macOS Apple Silicon
engine = Engine(device="cuda") # NVIDIA GPU
engine = Engine(device="cpu") # CPU only
The engine supports Python's context manager:
with Engine() as engine:
# ... use engine ...
pass # Automatically cleans up
Step 3: Load a Model
# Blocking — downloads from cache/HuggingFace if needed
engine.load_model("meta-llama/Llama-3.2-3B-Instruct-Q4_K_M")
# Check if loaded
assert engine.model_is_loaded("meta-llama/Llama-3.2-3B-Instruct-Q4_K_M")
The first call downloads the model to the local cache. Subsequent calls load from cache instantly.
Step 4: Blocking Chat Request
import json
request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are a friendly tavern keeper named Boris. Keep responses under 2 sentences."},
{"role": "user", "content": "What's on the menu today?"},
],
"max_tokens": 100,
"temperature": 0.7,
})
response_json = engine.llm_chat_completion(request)
response = json.loads(response_json)
print(response["choices"][0]["message"]["content"])
The call blocks until the full response is generated. The GIL is released during inference, so other Python threads can run.
Step 5: Streaming
The streaming API returns a TokenStream iterator — use it in a regular for loop:
import json
request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are a narrator for a fantasy RPG."},
{"role": "user", "content": "Describe the entrance to the dungeon."},
],
"max_tokens": 200,
"temperature": 0.8,
})
stream = engine.llm_chat_completion_stream(request)
full_text = ""
for chunk_json in stream:
chunk = json.loads(chunk_json)
delta = chunk["choices"][0]["delta"]
if "content" in delta and delta["content"] is not None:
full_text += delta["content"]
print(delta["content"], end="", flush=True)
print() # newline after streaming
print(f"Full response: {full_text}")
Each iteration of the for loop blocks until the next token is available, then yields a JSON string in OpenAI ChatCompletionChunk format.
Step 6: Multi-Turn Conversation
Build up the messages array across turns:
import json
conversation = [
{"role": "system", "content": "You are Greta, a grumpy blacksmith. You secretly care about the player but never admit it. Keep responses under 3 sentences."}
]
def chat(player_message: str) -> str:
conversation.append({"role": "user", "content": player_message})
request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": conversation,
"max_tokens": 150,
"temperature": 0.8,
})
response = json.loads(engine.llm_chat_completion(request))
reply = response["choices"][0]["message"]["content"]
# Store the reply for future context
conversation.append({"role": "assistant", "content": reply})
return reply
# Interactive loop
while True:
user_input = input("You: ")
if user_input.lower() in ("quit", "exit"):
break
print(f"Greta: {chat(user_input)}")
Step 7: Structured Generation
Force the model to output valid JSON matching a schema:
import json
request = json.dumps({
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "user", "content": "Generate a random fantasy weapon for a level 5 rogue"},
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "Weapon",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string", "enum": ["sword", "axe", "bow", "staff", "dagger"]},
"damage": {"type": "integer", "minimum": 1, "maximum": 100},
"rarity": {"type": "string", "enum": ["common", "uncommon", "rare", "legendary"]},
"description": {"type": "string"},
},
"required": ["name", "type", "damage", "rarity", "description"],
},
"strict": True,
},
},
})
response = json.loads(engine.llm_chat_completion(request))
weapon = json.loads(response["choices"][0]["message"]["content"])
print(f"Found a {weapon['rarity']} {weapon['type']}: {weapon['name']}")
print(f"Damage: {weapon['damage']}")
print(f"Description: {weapon['description']}")
The output is guaranteed to be valid JSON matching the schema.
Error Handling
The SDK provides typed exceptions:
from atelico import (
Engine,
AtelicoError,
ModelLoadError,
InferenceError,
DeviceError,
GuardrailBlockedError,
)
try:
engine.load_model("nonexistent/model")
except ModelLoadError as e:
print(f"Model not found: {e}")
except DeviceError as e:
print(f"GPU error: {e}")
except AtelicoError as e:
print(f"Engine error: {e}")
GPU Scheduling
Control GPU resource allocation:
engine.set_scheduling_mode("balance") # Default
engine.set_scheduling_mode("prioritize_compute") # Fast inference
engine.set_scheduling_mode("prioritize_graphics") # Smooth rendering
engine.set_vram_budget_mb(4096) # Cap VRAM usage
engine.set_target_tps(15) # Limit tokens/sec
Next Steps
- Structured Generation — more examples with game-oriented schemas
- Python API Reference — full list of all classes and methods
- Chat Completions API — detailed API reference (same JSON format)