Version: 0.8

Python: Getting Started

This guide walks you through setting up the Atelico Python SDK and using it for LLM inference, streaming, and structured generation.

What You'll Build

A Python script that loads a model, sends chat requests, streams token-by-token output, and generates structured JSON — the foundation for game tools, content pipelines, or scripted NPC behavior.

By the end, you'll understand:

How to install the SDK and create an Engine
How to send a blocking chat request
How to stream tokens with a Python iterator
How to maintain conversation history
How to use structured generation for typed output

Prerequisites

Python 3.9 or later
The Atelico server bundle (atelico-asset-downloader)
A downloaded model:

./atelico-asset-downloader download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M

Step 1: Install the SDK

From a Release Wheel

pip install atelico-0.6.2-cp39-abi3-macosx_11_0_arm64.whl  # macOS Apple Silicon
pip install atelico-0.6.2-cp39-abi3-linux_x86_64.whl        # Linux

From Source (Development)

Requires Rust and maturin:

pip install maturin

cd atelico-python

# macOS (Apple Silicon, Metal GPU)
maturin develop --release --features metal

# Linux/Windows (NVIDIA GPU)
maturin develop --release --features cuda

# CPU only
maturin develop --release

Verify Installation

import atelico
print("Atelico SDK loaded")

Step 2: Create an Engine

from atelico import Engine

# Auto-detect the best GPU backend (Metal on Mac, CUDA on NVIDIA, CPU fallback)
engine = Engine()

# Or specify explicitly
engine = Engine(device="metal")   # macOS Apple Silicon
engine = Engine(device="cuda")    # NVIDIA GPU
engine = Engine(device="cpu")     # CPU only

The engine supports Python's context manager:

with Engine() as engine:
    # ... use engine ...
    pass  # Automatically cleans up

Step 3: Load a Model

# Blocking — downloads from cache/HuggingFace if needed
engine.load_model("meta-llama/Llama-3.2-3B-Instruct-Q4_K_M")

# Check if loaded
assert engine.model_is_loaded("meta-llama/Llama-3.2-3B-Instruct-Q4_K_M")

The first call downloads the model to the local cache. Subsequent calls load from cache instantly.

Step 4: Blocking Chat Request

import json

request = json.dumps({
    "model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
    "messages": [
        {"role": "system", "content": "You are a friendly tavern keeper named Boris. Keep responses under 2 sentences."},
        {"role": "user", "content": "What's on the menu today?"},
    ],
    "max_tokens": 100,
    "temperature": 0.7,
})

response_json = engine.llm_chat_completion(request)
response = json.loads(response_json)
print(response["choices"][0]["message"]["content"])

The call blocks until the full response is generated. The GIL is released during inference, so other Python threads can run.

Step 5: Streaming

The streaming API returns a TokenStream iterator — use it in a regular for loop:

import json

request = json.dumps({
    "model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
    "messages": [
        {"role": "system", "content": "You are a narrator for a fantasy RPG."},
        {"role": "user", "content": "Describe the entrance to the dungeon."},
    ],
    "max_tokens": 200,
    "temperature": 0.8,
})

stream = engine.llm_chat_completion_stream(request)

full_text = ""
for chunk_json in stream:
    chunk = json.loads(chunk_json)
    delta = chunk["choices"][0]["delta"]
    if "content" in delta and delta["content"] is not None:
        full_text += delta["content"]
        print(delta["content"], end="", flush=True)

print()  # newline after streaming
print(f"Full response: {full_text}")

Each iteration of the for loop blocks until the next token is available, then yields a JSON string in OpenAI ChatCompletionChunk format.

Step 6: Multi-Turn Conversation

Build up the messages array across turns:

import json

conversation = [
    {"role": "system", "content": "You are Greta, a grumpy blacksmith. You secretly care about the player but never admit it. Keep responses under 3 sentences."}
]

def chat(player_message: str) -> str:
    conversation.append({"role": "user", "content": player_message})

    request = json.dumps({
        "model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
        "messages": conversation,
        "max_tokens": 150,
        "temperature": 0.8,
    })

    response = json.loads(engine.llm_chat_completion(request))
    reply = response["choices"][0]["message"]["content"]

    # Store the reply for future context
    conversation.append({"role": "assistant", "content": reply})
    return reply

# Interactive loop
while True:
    user_input = input("You: ")
    if user_input.lower() in ("quit", "exit"):
        break
    print(f"Greta: {chat(user_input)}")

Step 7: Structured Generation

Force the model to output valid JSON matching a schema:

import json

request = json.dumps({
    "model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
    "messages": [
        {"role": "user", "content": "Generate a random fantasy weapon for a level 5 rogue"},
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "Weapon",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "type": {"type": "string", "enum": ["sword", "axe", "bow", "staff", "dagger"]},
                    "damage": {"type": "integer", "minimum": 1, "maximum": 100},
                    "rarity": {"type": "string", "enum": ["common", "uncommon", "rare", "legendary"]},
                    "description": {"type": "string"},
                },
                "required": ["name", "type", "damage", "rarity", "description"],
            },
            "strict": True,
        },
    },
})

response = json.loads(engine.llm_chat_completion(request))
weapon = json.loads(response["choices"][0]["message"]["content"])

print(f"Found a {weapon['rarity']} {weapon['type']}: {weapon['name']}")
print(f"Damage: {weapon['damage']}")
print(f"Description: {weapon['description']}")

The output is guaranteed to be valid JSON matching the schema.

Error Handling

The SDK provides typed exceptions:

from atelico import (
    Engine,
    AtelicoError,
    ModelLoadError,
    InferenceError,
    DeviceError,
    GuardrailBlockedError,
)

try:
    engine.load_model("nonexistent/model")
except ModelLoadError as e:
    print(f"Model not found: {e}")
except DeviceError as e:
    print(f"GPU error: {e}")
except AtelicoError as e:
    print(f"Engine error: {e}")

GPU Scheduling

Control GPU resource allocation:

engine.set_scheduling_mode("balance")              # Default
engine.set_scheduling_mode("prioritize_compute")   # Fast inference
engine.set_scheduling_mode("prioritize_graphics")  # Smooth rendering

engine.set_vram_budget_mb(4096)  # Cap VRAM usage
engine.set_target_tps(15)       # Limit tokens/sec

Audio: TTS & STT

The Python SDK exposes two blocking methods (audio_synthesize, audio_transcribe) and one streaming iterator (audio_synthesize_stream). Audio bytes cross the binding as base64-encoded WAV files.

import base64, json

# Blocking TTS — Kokoro by default, Pocket TTS via "in-memory::pocket-tts"
resp = json.loads(engine.audio_synthesize(json.dumps({
    "model": "in-memory::tts",
    "input": "Hello from Atelico.",
    "voice": "af_heart",
})))
open("hello.wav", "wb").write(base64.b64decode(resp["audio_b64"]))

# Streaming TTS — yields one AudioSpeechChunk per sentence
stream = engine.audio_synthesize_stream(json.dumps({
    "model": "in-memory::pocket-tts",
    "input": "First sentence. Second one comes right after.",
    "voice": "alba",
}))
for chunk_json in stream:
    chunk = json.loads(chunk_json)
    wav = base64.b64decode(chunk["audio"])
    # play `wav` immediately while subsequent sentences synthesize

# Blocking STT — encode a WAV file as base64
wav_b64 = base64.b64encode(open("speech.wav", "rb").read()).decode()
result = json.loads(engine.audio_transcribe(json.dumps({
    "model": "in-memory::whisper",
    "audio_b64": wav_b64,
})))
print(result["text"])

Whisper variant ids: whisper (default → whisper-base.en), whisper-tiny[.en], whisper-base[.en], whisper-small[.en], whisper-medium[.en], whisper-large-v3[-turbo], distil-large-v3. TTS ids: tts (default → kokoro-82m), kokoro, kokoro-82m, pocket, pocket-tts.

For the full feature matrix (24 Pocket TTS voices, instant voice cloning, Kokoro Q8/Q4 quantization, BF16/F16 dtypes, language overrides, segment/word timestamps), see the Audio guide.

Next Steps

Structured Generation — more examples with game-oriented schemas
Audio (TTS & STT) — voices, voice cloning, quantization, env vars
Python API Reference — full list of all classes and methods
Chat Completions API — detailed API reference (same JSON format)

What You'll Build​

Prerequisites​

Step 1: Install the SDK​

From a Release Wheel​

From Source (Development)​

Verify Installation​

Step 2: Create an Engine​

Step 3: Load a Model​

Step 4: Blocking Chat Request​

Step 5: Streaming​

Step 6: Multi-Turn Conversation​

Step 7: Structured Generation​

Error Handling​

GPU Scheduling​

Audio: TTS & STT​

Next Steps​