Skip to main content
Version: 0.7

Python API Reference

Functions

set_log_level(_level: str)

Set the engine log level.

Currently a no-op at runtime. Use the RUST_LOG environment variable before starting the engine (e.g. RUST_LOG=debug).

  • level: One of "error", "warn", "info", "debug", "trace".

Returns: None

import os
os.environ["RUST_LOG"] = "debug" # set before creating Engine
set_log_level("debug") # no effect at runtime

Backend

Helper for constructing backend configuration JSON strings.

Each method returns a JSON string that can be passed to Engine configuration. Two backend types are supported:

  • in_memory: Local on-device inference (Metal/CUDA/CPU).
  • proxy: Forward requests to a remote OpenAI-compatible API.
local = Backend.in_memory(name="local")
remote = Backend.proxy(name="openai", api_key="sk-...")

Methods

@staticmethod proxy(name="openai", base_url="https://api.openai.com/v1", api_key=None) -> str

Return a JSON configuration string for a remote proxy backend.

The proxy backend forwards all requests to an OpenAI-compatible HTTP API.

  • name: Logical name for this backend (default "openai").
  • base_url: Base URL of the remote API (default "https://api.openai.com/v1").
  • api_key: API key for authentication. Pass None if the remote API does not require one.

Returns: A JSON string with the backend configuration.

{
"name": "openai",
"backend_type": "proxy",
"base_url": "https://api.openai.com/v1",
"api_key": "sk-..."
}
cfg = Backend.proxy(name="openai", api_key="sk-...")

TokenStream

Iterator that yields streaming LLM chat completion chunks as JSON strings.

Implements Python's iterator protocol (__iter__ / __next__). Each call to __next__ blocks (releasing the GIL) until the next token chunk arrives from the inference engine.

Each yielded string is a JSON object following the OpenAI ChatCompletionChunk schema:

{
"id": "chatcmpl-...",
"object": "chat.completion.chunk",
"model": "...",
"choices": [{
"index": 0,
"delta": {"role": "assistant", "content": "token"},
"finish_reason": null
}]
}

The last chunk has finish_reason set to "stop" (or "length"), and after that __next__ raises StopIteration.

stream = engine.llm_chat_completion_stream(request_json)
for chunk_json in stream:
chunk = json.loads(chunk_json)
token = chunk["choices"][0]["delta"].get("content", "")
print(token, end="", flush=True)

Engine

The main Atelico AI Engine.

Thread-safe. Supports context manager protocol (with statement).

engine = Engine(device="auto")
engine.load_model("meta-llama/Llama-3.2-1B-Instruct-GGUF")
response = engine.llm_chat_completion(request_json)

Methods

close()

Shut down the engine and release all resources.

Unloads every model, frees GPU memory, and stops background threads. After calling this, the engine instance must not be used again. Equivalent to exiting the context manager.

Returns: None

engine = Engine()
engine.close()

load_model(model_id: str)

Load a model into memory (blocking).

Downloads model weights from HuggingFace Hub if not already cached, then loads them onto the configured device. Blocks until fully loaded.

  • model_id: HuggingFace model ID (e.g. "meta-llama/Llama-3.2-1B-Instruct-GGUF").

Returns: None

Raises: ModelLoadError: If the model cannot be found or loaded.

engine.load_model("meta-llama/Llama-3.2-1B-Instruct-GGUF")

unload_model(model_id: str)

Unload a model and free its resources (GPU memory, caches).

  • model_id: The model ID that was previously passed to load_model.

Returns: None

Raises: ModelLoadError: If the model is not currently loaded.

engine.unload_model("meta-llama/Llama-3.2-1B-Instruct-GGUF")

llm_chat_completion(request_json: str) -> str

Run a chat completion request (blocking, non-streaming).

  • request_json: A JSON string following the OpenAI ChatCompletionRequest schema.

Expected request JSON structure:

{
"model": "meta-llama/Llama-3.2-1B-Instruct-GGUF",
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 128,
"temperature": 0.7,
"top_p": 0.9,
"response_format": null
}

Returns: A JSON string with the ChatCompletionResponse.

{
"id": "chatcmpl-...",
"choices": [{
"message": {"role": "assistant", "content": "Hi!"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 5, "completion_tokens": 3, "total_tokens": 8}
}

Raises: ValueError: If request_json is not valid JSON or is missing required fields. InferenceError: If token generation fails.

import json
request = json.dumps({
"model": "meta-llama/Llama-3.2-1B-Instruct-GGUF",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128,
})
response = json.loads(engine.llm_chat_completion(request))
print(response["choices"][0]["message"]["content"])

llm_text_completion(request_json: str) -> str

Run a text completion request (blocking, non-streaming).

Unlike chat completion, text completion continues a raw prompt without chat template formatting.

  • request_json: A JSON string with TextCompletionRequest fields.

Expected request JSON structure:

{
"model": "meta-llama/Llama-3.2-1B-Instruct-GGUF",
"prompt": "Once upon a time",
"max_tokens": 50,
"temperature": 0.7
}

Returns: A JSON string with the TextCompletionResponse.

{
"id": "cmpl-...",
"object": "text_completion",
"model": "...",
"choices": [{"text": " there was a...", "index": 0, "finish_reason": "stop"}],
"usage": {"prompt_tokens": 4, "completion_tokens": 50, "total_tokens": 54}
}

Raises: ValueError: If request_json is invalid. InferenceError: If token generation fails.

import json
request = json.dumps({
"model": "meta-llama/Llama-3.2-1B-Instruct-GGUF",
"prompt": "Once upon a time",
"max_tokens": 50,
})
response = json.loads(engine.llm_text_completion(request))
print(response["choices"][0]["text"])

llm_respond(request_json: str) -> str

Run a Responses API request (blocking).

The Responses API is a higher-level conversational interface that manages conversation state internally.

  • request_json: A JSON string with ResponseRequest fields.

Expected request JSON structure:

{
"model": "meta-llama/Llama-3.2-1B-Instruct-GGUF",
"input": "What is 2+2?",
"instructions": "You are a math tutor.",
"max_output_tokens": 100,
"temperature": 0.7
}

Returns: A JSON string with the ResponseResponse.

{
"id": "resp-...",
"object": "response",
"output": [{"type": "message", "content": [{"type": "output_text", "text": "4"}]}],
"usage": {"input_tokens": 5, "output_tokens": 1}
}

Raises: ValueError: If request_json is invalid. InferenceError: If generation fails.

import json
request = json.dumps({
"model": "meta-llama/Llama-3.2-1B-Instruct-GGUF",
"input": "What is 2+2?",
})
response = json.loads(engine.llm_respond(request))

image_generate(request_json: str) -> str

Generate an image from a text prompt (blocking).

  • request_json: A JSON string with ImageGenerationRequest fields.

Expected request JSON structure:

{
"model": "PixArt-alpha/PixArt-Sigma-XL-2-512-MS",
"prompt": "A sunset over mountains",
"n": 1,
"size": "512x512",
"response_format": "b64_json"
}

Returns: A JSON string with the ImageGenerationResponse.

{
"created": 1234567890,
"data": [{"b64_json": "iVBORw0KGgo...", "revised_prompt": "A sunset over mountains"}]
}

Raises: ValueError: If request_json is invalid. InferenceError: If image generation fails.

import json
request = json.dumps({
"model": "PixArt-alpha/PixArt-Sigma-XL-2-512-MS",
"prompt": "A sunset over mountains",
"size": "512x512",
})
response = json.loads(engine.image_generate(request))
b64_image = response["data"][0]["b64_json"]

image_remove_background(request_json: str) -> str

Remove the background from an image (blocking).

  • request_json: A JSON string with BackgroundRemovalRequest fields.

Expected request JSON structure:

{
"model": "briaai/RMBG-1.4",
"image": "iVBORw0KGgo..."
}

Returns: A JSON string with the processed image.

{
"data": [{"b64_json": "iVBORw0KGgo..."}]
}

Raises: ValueError: If request_json is invalid. InferenceError: If background removal fails.

import json
request = json.dumps({"model": "briaai/RMBG-1.4", "image": b64_image_str})
response = json.loads(engine.image_remove_background(request))
result_image = response["data"][0]["b64_json"]

embed(request_json: str) -> str

Generate text embeddings (blocking).

  • request_json: A JSON string with EmbeddingRequest fields.

Expected request JSON structure:

{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": ["Hello world", "Goodbye world"]
}

Returns: A JSON string with the EmbeddingResponse.

{
"object": "list",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"data": [
{"object": "embedding", "index": 0, "embedding": [0.1, -0.2, ...]},
{"object": "embedding", "index": 1, "embedding": [0.3, 0.4, ...]}
],
"usage": {"prompt_tokens": 8, "total_tokens": 8}
}

Raises: ValueError: If request_json is invalid. InferenceError: If embedding generation fails.

import json
request = json.dumps({
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": ["Hello world", "Goodbye world"],
})
response = json.loads(engine.embed(request))
vectors = [d["embedding"] for d in response["data"]]

guardrail_check_input(text: str) -> str

Check user input text against safety guardrails.

  • text: The user-provided input text to validate.

Returns: A JSON string with the SafetyVerdict.

When allowed:

{"action": "Allow", "checker_name": "keyword", "score": null}

When blocked:

{"action": {"Block": {"reason": "profanity detected"}}, "checker_name": "keyword", "score": 0.95}

Fields:

  • action: "Allow", or {"Block": {"reason": "..."}}, or {"Rewrite": {"original": "...", "rewritten": "...", "reason": "..."}}.
  • checker_name: which guardrail checker produced the verdict.
  • score: confidence score (0.0-1.0), or null.

Raises: AtelicoError if guardrails are not configured.

import json
verdict = json.loads(engine.guardrail_check_input("Tell me a joke"))
if verdict["action"] != "Allow":
print(f"Blocked by {verdict['checker_name']}")

guardrail_check_output(text: str) -> str

Check model output text against safety guardrails.

  • text: The model-generated output to validate before displaying.

Returns: A JSON string with the SafetyVerdict (same schema as guardrail_check_input).

Raises: AtelicoError: If guardrails are not configured.

import json
verdict = json.loads(engine.guardrail_check_output("Here is a helpful answer."))
if not verdict["allowed"]:
print(f"Blocked: {verdict['category']}")

lora_load(model_id: str, adapter_path: str)

Load a LoRA adapter onto an already-loaded model.

  • model_id: The base model ID to attach the adapter to.
  • adapter_path: Filesystem path to the adapter directory (must contain adapter_config.json and weight files).

Returns: None

Raises: ModelLoadError: If the model is not loaded or the adapter path is invalid.

engine.lora_load("meta-llama/Llama-3.2-1B-Instruct-GGUF", "/adapters/my-lora")

lora_unload(model_id: str)

Unload a LoRA adapter from a model, reverting to base model weights.

  • model_id: The model whose adapter should be removed.

Returns: None

Raises: ModelLoadError: If the model is not loaded or has no adapter.

engine.lora_unload("meta-llama/Llama-3.2-1B-Instruct-GGUF")

lora_set_scale(model_id: str, scale: float)

Set the LoRA runtime scale factor for a model's loaded adapter.

A scale of 1.0 applies the full adapter effect; 0.0 effectively disables it without unloading.

  • model_id: The model with a loaded LoRA adapter.
  • scale: Scale factor (typically 0.0 to 1.0).

Returns: None

Raises: ModelLoadError: If the model is not loaded or has no adapter.

engine.lora_set_scale("meta-llama/Llama-3.2-1B-Instruct-GGUF", 0.5)

guardrail_check_image_prompt(text: str) -> str

Check an image generation prompt against safety guardrails.

  • text: The image generation prompt to validate.

Returns: A JSON string with the SafetyVerdict (same schema as guardrail_check_input).

Raises: AtelicoError: If guardrails are not configured.

import json
verdict = json.loads(engine.guardrail_check_image_prompt("A sunset over mountains"))
if not verdict["allowed"]:
print(f"Blocked: {verdict['category']}")

model_list() -> str

List all loaded models.

Returns: A JSON array of ModelInfo objects.

[
{"id": "meta-llama/Llama-3.2-1B-Instruct-GGUF", "object": "model", "owned_by": "meta-llama"}
]
import json
models = json.loads(engine.model_list())
for m in models:
print(m["id"])

model_is_loaded(model_id: str) -> bool

Check if a model is loaded and ready for inference.

  • model_id: The model ID to check.

Returns: True if the model is loaded and ready, False otherwise.

is_ready = engine.model_is_loaded("meta-llama/Llama-3.2-1B-Instruct-GGUF")

kvstore_insert(store_id: str, entries_json: str)

Insert entries into a KV store.

  • store_id: The store to insert into.
  • entries_json: A JSON array of KvEntry objects.

Expected entries JSON structure:

[
{"key": "greeting", "value": "Hello!", "embedding": [0.1, 0.2, 0.3]}
]

Returns: None

Raises: ValueError: If entries_json is invalid. StoreError: If the store is not found or insertion fails.

import json
entries = json.dumps([
{"key": "greeting", "value": "Hello!", "embedding": [0.1, 0.2, 0.3]},
])
engine.kvstore_insert("lore", entries)

kvstore_query(store_id: str, query_json: str) -> str

Query a KV store using vector similarity search.

  • store_id: The store to query.
  • query_json: A JSON string with KvQuery fields.

Expected query JSON structure:

{
"query_embedding": [0.1, 0.2, 0.3],
"query_text": "optional text filter",
"limit": 5,
"vector_search_limit": 20,
"use_prefilter": true
}

Returns: A JSON array of result objects.

[
{
"id": "...",
"key_text": "greeting",
"similarity": 0.95,
"priority": 1.0,
"combined_score": 0.975
}
]

Raises: ValueError: If query_json is invalid. StoreError: If the store is not found or the query fails.

import json
query = json.dumps({"query_embedding": embedding_vec, "limit": 5})
results = json.loads(engine.kvstore_query("lore", query))
for r in results:
print(r["key_text"], r["similarity"])

kvstore_delete(store_id: str)

Delete a KV store and remove its database files.

  • store_id: The store to delete.

Returns: None

Raises: StoreError: If the store is not found.

engine.kvstore_delete("lore")

set_scheduling_mode(mode: str)

Set the GPU scheduling mode at runtime.

Controls how GPU time is shared between inference and rendering when running alongside a game engine.

  • mode: One of "balance" (default), "prioritize_compute" (maximize inference speed), or "prioritize_graphics" (minimize rendering impact).

Returns: None

engine.set_scheduling_mode("prioritize_graphics")

set_vram_budget_mb(mb: int)

Set the VRAM budget in megabytes at runtime.

  • mb: Maximum VRAM to use in megabytes. 0 means unlimited.

Returns: None

engine.set_vram_budget_mb(4096) # 4 GB budget

set_target_tps(tps: int)

Set the target tokens per second for inference throttling.

When set to a non-zero value, the engine will pace token generation to leave GPU headroom for rendering.

  • tps: Target tokens per second. 0 means unlimited (default).

Returns: None

engine.set_target_tps(30) # throttle to ~30 tok/s

@staticmethod is_cig_d3d12_supported(device_index=0) -> bool

Check whether D3D12 Compute-in-Graphics (CiG) is supported on a GPU.

CiG allows sharing GPU scheduling context with a D3D12 renderer, avoiding OS-level context switching. Requires NVIDIA R570+ driver, CUDA 12.8+, and Ada Lovelace+ GPU.

  • device_index: CUDA device index (default 0).

Returns: True if D3D12 CiG is supported, False otherwise. Always returns False when built without CUDA.

supported = Engine.is_cig_d3d12_supported(0)

@staticmethod is_cig_vulkan_supported(device_index=0) -> bool

Check whether Vulkan Compute-in-Graphics (CiG) is supported on a GPU.

CiG allows sharing GPU scheduling context with a Vulkan renderer. Requires NVIDIA R570+ driver, CUDA 12.9+, and Ada Lovelace+ GPU.

  • device_index: CUDA device index (default 0).

Returns: True if Vulkan CiG is supported, False otherwise. Always returns False when built without CUDA.

supported = Engine.is_cig_vulkan_supported(0)

AnnIndex

Approximate Nearest Neighbor index for vector search.

Backed by HNSW (Hierarchical Navigable Small World) graph. This is a pure data structure -- no GPU or models required.

index = AnnIndex(dim=384, max_elements=1000)
index.insert([0.1, 0.2, ...], label_id=42)
index.build()
results = index.search([0.1, 0.2, ...], k=5)

Methods

dim() -> int

Return the vector dimensionality of this index.

Returns: The dimensionality (int) passed to the constructor.

index = AnnIndex(dim=384)
assert index.dim() == 384

insert(vector: list[float], label_id: int)

Insert a vector with an associated label ID.

Call build() after all insertions are complete before searching.

  • vector: A list of floats with length matching dim.
  • label_id: An integer label associated with this vector (used to identify results returned by search).

Returns: None

index.insert([0.1, 0.2, 0.3], label_id=42)

build()

Build the HNSW index graph. Must be called after all insertions and before search.

Returns: None

index.insert([0.1, 0.2, 0.3], label_id=1)
index.build()
**Index is now ready for search**

search(query: list[float], k: int) -> list[(usize, f32)]

Search for the k nearest neighbors of a query vector.

  • query: Query vector with length matching dim.
  • k: Number of nearest neighbors to return.

Returns: A list of (label_id, distance) tuples sorted by ascending cosine distance (lower = more similar).

results = index.search([0.1, 0.2, 0.3], k=5)
for label_id, distance in results:
print(f"Label {label_id}: distance={distance:.4f}")

save(path: str)

Save the index to a file on disk.

  • path: Filesystem path to write the index to.

Returns: None

Raises: AtelicoError: If the file cannot be written.

index.save("/data/my_index.bin")

@staticmethod load(path: str) -> Self

Load a previously saved index from disk.

  • path: Filesystem path to read the index from.

Returns: A new AnnIndex instance with the loaded data.

Raises: AtelicoError: If the file cannot be read or is corrupted.

index = AnnIndex.load("/data/my_index.bin")
results = index.search([0.1, 0.2, 0.3], k=5)