Skip to main content
Version: 0.9

LM Primitives

Low-level access to a loaded model's tokenizer and static metadata. Useful for budgeting prompts against the context window, pre-tokenizing prefixes, building custom samplers, or inspecting which device a model is running on.

These primitives share the same in-process model thread used for inference, so the model is loaded on demand and reused across calls — no separate load step is required.

note

Logits and hidden activations are intentionally not exposed in this release. They are model-specific, large, and have a high compatibility risk across model architectures and quantization formats.

API

MethodReturnsPurpose
tokenize(model, text)list[int]Encode text using the model's tokenizer
detokenize(model, tokens, skip_special_tokens?)strDecode a token sequence back to text
model_capabilities(model)ModelCapabilitiesInspect vocab size, context window, dtype, device

ModelCapabilities fields:

FieldTypeDescription
model_idstringThe model identifier as loaded
vocab_sizeintTokenizer vocabulary size
max_position_embeddingsintContext window in tokens
dtypestringActive weight dtype (e.g. "BF16", "F16", "Q4_K_M")
backend_kindstringActive device (e.g. "metal", "cuda", "cpu")

Usage

import json
import atelico

engine = atelico.Engine()

model = "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M"
engine.load_model(model)

# Inspect the model
caps = json.loads(engine.model_capabilities(model))
print(f"context window = {caps['max_position_embeddings']} tokens, "
f"running as {caps['dtype']} on {caps['backend_kind']}")

# Budget a prompt against the context window
prompt = "You are a tavern keeper named Boris..."
tokens = engine.tokenize(model, prompt)
budget = caps["max_position_embeddings"] - len(tokens)
print(f"prompt = {len(tokens)} tokens, {budget} tokens remaining for the response")

# Round-trip
text = engine.detokenize(model, tokens, skip_special_tokens=True)
assert text.strip() == prompt

When to reach for these

  • Prompt budgeting — measure a prompt's token cost before sending it; surface a "context full" warning to the user instead of silent truncation.
  • Prefix design — pair with the Prefix Cache so the cached prefix matches your tokenization exactly.
  • Custom samplers / tooling — drive your own decoding loop or build editor-style token-by-token introspection.
  • Telemetry — log dtype + backend_kind alongside latency so a slow request can be diagnosed (CPU fallback, wrong quantization, etc).