Skip to main content
Version: 0.9

Structured Generation

Structured generation constrains the model's output to match a format you define. The output is guaranteed to be valid -- no retries, no hoping the model formats things correctly.

Supported formats:

  • JSON Schema -- structured objects with typed fields, enums, nested objects, and arrays
  • Choice -- pick exactly one from a list of options (e.g., emotions, actions, item types)
  • Regex -- match a pattern (e.g., dates, codes, identifiers)
  • Grammar -- match a Lark context-free grammar (e.g., custom DSLs, complex formats)

This is essential for game development where AI output needs to be consumed by code.

How It Works

Add a response_format field to your chat completion request with a JSON Schema:

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "user", "content": "Generate a random fantasy weapon"}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "Weapon",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string", "enum": ["sword", "axe", "bow", "staff", "dagger"]},
"damage": {"type": "integer", "minimum": 1, "maximum": 100},
"rarity": {"type": "string", "enum": ["common", "uncommon", "rare", "legendary"]},
"description": {"type": "string"}
},
"required": ["name", "type", "damage", "rarity", "description"]
},
"strict": true
}
}
}'

The content in the response is always valid JSON matching your schema. Parse it directly in your game code.

Schema Format

The response_format object has this structure:

{
"type": "json_schema",
"json_schema": {
"name": "SchemaName",
"schema": { ... },
"strict": true,
"schema_injection": "concise"
}
}
FieldTypeDefaultDescription
typestringMust be "json_schema"
json_schema.namestringA name for the schema (used internally)
json_schema.schemaobjectStandard JSON Schema object
json_schema.strictbooleanfalseEnforce strict adherence (recommended: true)
json_schema.schema_injectionstring"concise"Controls how the schema is described in the prompt (see below)

Schema Injection

When you use structured generation, the engine automatically describes the JSON schema in the system message so the model knows what to produce. This is critical for smaller models -- without it, the model may generate degenerate output (whitespace loops) because it doesn't know it should output JSON.

The schema_injection field controls the verbosity of this description:

LevelDescriptionToken Cost
"none"No injection. Use when your prompt already contains JSON instructions.0
"light"Field names and types only. Compact one-liner.Low
"concise"Field names, types, and descriptions from the schema. Default.Medium
"full"Injects the complete JSON schema verbatim.High

You don't need to include "respond as JSON" in your prompts -- the engine handles this automatically. If you're migrating from a setup where you manually included JSON instructions in your prompts, you can either remove them (recommended) or set schema_injection to "none" to avoid redundancy.

OpenAI clients that don't send schema_injection get the default "concise" behavior automatically. The field is fully backward-compatible.

Example: NPC Dialogue with Emotion Tags

Generate NPC dialogue with emotion metadata for driving animations:

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{"role": "system", "content": "You are Greta, a grumpy but kind-hearted blacksmith."},
{"role": "user", "content": "I brought you flowers!"},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "NPCDialogue",
"schema": {
"type": "object",
"properties": {
"text": {"type": "string"},
"emotion": {"type": "string", "enum": ["happy", "sad", "angry", "surprised", "neutral", "embarrassed"]},
"gesture": {"type": "string", "enum": ["wave", "nod", "shrug", "point", "cross_arms", "none"]},
},
"required": ["text", "emotion", "gesture"],
},
"strict": True,
},
},
)

dialogue = json.loads(response.choices[0].message.content)
print(f"[{dialogue['emotion']}] {dialogue['text']}")
# e.g. [embarrassed] Flowers?! I... well, put them over there, I suppose.

Your game reads emotion to set the character's facial expression and gesture to trigger an animation.

Example: Quest Generation

response = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{"role": "user", "content": "Create a side quest for a coastal fishing village troubled by sea creatures"}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "Quest",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"giver_npc": {"type": "string"},
"description": {"type": "string"},
"objectives": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"type": {"type": "string", "enum": ["kill", "collect", "talk", "explore", "escort"]},
"target": {"type": "string"},
"count": {"type": "integer", "minimum": 1},
},
"required": ["description", "type", "target", "count"],
},
},
"reward_gold": {"type": "integer", "minimum": 0},
"reward_xp": {"type": "integer", "minimum": 0},
},
"required": ["title", "giver_npc", "description", "objectives", "reward_gold", "reward_xp"],
},
"strict": True,
},
},
)

quest = json.loads(response.choices[0].message.content)
print(f"Quest: {quest['title']} (from {quest['giver_npc']})")
for obj in quest["objectives"]:
print(f" [{obj['type']}] {obj['description']} ({obj['count']}x {obj['target']})")
print(f" Reward: {quest['reward_gold']}g, {quest['reward_xp']} XP")

Choice Constraint

Force the model to pick exactly one option from a list. Useful for classification, branching dialogue, and any pick-from-a-set scenario.

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are an NPC reacting to the player entering a dark cave."},
{"role": "user", "content": "How do you feel?"}
],
"response_format": {
"type": "choice",
"choices": ["happy", "sad", "angry", "scared", "neutral"]
}
}'

The response content will be exactly one of the strings in choices -- no quotes, no extras. Parse it directly as a string.

This is ideal for driving game state: emotion systems, dialogue branching, action selection, difficulty ratings, etc.

Regex Constraint

Force the model to output text matching a regular expression. Useful for codes, dates, identifiers, and structured text that isn't JSON.

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "user", "content": "Generate a fantasy item code."}
],
"response_format": {
"type": "regex",
"pattern": "ITEM-[A-Z]{3}-[0-9]{3}"
}
}'

Example output: ITEM-LYS-042

The pattern uses standard regex syntax. The model's output is guaranteed to match the pattern exactly.

Grammar Constraint (Lark)

For complex output formats that go beyond regex but aren't JSON, you can specify a Lark context-free grammar.

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "user", "content": "Greet the player."}
],
"response_format": {
"type": "grammar",
"lark": "start: greeting \" \" name\ngreeting: \"hello\" | \"hi\" | \"hey\"\nname: /[A-Z][a-z]+/"
}
}'

Example output: hello Adventurer

This is the most flexible format -- you can define any structure that a context-free grammar can express. Use it for custom DSLs, formatted commands, or structured text that doesn't fit JSON.

Structured Generation with Streaming

Structured generation works with streaming too. The JSON is built token by token, and the final assembled output is guaranteed valid. Accumulate the streamed tokens, then parse the complete JSON when the stream finishes.

Tips

  • Keep schemas focused. Smaller schemas with fewer fields produce better results. Don't ask for 20 fields when you need 5.
  • Use enums for fields that should be one of a known set of values (emotions, item types, difficulty levels). This prevents the model from inventing invalid values.
  • System prompts still matter. The schema constrains the structure, but the system prompt influences the content quality. "Generate a balanced RPG encounter for level 8 players" produces better results than just "Generate an encounter."
  • Temperature matters less with structured generation since the schema already constrains the output, but lower temperatures (0.3-0.7) produce more consistent field values.
  • Add description fields to your schema properties. These are included in the auto-injected prompt hint (at "concise" level), helping the model understand what each field should contain.
  • Use "schema_injection": "light" on very small models (1B) with long prompts to save context tokens. Use "full" on larger models (7B+) with complex nested schemas for maximum precision.