Skip to main content
Version: 0.9

Chat Completions API

The chat completions endpoint generates responses from a conversation. It's OpenAI-compatible, so if you've used the OpenAI API before, this works the same way.

POST /v1/chat/completions

Basic Request

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'

Response:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
}
}

Request Parameters

ParameterTypeDefaultDescription
modelstringrequiredModel identifier (e.g., in-memory::meta-llama/Llama-3.2-3B-Instruct)
messagesarrayrequiredConversation messages (see below)
streambooleanfalseEnable token-by-token streaming
temperaturefloat1.0Sampling temperature. Lower = more deterministic, higher = more creative
max_tokensintegermodel defaultMaximum tokens to generate
response_formatobjectnullConstrain output format (see Structured Generation)
enable_thinkingbooleannullEnable extended thinking for models that support it (e.g., Qwen 3.5)

Message Roles

Each message in the messages array has a role and content:

RolePurpose
systemSets the AI's behavior, personality, or constraints. Placed first.
userThe human's message.
assistantThe AI's previous response. Used for multi-turn context.

System Prompts

System prompts define how the model behaves. They're essential for game NPCs, assistants, or any specialized behavior:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")

response = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{
"role": "system",
"content": "You are a ship AI aboard a deep-space freighter. You speak formally, "
"address the player as Captain, and provide status reports when asked. "
"You are concerned about a recent anomaly in sector 7.",
},
{"role": "user", "content": "Status report."},
],
temperature=0.7,
)

print(response.choices[0].message.content)

Multi-Turn Conversations

Include previous messages to give the model context of the conversation:

response = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{"role": "system", "content": "You are a helpful tavern keeper named Boris."},
{"role": "user", "content": "What do you have on the menu?"},
{"role": "assistant", "content": "We have roasted boar, mushroom stew, and fresh bread. The stew is my specialty!"},
{"role": "user", "content": "I'll have the stew. Any rumors lately?"},
],
)

print(response.choices[0].message.content)

Streaming

Set "stream": true to receive tokens as they're generated. This is ideal for typewriter-style dialogue UI.

SSE Format: Each token arrives as a Server-Sent Event. The stream ends with data: [DONE].

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
stream = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{"role": "system", "content": "You are a narrator for a fantasy RPG."},
{"role": "user", "content": "Describe the entrance to the dungeon."},
],
stream=True,
)

for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print()

Temperature

Temperature controls randomness. For game applications:

Use CaseTemperatureWhy
Factual responses, game rules0.1 - 0.3Consistent, predictable
NPC dialogue, general conversation0.6 - 0.8Natural variation
Creative writing, storytelling0.9 - 1.2More surprising, diverse

Finish Reasons

The finish_reason field tells you why generation stopped:

ValueMeaning
stopModel finished naturally (end of response)
lengthHit the max_tokens limit

Error Handling

Errors return standard HTTP status codes with an OpenAI-compatible error body:

{
"error": {
"message": "Model 'nonexistent-model' not found",
"type": "invalid_request_error",
"param": "model",
"code": null
}
}
StatusMeaning
400Invalid request (bad model name, malformed JSON)
500Server error (model failed to load, inference error)