Chat Completions API
The chat completions endpoint generates responses from a conversation. It's OpenAI-compatible, so if you've used the OpenAI API before, this works the same way.
POST /v1/chat/completions
Basic Request
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
}
}
Request Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | required | Model identifier (e.g., in-memory::meta-llama/Llama-3.2-3B-Instruct) |
messages | array | required | Conversation messages (see below) |
stream | boolean | false | Enable token-by-token streaming |
temperature | float | 1.0 | Sampling temperature. Lower = more deterministic, higher = more creative |
max_tokens | integer | model default | Maximum tokens to generate |
response_format | object | null | Constrain output format (see Structured Generation) |
enable_thinking | boolean | null | Enable extended thinking for models that support it (e.g., Qwen 3.5) |
Message Roles
Each message in the messages array has a role and content:
| Role | Purpose |
|---|---|
system | Sets the AI's behavior, personality, or constraints. Placed first. |
user | The human's message. |
assistant | The AI's previous response. Used for multi-turn context. |
System Prompts
System prompts define how the model behaves. They're essential for game NPCs, assistants, or any specialized behavior:
- Python
- Godot (GDScript)
- Unity (C#)
- Unreal (C++)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
response = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{
"role": "system",
"content": "You are a ship AI aboard a deep-space freighter. You speak formally, "
"address the player as Captain, and provide status reports when asked. "
"You are concerned about a recent anomaly in sector 7.",
},
{"role": "user", "content": "Status report."},
],
temperature=0.7,
)
print(response.choices[0].message.content)
func get_ship_ai_response(player_input: String) -> void:
var request = {
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{
"role": "system",
"content": "You are a ship AI aboard a deep-space freighter. You speak formally, address the player as Captain, and provide status reports when asked. You are concerned about a recent anomaly in sector 7."
},
{"role": "user", "content": player_input}
],
"temperature": 0.7
}
engine.async_chat_completions(JSON.stringify(request))
public async Task<string> GetShipAIResponse(string playerInput)
{
var request = new
{
model = "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages = new object[]
{
new { role = "system", content = "You are a ship AI aboard a deep-space freighter. You speak formally, address the player as Captain, and provide status reports when asked." },
new { role = "user", content = playerInput }
},
temperature = 0.7
};
var json = JsonSerializer.Serialize(request);
var content = new StringContent(json, Encoding.UTF8, "application/json");
var response = await client.PostAsync($"{BaseUrl}/chat/completions", content);
var responseJson = await response.Content.ReadAsStringAsync();
using var doc = JsonDocument.Parse(responseJson);
return doc.RootElement.GetProperty("choices")[0]
.GetProperty("message").GetProperty("content").GetString();
}
void UAtelicoClient::GetShipAIResponse(const FString& PlayerInput)
{
TSharedPtr<FJsonObject> Body = MakeShareable(new FJsonObject);
Body->SetStringField("model", "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M");
Body->SetNumberField("temperature", 0.7);
TArray<TSharedPtr<FJsonValue>> Messages;
TSharedPtr<FJsonObject> SystemMsg = MakeShareable(new FJsonObject);
SystemMsg->SetStringField("role", "system");
SystemMsg->SetStringField("content",
"You are a ship AI aboard a deep-space freighter. You speak formally, "
"address the player as Captain, and provide status reports when asked.");
Messages.Add(MakeShareable(new FJsonValueObject(SystemMsg)));
TSharedPtr<FJsonObject> UserMsg = MakeShareable(new FJsonObject);
UserMsg->SetStringField("role", "user");
UserMsg->SetStringField("content", PlayerInput);
Messages.Add(MakeShareable(new FJsonValueObject(UserMsg)));
Body->SetArrayField("messages", Messages);
SendRequest(Body); // see Getting Started for full HTTP setup
}
Multi-Turn Conversations
Include previous messages to give the model context of the conversation:
- Python
- Godot (GDScript)
- Unity (C#)
- Unreal (C++)
response = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{"role": "system", "content": "You are a helpful tavern keeper named Boris."},
{"role": "user", "content": "What do you have on the menu?"},
{"role": "assistant", "content": "We have roasted boar, mushroom stew, and fresh bread. The stew is my specialty!"},
{"role": "user", "content": "I'll have the stew. Any rumors lately?"},
],
)
print(response.choices[0].message.content)
# Keep conversation history in an array
var conversation: Array = [
{"role": "system", "content": "You are a helpful tavern keeper named Boris."}
]
func talk_to_npc(player_input: String) -> void:
conversation.append({"role": "user", "content": player_input})
var request = {
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": conversation
}
engine.async_chat_completions(JSON.stringify(request))
func _on_async_request_completed(_job_id: int, response: String) -> void:
var parsed = JSON.parse_string(response)
var reply = parsed["choices"][0]["message"]["content"]
conversation.append({"role": "assistant", "content": reply})
dialogue_label.text = reply
private List<object> conversation = new()
{
new { role = "system", content = "You are a helpful tavern keeper named Boris." }
};
public async Task<string> TalkToNPC(string playerInput)
{
conversation.Add(new { role = "user", content = playerInput });
var request = new { model = "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M", messages = conversation };
var json = JsonSerializer.Serialize(request);
var content = new StringContent(json, Encoding.UTF8, "application/json");
var response = await client.PostAsync($"{BaseUrl}/chat/completions", content);
var responseJson = await response.Content.ReadAsStringAsync();
using var doc = JsonDocument.Parse(responseJson);
var reply = doc.RootElement.GetProperty("choices")[0]
.GetProperty("message").GetProperty("content").GetString();
conversation.Add(new { role = "assistant", content = reply });
return reply;
}
// Store conversation as TArray<TSharedPtr<FJsonValue>>
TArray<TSharedPtr<FJsonValue>> Conversation;
void UAtelicoClient::InitConversation()
{
TSharedPtr<FJsonObject> SystemMsg = MakeShareable(new FJsonObject);
SystemMsg->SetStringField("role", "system");
SystemMsg->SetStringField("content", "You are a helpful tavern keeper named Boris.");
Conversation.Add(MakeShareable(new FJsonValueObject(SystemMsg)));
}
void UAtelicoClient::TalkToNPC(const FString& PlayerInput)
{
TSharedPtr<FJsonObject> UserMsg = MakeShareable(new FJsonObject);
UserMsg->SetStringField("role", "user");
UserMsg->SetStringField("content", PlayerInput);
Conversation.Add(MakeShareable(new FJsonValueObject(UserMsg)));
TSharedPtr<FJsonObject> Body = MakeShareable(new FJsonObject);
Body->SetStringField("model", "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M");
Body->SetArrayField("messages", Conversation);
// On response callback, parse assistant reply and append to Conversation
SendRequest(Body);
}
Streaming
Set "stream": true to receive tokens as they're generated. This is ideal for typewriter-style dialogue UI.
SSE Format: Each token arrives as a Server-Sent Event. The stream ends with data: [DONE].
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
- Python
- Godot (GDScript)
- Unity (C#)
- Unreal (C++)
stream = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[
{"role": "system", "content": "You are a narrator for a fantasy RPG."},
{"role": "user", "content": "Describe the entrance to the dungeon."},
],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print()
# Streaming uses the built-in signal-based API
func stream_narration(prompt: String) -> void:
var request = {
"model": "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
"messages": [
{"role": "system", "content": "You are a narrator for a fantasy RPG."},
{"role": "user", "content": prompt}
]
}
engine.stream_chat_completions(JSON.stringify(request))
# Called once per token as it arrives
func _on_inference_token_generated(_job_id: int, token: String) -> void:
dialogue_label.text += token # typewriter effect
func _on_inference_completed(_job_id: int) -> void:
print("Stream finished")
public async Task StreamToDialogue(string prompt, TMPro.TextMeshProUGUI dialogueText)
{
var request = new
{
model = "in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages = new[]
{
new { role = "system", content = "You are a narrator for a fantasy RPG." },
new { role = "user", content = prompt }
},
stream = true
};
var json = JsonSerializer.Serialize(request);
var httpContent = new StringContent(json, Encoding.UTF8, "application/json");
var httpRequest = new HttpRequestMessage(HttpMethod.Post, $"{BaseUrl}/chat/completions")
{
Content = httpContent
};
var response = await client.SendAsync(httpRequest, HttpCompletionOption.ResponseHeadersRead);
using var stream = await response.Content.ReadAsStreamAsync();
using var reader = new StreamReader(stream);
dialogueText.text = "";
while (await reader.ReadLineAsync() is { } line)
{
if (line.StartsWith("data: ") && line != "data: [DONE]")
{
var chunk = JsonDocument.Parse(line.Substring(6));
var delta = chunk.RootElement.GetProperty("choices")[0].GetProperty("delta");
if (delta.TryGetProperty("content", out var c))
dialogueText.text += c.GetString();
}
}
}
void UAtelicoClient::StreamNarration(const FString& Prompt)
{
auto Request = FHttpModule::Get().CreateRequest();
Request->SetURL(TEXT("http://localhost:11434/v1/chat/completions"));
Request->SetVerb(TEXT("POST"));
Request->SetHeader(TEXT("Content-Type"), TEXT("application/json"));
// Build JSON with "stream": true and messages array
// ... (see Getting Started for JSON building pattern) ...
// Handle chunked SSE responses via progress callback
Request->OnRequestProgress().BindLambda(
[this](FHttpRequestPtr Req, int32 BytesSent, int32 BytesReceived)
{
FString Content = Req->GetResponse()->GetContentAsString();
// Parse new SSE lines since last callback
// Extract delta.content tokens
// Append to dialogue UTextBlock
});
Request->ProcessRequest();
}
Temperature
Temperature controls randomness. For game applications:
| Use Case | Temperature | Why |
|---|---|---|
| Factual responses, game rules | 0.1 - 0.3 | Consistent, predictable |
| NPC dialogue, general conversation | 0.6 - 0.8 | Natural variation |
| Creative writing, storytelling | 0.9 - 1.2 | More surprising, diverse |
Finish Reasons
The finish_reason field tells you why generation stopped:
| Value | Meaning |
|---|---|
stop | Model finished naturally (end of response) |
length | Hit the max_tokens limit |
Error Handling
Errors return standard HTTP status codes with an OpenAI-compatible error body:
{
"error": {
"message": "Model 'nonexistent-model' not found",
"type": "invalid_request_error",
"param": "model",
"code": null
}
}
| Status | Meaning |
|---|---|
| 400 | Invalid request (bad model name, malformed JSON) |
| 500 | Server error (model failed to load, inference error) |