Skip to main content
Version: 0.9

Getting Started

This guide assumes you have an Atelico server bundle containing:

  • atelico-server -- the inference server binary
  • atelico-asset-downloader -- model downloading tool

1. Download a Model

List available models:

./atelico-asset-downloader list --namespace models

Download a model. For a quick start, the 1B quantized model is small and fast:

./atelico-asset-downloader download meta-llama/Llama-3.2-1B-Instruct-Q4_K_M

For better quality responses, use the 3B model:

./atelico-asset-downloader download meta-llama/Llama-3.2-3B-Instruct-Q4_K_M

Or use interactive mode to browse and select:

./atelico-asset-downloader interactive

Models are cached locally and only need to be downloaded once:

  • macOS: ~/Library/Caches/atelico/models/
  • Linux: ~/.cache/atelico/models/
  • Windows: %LOCALAPPDATA%\atelico\models\

2. Start the Server

./atelico-server

The server starts on port 11434 and auto-detects your GPU:

  • Mac: uses Metal (Apple Silicon GPU)
  • NVIDIA: uses CUDA
  • No GPU: falls back to CPU

To use a different port:

./atelico-server --port 8080

3. Verify It's Running

curl http://localhost:11434/v1/models

You should see a list of available models:

{
"object": "list",
"data": [
{"id": "in-memory::meta-llama/Llama-3.2-1B-Instruct-Q4_K_M", "object": "model", "owned_by": "atelico"},
...
]
}

4. Send Your First Request

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "in-memory::meta-llama/Llama-3.2-1B-Instruct-Q4_K_M",
"messages": [{"role": "user", "content": "Hello! What can you do?"}]
}'

That's it. The first request takes a few seconds while the model loads into GPU memory. Subsequent requests are fast.

5. Use from Your Engine or Language

The server speaks the OpenAI API protocol. Any client library that works with OpenAI works with Atelico -- just point it at http://localhost:11434/v1.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed" # required by the library but not used
)

response = client.chat.completions.create(
model="in-memory::meta-llama/Llama-3.2-3B-Instruct-Q4_K_M",
messages=[{"role": "user", "content": "Tell me a short joke"}],
temperature=0.7,
)

print(response.choices[0].message.content)

Next Steps