POST /v1/messages

Endpoint

POST /v1/messages

Send a message to the model and receive a response. Supports both streaming and non-streaming modes.

Request Body

model

string

required

The model to use for generation. Examples:

claude-opus-4-6-thinking
claude-sonnet-4-5-thinking
gemini-3-flash

Use GET /v1/models to see all available models.

messages

array

required

Array of message objects representing the conversation history. Each message has:

role (string): Either user or assistant
content (string | array): Message content as text or array of content blocks

[
  {
    "role": "user",
    "content": "What is the capital of France?"
  }
]

max_tokens

number

default:"4096"

Maximum number of tokens to generate in the response.For Gemini models, this is automatically capped at 16384 (Gemini’s limit).

stream

boolean

default:"false"

Enable streaming mode. When true, the response is sent as Server-Sent Events (SSE).

system

string

System instruction to guide the model’s behavior.

"system": "You are a helpful coding assistant."

tools

array

Array of tool definitions for function calling. Each tool has:

name (string): Tool name
description (string): What the tool does
input_schema (object): JSON Schema for tool parameters

[
  {
    "name": "search_files",
    "description": "Search for files matching a pattern",
    "input_schema": {
      "type": "object",
      "properties": {
        "pattern": { "type": "string" }
      },
      "required": ["pattern"]
    }
  }
]

tool_choice

object

Control which tool the model should use:

{"type": "auto"} - Model decides (default)
{"type": "any"} - Model must use a tool
{"type": "tool", "name": "tool_name"} - Use specific tool

thinking

object

Enable extended thinking for supported models:

{
  "type": "enabled",
  "budget_tokens": 10000
}

temperature

number

Sampling temperature. Higher values make output more random.

top_p

number

Nucleus sampling threshold.

top_k

number

Top-K sampling parameter (Gemini only).

Response

Non-Streaming Response

string

Unique message identifier.

type

string

Always "message".

role

string

Always "assistant".

content

array

Array of content blocks. Each block can be:

Text block: {"type": "text", "text": "..."}
Thinking block: {"type": "thinking", "thinking": "...", "signature": "..."}
Tool use block: {"type": "tool_use", "id": "...", "name": "...", "input": {...}}

model

string

The model that generated the response.

stop_reason

string

Why the model stopped generating:

"end_turn" - Natural completion
"max_tokens" - Hit token limit
"tool_use" - Model called a tool
"stop_sequence" - Hit stop sequence

usage

object

Token usage statistics:

input_tokens (number): Tokens in the prompt
output_tokens (number): Tokens generated
cache_creation_input_tokens (number): Tokens cached (if prompt caching is used)
cache_read_input_tokens (number): Tokens read from cache

Streaming Response

When stream: true, the response is sent as Server-Sent Events:

event: message_start
data: {"type":"message_start","message":{"id":"msg_01...","type":"message","role":"assistant","content":[],"model":"claude-sonnet-4-5-thinking"}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"The capital"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" of France is"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" Paris."}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}

event: message_stop
data: {"type":"message_stop"}

Examples

Basic Request

curl -X POST http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5-thinking",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

Streaming Request

curl -X POST http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-3-flash",
    "max_tokens": 1024,
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": "Write a haiku about coding"
      }
    ]
  }'

With Tools

curl -X POST http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-6-thinking",
    "max_tokens": 2048,
    "messages": [
      {
        "role": "user",
        "content": "Find the package.json file"
      }
    ],
    "tools": [
      {
        "name": "search_files",
        "description": "Search for files matching a pattern",
        "input_schema": {
          "type": "object",
          "properties": {
            "pattern": { "type": "string", "description": "Glob pattern" }
          },
          "required": ["pattern"]
        }
      }
    ]
  }'

Prompt Caching

The proxy automatically handles prompt caching to reduce latency and token usage:

Caching is organization-scoped (requires same account + session ID)
Session ID is derived from the SHA256 hash of the first user message
Cached tokens are reported in usage.cache_read_input_tokens

How It Works

First request with a conversation → creates cache
Subsequent requests with the same account → reads from cache
If account switches → cache miss, new cache created

To maximize cache hits, use the sticky or hybrid account selection strategy.

Error Responses

400 Bad Request - Invalid Parameters

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "messages is required and must be an array"
  }
}

401 Unauthorized - Missing API Key

{
  "type": "error",
  "error": {
    "type": "authentication_error",
    "message": "Invalid or missing API key"
  }
}

503 Service Unavailable - All Accounts Exhausted

{
  "type": "error",
  "error": {
    "type": "api_error",
    "message": "No accounts available"
  }
}

400 Bad Request - Quota Exhausted

When all accounts are rate-limited for the requested model:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "RESOURCE_EXHAUSTED: You have exhausted your capacity on claude-opus-4-6-thinking. Quota will reset after 2h15m."
  }
}

The proxy returns 400 (not 429) for quota exhaustion to prevent clients from automatically retrying. This ensures Claude Code stops cleanly instead of entering a retry loop.

Endpoints

Authentication

Endpoint

Request Body

Response

Non-Streaming Response

Streaming Response

Examples

Basic Request

Streaming Request

With Tools

Prompt Caching

How It Works

Error Responses

400 Bad Request - Invalid Parameters

401 Unauthorized - Missing API Key

503 Service Unavailable - All Accounts Exhausted

400 Bad Request - Quota Exhausted

Endpoints

Authentication

Documentation Index

​Endpoint

​Request Body

​Response

​Non-Streaming Response

​Streaming Response

​Examples

​Basic Request

​Streaming Request

​With Tools

​Prompt Caching

​How It Works

​Error Responses

​400 Bad Request - Invalid Parameters

​401 Unauthorized - Missing API Key

​503 Service Unavailable - All Accounts Exhausted

​400 Bad Request - Quota Exhausted

Endpoint

Request Body

Response

Non-Streaming Response

Streaming Response

Examples

Basic Request

Streaming Request

With Tools

Prompt Caching

How It Works

Error Responses

400 Bad Request - Invalid Parameters

401 Unauthorized - Missing API Key

503 Service Unavailable - All Accounts Exhausted

400 Bad Request - Quota Exhausted