Introducing Google AI Edge Portal: Benchmark Edge AI at scale. Sign-up to request access during private preview.

OpenAI-Compatible Server

LiteRT-LM CLI can start a local HTTP server that is compatible with the OpenAI API. This lets you use LiteRT-LM as a drop-in replacement for OpenAI in your existing applications and workflows.

Start the Server

Use the serve command to start the server. By default, it starts an OpenAI-compatible server on port 9379.

The server dynamically loads and serves any models in your local registry. To manage the models available to the server (such as importing new models or listing existing ones), see Model Management.

litert-lm serve

Configuration Options

You can customize the server using the following options:

--host: The host to listen on (default: 0.0.0.0).
--port: The port to listen on (default: 9379).
--verbose: Enable verbose logging.

Example with custom host and port:

litert-lm serve --host 127.0.0.1 --port 8080

Supported Endpoints

The server emulates the following OpenAI API endpoints:

List Models: GET /v1/models Lists the models that are available to the server.
Chat Completions: POST /v1/chat/completions Generates text completions for a given chat conversation. Supports streaming responses.

Choosing the Backend and Configuration

When sending requests to the server, you can dynamically choose the execution backend (CPU, GPU, or NPU) and configure the maximum number of tokens (context length) by formatting the model field in your request payload.

The model field supports the following format:

model_id[,backend][,max_tokens]

Where model_id corresponds to any model ID in your local registry (see Model Management for details on how to list or import models).

Examples

gemma4-2b: default backend (CPU) with default max tokens.
gemma4-2b,gpu: GPU backend with default max tokens.
gemma4-2b,gpu,32768: GPU backend with max tokens 32768.

Usage Example

Once the server is running, you can interact with it by sending HTTP requests..

Sending HTTP Requests

Linux/macOS

curl http://localhost:9379/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-2b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Windows

Invoke-RestMethod -Uri "http://localhost:9379/v1/chat/completions" `
  -Method Post `
  -ContentType "application/json" `
  -Body '{"model": "gemma4-2b", "messages": [{"role": "user", "content": "Hello!"}]}'