隆重推出 Google AI Edge Portal：大规模对边缘 AI 进行基准测试。注册以在非公开预览期间申请访问权限。

LiteRT-LM 跨平台 C++ API

Conversation 是一个高级别 API，表示与 LLM 进行的单个有状态对话，是大多数用户的推荐入口点。它在内部管理 Session 并处理复杂的数据处理任务。这些任务包括维护初始上下文、管理工具定义、预处理多模态数据，以及应用基于角色的消息格式设置 Jinja 提示模板。

Conversation API 工作流

使用 Conversation API 的典型生命周期如下：

创建 Engine：使用模型路径和配置初始化单个 Engine。这是一个保存模型权重的大型对象。
创建 Conversation：使用 Engine 创建一个或多个轻量级 Conversation 对象。
发送消息：利用 Conversation 对象的方法向 LLM 发送消息并接收回答，从而实现类似聊天的互动。

以下是发送消息并获取模型响应的最简单方法。建议在大多数使用场景中使用。它与 Gemini Chat API 类似。

SendMessage：一个阻塞式调用，用于接收用户输入并返回完整的模型响应。
SendMessageAsync：一种非阻塞调用，通过回调逐个 token 地将模型的响应流式传输回来。

以下是代码段示例：

纯文字内容

#include "runtime/engine/engine.h"

// ...

// 1. Define model assets and engine settings.
auto model_assets = ModelAssets::Create(model_path);
CHECK_OK(model_assets);

auto engine_settings = EngineSettings::CreateDefault(
    model_assets,
    /*backend=*/litert::lm::Backend::CPU);

// 2. Create the main Engine object.
absl::StatusOr<std::unique_ptr<Engine>> engine = Engine::CreateEngine(engine_settings);
CHECK_OK(engine);

// 3. Create a Conversation
auto conversation_config = ConversationConfig::CreateDefault(**engine);
CHECK_OK(conversation_config)
absl::StatusOr<std::unique_ptr<Conversation>> conversation = Conversation::Create(**engine, *conversation_config);
CHECK_OK(conversation);

// 4. Send message to the LLM with blocking call.
absl::StatusOr<Message> model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", "What is the tallest building in the world?"}
    });
CHECK_OK(model_message);

// 5. Print the model message.
std::cout << *model_message << std::endl;

// 6. Send message to the LLM with asynchronous call
// where CreatePrintMessageCallback is a users implemented callback that would
// process the message once a chunk of message output is received.
std::stringstream captured_output;
(*conversation)->SendMessageAsync(
    JsonMessage{
        {"role", "user"},
        {"content", "What is the tallest building in the world?"}
    },
    CreatePrintMessageCallback(std::stringstream& captured_output)
);
// Wait until asynchronous finish or timeout.
*engine->WaitUntilDone(absl::Seconds(10));

示例 CreatePrintMessageCallback

absl::AnyInvocable<void(absl::StatusOr<Message>)> CreatePrintMessageCallback(
    std::stringstream& captured_output) {
  return [&captured_output](absl::StatusOr<Message> message) {
    if (!message.ok()) {
      std::cout << message.status().message() << std::endl;
      return;
    }
    if (auto json_message = std::get_if<JsonMessage>(&(*message))) {
      if (json_message->is_null()) {
        std::cout << std::endl << std::flush;
        return;
      }
      ABSL_CHECK_OK(PrintJsonMessage(*json_message, captured_output,
                                     /*streaming=*/true));
    }
  };
}

absl::Status PrintJsonMessage(const JsonMessage& message,
                              std::stringstream& captured_output,
                              bool streaming = false) {
  if (message["content"].is_array()) {
    for (const auto& content : message["content"]) {
      if (content["type"] == "text") {
        captured_output << content["text"].get<std::string>();
        std::cout << content["text"].get<std::string>();
      }
    }
    if (!streaming) {
      captured_output << std::endl << std::flush;
      std::cout << std::endl << std::flush;
    } else {
      captured_output << std::flush;
      std::cout << std::flush;
    }
  } else if (message["content"]["text"].is_string()) {
    if (!streaming) {
      captured_output << message["content"]["text"].get<std::string>()
                      << std::endl
                      << std::flush;
      std::cout << message["content"]["text"].get<std::string>() << std::endl
                << std::flush;
    } else {
      captured_output << message["content"]["text"].get<std::string>()
                      << std::flush;
      std::cout << message["content"]["text"].get<std::string>() << std::flush;
    }
  } else {
    return absl::InvalidArgumentError("Invalid message: " + message.dump());
  }
  return absl::OkStatus();
}

多模态数据内容

// To use multimodality, the engine must be created with vision and audio
// backend depending on the multimodality to be used
auto engine_settings = EngineSettings::CreateDefault(
    model_assets,
    /*backend=*/litert::lm::Backend::CPU,
    /*vision_backend*/litert::lm::Backend::GPU,
    /*audio_backend*/litert::lm::Backend::CPU,
);

// The same steps to create Engine and Conversation as above...

// Send message to the LLM with image data.
absl::StatusOr<Message> model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {
            {"type", "text"}, {"text", "Describe the following image: "}
          },
          {
            {"type", "image"}, {"path", "/file/path/to/image.jpg"}
          }
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;

// Send message to the LLM with audio data.
model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {
            {"type", "text"}, {"text", "Transcribe the audio: "}
          },
          {
            {"type", "audio"}, {"path", "/file/path/to/audio.wav"}
          }
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;

// The content can include multiple image or audio data.
model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {
            {"type", "text"}, {"text", "First briefly describe the two images "}
          },
          {
            {"type", "image"}, {"path", "/file/path/to/image1.jpg"}
          },
          {
            {"type", "text"}, {"text", "and "}
          },
          {
            {"type", "image"}, {"path", "/file/path/to/image2.jpg"}
          },
          {
            {"type", "text"}, {"text", " then transcribe the content in the audio"}
          },
          {
            {"type", "audio"}, {"path", "/file/path/to/audio.wav"}
          }
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;

使用对话和工具

如需详细了解如何将该工具与 Conversation API 搭配使用，请参阅高级用法

对话中的组件

Conversation 可以视为用户的代理，用于在将数据发送到会话之前维护 Session 和复杂的数据处理。

I/O 类型

Conversation API 的核心输入和输出格式为 Message。目前，此功能以 JsonMessage 的形式实现，它是 ordered_json（一种灵活的嵌套键值数据结构）的类型别名。

Conversation API 以消息输入/消息输出的方式运行，模拟典型的聊天体验。Message 的灵活性让用户可以根据特定提示模板或 LLM 模型的需求添加任意字段，从而使 LiteRT-LM 能够支持各种模型。

虽然没有严格的单一标准，但大多数提示模板和模型都希望 Message 遵循与 Gemini API 内容或 OpenAI 消息结构中使用的惯例类似的惯例。

Message 必须包含 role，表示消息的发送者。content 可以像文本字符串一样简单。

{
  "role": "model", // Represent who the message is sent from.
  "content": "Hello World!" // Naive text only content.
}

对于多模态数据输入，content 是 part 的列表。同样，part 不是预定义的数据结构，而是有序的键值对数据类型。具体字段取决于提示模板和模型的要求。

{
  "role": "user",
  "content": [  // Multimodal content.
    // Now the content is composed of parts
    {
      "type": "text",
      "text": "Describe the image in details: "
    },
    {
      "type": "image",
      "path": "/path/to/image.jpg"
    }
  ]
}

对于多模态 part，我们支持 data_utils.h 处理的以下格式

{
  "type": "text",
  "text": "this is a text"
}

{
  "type": "image",
  "path": "/path/to/image.jpg"
}

{
  "type": "image",
  "blob": "base64 encoded image bytes as string",
}

{
  "type": "audio",
  "path": "/path/to/audio.wav"
}

{
  "type": "audio",
  "blob": "base64 encoded audio bytes as string",
}

提示模板

为了保持变体模型的灵活性，PromptTemplate 实现为 Minja 的精简封装容器。Minja 是 Jinja 模板引擎的 C++ 实现，可处理 JSON 输入以生成格式化提示。

Jinja 模板引擎是一种广泛采用的 LLM 提示模板格式。以下是几个例子：

Jinja 模板引擎格式应与指令调优模型所需的结构完全一致。通常，模型版本会包含标准 Jinja 模板，以确保正确使用模型。

模型使用的 Jinja 模板将由模型文件元数据提供。

[!NOTE] 因格式不正确而导致提示发生细微变化，可能会导致模型性能大幅下降。如 Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting 一文中所述

序言

Preface 设置对话的初始上下文。它可以包含初始消息、工具定义以及 LLM 开始互动所需的任何其他背景信息。这实现了与 Gemini API system instruction 和 Gemini API Tools 类似的功能

序言包含以下字段

messages 前言中的消息。这些消息为对话提供了初始背景信息。例如，消息可以是对话历史记录、提示工程系统指令、少样本示例等。
tools 模型可以在对话中使用的工具。工具的格式同样不固定，但大多遵循 Gemini API FunctionDeclaration。
extra_context 额外的上下文，可让模型自定义其所需的上下文信息以开始对话。例如，
- enable_thinking 适用于具有思考模式的模型，例如 Qwen3 或 SmolLM3-3B。

提供初始系统指令、工具并停用思考模式的示例序言。

Preface preface = JsonPreface({
  .messages = {
      {"role", "system"},
      {"content", {"You are a model that can do function calling."}}
    },
  .tools = {
    {
      {"name", "get_weather"},
      {"description", "Returns the weather for a given location."},
      {"parameters", {
        {"type", "object"},
        {"properties", {
          {"location", {
            {"type", "string"},
            {"description", "The location to get the weather for."}
          }}
        }},
        {"required", {"location"}}
      }}
    },
    {
      {"name", "get_stock_price"},
      {"description", "Returns the stock price for a given stock symbol."},
      {"parameters", {
        {"type", "object"},
        {"properties", {
          {"stock_symbol", {
            {"type", "string"},
            {"description", "The stock symbol to get the price for."}
          }}
        }},
        {"required", {"stock_symbol"}}
      }}
    }
  },
  .extra_context = {
    {"enable_thinking": false}
  }
});

历史记录

对话会维护会话中所有消息交换的列表。此历史记录对于提示模板渲染至关重要，因为 Jinja 提示模板通常需要整个对话历史记录才能为 LLM 生成正确的提示。

不过，LiteRT-LM 会话是有状态的，这意味着它会以增量方式处理输入。为了弥合这一差距，对话通过两次呈现提示模板来生成必要的增量提示：一次是使用上一个轮次之前的历史记录，另一次是包含当前消息。通过比较这两个渲染后的提示，它会提取要发送到 Session 的新部分。

ConversationConfig

ConversationConfig 用于初始化 Conversation 实例。您可以通过以下几种方式创建此配置：

从 Engine：此方法使用与引擎关联的默认 SessionConfig。
来自特定 SessionConfig：这样可以更精细地控制会话设置。

除了会话设置之外，您还可以进一步自定义 ConversationConfig 中的 Conversation 行为。其中包括：

提供 Preface。
覆盖默认的 PromptTemplate。
覆盖默认的 DataProcessorConfig。

这些覆盖尤其适用于微调模型，因为它们可能需要不同于其所派生自的基础模型的配置或提示模板。

MessageCallback

MessageCallback 是用户在使用异步 SendMessageAsync 方法时应实现的回调函数。

回调签名是 absl::AnyInvocable<void(absl::StatusOr<Message>)>。此函数在以下情况下触发：

当从模型收到新的 Message 时。
如果在 LiteRT-LM 的消息处理过程中发生错误。
LLM 完成推理后，系统会触发回调，并返回一个空 Message（例如，JsonMessage()）来表示回答结束。

如需查看示例实现，请参阅第 6 步：异步调用。

[!IMPORTANT] 回调收到的 Message 仅包含模型输出的最新块，而不包含整个消息历史记录。

例如，如果从阻塞 SendMessage 调用中获得的完整模型响应应为：

{
  "role": "model",
  "content": [
    "type": "text",
    "text": "Hello World!"
  ]
}

SendMessageAsync 中的回调可能会被多次调用，每次调用都会提供后续的一段文本：

// 1st Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "He"
  ]
}

// 2nd Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "llo"
  ]
}

// 3rd Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": " Wo"
  ]
}

// 4th Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "rl"
  ]
}

// 5th Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "d!"
  ]
}

如果需要在异步流期间获得完整响应，实现者负责累积这些数据块。或者，异步调用完成后，完整响应将作为 History 中的最后一项提供。

高级用法

受限解码

LiteRT-LM 支持受限解码，可让您对模型的输出强制执行特定结构，例如 JSON 架构、正则表达式模式或语法规则。

如需启用此功能，请在 ConversationConfig 中设置 EnableConstrainedDecoding(true)，并提供 ConstraintProviderConfig（例如，LlGuidanceConfig，以支持正则表达式/JSON/语法）。然后，通过 SendMessage 中的 OptionalArgs 传递限制条件。

示例：正则表达式限制条件

LlGuidanceConstraintArg constraint_arg;
constraint_arg.constraint_type = LlgConstraintType::kRegex;
constraint_arg.constraint_string = "a+b+"; // Force output to match this regex

auto response = conversation->SendMessage(
    user_message,
    {.decoding_constraint = constraint_arg}
);

如需了解完整详情（包括 JSON 架构和 Lark 语法支持），请参阅受限解码文档。

工具使用

工具调用功能可让 LLM 请求执行客户端函数。您可以在对话的 Preface 中定义工具，并按名称为它们设置键。当模型输出工具调用时，您需要捕获该调用，在应用中执行相应的函数，然后将结果返回给模型。

流程概览： 1. 声明工具：在 Preface JSON 中定义工具（名称、说明、参数）。2. 检测通话：检查响应中的 model_message["tool_calls"]。 3. 执行：针对所请求的工具运行应用逻辑。 4. 回答：向模型发送一条包含工具输出的 role: "tool" 消息。

如需了解完整详情和完整的聊天循环示例，请参阅工具使用文档。