Google AI Edge Portal のご紹介: エッジ AI を大規模にベンチマークします。限定公開プレビュー中にアクセスをリクエストするには、登録してください。

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

LiteRT-LM クロスプラットフォーム C++ API

Conversation は高レベルの API で、LLM との単一のステートフルな会話を表します。ほとんどのユーザーにとって推奨されるエントリポイントです。内部で Session を管理し、複雑なデータ処理タスクを処理します。これらのタスクには、初期コンテキストの維持、ツール定義の管理、マルチモーダルデータの前処理、ロールベースのメッセージ形式での Jinja プロンプトテンプレートの適用が含まれます。

Conversation API のワークフロー

Conversation API を使用する際の一般的なライフサイクルは次のとおりです。

Engine を作成する: モデルパスと構成を使用して、単一の Engine を初期化します。これは、モデルの重みを保持するヘビーウェイトオブジェクトです。
Conversation を作成する: Engine を使用して、1 つ以上の軽量の Conversation オブジェクトを作成します。
メッセージを送信する: Conversation オブジェクトのメソッドを使用して LLM にメッセージを送信し、レスポンスを受信します。これにより、チャットのようなやり取りが可能になります。

メッセージを送信してモデルのレスポンスを取得する最も簡単な方法は次のとおりです。ほとんどのユースケースで推奨されます。Gemini Chat API をミラーリングします。

SendMessage: ユーザー入力を受け取り、完全なモデルのレスポンスを返すブロッキング呼び出し。
SendMessageAsync: コールバックを介してモデルのレスポンスをトークンごとにストリーミングするノンブロッキング呼び出し。

コードスニペットの例を次に示します。

テキストのみのコンテンツ

#include "runtime/engine/engine.h"

// ...

// 1. Define model assets and engine settings.
auto model_assets = ModelAssets::Create(model_path);
CHECK_OK(model_assets);

auto engine_settings = EngineSettings::CreateDefault(
    model_assets,
    /*backend=*/litert::lm::Backend::CPU);

// 2. Create the main Engine object.
absl::StatusOr<std::unique_ptr<Engine>> engine = Engine::CreateEngine(engine_settings);
CHECK_OK(engine);

// 3. Create a Conversation
auto conversation_config = ConversationConfig::CreateDefault(**engine);
CHECK_OK(conversation_config)
absl::StatusOr<std::unique_ptr<Conversation>> conversation = Conversation::Create(**engine, *conversation_config);
CHECK_OK(conversation);

// 4. Send message to the LLM with blocking call.
absl::StatusOr<Message> model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", "What is the tallest building in the world?"}
    });
CHECK_OK(model_message);

// 5. Print the model message.
std::cout << *model_message << std::endl;

// 6. Send message to the LLM with asynchronous call
// where CreatePrintMessageCallback is a users implemented callback that would
// process the message once a chunk of message output is received.
std::stringstream captured_output;
(*conversation)->SendMessageAsync(
    JsonMessage{
        {"role", "user"},
        {"content", "What is the tallest building in the world?"}
    },
    CreatePrintMessageCallback(std::stringstream& captured_output)
);
// Wait until asynchronous finish or timeout.
*engine->WaitUntilDone(absl::Seconds(10));

CreatePrintMessageCallback の例

absl::AnyInvocable<void(absl::StatusOr<Message>)> CreatePrintMessageCallback(
    std::stringstream& captured_output) {
  return [&captured_output](absl::StatusOr<Message> message) {
    if (!message.ok()) {
      std::cout << message.status().message() << std::endl;
      return;
    }
    if (auto json_message = std::get_if<JsonMessage>(&(*message))) {
      if (json_message->is_null()) {
        std::cout << std::endl << std::flush;
        return;
      }
      ABSL_CHECK_OK(PrintJsonMessage(*json_message, captured_output,
                                     /*streaming=*/true));
    }
  };
}

absl::Status PrintJsonMessage(const JsonMessage& message,
                              std::stringstream& captured_output,
                              bool streaming = false) {
  if (message["content"].is_array()) {
    for (const auto& content : message["content"]) {
      if (content["type"] == "text") {
        captured_output << content["text"].get<std::string>();
        std::cout << content["text"].get<std::string>();
      }
    }
    if (!streaming) {
      captured_output << std::endl << std::flush;
      std::cout << std::endl << std::flush;
    } else {
      captured_output << std::flush;
      std::cout << std::flush;
    }
  } else if (message["content"]["text"].is_string()) {
    if (!streaming) {
      captured_output << message["content"]["text"].get<std::string>()
                      << std::endl
                      << std::flush;
      std::cout << message["content"]["text"].get<std::string>() << std::endl
                << std::flush;
    } else {
      captured_output << message["content"]["text"].get<std::string>()
                      << std::flush;
      std::cout << message["content"]["text"].get<std::string>() << std::flush;
    }
  } else {
    return absl::InvalidArgumentError("Invalid message: " + message.dump());
  }
  return absl::OkStatus();
}

🔴 新機能: マルチトークン予測（MTP）

マルチトークン予測（MTP）は、デコード速度を大幅に向上させるパフォーマンス最適化です。MTP は、GPU バックエンドのすべてのタスクで推奨されます。

MTP を使用するには、エンジン構成の詳細設定で投機的デコードを有効にする必要があります。

// 1. Define model assets and engine settings.
auto model_assets = ModelAssets::Create(model_path);
CHECK_OK(model_assets);

auto engine_settings = EngineSettings::CreateDefault(
    model_assets,
    /*backend=*/litert::lm::Backend::GPU);
CHECK_OK(engine_settings);

// 2. Enable MTP via speculative decoding in advanced settings.
litert::lm::AdvancedSettings advanced_settings;
advanced_settings.enable_speculative_decoding = true;
engine_settings->GetMutableMainExecutorSettings().SetAdvancedSettings(
    advanced_settings);

// 3. Create the main Engine object.
absl::StatusOr<std::unique_ptr<Engine>> engine = Engine::CreateEngine(
    *engine_settings);
CHECK_OK(engine);

// The same steps to create Conversation and send messages as above...

マルチモーダルデータのコンテンツ

// To use multimodality, the engine must be created with vision and audio
// backend depending on the multimodality to be used
auto engine_settings = EngineSettings::CreateDefault(
    model_assets,
    /*backend=*/litert::lm::Backend::CPU,
    /*vision_backend*/litert::lm::Backend::GPU,
    /*audio_backend*/litert::lm::Backend::CPU,
);

// The same steps to create Engine and Conversation as above...

// Send message to the LLM with image data.
absl::StatusOr<Message> model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {
            {"type", "text"}, {"text", "Describe the following image: "}
          },
          {
            {"type", "image"}, {"path", "/file/path/to/image.jpg"}
          }
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;

// Send message to the LLM with audio data.
model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {
            {"type", "text"}, {"text", "Transcribe the audio: "}
          },
          {
            {"type", "audio"}, {"path", "/file/path/to/audio.wav"}
          }
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;

// The content can include multiple image or audio data.
model_message = (*conversation)->SendMessage(
    JsonMessage{
        {"role", "user"},
        {"content", { // Now content must be an array.
          {
            {"type", "text"}, {"text", "First briefly describe the two images "}
          },
          {
            {"type", "image"}, {"path", "/file/path/to/image1.jpg"}
          },
          {
            {"type", "text"}, {"text", "and "}
          },
          {
            {"type", "image"}, {"path", "/file/path/to/image2.jpg"}
          },
          {
            {"type", "text"}, {"text", " then transcribe the content in the audio"}
          },
          {
            {"type", "audio"}, {"path", "/file/path/to/audio.wav"}
          }
        }},
    });
CHECK_OK(model_message);

// Print the model message.
std::cout << *model_message << std::endl;

ツールで Conversation を使用する

Conversation API でのツールの使用方法について詳しくは、高度な使用方法をご覧ください。

Conversation のコンポーネント

Conversation は、ユーザーが Session と複雑なデータ処理を維持するためのデリゲートと見なすことができます。

I/O の種類

Conversation API のコアとなる入出力形式は Messageです。現在、これは JsonMessageとして実装されています。これは、柔軟なネストされた Key-Value データ構造である ordered_jsonの型エイリアスです。

Conversation API は、一般的なチャットエクスペリエンスを模倣して、メッセージの入力と出力に基づいて動作します。 Message の柔軟性により、ユーザーは特定のプロンプトテンプレートや LLM モデルで必要に応じて任意のフィールドを含めることができるため、LiteRT-LM はさまざまなモデルをサポートできます。

単一の厳格な標準はありませんが、ほとんどのプロンプトテンプレートとモデルでは、Message が Gemini API コンテンツまたは OpenAI メッセージ構造で使用されるものと同様の規則に従うことが想定されています。

Message には、メッセージの送信元を表す role が含まれている必要があります。content は、テキスト文字列のように単純なものでも構いません。

{
  "role": "model", // Represent who the message is sent from.
  "content": "Hello World!" // Naive text only content.
}

マルチモーダルデータの入力の場合、content は part のリストです。ここでも、part は事前定義されたデータ構造ではなく、順序付けられた Key-Value ペアのデータ型です。特定のフィールドは、プロンプトテンプレートとモデルで想定されているものによって異なります。

{
  "role": "user",
  "content": [  // Multimodal content.
    // Now the content is composed of parts
    {
      "type": "text",
      "text": "Describe the image in details: "
    },
    {
      "type": "image",
      "path": "/path/to/image.jpg"
    }
  ]
}

マルチモーダル part の場合、 data_utils.h で処理される次の形式がサポートされています。

{
  "type": "text",
  "text": "this is a text"
}

{
  "type": "image",
  "path": "/path/to/image.jpg"
}

{
  "type": "image",
  "blob": "base64 encoded image bytes as string",
}

{
  "type": "audio",
  "path": "/path/to/audio.wav"
}

{
  "type": "audio",
  "blob": "base64 encoded audio bytes as string",
}

プロンプトテンプレート

バリアントモデルの柔軟性を維持するため、PromptTemplate は Minja のシンラッパーとして実装されています。 Minja は、Jinja テンプレートエンジンの C++ 実装で、JSON 入力を処理してフォーマットされたプロンプトを生成します。

Jinja テンプレートエンジンは、LLM プロンプトテンプレートで広く採用されている形式です。いくつか例を挙げましょう。

Jinja テンプレートエンジンの形式は、命令チューニング済みモデルで想定される構造と厳密に一致している必要があります。通常、モデルのリリースには、モデルを適切に使用するための標準の Jinja テンプレートが含まれています。

モデルで使用される Jinja テンプレートは、モデルファイルのメタデータによって提供されます。

注: フォーマットが正しくないためにプロンプトがわずかに変更されると、モデルのパフォーマンスが大幅に低下する可能性があります。詳しくは、プロンプト設計における言語モデルの偽の特徴に対する感度の定量化またはプロンプトのフォーマットについて心配し始めた方法をご覧ください。

序文

Preface は、会話の初期コンテキストを設定します。これには、初期メッセージ、ツール定義、LLM がやり取りを開始するために必要なその他の背景情報を含めることができます。これにより、機能が Gemini API system instruction と Gemini API Tools と同様に実現します。

Preface には次のフィールドがあります。

messages 序文のメッセージ。提供されたメッセージは、会話の初期背景です。たとえば、メッセージは会話履歴、プロンプトエンジニアリングシステムの指示、フューショットの例などです。
tools モデルが会話で使用できるツール。ツールの形式は固定されていませんが、ほとんどの場合、Gemini API FunctionDeclarationに従います。
extra_context モデルが会話を開始するために必要なコンテキスト情報をカスタマイズできるように、拡張性を維持する追加のコンテキスト。例:
- enable_thinking 思考モードのモデル（ Qwen3、 SmolLM3-3B など）用。

初期システム指示とツールを提供し、思考モードを無効にする序文の例。

Preface preface = JsonPreface({
  .messages = {
      {"role", "system"},
      {"content", {"You are a model that can do function calling."}}
    },
  .tools = {
    {
      {"name", "get_weather"},
      {"description", "Returns the weather for a given location."},
      {"parameters", {
        {"type", "object"},
        {"properties", {
          {"location", {
            {"type", "string"},
            {"description", "The location to get the weather for."}
          }}
        }},
        {"required", {"location"}}
      }}
    },
    {
      {"name", "get_stock_price"},
      {"description", "Returns the stock price for a given stock symbol."},
      {"parameters", {
        {"type", "object"},
        {"properties", {
          {"stock_symbol", {
            {"type", "string"},
            {"description", "The stock symbol to get the price for."}
          }}
        }},
        {"required", {"stock_symbol"}}
      }}
    }
  },
  .extra_context = {
    {"enable_thinking": false}
  }
});

履歴

Conversation は、セッション内のすべてのメッセージ交換のリストを保持します。この履歴は、プロンプトテンプレートのレンダリングに不可欠です。通常、jinja プロンプトテンプレートでは、LLM の正しいプロンプトを生成するために会話履歴全体が必要になります。

ただし、LiteRT-LM セッションはステートフルであり、入力を段階的に処理します。このギャップを埋めるため、Conversation はプロンプトテンプレートを 2 回レンダリングして、必要な増分プロンプトを生成します。1 回は前のターンまでの履歴を使用し、もう 1 回は現在のメッセージを含めます。レンダリングされた 2 つのプロンプトを比較して、 Session に送信する新しい部分を抽出します。

ConversationConfig

ConversationConfig は、 Conversation インスタンスを初期化するために使用されます。この構成は、次の 2 つの方法で作成できます。

Engine から: このメソッドは、エンジンに関連付けられたデフォルトの SessionConfig を使用します。
特定の SessionConfigから: これにより、セッション設定をより細かく制御できます。

セッション設定以外にも、 Conversation の動作をさらに ConversationConfigできます。これには以下が該当します。

Preface を提供する。
デフォルトの PromptTemplate を上書きする。
デフォルトの DataProcessorConfig を上書きする。

これらの上書きは、ファインチューニングされたモデルで特に便利です。ファインチューニングされたモデルでは、派生元のベースモデルとは異なる構成やプロンプトテンプレートが必要になる場合があります。

MessageCallback

MessageCallback は、非同期の SendMessageAsync メソッドを使用するときにユーザーが実装する必要があるコールバック関数です。

コールバックシグネチャは absl::AnyInvocable<void(absl::StatusOr<Message>)> です。この関数は、次の条件でトリガーされます。

モデルから Message の新しいチャンクを受信したとき。
LiteRT-LM のメッセージ処理中にエラーが発生した場合。
LLM の推論が完了すると、レスポンスの終了を通知するために、空の Message（JsonMessage() など）でコールバックがトリガーされます。

実装例については、ステップ 6 の非同期呼び出しをご覧ください。

注: コールバックで受信した Message には、メッセージの履歴全体ではなく、モデル出力の最新のチャンクのみが含まれます。

たとえば、ブロッキング SendMessage呼び出しから想定される完全なモデルのレスポンスは次のようになります。

{
  "role": "model",
  "content": [
    "type": "text",
    "text": "Hello World!"
  ]
}

SendMessageAsync のコールバックは複数回呼び出されることがあり、そのたびにテキストの次の部分が返されます。

// 1st Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "He"
  ]
}

// 2nd Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "llo"
  ]
}

// 3rd Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": " Wo"
  ]
}

// 4th Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "rl"
  ]
}

// 5th Message
{
  "role": "model",
  "content": [
    "type": "text",
    "text": "d!"
  ]
}

非同期ストリーム中に完全なレスポンスが必要な場合は、実装者がこれらのチャンクを累積する必要があります。または、非同期呼び出しが完了すると、完全なレスポンスが History の最後のエントリとして使用可能になります。

高度な使用方法

制約付きデコード

LiteRT-LM は制約付きデコードをサポートしており、JSON スキーマ、正規表現パターン、文法ルールなど、モデルの出力に特定の構造を適用できます。

有効にするには、ConversationConfig で EnableConstrainedDecoding(true) を設定し、ConstraintProviderConfig（正規表現/JSON/文法をサポートする LlGuidanceConfig など）を指定します。次に、SendMessage の OptionalArgs を介して制約を渡します。

例: 正規表現の制約

LlGuidanceConstraintArg constraint_arg;
constraint_arg.constraint_type = LlgConstraintType::kRegex;
constraint_arg.constraint_string = "a+b+"; // Force output to match this regex

auto response = conversation->SendMessage(
    user_message,
    {.decoding_constraint = constraint_arg}
);

JSON スキーマと Lark 文法のサポートなど、詳細については、制約付きデコードのドキュメントをご覧ください。

ツールの使用

ツール呼び出しを使用すると、LLM はクライアントサイド関数の実行をリクエストできます。会話の Preface でツールを定義し、名前でキーを設定します。モデルがツール呼び出しを出力したら、それをキャプチャし、アプリケーションで対応する関数を実行して、結果をモデルに返します。

大まかな流れ:

ツールを宣言する: Preface JSON でツール（名前、説明、パラメータ）を定義します。
[呼び出しを検出する]: レスポンスで model_message["tool_calls"] を確認します。
実行: リクエストされたツールのアプリケーションロジックを実行します。
応答: ツールの出力を含む role: "tool" のメッセージをモデルに送信します。

詳細と完全なチャットループの例については、ツールの使用に関するドキュメントをご覧ください。