[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["缺少我需要的資訊","missingTheInformationINeed","thumb-down"],["過於複雜/步驟過多","tooComplicatedTooManySteps","thumb-down"],["過時","outOfDate","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["示例/程式碼問題","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-07-24 (世界標準時間)。"],[],[],null,["# GPU delegates for LiteRT\n\nUsing graphics processing units (GPUs) to run your machine learning (ML) models\ncan dramatically improve the performance of your model and the user experience\nof your ML-enabled applications. LiteRT enables the use of GPUs and\nother specialized processors through hardware driver called\n[*delegates*](./delegates). Enabling use of GPUs with your LiteRT ML\napplications can provide the following benefits:\n\n- **Speed** - GPUs are built for high throughput of massively parallel workloads. This design makes them well-suited for deep neural nets, which consist of a huge number of operators, each working on input tensors that can be processed in parallel, which typically results in lower latency. In the best scenario, running your model on a GPU may run fast enough to enable real-time applications that were not previously possible.\n- **Power efficiency** - GPUs carry out ML computations in a very efficient and optimized manner, typically consuming less power and generating less heat than the same task running on CPUs.\n\nThis document provides an overview of GPUs support in LiteRT, and some\nadvanced uses for GPU processors. For more specific information about\nimplementing GPU support on specific platforms, see the following guides:\n\n- [GPU support for Android](../android/gpu)\n- [GPU support for iOS](../ios/gpu)\n\nGPU ML operations support\n-------------------------\n\nThere are some limitations to what TensorFlow ML operations, or *ops*, can be\naccelerated by the LiteRT GPU delegate. The delegate supports the\nfollowing ops in 16-bit and 32-bit float precision:\n\n- `ADD`\n- `AVERAGE_POOL_2D`\n- `CONCATENATION`\n- `CONV_2D`\n- `DEPTHWISE_CONV_2D v1-2`\n- `EXP`\n- `FULLY_CONNECTED`\n- `LOGICAL_AND`\n- `LOGISTIC`\n- `LSTM v2 (Basic LSTM only)`\n- `MAX_POOL_2D`\n- `MAXIMUM`\n- `MINIMUM`\n- `MUL`\n- `PAD`\n- `PRELU`\n- `RELU`\n- `RELU6`\n- `RESHAPE`\n- `RESIZE_BILINEAR v1-3`\n- `SOFTMAX`\n- `STRIDED_SLICE`\n- `SUB`\n- `TRANSPOSE_CONV`\n\nBy default, all ops are only supported at version 1. Enabling the [quantization\nsupport](#quantized-models) enables the appropriate versions, for example, ADD\nv2.\n\n### Troubleshooting GPU support\n\nIf some of the ops are not supported by the GPU delegate, the framework will\nonly run a part of the graph on the GPU and the remaining part on the CPU. Due\nto the high cost of CPU/GPU synchronization, a split execution mode like this\noften results in slower performance than when the whole network is run on the\nCPU alone. In this case, the application generates warning, such as: \n\n WARNING: op code #42 cannot be handled by this delegate.\n\nThere is no callback for failures of this type, since this is not an actual\nrun-time failure. When testing execution of your model with the GPU delegate,\nyou should be alert for these warnings. A high number of these warnings can\nindicate that your model is not the best fit for use for GPU acceleration, and\nmay require refactoring of the model.\n\nExample models\n--------------\n\nThe following example models are built to take advantage GPU acceleration with\nLiteRT and are provided for reference and testing:\n\n- [MobileNet v1 (224x224) image\n classification](https://ai.googleblog.com/2017/06/mobilenets-open-source-models-for.html)\n - An image classification model designed for mobile and embedded based vision applications. ([model](https://www.kaggle.com/models/google/mobilenet-v1/tensorFlow2/100-224-classification/2))\n - [DeepLab segmentation\n (257x257)](https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.html)\n - image segmentation model that assigns semantic labels, such as a dog, cat, car, to every pixel in the input image. ([model](https://www.kaggle.com/models/tensorflow/deeplabv3/tfLite/default/1))\n - [MobileNet SSD object\n detection](https://ai.googleblog.com/2018/07/accelerated-training-and-inference-with.html)\n - An image classification model that detects multiple objects with bounding boxes. ([model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobile_ssd_v2_float_coco.tflite))\n - [PoseNet for pose\n estimation](https://github.com/tensorflow/tfjs-models/tree/master/pose-detection)\n - A vision model that estimates the poses of people in image or video. ([model](https://www.kaggle.com/models/tensorflow/posenet-mobilenet/tfLite/float-075/1))\n\nOptimizing for GPUs\n-------------------\n\nThe following techniques can help you get better performance when running models\non GPU hardware using the LiteRT GPU delegate:\n\n- **Reshape operations** - Some operations that are quick on a CPU may have a\n high cost for the GPU on mobile devices. Reshape operations are particularly\n expensive to run, including `BATCH_TO_SPACE`, `SPACE_TO_BATCH`,\n `SPACE_TO_DEPTH`, and so forth. You should closely examine use of reshape\n operations, and consider that may have been applied only for exploring data\n or for early iterations of your model. Removing them can significantly\n improve performance.\n\n- **Image data channels** - On GPU, tensor data is sliced into 4-channels, and\n so a computation on a tensor with the shape `[B,H,W,5]` performs about the\n same on a tensor of shape `[B,H,W,8]`, but significantly worse than\n `[B,H,W,4]`. If the camera hardware you are using supports image frames in\n RGBA, feeding that 4-channel input is significantly faster, since it avoids\n a memory copy from 3-channel RGB to 4-channel RGBX.\n\n- **Mobile-optimized models** - For best performance, you should consider\n retraining your classifier with a mobile-optimized network architecture.\n Optimization for on-device inferencing can dramatically reduce latency and\n power consumption by taking advantage of mobile hardware features.\n\nAdvanced GPU support\n--------------------\n\nYou can use additional, advanced techniques with GPU processing to enable even\nbetter performance for your models, including quantization and serialization.\nThe following sections describe these techniques in further detail.\n\n### Using quantized models\n\nThis section explains how the GPU delegate accelerates 8-bit quantized models,\nincluding the following:\n\n- Models trained with [Quantization-aware training](https://www.tensorflow.org/model_optimization/guide/quantization/training)\n- Post-training [dynamic-range quantization](../models/post_training_quant)\n- Post-training [full-integer quantization](../models/post_training_integer_quant)\n\nTo optimize performance, use models that have both floating-point input and\noutput tensors.\n\n#### How does this work?\n\nSince the GPU backend only supports floating-point execution, we run quantized\nmodels by giving it a 'floating-point view' of the original model. At a\nhigh-level, this entails the following steps:\n\n- *Constant tensors* (such as weights/biases) are de-quantized once into the\n GPU memory. This operation happens when the delegate is enabled for\n LiteRT.\n\n- *Inputs and outputs* to the GPU program, if 8-bit quantized, are\n de-quantized and quantized (respectively) for each inference. This operation\n is done on the CPU using LiteRT's optimized kernels.\n\n- *Quantization simulators* are inserted between operations to mimic quantized\n behavior. This approach is necessary for models where ops expect activations\n to follow bounds learnt during quantization.\n\nFor information about enabling this feature with the GPU delegate, see the\nfollowing:\n\n- Using [quantized models with GPU on Android](../android/gpu#quantized-models)\n- Using [quantized models with GPU on iOS](../ios/gpu#quantized-models)\n\n### Reducing initialization time with serialization\n\nThe GPU delegate feature allows you to load from pre-compiled kernel code and\nmodel data serialized and saved on disk from previous runs. This approach avoids\nre-compilation and can reduce startup time by up to 90%. This improvement is\nachieved by exchanging disk space for time savings. You can enable this feature\nwith a few configurations options, as shown in the following code examples: \n\n### C++\n\n\u003cbr /\u003e\n\n```c++\n TfLiteGpuDelegateOptionsV2 options = TfLiteGpuDelegateOptionsV2Default();\n options.experimental_flags |= TFLITE_GPU_EXPERIMENTAL_FLAGS_ENABLE_SERIALIZATION;\n options.serialization_dir = kTmpDir;\n options.model_token = kModelToken;\n\n auto* delegate = TfLiteGpuDelegateV2Create(options);\n if (interpreter-\u003eModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;\n \n```\n\n\u003cbr /\u003e\n\n### Java\n\n\u003cbr /\u003e\n\n```java\n GpuDelegate delegate = new GpuDelegate(\n new GpuDelegate.Options().setSerializationParams(\n /* serializationDir= */ serializationDir,\n /* modelToken= */ modelToken));\n\n Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);\n \n```\n\n\u003cbr /\u003e\n\nWhen using the serialization feature, make sure your code complies with these\nimplementation rules:\n\n- Store the serialization data in a directory that is not accessible to other apps. On Android devices, use [`getCodeCacheDir()`](https://developer.android.com/reference/android/content/Context#getCacheDir()) which points to a location that is private to the current application.\n- The model token must be unique to the device for the specific model. You can compute a model token by generating a fingerprint from the model data using libraries such as [`farmhash::Fingerprint64`](https://github.com/google/farmhash).\n\n| **Note:** Use of this serialization feature requires the [OpenCL\n| SDK](https://github.com/KhronosGroup/OpenCL-SDK)."]]