以下文件說明 LiteRT 的 8 位元量化配置規格。這項功能旨在協助硬體開發人員,為使用量化 LiteRT 模型的推論作業提供硬體支援。
規格摘要
我們提供規格,但只有在遵循規格的情況下,才能保證某些行為。我們也瞭解不同硬體可能會有偏好設定和限制,導致實作規格時出現些微偏差,進而導致實作結果並非完全相同。在大多數情況下,這或許可以接受 (我們會提供一系列測試,盡可能納入我們從多個模型收集的每項作業容差),但機器學習 (以及最常見的深度學習) 的本質,讓我們無法提供任何硬性保證。
8 位元量化會使用下列公式,估算浮點值。
\[real\_value = (int8\_value - zero\_point) \times scale\]
每個軸 (在 Conv 作業中又稱為每個管道) 或每個張量的權重,是以 int8 範圍內的二補數值表示,零點等於 0。[-127, 127]每個張量的啟用/輸入內容都以 int8 二補數值表示,範圍為 [-128, 127],零點範圍為 [-128, 127]。
特定作業也有其他例外狀況,詳情請參閱下文。
帶正負號整數與不帶正負號整數
LiteRT 量化作業主要會優先處理 8 位元量化作業的工具和核心。int8
這是為了方便以零點等於 0 表示對稱量化。此外,許多後端都針對 int8xint8 累積作業進行額外最佳化。
依軸與依張量
張量層級量化是指每個完整張量都會有一個比例和/或零點。以軸為根據的量化是指 quantized_dimension 中的每個切片都會有一個比例和/或 zero_point。量化維度會指定 Tensor 形狀的維度,而該維度對應於比例和零點。舉例來說,張量 t 具有量化參數:scale=[1.0, 2.0, 3.0]、zero_point=[1, 2, 3]、quantization_dimension=1,將在 t 的第二個維度中量化:dims=[4, 3, 2, 1]
t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
t[:, 1, :, :] will have scale[1]=2.0, zero_point[1]=2
t[:, 2, :, :] will have scale[2]=3.0, zero_point[2]=3
通常 quantized_dimension 是迴旋的權重 output_channel,但理論上,它可以是與核心實作中的每個點積對應的維度,允許更多量化細微性,而不會影響效能。這項功能可大幅提升準確度。
TFLite 支援越來越多運算的每個軸。撰寫本文時,系統支援 Conv2d 和 DepthwiseConv2d。
對稱與非對稱
啟動是非對稱的,零點可位於有正負號的 int8 範圍 [-128, 127] 內。許多啟動作業本質上都是非對稱的,而零點是相對便宜的方式,可有效取得額外的二進位位元精確度。由於啟用次數只會乘以常數權重,因此常數零點值可以大幅最佳化。
權重為對稱:強制零點等於 0。系統會將權重值乘以動態輸入值和啟用值。也就是說,將權重的零點乘以啟用值時,會產生無法避免的執行階段成本。強制將零點設為 0 即可避免這筆費用。
數學說明:這與 arXiv:1712.05877 中的 2.3 節類似,但我們允許每個軸都有不同的比例值。這可輕易概括化,如下所示:
$A$ 是量化啟動的 $m \times n$ 矩陣。
$B$ 是量化權重的 $n \times p$ 矩陣。
假設將 $A$的第 $j$ 列 $a_j$ 乘以 $B$的第 $k$ 欄 $b_k$,兩者的長度都是 $n$。量化整數值和零點值分別為 $q_a$、$z_a$ 和 $q_b$、$z_b$。
\[a_j \cdot b_k = \sum_{i=0}^{n} a_{j}^{(i)} b_{k}^{(i)} = \sum_{i=0}^{n} (q_{a}^{(i)} - z_a) (q_{b}^{(i)} - z_b) = \sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)} - \sum_{i=0}^{n} q_{a}^{(i)} z_b - \sum_{i=0}^{n} q_{b}^{(i)} z_a + \sum_{i=0}^{n} z_a z_b\]
由於 \(\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)}\) 字詞會執行輸入值和權重值的點積,因此無法避免。
\(\sum_{i=0}^{n} q_{b}^{(i)} z_a\) 和 \(\sum_{i=0}^{n} z_a z_b\) 字詞是由常數組成,每個推論呼叫都會保持不變,因此可以預先計算。
由於每次推論的啟用情形都會改變,因此每次推論都必須計算 \(\sum_{i=0}^{n} q_{a}^{(i)} z_b\) 項。強制權重對稱,即可移除這項條件的成本。
int8 量化運算子規格
以下說明 int8 TFLite 核心的量化規定:
ADD
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
AVERAGE_POOL_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
CONCATENATION
Input ...:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
CONV_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1 (Weight):
data_type : int8
range : [-127, 127]
granularity: per-axis (dim = 0)
restriction: zero_point = 0
Input 2 (Bias):
data_type : int32
range : [int32_min, int32_max]
granularity: per-axis
restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
DEPTHWISE_CONV_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1 (Weight):
data_type : int8
range : [-127, 127]
granularity: per-axis (dim = 3)
restriction: zero_point = 0
Input 2 (Bias):
data_type : int32
range : [int32_min, int32_max]
granularity: per-axis
restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
FULLY_CONNECTED
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1 (Weight):
data_type : int8
range : [-127, 127]
granularity: per-axis (dim = 0)
restriction: zero_point = 0
Input 2 (Bias):
data_type : int32
range : [int32_min, int32_max]
granularity: per-tensor
restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
L2_NORMALIZATION
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 128.0, 0)
LOGISTIC
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 256.0, -128)
MAX_POOL_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
MUL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
RESHAPE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
RESIZE_BILINEAR
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
SOFTMAX
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 256.0, -128)
SPACE_TO_DEPTH
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
TANH
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 128.0, 0)
PAD
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
GATHER
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
BATCH_TO_SPACE_ND
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
SPACE_TO_BATCH_ND
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
TRANSPOSE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
MEAN
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SUB
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SUM
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SQUEEZE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
LOG_SOFTMAX
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (16.0 / 256.0, 127)
MAXIMUM
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
ARG_MAX
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
MINIMUM
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
LESS
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
PADV2
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
GREATER
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
GREATER_EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
LESS_EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SLICE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
NOT_EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SHAPE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
QUANTIZE (Requantization)
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor