使用XNNPACK后端构建和运行ExecuTorch¶
以下教程将帮助您了解如何利用 ExecuTorch XNNPACK 委托来加速使用 CPU 硬件的机器学习模型。它将介绍如何将模型导出并序列化为二进制文件,针对 XNNPACK 委托后端,并在支持的目标平台上运行模型。为了快速入门,请使用 ExecuTorch 仓库中的脚本,其中包含有关导出和生成几个示例模型二进制文件的说明,以演示整个流程。
在这个教程中,你将学习如何导出一个XNNPACK降低后的模型并在目标平台上运行它
将模型转换为XNNPACK¶
import torch
import torchvision.models as models
from torch.export import export, ExportedProgram
from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import EdgeProgramManager, ExecutorchProgramManager, to_edge_transform_and_lower
from executorch.exir.backend.backend_api import to_backend
mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
sample_inputs = (torch.randn(1, 3, 224, 224), )
exported_program: ExportedProgram = export(mobilenet_v2, sample_inputs)
edge: EdgeProgramManager = to_edge_transform_and_lower(
exported_program,
partitioner=[XnnpackPartitioner()],
)
我们将使用从TorchVision库下载的MobileNetV2预训练模型来演示这个示例。模型转换的流程始于导出模型to_edge之后。我们调用to_backend API,并传入XnnpackPartitioner。分区器会识别适合XNNPACK后端委托消费的子图。随后,这些识别出的子图将按照XNNPACK委托的flatbuffer格式进行序列化,并且每个子图将被替换为对XNNPACK委托的调用。
>>> print(edge.exported_program().graph_module)
GraphModule(
(lowered_module_0): LoweredBackendModule()
(lowered_module_1): LoweredBackendModule()
)
def forward(self, b_features_0_1_num_batches_tracked, ..., x):
lowered_module_0 = self.lowered_module_0
lowered_module_1 = self.lowered_module_1
executorch_call_delegate_1 = torch.ops.higher_order.executorch_call_delegate(lowered_module_1, x); lowered_module_1 = x = None
getitem_53 = executorch_call_delegate_1[0]; executorch_call_delegate_1 = None
aten_view_copy_default = executorch_exir_dialects_edge__ops_aten_view_copy_default(getitem_53, [1, 1280]); getitem_53 = None
aten_clone_default = executorch_exir_dialects_edge__ops_aten_clone_default(aten_view_copy_default); aten_view_copy_default = None
executorch_call_delegate = torch.ops.higher_order.executorch_call_delegate(lowered_module_0, aten_clone_default); lowered_module_0 = aten_clone_default = None
getitem_52 = executorch_call_delegate[0]; executorch_call_delegate = None
return (getitem_52,)
我们在降低上述内容后打印图,以显示调用XNNPACK委托所插入的新节点。被委托给XNNPACK的子图是在每个调用点的第一个参数。可以观察到,大部分convolution-relu-add块和linear块能够被委托给XNNPACK。我们还可以看到无法降低到XNNPACK委托的操作符,例如clone和view_copy。
exec_prog = edge.to_executorch()
with open("xnnpack_mobilenetv2.pte", "wb") as file:
exec_prog.write_to_file(file)
在降低到XNNPACK程序后,我们可以将其准备用于executorch,并将模型保存为.pte文件。.pte是一种二进制格式,用于存储序列化的ExecuTorch图。
将量化模型降低到XNNPACK¶
XNNPACK 委托还可以执行对称量化模型。要了解量化流程并学习如何量化模型,请参阅自定义量化说明。为了本教程的目的,我们将利用添加到executorch/executorch/examples文件夹中的quantize() Python 辅助函数。
from torch.export import export_for_training
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
sample_inputs = (torch.randn(1, 3, 224, 224), )
mobilenet_v2 = export_for_training(mobilenet_v2, sample_inputs).module() # 2-stage export for quantization path
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e
from torch.ao.quantization.quantizer.xnnpack_quantizer import (
get_symmetric_quantization_config,
XNNPACKQuantizer,
)
def quantize(model, example_inputs):
"""This is the official recommended flow for quantization in pytorch 2.0 export"""
print(f"Original model: {model}")
quantizer = XNNPACKQuantizer()
# if we set is_per_channel to True, we also need to add out_variant of quantize_per_channel/dequantize_per_channel
operator_config = get_symmetric_quantization_config(is_per_channel=False)
quantizer.set_global(operator_config)
m = prepare_pt2e(model, quantizer)
# calibration
m(*example_inputs)
m = convert_pt2e(m)
print(f"Quantized model: {m}")
# make sure we can export to flat buffer
return m
quantized_mobilenetv2 = quantize(mobilenet_v2, sample_inputs)
量化需要一个两阶段导出。首先我们使用 export_for_training API 来捕获模型,在将其交给 quantize 工具函数之前。在执行量化步骤后,我们现在可以利用 XNNPACK 委托来降低量化导出的模型图。从这里开始,流程与非量化模型降低到 XNNPACK 的过程相同。
# Continued from earlier...
edge = to_edge_transform_and_lower(
export(quantized_mobilenetv2, sample_inputs),
compile_config=EdgeCompileConfig(_check_ir_validity=False),
partitioner=[XnnpackPartitioner()]
)
exec_prog = edge.to_executorch()
with open("qs8_xnnpack_mobilenetv2.pte", "wb") as file:
exec_prog.write_to_file(file)
使用 aot_compiler.py 脚本降低¶
我们还提供了一个脚本,可以快速降低并导出几个示例模型。您可以运行该脚本以生成降低后的 fp32 和量化模型。此脚本仅用于方便起见,执行的操作与前两个部分列出的步骤完全相同。
python -m examples.xnnpack.aot_compiler --model_name="mv2" --quantize --delegate
注意在上面的例子中,
the
-—model_name指定要使用的模型the
-—quantizeflag controls whether the model should be quantized or notthe
-—delegateflag controls whether we attempt to lower parts of the graph to the XNNPACK delegate.
生成的模型文件将根据提供的参数命名为 [model_name]_xnnpack_[qs8/fp32].pte。
使用CMake运行XNNPACK模型¶
在导出 XNNPACK 委托模型后,我们现在可以尝试使用 CMake 用示例输入运行它。我们可以构建并使用 xnn_executor_runner,它是 ExecuTorch 运行时和 XNNPACK 后端的一个示例包装器。我们首先通过以下方式配置 CMake 构建:
# cd to the root of executorch repo
cd executorch
# Get a clean cmake-out directory
./install_requirements.sh --clean
mkdir cmake-out
# Configure cmake
cmake \
-DCMAKE_INSTALL_PREFIX=cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
-DEXECUTORCH_BUILD_XNNPACK=ON \
-DEXECUTORCH_ENABLE_LOGGING=ON \
-DPYTHON_EXECUTABLE=python \
-Bcmake-out .
然后你可以使用以下命令构建运行时组件:
cmake --build cmake-out -j9 --target install --config Release
现在你应该能够找到在 ./cmake-out/backends/xnnpack/xnn_executor_runner 构建的可执行文件,你可以用你生成的模型以如下方式运行该可执行文件
./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./mv2_xnnpack_fp32.pte
# or to run the quantized variant
./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./mv2_xnnpack_q8.pte