Executorch中的LLM简介¶

欢迎来到LLM手册！本手册旨在提供一个实际示例，以帮助您利用ExecuTorch将您自己的大型语言模型（LLMs）集成到系统中。我们的主要目标是提供一份清晰简洁的指南，说明如何将我们的系统与您自己的LLM进行集成。

请注意，此项目仅作为演示用途，并非功能齐全且性能最优的示例。因此，诸如采样器、分词器等组件仅以最基本的形式提供，仅供演示之用。因此，模型生成的结果可能有所不同，且不一定总是最优的。

我们鼓励用户将此项目作为起点，并根据自己的具体需求进行调整，这包括创建自己的分词器、采样器、加速后端及其他组件的版本。我们希望这个项目能在您使用LLMs和ExecuTorch的过程中提供有用的指导。

要以最优性能部署Llama，请参阅Llama指南。

目录¶

先决条件
你好，世界示例
量化
使用移动加速
调试和分析
如何使用自定义内核
如何构建移动应用程序

预备知识¶

要遵循本指南，您需要克隆 ExecuTorch 仓库并安装依赖项。 ExecuTorch 建议使用 Python 3.10 并推荐使用 Conda 来管理您的环境。虽然不强制要求使用 Conda，但请注意，根据您的环境，您可能需要用 python3/pip3 替代 python/pip。

Conda

Miniconda的安装说明可以在这里找到。

# Create a directory for this example.
mkdir et-nanogpt
cd et-nanogpt

# Clone the ExecuTorch repository and submodules.
mkdir third-party
git clone -b release/0.4 https://github.com/pytorch/executorch.git third-party/executorch
cd third-party/executorch
git submodule update --init

# Create a conda environment and install requirements.
conda create -yn executorch python=3.10.0
conda activate executorch
./install_requirements.sh

cd ../..

pyenv-virtualenv

安装pyenv-virtualenv的说明可以在这里找到。

重要的是，如果通过 brew 安装 pyenv，它不会自动在终端中启用 pyenv，这会导致错误。运行以下命令以启用。请参阅上面的 pyenv-virtualenv 安装指南，了解如何将其添加到您的 .bashrc 或 .zshrc 中，以避免需要手动运行这些命令。

eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

# Create a directory for this example.
mkdir et-nanogpt
cd et-nanogpt

pyenv install -s 3.10
pyenv virtualenv 3.10 executorch
pyenv activate executorch

# Clone the ExecuTorch repository and submodules.
mkdir third-party
git clone -b release/0.4 https://github.com/pytorch/executorch.git third-party/executorch
cd third-party/executorch
git submodule update --init

# Install requirements.
PYTHON_EXECUTABLE=python ./install_requirements.sh

cd ../..

有关更多信息，请参阅设置 ExecuTorch。

在本地运行大型语言模型¶

此示例使用了Karpathy的nanoGPT，它是GPT-2 124M的一个最小实现。本指南适用于其他语言模型，因为ExecuTorch是模型无关的。

使用 ExecuTorch 运行模型有两个步骤：

导出模型。此步骤将其预处理为适合运行时执行的格式。
在运行时，加载模型文件并使用 ExecuTorch 运行时运行。

导出步骤发生在事先，通常作为应用程序构建的一部分或在模型更改时进行。生成的.pte文件会与应用程序一起分发。在运行时，应用程序加载.pte文件并将其传递给ExecuTorch运行时。

步骤 1. 导出到 ExecuTorch¶

导出是将 PyTorch 模型转换为一种可以在消费设备上高效运行的格式。

对于这个示例，您将需要 nanoGPT 模型和相应的分词器词汇表。

curl 请求

curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O
curl https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O

wget（此词通常在中文中不翻译，保持原样）

wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py
wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json

要将模型转换为优化的独立执行格式，需要两个步骤。首先，使用PyTorch export 函数将PyTorch模型转换为平台无关的中间表示。然后使用ExecuTorch to_edge 和 to_executorch 方法准备模型以进行设备上的执行。这会创建一个.pte文件，可以在桌面或移动应用程序运行时加载该文件。

创建一个名为 export_nanogpt.py 的文件，其内容如下：

# export_nanogpt.py

import torch

from executorch.exir import EdgeCompileConfig, to_edge
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export, export_for_training

from model import GPT

# Load the model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
edge_config = EdgeCompileConfig(_check_ir_validity=False)
edge_manager = to_edge(traced_model,  compile_config=edge_config)
et_program = edge_manager.to_executorch()

# Save the ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

要导出，请使用python export_nanogpt.py运行脚本（或根据您的环境使用python3）。它将在当前目录生成一个nanogpt.pte文件。

有关更多信息，请参阅导出到 ExecuTorch 和 torch.export。

步骤 2. 调用运行时¶

ExecuTorch 提供了一组运行时 API 和类型，用于加载和运行模型。

创建一个名为 main.cpp 的文件，内容如下：

// main.cpp

#include <cstdint>

#include "basic_sampler.h"
#include "basic_tokenizer.h"

#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
#include <executorch/runtime/core/evalue.h>
#include <executorch/runtime/core/exec_aten/exec_aten.h>
#include <executorch/runtime/core/result.h>

using executorch::aten::ScalarType;
using executorch::aten::Tensor;
using executorch::extension::from_blob;
using executorch::extension::Module;
using executorch::runtime::EValue;
using executorch::runtime::Result;

模型的输入和输出采用张量的形式。可以将张量视为多维数组。 ExecuTorch EValue 类提供了对张量和其他 ExecuTorch 数据类型的封装。

由于LLM一次生成一个标记，驱动代码需要反复调用模型，逐个构建输出标记。每个生成的标记作为下一个运行的输入。

// main.cpp

// The value of the gpt2 `<|endoftext|>` token.
#define ENDOFTEXT_TOKEN 50256

std::string generate(
    Module& llm_model,
    std::string& prompt,
    BasicTokenizer& tokenizer,
    BasicSampler& sampler,
    size_t max_input_length,
    size_t max_output_length) {
  // Convert the input text into a list of integers (tokens) that represents it,
  // using the string-to-token mapping that the model was trained on. Each token
  // is an integer that represents a word or part of a word.
  std::vector<int64_t> input_tokens = tokenizer.encode(prompt);
  std::vector<int64_t> output_tokens;

  for (auto i = 0u; i < max_output_length; i++) {
    // Convert the input_tokens from a vector of int64_t to EValue. EValue is a
    // unified data type in the ExecuTorch runtime.
    auto inputs = from_blob(
        input_tokens.data(),
        {1, static_cast<int>(input_tokens.size())},
        ScalarType::Long);

    // Run the model. It will return a tensor of logits (log-probabilities).
    auto logits_evalue = llm_model.forward(inputs);

    // Convert the output logits from EValue to std::vector, which is what the
    // sampler expects.
    Tensor logits_tensor = logits_evalue.get()[0].toTensor();
    std::vector<float> logits(
        logits_tensor.data_ptr<float>(),
        logits_tensor.data_ptr<float>() + logits_tensor.numel());

    // Sample the next token from the logits.
    int64_t next_token = sampler.sample(logits);

    // Break if we reached the end of the text.
    if (next_token == ENDOFTEXT_TOKEN) {
      break;
    }

    // Add the next token to the output.
    output_tokens.push_back(next_token);

    std::cout << tokenizer.decode({next_token});
    std::cout.flush();

    // Update next input.
    input_tokens.push_back(next_token);
    if (input_tokens.size() > max_input_length) {
      input_tokens.erase(input_tokens.begin());
    }
  }

  std::cout << std::endl;

  // Convert the output tokens into a human-readable string.
  std::string output_string = tokenizer.decode(output_tokens);
  return output_string;
}

The Module 类处理加载 .pte 文件并准备执行。

分词器负责将人类可读的提示字符串转换为模型期望的数值形式。为此，分词器将短的子字符串与给定的标记ID关联起来。可以将这些标记视为代表单词或单词的一部分，尽管在实践中，它们可能是任意的字符序列。

分词器从文件中加载词汇表，该文件包含每个标记ID与它所代表的文本之间的映射。调用tokenizer.encode()和tokenizer.decode()可以在字符串和标记表示之间进行转换。

采样器负责根据模型输出的logits（对数概率）选择下一个标记。LLM为每个可能的下一个标记返回一个logit值。采样器根据某种策略选择使用哪个标记。这里使用的最简单的方法是选取具有最高logit值的标记。

采样器可能提供可配置的选项，例如可配置的输出选择随机性数量、重复令牌的惩罚以及优先或降级特定令牌的偏置。

// main.cpp

int main() {
  // Set up the prompt. This provides the seed text for the model to elaborate.
  std::cout << "Enter model prompt: ";
  std::string prompt;
  std::getline(std::cin, prompt);

  // The tokenizer is used to convert between tokens (used by the model) and
  // human-readable strings.
  BasicTokenizer tokenizer("vocab.json");

  // The sampler is used to sample the next token from the logits.
  BasicSampler sampler = BasicSampler();

  // Load the exported nanoGPT program, which was generated via the previous
  // steps.
  Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors);

  const auto max_input_tokens = 1024;
  const auto max_output_tokens = 30;
  std::cout << prompt;
  generate(
      model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens);
}

最后，将以下文件下载到与 main.cpp 相同的目录中：

curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_sampler.h
curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_tokenizer.h

要了解更多，请参阅运行时API教程。

构建和运行¶

ExecuTorch 使用 CMake 构建系统。要编译和链接到 ExecuTorch 运行时，通过 add_directory 包含 ExecuTorch 项目，并链接到 executorch 和其他附加依赖项。

创建一个名为 CMakeLists.txt 的文件，其内容如下：

# CMakeLists.txt

cmake_minimum_required(VERSION 3.19)
project(nanogpt_runner)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

# Set options for executorch build.
option(EXECUTORCH_ENABLE_LOGGING "" ON)
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)

# Include the executorch subdirectory.
add_subdirectory(
  ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch
  ${CMAKE_BINARY_DIR}/executorch
)

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
  nanogpt_runner
  PRIVATE executorch
          extension_module_static # Provides the Module class
          extension_tensor # Provides the TensorPtr class
          optimized_native_cpu_ops_lib # Provides baseline cross-platform
                                       # kernels
)

此时，工作目录应包含以下文件：

CMakeLists.txt
main.cpp
basic_tokenizer.h
basic_sampler.h
export_nanogpt.py
model.py
vocab.json
nanogpt.pte

如果这些都存在，你现在可以构建并运行：

./install_requirements.sh --clean
(mkdir cmake-out && cd cmake-out && cmake ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

你应该看到消息：

Enter model prompt:

输入一些模型的种子文本并按回车。这里我们使用“Hello world!”作为示例提示：

Enter model prompt: Hello world!
Hello world!

I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in

此时，运行速度可能会非常慢。这是因为 ExecuTorch 尚未被告知针对特定硬件进行优化（委托），并且它正在使用 32 位浮点数执行所有计算（未进行量化）。

委托¶

虽然 ExecuTorch 为所有算子提供了可移植的跨平台实现，但它也为多种不同目标提供了专用后端。这些后端包括但不限于：通过 XNNPACK 后端实现的 x86 和 ARM CPU 加速、通过 Core ML 后端和 Metal Performance Shader (MPS) 后端实现的 Apple 加速，以及通过 Vulkan 后端实现的 GPU 加速。

由于优化针对特定后端，每个 PTE 文件仅适用于导出时所针对的后端。若要支持多种设备（例如 Android 的 XNNPACK 加速和 iOS 的 Core ML），请为每个后端分别导出一个独立的 PTE 文件。

为了在导出时委托给后端，ExecuTorch 在 to_backend() 对象中提供了 EdgeProgramManager 函数，该函数接受特定于后端的分区器对象。分区器负责查找计算图中可由目标后端加速的部分，并且 to_backend() 函数会将匹配的部分委托给给定后端以进行加速和优化。任何未委托的计算图部分将由 ExecuTorch 算子实现执行。

要将导出的模型委托给特定后端，我们首先需要从ExecuTorch代码库中导入其分区器以及边缘编译配置，然后在EdgeProgramManager对象上调用to_backend函数，并传入一个分区器实例。

以下是将 nanoGPT 委托给 XNNPACK 的示例（例如，如果您要部署到 Android 手机）：

# export_nanogpt.py

# Load partitioner for Xnnpack backend
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

# Model to be delegated to specific backend should use specific edge compile config
from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config
from executorch.exir import EdgeCompileConfig, to_edge

import torch
from torch.export import export
from torch.nn.attention import sdpa_kernel, SDPBackend
from torch.export import export_for_training

from model import GPT

# Load the nanoGPT model.
model = GPT.from_pretrained('gpt2')

# Create example inputs. This is used in the export process to provide
# hints on the expected shape of the model input.
example_inputs = (
        torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long),
    )

# Set up dynamic shape configuration. This allows the sizes of the input tensors
# to differ from the sizes of the tensors in `example_inputs` during runtime, as
# long as they adhere to the rules specified in the dynamic shape configuration.
# Here we set the range of 0th model input's 1st dimension as
# [0, model.config.block_size].
# See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes
# for details about creating dynamic shapes.
dynamic_shape = (
    {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)},
)

# Trace the model, converting it to a portable intermediate representation.
# The torch.no_grad() call tells PyTorch to exclude training-specific logic.
with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad():
    m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module()
    traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape)

# Convert the model into a runnable ExecuTorch program.
# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config
edge_config = get_xnnpack_edge_compile_config()
edge_manager = to_edge(traced_model, compile_config=edge_config)

# Delegate exported model to Xnnpack backend by invoking `to_backend` function with Xnnpack partitioner.
edge_manager = edge_manager.to_backend(XnnpackPartitioner())
et_program = edge_manager.to_executorch()

# Save the Xnnpack-delegated ExecuTorch program to a file.
with open("nanogpt.pte", "wb") as file:
    file.write(et_program.buffer)

此外，请更新 CMakeLists.txt 以构建 XNNPACK 后端并将其链接到 ExecuTorch 运行器。

cmake_minimum_required(VERSION 3.19)
project(nanogpt_runner)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)

# Set options for executorch build.
option(EXECUTORCH_ENABLE_LOGGING "" ON)
option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON)
option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON)
option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON)
option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON)
option(EXECUTORCH_BUILD_XNNPACK "" ON) # Build with Xnnpack backend

# Include the executorch subdirectory.
add_subdirectory(
  ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch
  ${CMAKE_BINARY_DIR}/executorch
)

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
  nanogpt_runner
  PRIVATE executorch
          extension_module_static # Provides the Module class
          extension_tensor # Provides the TensorPtr class
          optimized_native_cpu_ops_lib # Provides baseline cross-platform
                                       # kernels
          xnnpack_backend # Provides the XNNPACK CPU acceleration backend
)

保持其余代码不变。有关详细信息，请参阅导出到 ExecuTorch和调用运行时以获取更多详细信息

此时，工作目录应包含以下文件：

CMakeLists.txt
main.cpp
basic_tokenizer.h
basic_sampler.h
export_nanogpt.py
model.py
vocab.json

如果以上所有条件均已满足，您现在可以导出 Xnnpack 委托的 pte 模型：

python export_nanogpt.py

它将在同一工作目录下生成 nanogpt.pte。

然后我们可以通过以下方式构建并运行模型：

(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

你应该看到消息：

Enter model prompt:

输入一些模型的种子文本并按回车。这里我们使用“Hello world!”作为示例提示：

Enter model prompt: Hello world!
Hello world!

I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in

与未委托的模型相比，委托后的模型速度应明显更快。

有关后端委派的更多信息，请参阅ExecuTorch指南针对 XNNPACK 后端、Core ML 后端和 Qualcomm AI 引擎直接后端。

量化¶

量化指的是一组使用较低精度类型进行计算和存储张量的技术。与 32 位浮点数相比，使用 8 位整数可以显著提升速度并减少内存占用。量化模型有多种方法，它们在所需预处理量、使用的数据类型以及对模型准确率和性能的影响方面各不相同。

由于移动设备上的计算和内存资源高度受限，因此必须采用某种形式的量化才能在消费级电子产品上部署大型模型。特别是像 Llama2 这样的大型语言模型，可能需要将模型权重量化至 4 比特或更低。

利用量化需要在导出前对模型进行转换。PyTorch 为此提供了 pt2e（PyTorch 2 Export）API。本示例旨在使用 XNNPACK 后端实现 CPU 加速，因此需要使用针对 XNNPACK 的量化器。若目标为其他后端，则需使用相应的量化器。

要使用 XNNPACK 委托进行 8 位整数动态量化，请调用 prepare_pt2e，通过代表性输入运行以校准模型，然后调用 convert_pt2e。这将更新计算图以在可用时使用量化运算符。

# export_nanogpt.py

from executorch.backends.transforms.duplicate_dynamic_quant_chain import (
    DuplicateDynamicQuantChainPass,
)
from torch.ao.quantization.quantizer.xnnpack_quantizer import (
    get_symmetric_quantization_config,
    XNNPACKQuantizer,
)
from torch.ao.quantization.quantize_pt2e import convert_pt2e, prepare_pt2e

# Use dynamic, per-channel quantization.
xnnpack_quant_config = get_symmetric_quantization_config(
    is_per_channel=True, is_dynamic=True
)
xnnpack_quantizer = XNNPACKQuantizer()
xnnpack_quantizer.set_global(xnnpack_quant_config)

m = export_for_training(model, example_inputs).module()

# Annotate the model for quantization. This prepares the model for calibration.
m = prepare_pt2e(m, xnnpack_quantizer)

# Calibrate the model using representative inputs. This allows the quantization
# logic to determine the expected range of values in each tensor.
m(*example_inputs)

# Perform the actual quantization.
m = convert_pt2e(m, fold_quantize=False)
DuplicateDynamicQuantChainPass()(m)

traced_model = export(m, example_inputs)

此外，添加或更新 to_backend() 调用以使用 XnnpackPartitioner。这将指示 ExecuTorch 通过 XNNPACK 后端针对 CPU 执行优化模型。

from executorch.backends.xnnpack.partition.xnnpack_partitioner import (
    XnnpackPartitioner,
)

edge_manager = to_edge(traced_model, compile_config=edge_config)
edge_manager = edge_manager.to_backend(XnnpackPartitioner()) # Lower to XNNPACK.
et_program = edge_manager.to_executorch()

最后，确保运行器在 CMakeLists.txt 中链接到 xnnpack_backend 目标。

add_executable(nanogpt_runner main.cpp)
target_link_libraries(
    nanogpt_runner
    PRIVATE
    executorch
    extension_module_static # Provides the Module class
    optimized_native_cpu_ops_lib # Provides baseline cross-platform kernels
    xnnpack_backend) # Provides the XNNPACK CPU acceleration backend

有关更多信息，请参阅 ExecuTorch 中的量化。

性能分析与调试¶

在调用 to_backend() 降低模型后，您可能希望查看哪些部分被委托以及哪些未被委托。ExecuTorch 提供了实用方法来提供有关委托的见解。您可以利用这些信息深入了解底层计算并诊断潜在的性能问题。模型作者可以利用这些信息以与目标后端兼容的方式构建模型结构。

可视化委托¶

get_delegation_info() 方法提供了在调用 to_backend() 后模型发生变化的摘要：

from executorch.devtools.backend_debug import get_delegation_info
from tabulate import tabulate

# ... After call to to_backend(), but before to_executorch()
graph_module = edge_manager.exported_program().graph_module
delegation_info = get_delegation_info(graph_module)
print(delegation_info.get_summary())
df = delegation_info.get_operator_delegation_dataframe()
print(tabulate(df, headers="keys", tablefmt="fancy_grid"))

针对面向 XNNPACK 后端的 nanoGPT，您可能会看到以下内容（请注意，以下数值仅用于说明，实际值可能有所不同）：

Total  delegated  subgraphs:  145
Number  of  delegated  nodes:  350
Number  of  non-delegated  nodes:  760

	op_type	# 在委托图中	# 在非委托图中
0	aten__softmax_default	12	0
1	aten_add_tensor	37	0
2	aten_addmm_default	48	0
3	aten_any_dim	0	12
	…
25	aten_view_copy_default	96	122
	…
30	总计	350	760

从表中可以看出，算子 aten_view_copy_default 在委托图中出现了 96 次，在非委托图中出现了 122 次。若要查看更详细的视图，请使用 format_delegated_graph() 方法获取整个图的格式化字符串输出，或直接使用 print_delegated_graph() 进行打印：

from executorch.exir.backend.utils import format_delegated_graph
graph_module = edge_manager.exported_program().graph_module
print(format_delegated_graph(graph_module))

对于大型模型，这可能会生成大量输出。建议使用“Control+F"或“Command+F"来定位您感兴趣的算子（例如"aten_view_copy_default"）。请观察哪些实例未被置于降低后的计算图中。

在下方 nanoGPT 的输出片段中，请注意 Transformer 模块已被委托给 XNNPACK，而 where 算子则未如此处理。

%aten_where_self_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.where.self](args = (%aten_logical_not_default_33, %scalar_tensor_23, %scalar_tensor_22), kwargs = {})
%lowered_module_144 : [num_users=1] = get_attr[target=lowered_module_144]
backend_id: XnnpackBackend
lowered graph():
    %p_transformer_h_0_attn_c_attn_weight : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_weight]
    %p_transformer_h_0_attn_c_attn_bias : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_bias]
    %getitem : [num_users=1] = placeholder[target=getitem]
    %sym_size : [num_users=2] = placeholder[target=sym_size]
    %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%getitem, [%sym_size, 768]), kwargs = {})
    %aten_permute_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.permute_copy.default](args = (%p_transformer_h_0_attn_c_attn_weight, [1, 0]), kwargs = {})
    %aten_addmm_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.addmm.default](args = (%p_transformer_h_0_attn_c_attn_bias, %aten_view_copy_default, %aten_permute_copy_default), kwargs = {})
    %aten_view_copy_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%aten_addmm_default, [1, %sym_size, 2304]), kwargs = {})
    return [aten_view_copy_default_1]

性能分析¶

通过 ExecuTorch 开发者工具，用户可以分析模型执行过程，获取模型中每个算子的时序信息。

预备知识¶

ETRecord 生成（可选）¶

ETRecord 是在导出时生成的一种工件，包含模型图和源级别元数据，这些元数据将 ExecuTorch 程序链接到原始的 PyTorch 模型。您可以查看所有分析事件，即使没有 ETRecord 也可以，但有了 ETRecord，您将能够将每个事件链接到正在执行的操作类型、模块层次结构以及原始 PyTorch 源代码的堆栈跟踪。有关更多信息，请参阅 ETRecord 文档。

在您的导出脚本中，调用to_edge()和to_executorch()之后，请调用generate_etrecord()，并传入来自to_edge()的EdgeProgramManager以及来自to_executorch()的ExecuTorchProgramManager。请务必复制EdgeProgramManager，因为对to_backend()的调用会就地修改图结构。

# export_nanogpt.py

import copy
from executorch.devtools import generate_etrecord

# Make the deep copy immediately after to to_edge()
edge_manager_copy = copy.deepcopy(edge_manager)

# ...
# Generate ETRecord right after to_executorch()
etrecord_path = "etrecord.bin"
generate_etrecord(etrecord_path, edge_manager_copy, et_program)

运行导出脚本，ETRecord 将生成为 etrecord.bin。

ETDump 生成¶

ETDump 是在运行时生成的工件，包含模型执行的跟踪。有关更多信息，请参阅 ETDump 文档。

在您的代码中包含 ETDump 头文件。

// main.cpp

#include <executorch/devtools/etdump/etdump_flatcc.h>

创建 ETDumpGen 类的实例，并将其传递给 Module 构造函数。

std::unique_ptr<ETDumpGen> etdump_gen_ = std::make_unique<ETDumpGen>();
Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors, std::move(etdump_gen_));

调用 generate() 后，将 ETDump 保存到文件中。如果需要，您可以在单个跟踪中捕获多个模型运行。

ETDumpGen* etdump_gen = static_cast<ETDumpGen*>(model.event_tracer());

ET_LOG(Info, "ETDump size: %zu blocks", etdump_gen->get_num_blocks());
etdump_result result = etdump_gen->get_etdump_data();
if (result.buf != nullptr && result.size > 0) {
    // On a device with a file system, users can just write it to a file.
    FILE* f = fopen("etdump.etdp", "w+");
    fwrite((uint8_t*)result.buf, 1, result.size, f);
    fclose(f);
    free(result.buf);
}

此外，请更新 CMakeLists.txt 以使用开发者工具进行构建，并启用事件追踪与记录至 ETDump：

option(EXECUTORCH_ENABLE_EVENT_TRACER "" ON)
option(EXECUTORCH_BUILD_DEVTOOLS "" ON)

# ...

target_link_libraries(
    # ... omit existing ones
    etdump) # Provides event tracing and logging

target_compile_options(executorch PUBLIC -DET_EVENT_TRACER_ENABLED)
target_compile_options(portable_ops_lib PUBLIC -DET_EVENT_TRACER_ENABLED)

构建并运行 runner，您将看到一个名为"etdump.etdp"的文件被生成。（请注意，此次我们采用 release 模式进行构建，以规避 flatccrt 的构建限制。）

(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake -DCMAKE_BUILD_TYPE=Release ..)
cmake --build cmake-out -j10
./cmake-out/nanogpt_runner

使用 Inspector API 进行分析¶

一旦您收集了调试工件 ETDump（以及可选的 ETRecord），即可使用 Inspector API 查看性能信息。

from executorch.devtools import Inspector

inspector = Inspector(etdump_path="etdump.etdp")
# If you also generated an ETRecord, then pass that in as well: `inspector = Inspector(etdump_path="etdump.etdp", etrecord="etrecord.bin")`

with open("inspector_out.txt", "w") as file:
    inspector.print_data_tabular(file)

这将以表格格式在“inspector_out.txt”中打印性能数据，每一行代表一个分析事件。顶部的几行看起来像这样：查看完整尺寸

要了解更多关于Inspector及其提供的丰富功能，请参阅Inspector API 参考。

自定义内核¶

借助 ExecuTorch 自定义算子 API，自定义算子和内核的开发者可以轻松将其内核集成到 PyTorch/ExecuTorch 中。

在 ExecuTorch 中使用自定义内核包含三个步骤：

使用 ExecuTorch 类型编写自定义内核。
将自定义内核编译并链接到 AOT Python 环境以及运行时二进制文件。
将算子替换为自定义算子的源码到源码转换。

编写自定义内核¶

为函数式变体（用于 AOT 编译）和 out 变体（用于 ExecuTorch 运行时）定义您的自定义算子模式。该模式需遵循 PyTorch ATen 约定（参见 native_functions.yaml）。

custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor

custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)

根据上述模式编写您的自定义内核。使用 EXECUTORCH_LIBRARY 宏使该内核对 ExecuTorch 运行时可用。

// custom_linear.h / custom_linear.cpp
#include <executorch/runtime/kernel/kernel_includes.h>

Tensor& custom_linear_out(const Tensor& weight, const Tensor& input, optional<Tensor> bias, Tensor& out) {
    // calculation
    return out;
}

// Register as myop::custom_linear.out
EXECUTORCH_LIBRARY(myop, "custom_linear.out", custom_linear_out);

为了使该算子在 PyTorch 中可用，您可以围绕 ExecuTorch 自定义内核定义一个包装器。请注意，ExecuTorch 的实现使用 ExecuTorch 张量类型，而 PyTorch 包装器使用 ATen 张量。

// custom_linear_pytorch.cpp

#include "custom_linear.h"
#include <torch/library.h>

at::Tensor custom_linear(const at::Tensor& weight, const at::Tensor& input, std::optional<at::Tensor> bias) {

    // initialize out
    at::Tensor out = at::empty({weight.size(1), input.size(1)});

    // wrap kernel in custom_linear.cpp into ATen kernel
    WRAP_TO_ATEN(custom_linear_out, 3)(weight, input, bias, out);

    return out;
}

// Register the operator with PyTorch.
TORCH_LIBRARY(myop,  m) {
    m.def("custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor", custom_linear);
    m.def("custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)", WRAP_TO_ATEN(custom_linear_out, 3));
}

编译并链接自定义内核¶

要使其对 ExecuTorch 运行时可用，请将 custom_linear.h/cpp 编译到二进制目标中。您也可以将内核构建为动态加载库（.so 或 .dylib）并进行链接。

要使其在 PyTorch 中可用，请将 custom_linear.h、custom_linear.cpp 和 custom_linear_pytorch.cpp 打包为动态加载库（.so 或 .dylib），并将其加载到 Python 环境中。这是为了让 PyTorch 在导出时能够识别自定义算子。

import torch
torch.ops.load_library("libcustom_linear.so")

加载后，您可以在 PyTorch 代码中使用自定义算子。

有关更多信息，请参阅 PyTorch 自定义操作符和 ExecuTorch 内核注册。

在模型中使用自定义运算符¶

自定义算子可以显式地在 PyTorch 模型中使用，或者您可以编写一个转换来用自定义变体替换核心算子的实例。对于此示例，您可以找到所有 torch.nn.Linear 的实例并将它们替换为 CustomLinear。

def  replace_linear_with_custom_linear(module):
    for name, child in module.named_children():
        if isinstance(child, nn.Linear):
            setattr(
                module,
                name,
                CustomLinear(child.in_features,  child.out_features, child.bias),
        )
        else:
            replace_linear_with_custom_linear(child)

其余步骤与正常流程相同。现在，您既可以在急切模式下运行此模块，也可以将其导出到 ExecuTorch。

如何构建移动应用¶

请参阅在 iOS 和 Android 上使用 ExecuTorch 构建和运行大语言模型（LLM）的说明。