注意

点击这里下载完整示例代码

Inductor CPU后端调试和分析¶

创建日期：2023年7月1日 | 最后更新日期：2024年7月23日 | 最后验证日期：2024年11月5日

作者: 廖轩, 朱浩哲, 龚炯, 王伟汉

概述¶

PyTorch 2.0 引入了一个名为 torch.compile 的编译 API。此新功能通过默认的 Inductor 后端提供的图级优化，在 eager 模式执行上实现了显著的速度提升。

本教程旨在通过深入探讨 torch.compile 的细节，提供关于 Inductor CPU 后端调试和性能分析的深入介绍。

同时，您也可以查找关于 torch.compile 的相关教程，例如基本用法，全面的故障排除以及像 GPU性能分析这样的GPU特定知识。

我们将从一个能引发编译问题和准确率问题的激励性示例开始调试，通过演示调试过程来定位问题。

通过启用日志并探索生成的底层代码，你可以逐步缩小故障范围，最终找出根本原因。

接下来，我们将讨论如何对编译后的代码进行性能分析，并通过与急切模式的性能对比，详细说明为什么 torch.compile 相比其急切模式版本能够提供额外的性能提升。

调试¶

这里是一个简单的示例，使用 Inductor 运行 torch.compile 并将其结果与 eager 模式进行比较：

import torch

def foo1(x1, x2):
    a = torch.neg(x1)
    b = torch.maximum(x2, a)
    y = torch.cat([b], dim=0)
    return y

x1 = torch.randint(256, (1, 8), dtype=torch.uint8)
x2 = torch.randint(256, (8390, 8), dtype=torch.uint8)

compiled_foo1 = torch.compile(foo1)
result = compiled_foo1(x1, x2)

/usr/local/lib/python3.10/dist-packages/onnxscript/converter.py:820: FutureWarning:

'onnxscript.values.Op.param_schemas' is deprecated in version 0.1 and will be removed in the future. Please use '.op_signature' instead.

/usr/local/lib/python3.10/dist-packages/onnxscript/converter.py:820: FutureWarning:

'onnxscript.values.OnnxFunction.param_schemas' is deprecated in version 0.1 and will be removed in the future. Please use '.op_signature' instead.

第neg个在cpp代码生成中的正确实现如下:

def neg1(x):
    return f"decltype({x})(-{x})"

为了演示调试，我们之后会将该函数修改为一个错误的版本。

获取更多日志信息¶

默认情况下，如果您运行此简单示例，将不会提供任何调试信息。为了获取更有用的调试和日志信息，我们通常添加一个 TORCH_COMPILE_DEBUG 环境变量，如下所示：

TORCH_COMPILE_DEBUG=1 python xx.py

这将在输出日志中打印更多的调试信息，并且还会在代码生成过程中 dump 生成的中间 IR。你可以在日志中找到这些 dump 文件的路径，如下所示：

torch._inductor.debug: [WARNING] model___20 debug trace: /tmp/torchinductor_root/rx/crxfi2ybd7yp5sbj2pnhw33wfhtdw7wumvrobyp5sjvdui5ktjc2.debug

在此目录中，以下文件用于调试目的：

文件	描述
`fx_graph_runnable.py`	可执行的FX图，在分解后，模式匹配前
`fx_graph_transformed.py`	转换后的 FX 图表，经过模式匹配后
`ir_pre_fusion.txt`	融合前的 Inductor IR
`ir_post_fusion.txt`	融合后的电感器 IR
`output_code.py`	为图生成的 Python 代码，包含 C++/Triton 内核

请注意 fx_graph_runnable.py 和 output_code.py 都可以运行和编辑，以便于调试。以下是从业务文件中提取的主要代码部分，并将生成的C++行与FX代码行进行关联。

fx_graph_runnable:

def forward1(self, arg0_1, arg1_1):
    neg = torch.ops.aten.neg.default(arg0_1);  arg0_1 = None
    maximum = torch.ops.aten.maximum.default(arg1_1, neg);  arg1_1 = neg = None
    clone = torch.ops.aten.clone.default(maximum);  maximum = None
    return (clone,)

C++内核在 output_code:

import torch
from torch._inductor.async_compile import AsyncCompile
async_compile = AsyncCompile()

cpp_fused_cat_maximum_neg_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       const unsigned char* in_ptr1,
                       unsigned char* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(8390L); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(8L); i1+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>(i1 + (8L*i0))];
                auto tmp1 = in_ptr1[static_cast<long>(i1)];
                // Corresponding FX code line: neg = torch.ops.aten.neg.default(arg0_1);  arg0_1 = None
                auto tmp2 = decltype(tmp1)(-tmp1);
                // Corresponding FX code line: maximum = torch.ops.aten.maximum.default(arg1_1, neg);  arg1_1 = neg = None
                auto tmp3 = max_propagate_nan(tmp0, tmp2);
                // Corresponding FX code line: clone = torch.ops.aten.clone.default(maximum);  maximum = None
                out_ptr0[static_cast<long>(i1 + (8L*i0))] = tmp3;
            }
        }
    }
}''')

确定误差的组成部分¶

当遇到错误或精度问题时，找到错误的直接方法是缩小问题范围。首先要做的是确定错误发生的位置。幸运的是，只需通过更改 torch.compile 的后端即可轻松实现。

代码	描述
`torch.compile(fn, backend="eager")`	启用Dynamo
`torch.compile(fn, backend="aot_eager")`	启用Dynamo + AOT自动微分
`torch.compile(fn, backend="inductor")`	启用 Dynamo + AOT Autograd + Inductor

如果模型在后端设置为 eager 或 aot_eager 时可以成功运行，而在设置为 inductor 时失败，我们可以将故障缩小到 Inductor。

编译错误¶

众所周知，图级优化的演变过程如下：

torch.neg (Python) -> torch.ops.aten.neg.default (within FX graph) -> ops.neg (within IR node) -> tmp2 = -tmp1 (within C++ kernel)

如果你遇到编译错误，说明在输出代码中编译 C++ 内核时出现了问题。这种类型的错误表明在将 IR 节点降低为输出代码时引入了 bug。编译错误的根本原因通常会在 traceback 日志中显示。

例如，neg 函数被修改如下：

def neg2(x):
    return f"-{x}"

日志记录会给出一个原因相当明确的编译错误。

 torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
 CppCompileError: C++ compile error
 /tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp: In function ‘void kernel(const unsigned char*, const unsigned char*, unsigned char*)’:
 /tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: error: no matching function for call to ‘max_propagate_nan(unsigned char&, int&)’
   17 |                 auto tmp3 = max_propagate_nan(tmp0, tmp2);
        |                                                         ^
 In file included from /tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:2:
 /tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note: candidate: ‘template<class scalar_t> scalar_t max_propagate_nan(scalar_t, scalar_t)’
 27 | inline scalar_t max_propagate_nan(scalar_t a, scalar_t b) {
      |                 ^~~~~~~~~~~~~~~~~
 /tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note:   template argument deduction/substitution failed:
/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: note:   deduced conflicting types for parameter ‘scalar_t’ (‘unsigned char’ and ‘int’)
 17 |                 auto tmp3 = max_propagate_nan(tmp0, tmp2);
      |                                                         ^

让我们也查看输出代码和 IR 节点中对应的 C++ 内核。

C++内核：

include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                    const unsigned char* in_ptr1,
                    unsigned char* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(8390L); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(8L); i1+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>(i1 + (8L*i0))];
                auto tmp1 = in_ptr1[static_cast<long>(i1)];
                auto tmp2 = -tmp1;
                auto tmp3 = max_propagate_nan(tmp0, tmp2);
                out_ptr0[static_cast<long>(i1 + (8L*i0))] = tmp3;
            }
        }
    }
}

IR node:

buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep('buf0', c0, {c0: 67120})]
buf0.unmet_dependencies = []
buf0.met_dependencies =
    [   MemoryDep('arg0_1', c1, {c0: 8390, c1: 8}),
        MemoryDep('arg1_1', c0, {c0: 67120})]
buf0.users = [NodeUser(node=OUTPUT, can_inplace=False)]
buf0.group.device = cpu
buf0.group.iteration = ((8390, 8), ())
buf0.sizes = ([8390, 8], [])
class buf0_loop_body:
    var_ranges = {z0: 8390, z1: 8}
    index0 = 8*z0 + z1
    index1 = z1
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index1')
        load_1 = ops.load('arg0_1', get_index_1)
        neg = ops.neg(load_1)
        maximum = ops.maximum(load, neg)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf0', get_index_2, maximum, None)
        return store

根据跟踪日志，编译错误是由于 max_propagate_nan 的输入数据类型不一致造成的。通过检查C++内核，我们知道 tmp2 在执行 - 作为 tmp0 之后不再是 long，因为 long。我们可以在C++内核中轻松地将 - 和 max_propagate_nan 分别与IR节点中的 ops.neg 和 ops.maximum 匹配。

现在我们成功找到根本原因是ops.neg在cpp代码生成中的实现，它在执行neg时静默地更改了数据类型。

精度调试¶

否则，如果模型运行时出现其他错误或精度问题，您可以使用名为 Minifier 的 PyTorch 调试工具。

Minifier 的核心思想是不断移除图中的节点和输入，直到找到具有问题的最小图。它通过 4 种策略帮助自动生成一个压缩后的有问题图：截断后缀、delta 调试、消除死代码和移除未使用的输入。

我们现在将借助Minifer展示准确率问题的调试过程。准确率问题指的是后端eager和inductor的输出不同的情况。

例如，我们将示例修改如下：

from torch._dynamo.utils import same

def foo2(x1, x2):
    a = torch.neg(x1)
    b = torch.maximum(x2, a)
    y = torch.cat([b], dim=0)
    return y

x1 = torch.randn((1, 8), dtype=torch.float32)
x2 = torch.randn((8390, 8), dtype=torch.float32)

expected_result = foo2(x1, x2)

compiled_foo2 = torch.compile(foo2)
actual_result = compiled_foo2(x1, x2)

assert same(expected_result, actual_result) == True

并且修改 neg 函数:

def neg3(x):
    return f"decltype({x})(2 * {x})"

一个准确性问题将被表述如下：

torch._dynamo.utils: [ERROR] Accuracy failed: allclose not within tol=0.0001
Traceback (most recent call last):
  File "test_script.py", line 18, in <module>
    assert same(expected_result, actual_result) == True
AssertionError

为了调试 Minifier 的准确性问题，需要两个环境变量：

TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4 python xx.py

这为我们提供了日志信息，展示了压缩步骤：

Started off with 6 nodes

Trying granularity 2
Strategy: Truncate suffix (G: 2) (6 nodes, 2 inputs)
SUCCESS: Went from 6 to 4 nodes

Trying granularity 4
Strategy: Remove unused inputs (G: 4) (4 nodes, 2 inputs)
SUCCESS: Went from 4 to 3 nodes

运行后，我们得到包含目标节点 neg 的最终压缩图：

def forward2(self, arg0_1):
    neg = torch.ops.aten.neg.default(arg0_1);  arg0_1 = None
    return (neg,)

有关Minifier的更多使用详情，请参阅故障排除。

性能分析¶

在此部分中，我们将演示对使用Inductor CPU后端编译的模型进行性能分析的过程。在下面的例子中，我们使用急切模式和Inductor图模式对Hugging Face Transformer模型 MobileBertForQuestionAnswering 进行基准测试。基准测试结束后，会打印出执行时间和Inductor的速度提升比例。我们使用Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz，并在第一个插槽上运行基准测试以展示本部分中的优化。我们按照最佳实践设置以下环境变量以在Intel(R) CPU上进行基准测试。

export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
export KMP_AFFINITY=granularity=fine,compact,1,0
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
numactl -C 0-31 -m 0 python bench.py

# bench.py
from transformers import MobileBertForQuestionAnswering
# Initialize an eager model
model = MobileBertForQuestionAnswering.from_pretrained("csarron/mobilebert-uncased-squad-v2")
seq_length = 128
bs = 128
vocab_size = model.config.vocab_size
input = torch.randint(0, vocab_size, (bs, seq_length), dtype=torch.int64)
input_dict = {"input_ids": input}

# Initialize the inductor model
compiled_model = torch.compile(model)
with torch.no_grad():
    compiled_model(**input_dict)

NUM_ITERS=50
import timeit
with torch.no_grad():
    # warmup
    for _ in range(10):
        model(**input_dict)
    eager_t = timeit.timeit("model(**input_dict)", number=NUM_ITERS, globals=globals())

with torch.no_grad():
    # warmup
    for _ in range(10):
        compiled_model(**input_dict)
    inductor_t = timeit.timeit("compiled_model(**input_dict)", number=NUM_ITERS, globals=globals())
# print(f"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter")
# print(f"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter")
# print(f"speed up ratio: {eager_t / inductor_t}")

/usr/local/lib/python3.10/dist-packages/numpy/core/getlimits.py:518: UserWarning:

The value of the smallest subnormal for <class 'numpy.float32'> type is zero.

/usr/local/lib/python3.10/dist-packages/numpy/core/getlimits.py:89: UserWarning:

The value of the smallest subnormal for <class 'numpy.float32'> type is zero.

Output:

eager use: 802.1023553796113 ms/iter
inductor use: 339.95180135127157 ms/iter
speed up ratio: 2.359459053287382

在我们自己的测试中，我们发现Inductor CPU后端将模型的速度提升了约2.355倍。

接下来，让我们深入到操作层面的性能，以了解速度提升的来源。 Pytorch Profiler 是一个很好的工具来帮助我们。 Inductor CPU 后端支持通过 enable_kernel_profile 配置选项将融合内核的时间报告给 profiler：

from torch._inductor import config
config.cpp.enable_kernel_profile = True

按照 Pytorch Profiler 中的步骤，我们能够获取性能分析表和跟踪文件。

# bench.py
from torch.profiler import profile, schedule, ProfilerActivity
RESULT_DIR = "./prof_trace"
my_schedule = schedule(
    skip_first=10,
    wait=5,
    warmup=5,
    active=1,
    repeat=5)

def trace_handler(p):
    output = p.key_averages().table(sort_by="self_cpu_time_total", row_limit=20)
    # print(output)
    p.export_chrome_trace(f"{RESULT_DIR}/{p.step_num}.json")

for _ in range(10):
    model(**input_dict)  # compiled_model(**input_dict) to get inductor model profiling

total = 0
with profile(
    activities=[ProfilerActivity.CPU],
    schedule=my_schedule,
    on_trace_ready=trace_handler
) as p:
    for _ in range(50):
        model(**input_dict)  # compiled_model(**input_dict) to get inductor model profiling
        p.step()

我们得到以下针对急切模式模型的性能分析表（省略了一些列）：

-------------------------  ------------  ------------  ------------
                     Name   CPU total %     CPU total    # of Calls
-------------------------  ------------  ------------  ------------
              aten::addmm        45.73%     370.814ms           362
                aten::add        19.89%     161.276ms           363
              aten::copy_        14.97%     121.416ms           488
                aten::mul         9.02%      73.154ms           194
          aten::clamp_min         8.81%      71.444ms            96
                aten::bmm         5.46%      44.258ms            48
            ProfilerStep*       100.00%     810.920ms             1
                aten::div         2.89%      23.447ms            24
           aten::_softmax         1.00%       8.087ms            24
             aten::linear        46.48%     376.888ms           362
              aten::clone         2.77%      22.430ms            98
                  aten::t         0.31%       2.502ms           362
               aten::view         0.14%       1.161ms           850
          aten::transpose         0.17%       1.377ms           386
       aten::index_select         0.12%     952.000us             3
             aten::expand         0.12%     986.000us           458
             aten::matmul         8.31%      67.420ms            48
                aten::cat         0.09%     703.000us             1
         aten::as_strided         0.08%     656.000us           963
               aten::relu         8.86%      71.864ms            96
-------------------------  ------------  ------------  ------------
Self CPU time total: 810.920ms

同样，我们还得到了使用 Inductor 编译的模型的表格（省略了一些列）：

-----------------------------------------------  ------------  ------------  ------------
                                           Name   CPU total %     CPU total    # of Calls
-----------------------------------------------  ------------  ------------  ------------
                               mkl::_mkl_linear        68.79%     231.573ms           362
                                      aten::bmm         8.02%      26.992ms            48
                                  ProfilerStep*       100.00%     336.642ms             1
  graph_0_cpp_fused_constant_pad_nd_embedding_0         0.27%     915.000us             1
                                    aten::empty         0.27%     911.000us           362
 graph_0_cpp_fused__mkl_linear_add_mul_relu_151         0.27%     901.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_226         0.27%     899.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_361         0.27%     898.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_121         0.27%     895.000us             1
  graph_0_cpp_fused__mkl_linear_add_mul_relu_31         0.27%     893.000us             1
  graph_0_cpp_fused__mkl_linear_add_mul_relu_76         0.26%     892.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_256         0.26%     892.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_346         0.26%     892.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_241         0.26%     891.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_316         0.26%     891.000us             1
  graph_0_cpp_fused__mkl_linear_add_mul_relu_91         0.26%     890.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_106         0.26%     890.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_211         0.26%     890.000us             1
  graph_0_cpp_fused__mkl_linear_add_mul_relu_61         0.26%     889.000us             1
 graph_0_cpp_fused__mkl_linear_add_mul_relu_286         0.26%     889.000us             1
-----------------------------------------------  ------------  ------------  ------------
Self CPU time total: 336.642ms

从eager模型的分析表中，我们可以看到耗时最多的操作是[aten::addmm, aten::add, aten::copy_, aten::mul, aten::clamp_min, aten::bmm]. 与inductor模型的分析表相比，我们注意到有一个mkl::_mkl_linear条目和多个以graph_0_cpp_fused_*形式出现的融合内核。它们是inductor模型进行的主要优化。让我们分别讨论它们。

(1) 关于 mkl::_mkl_linear：您可能注意到调用此内核的次数是362，这与急切模式分析表中的aten::linear完全相同。 aten::linear 的CPU总计时间为376.888毫秒，而mkl::_mkl_linear 的时间是231.573毫秒。这表明“线性”部分大约有1.63倍的加速。加速主要来自于将权重张量打包到块内存格式并在Inductor CPU后端调用cblas_sgemm_compute 以在GEMM计算期间获得更好的缓存行为。

(2) 关于其他内存密集型操作：在我们的测试中，eager/inductor模型的端到端延迟为802/339毫秒。因此，我们可以大致推断出其他内存密集型操作的速度提升约为3.94倍。让我们阅读生成的代码，以了解inductor如何实现这一显著的优化。您可以通过搜索 cpp_fused__mkl_linear_add_mul_relu_151 在 output_code.py

cpp_fused__mkl_linear_add_mul_relu_151 = async_compile.cpp('''
#include <ATen/record_function.h>
#include "/tmp/torchinductor_root/lr/clrlgu27q4ggd472umdzwsu6qcpqxcuusjxqvx2hwitjbujiiz7z.h"
extern "C" void kernel(float* in_out_ptr0,
                       const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       const float* in_ptr3)
{
    RECORD_FUNCTION("graph_0_cpp_fused__mkl_linear_add_mul_relu_151", c10::ArrayRef<c10::IValue>({}));
    #pragma omp parallel num_threads(32)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(16384L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(512L); i1+=static_cast<long>(8L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (512L*i0)));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(i1));
                    auto tmp3 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i1 + (512L*i0)));
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(i1));
                    auto tmp7 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(i1));
                    auto tmp2 = tmp0 + tmp1;
                    auto tmp4 = tmp2 + tmp3;
                    auto tmp6 = tmp4 * tmp5;
                    auto tmp8 = tmp6 + tmp7;
                    tmp8.store(in_out_ptr0 + static_cast<long>(i1 + (512L*i0)));
                }
            }
        }
    }
}''')

从上面生成的代码中，我们可以看到这个内核对[add, add, mul, add]进行了典型的循环融合。这是一个内存受限的瓶颈，阻碍了良好的性能。为了更直观地了解这种优化，我们可以推断输入的大小和步幅，并进一步对这种[add, add, mul, add]模式进行基准测试。

# bench.py
def func(arg_0, arg_1, arg_2, arg_3, arg_4):
    add_0 = arg_0 + arg_1
    add_1 = add_0 + arg_2
    mul_1 = add_1 * arg_3
    add_2 = mul_1 + arg_4
    arg_2 = add_2
    return arg_2

arg_0 = torch.rand(16384, 512)
arg_1 = torch.rand(1, 512)
arg_2 = torch.zeros(16384, 512)
arg_3 = torch.rand(1, 512)
arg_4 = torch.rand(1, 512)

input = (arg_0, arg_1, arg_2, arg_3, arg_4)
inductor_func = torch.compile(func)
with torch.no_grad():
    inductor_func(*input)

import timeit
NUM_ITERS=100
with torch.no_grad():
    # warmup
    for _ in range(10):
        func(*input)
    eager_t = timeit.timeit("func(*input)", number=NUM_ITERS, globals=globals())

with torch.no_grad():
    # warmup
    for _ in range(10):
        inductor_func(*input)
    inductor_t = timeit.timeit("inductor_func(*input)", number=NUM_ITERS, globals=globals())
# print(f"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter")
# print(f"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter")
# print(f"speed up ratio: {eager_t / inductor_t}")

Output:

eager use: 5.780875144992024 ms/iter
inductor use: 0.9588955780491233 ms/iter
speed up ratio: 6.0286805751604735

这是一个示例。性能表格显示在此模型中，所有元素级操作都会自动在诱导器中融合。您可以阅读更多内核信息在 output_code.py

结论¶

该文档提供了对 Inductor CPU 后端的深入教程。

通过有启发性的示例，我们逐步讲解调试和性能分析的过程。主要思路是缩小问题的范围。

我们逐步展示如何深入探讨问题并找到失败的根本原因，借助调试日志和工具 Minifier。首先确定失败发生在哪个组件中，然后尝试生成能够复现失败的最小代码片段。

当使用 Inductor 的性能优于 eager 模式时，我们提供了一种可靠的分析方法来进行性能剖析。我们展示了如何通过 PyTorch Profiler 找到耗时的热点，并确定操作级别或内核级别的原因以解释该现象。

脚本总运行时间: ( 8 分钟 48.727 秒)

通过 Sphinx-Gallery 生成的画廊