ascend pytorch 踩坑.

在910b上安装pytorch 和 pytorch_npu, 因为后续准备装vllm, 所以torch_npu是特殊的版本.

代码语言:shell复制
pip install torch==2.5.1 --extra-index /

pip install numpy==1.26.4

mkdir pta
cd pta
wget .5.1/20250320.3/pytorch_v2.5.1_py310.tar.gz
tar -xvf pytorch_v2.5.1_py310.tar.gz
pip install ./torch_npu-2.5.1.dev20250320-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

安装完毕后, 执行下example_npu.py, 内容如下:

代码语言:python代码运行次数:0运行复制
import torch
import torch_npu

x = torch.randn(2, 2).npu()
y = torch.randn(2, 2).npu()
z = x.mm(y)

print(z)

但是执行python example_npu.py报错:

代码语言:log复制
/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/utils/path_manager.py:82: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC2/x86_64-linux/ascend_toolkit_install.info owner does not match the current user.
  warnings.warn(f"Warning: The {path} owner does not match the current user.")
[W NPUCachingAllocator.cpp:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
Traceback (most recent call last):
  File "./pta/example_npu.py", line 4, in <module>
    x = torch.randn(2, 2).npu()
  File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 153, in wrap_tensor_to
    device_idx = _normalization_device(custom_backend_name, device)
  File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 109, in _normalization_device
    return _get_current_device_index()
  File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch/utils/backend_registration.py", line 103, in _get_current_device_index
    return getattr(getattr(torch, custom_backend_name), _get_device_index)()
  File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/npu/utils.py", line 59, in current_device
    torch_npu.npu._lazy_init()
  File "/data/miniconda3/envs/ascend/lib/python3.10/site-packages/torch_npu/npu/__init__.py", line 214, in _lazy_init
    torch_npu._C._npu_init()
RuntimeError: Initialize:torch_npu/csrc/core/npu/sys_ctrl/npu_sys_ctrl.cpp:217 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt::ACL_PRECISION_MODE, precision_mode), error code is 500001
[ERROR] 2025-04-23-11:06:03 (PID:4586, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The internal ACL of the system is incorrect.
        Rectify the fault based on the error information in the ascend log.
EC0010: 2025-04-23-11:06:03.331.980 Failed to import Python module [ModuleNotFoundError: No module named 'scipy'.].
        Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
        TraceBack (most recent call last):
        AOE Failed to call InitCannKB
        [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter][LINE:1719]
        [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager][LINE:79]
        [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager][LINE:120]
        [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager][LINE:117]
        PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager][LINE:82]
        OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib][LINE:234]
        GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib][LINE:162]
        GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api][LINE:334]
        [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]

第一眼看上去眼花缭乱, 完全不知道哪里有问题. 仔细分析, 可以看到"Failed to import Python module ModuleNotFoundError: No module named 'scipy'."

pip install scipy, 在次执行, 又报了一个类似的错误, 缺另一个依赖, 循环几次. 在次执行, 即可正常.

代码语言:shell复制
Warning: Device do not support double dtype now, dtype cast repalce with float.
tensor([[ 1.0745,  1.2646],
        [-1.1924, -2.2859]], device='npu:0')