Ascend 910b上 paddle安装及运行报错排查

910b上安装paddle最新版还是比较容易的.

安装

参考官网:

代码语言:shell复制
pip install paddlepaddle==3.0.0 -i /
pip install paddle-custom-npu==3.0.0 -i /

可能还有一些第三方依赖, 后面执行的时候, 会报错, 一起装了

代码语言:shell复制
pip install scikit-image albumentations pyclipper shapely lmdb rapidfuzz visualdl

执行paddle命令检测下是否安装成功:

代码语言:shell复制
python -c "import paddle; paddle.utils.run_check()"

第一次报错: libmki.so: cannot open shared object file

代码语言:shell复制
I0425 19:27:30.366289 197268 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:27:30.366333 197268 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/__init__.py", line 38, in <module>
    from .base import core  # noqa: F401
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/base/__init__.py", line 204, in <module>
    __bootstrap__()
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/base/__init__.py", line 196, in __bootstrap__
    core.init_devices()
ValueError: (InvalidArgument) Fail to open library: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so with error: libmki.so: cannot open shared object file: No such file or directory
  [Hint: dso_handle should not be null.] (at /paddle/paddle/fluid/platform/init:150)

libmki.so未找到. 通过find /usr/local/Ascend/ -name "libmki.so"找下, 其位置在/usr/local/Ascend/nnal/atb/8.0.0/atb/cxx_abi_0/lib/libmki.so.

执行如下命令, 把对应的路径path添加到LD_LIBRARY_PATH.

代码语言:shell复制
source /usr/local/Ascend/nnal/atb/set_env.sh

再次执行, 二次报错

##第二次报错: ValueError: paddle.distributed initialize error

如果是单卡, 会看到这里已经成功了.

但是如果是多卡, 会发现单卡成功, 多卡报错.

代码语言:shell复制
python -c "import paddle; paddle.utils.run_check()"

which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
  warnings.warn(warning_message)
I0425 19:39:24.058059 247656 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:24.058089 247656 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:24.602092 247656 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:24.602130 247656 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:24.605474 247656 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:24.605615 247656 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:24.605638 247656 init:243] CustomDevice: npu, visible devices count: 4
Running verify PaddlePaddle program ... 
I0425 19:39:25.046949 247656 pir_interpreter:1541] New Executor is Running ...
I0425 19:39:25.049239 247656 pir_interpreter:1564] pir interpreter is running by multi-thread mode ...
PaddlePaddle works well on 1 npu.
which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miAscend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/lniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
ocal/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
  warnings.warn(warning_message)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
  warnings.warn(warning_message)
which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
  warnings.warn(warning_message)
which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
  warnings.warn(warning_message)
I0425 19:39:34.973912 248453 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.973942 248453 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:34.974052 248454 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.974077 248454 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:34.990751 248455 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.990775 248455 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:34.993745 248452 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.993765 248452 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.513886 248453 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.513922 248453 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.514699 248454 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.514722 248454 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.516956 248453 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.517093 248453 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.517115 248453 init:243] CustomDevice: npu, visible devices count: 4
I0425 19:39:35.517756 248454 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.517896 248454 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.517915 248454 init:243] CustomDevice: npu, visible devices count: 4
I0425 19:39:35.523406 248455 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.523430 248455 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.526330 248455 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.526466 248455 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.526491 248455 init:243] CustomDevice: npu, visible devices count: 4
I0425 19:39:35.539834 248452 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.539861 248452 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.542869 248452 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.542997 248452 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.543021 248452 init:243] CustomDevice: npu, visible devices count: 4
======================= Modified FLAGS detected =======================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
======================= Modified FLAGS detected =======================
=======================================================================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
[2025-04-25 19:39:36,304] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 4 npus. This may be caused by:
 1. There is not enough GPUs visible on your system
 2. Some GPUs are occupied by other process now
 3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on  
 to test your NCCL, or reinstall it following .html
[2025-04-25 19:39:36,304] [ WARNING] install_check.py:297 - 
 Original Error is: 

----------------------------------------------
Process 1 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 380, in _func_wrapper
    result = func(*args)
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 183, in train_for_run_parallel
    paddle.distributed.init_parallel_env()
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 1067, in init_parallel_env
    _check_var_exists(FLAGS_selected_custom_devices)
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 947, in _check_var_exists
    raise ValueError(
ValueError: paddle.distributed initialize error, environment variable FLAGS_selected_s is needed, but not set.

PaddlePaddle is installed successfully ONLY for single npu! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 302, in run_check
    raise e
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 283, in run_check
    _run_parallel(device_list)
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 212, in _run_parallel
    paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 627, in spawn
    while not context.join():
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 431, in join
    self._throw_exception(error_index)
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 453, in _throw_exception
    raise Exception(msg)
Exception: 

----------------------------------------------
Process 1 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 380, in _func_wrapper
    result = func(*args)
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 183, in train_for_run_parallel
    paddle.distributed.init_parallel_env()
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 1067, in init_parallel_env
    _check_var_exists(FLAGS_selected_custom_devices)
  File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 947, in _check_var_exists
    raise ValueError(
ValueError: paddle.distributed initialize error, environment variable FLAGS_selected_s is needed, but not set.

其中比较重要的是:

代码语言:shell复制
PaddlePaddle meets some problem with 4 npus. This may be caused by:
ValueError: paddle.distributed initialize error, environment variable FLAGS_selected_s is needed, but not set.

PaddlePaddle is installed successfully ONLY for single npu!

进一步查看run_check源码, 可以发现报错位置在这里. parallel_env.device_type是空.

代码语言:python代码运行次数:0运行复制
    if backend == "xccl":
        FLAGS_selected_custom_devices = (
            f'FLAGS_selected_{parallel_env.device_type}s'
        )
        _check_var_exists(FLAGS_selected_custom_devices)

def _check_var_exists(var_name):
    var = getenv_or_backup(var_name, None)
    if var is None:
        raise ValueError(
            "paddle.distributed initialize error, "
            f"environment variable {var_name} is needed, but not set."
        )

parallel_env.device_type是在ParallelEnv.__init__中赋值的. 其值取自os.getenv("PADDLE_XCCL_BACKEND", "").

代码语言:python代码运行次数:0运行复制
    def __init__(self):
        self._rank = int(os.getenv("PADDLE_TRAINER_ID", "0"))
        self._world_size = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
        self._device_type = str(os.getenv("PADDLE_XCCL_BACKEND", ""))
        self._pg_timeout = int(os.getenv("PADDLE_PG_TIMEOUT", "1800000"))

        # imperative only support one gpu or xpu
        if self._device_type != "":
            FLAGS_selected_custom_devices = (
                f'FLAGS_selected_{self._device_type}s'
            )
            selected_custom_devices = os.getenv(
                FLAGS_selected_custom_devices, "0"
            ).split(",")
            self._device_id = int(selected_custom_devices[0])
        else:
            if core.is_compiled_with_cuda():
                selected_gpus = os.getenv("FLAGS_selected_gpus", "0").split(",")
                self._device_id = int(selected_gpus[0])
            elif core.is_compiled_with_xpu():
                selected_xpus = os.getenv("FLAGS_selected_xpus", "0").split(",")
                self._device_id = int(selected_xpus[0])

那应该是npu下, PADDLE_XCCL_BACKEND需要设置个值, 是不是设置成npu呢?

在 paddle的github仓库, 搜索PADDLE_XCCL_BACKEND, 查到如下代码:

ENV PADDLE_XCCL_BACKEND=npu

确定就是这个了.

加上执行如下命令, 再次运行.

代码语言:shell复制
> PADDLE_XCCL_BACKEND=npu python -c "import paddle; paddle.utils.run_check()"

======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
I0425 21:02:37.260022  8098 tcp_utils:185] The server starts to listen on IP_ANY:48893; setting synclog to 2048
I0425 21:02:37.260236  8098 tcp_utils:134] Successfully connected to 127.0.0.1:48893
I0425 21:02:40.243685  8101 tcp_utils:134] Successfully connected to 127.0.0.1:48893
I0425 21:02:40.243683  8100 tcp_utils:134] Successfully connected to 127.0.0.1:48893
I0425 21:02:40.247510  8099 tcp_utils:134] Successfully connected to 127.0.0.1:48893
...........I0425 21:03:35.296976  8150 tcp_store:292] receive shutdown event and so quit from MasterDaemon run loop
PaddlePaddle works well on 4 npus.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

检测成功.