Ascend 910b上 paddle安装及运行报错排查
910b上安装paddle最新版还是比较容易的.
安装
参考官网:
代码语言:shell复制pip install paddlepaddle==3.0.0 -i /
pip install paddle-custom-npu==3.0.0 -i /
可能还有一些第三方依赖, 后面执行的时候, 会报错, 一起装了
代码语言:shell复制pip install scikit-image albumentations pyclipper shapely lmdb rapidfuzz visualdl
执行paddle命令检测下是否安装成功:
代码语言:shell复制python -c "import paddle; paddle.utils.run_check()"
第一次报错: libmki.so: cannot open shared object file
代码语言:shell复制I0425 19:27:30.366289 197268 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:27:30.366333 197268 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/__init__.py", line 38, in <module>
from .base import core # noqa: F401
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/base/__init__.py", line 204, in <module>
__bootstrap__()
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/base/__init__.py", line 196, in __bootstrap__
core.init_devices()
ValueError: (InvalidArgument) Fail to open library: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so with error: libmki.so: cannot open shared object file: No such file or directory
[Hint: dso_handle should not be null.] (at /paddle/paddle/fluid/platform/init:150)
libmki.so未找到. 通过find /usr/local/Ascend/ -name "libmki.so"
找下, 其位置在/usr/local/Ascend/nnal/atb/8.0.0/atb/cxx_abi_0/lib/libmki.so
.
执行如下命令, 把对应的路径path添加到LD_LIBRARY_PATH
.
source /usr/local/Ascend/nnal/atb/set_env.sh
再次执行, 二次报错
##第二次报错: ValueError: paddle.distributed initialize error
如果是单卡, 会看到这里已经成功了.
但是如果是多卡, 会发现单卡成功, 多卡报错.
代码语言:shell复制python -c "import paddle; paddle.utils.run_check()"
which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
warnings.warn(warning_message)
I0425 19:39:24.058059 247656 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:24.058089 247656 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:24.602092 247656 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:24.602130 247656 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:24.605474 247656 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:24.605615 247656 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:24.605638 247656 init:243] CustomDevice: npu, visible devices count: 4
Running verify PaddlePaddle program ...
I0425 19:39:25.046949 247656 pir_interpreter:1541] New Executor is Running ...
I0425 19:39:25.049239 247656 pir_interpreter:1564] pir interpreter is running by multi-thread mode ...
PaddlePaddle works well on 1 npu.
which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miAscend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/lniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
ocal/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
warnings.warn(warning_message)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
warnings.warn(warning_message)
which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
warnings.warn(warning_message)
which: no ccache in (/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/Ascend/ascend-toolkit/latest/tools/ccec_compiler/bin:/data/miniconda3/envs/ascend-3.10.14/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/data/npu/cmake-3.19.8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/app/.ft:/data/apache-maven-3.6.3/bin:/usr/local/app/.ft)
/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: .md
warnings.warn(warning_message)
I0425 19:39:34.973912 248453 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.973942 248453 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:34.974052 248454 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.974077 248454 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:34.990751 248455 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.990775 248455 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:34.993745 248452 init:237] ENV [CUSTOM_DEVICE_ROOT]=/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device
I0425 19:39:34.993765 248452 init:146] Try loading custom device libs from: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.513886 248453 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.513922 248453 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.514699 248454 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.514722 248454 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.516956 248453 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.517093 248453 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.517115 248453 init:243] CustomDevice: npu, visible devices count: 4
I0425 19:39:35.517756 248454 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.517896 248454 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.517915 248454 init:243] CustomDevice: npu, visible devices count: 4
I0425 19:39:35.523406 248455 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.523430 248455 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.526330 248455 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.526466 248455 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.526491 248455 init:243] CustomDevice: npu, visible devices count: 4
I0425 19:39:35.539834 248452 custom_device_load:52] Succeed in loading custom runtime in lib: /data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so
I0425 19:39:35.539861 248452 custom_device_load:59] Skipped lib [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device/libpaddle-custom-npu.so]: no custom engine Plugin symbol in this lib.
I0425 19:39:35.542869 248452 custom_kernel:63] Succeed in loading 358 custom kernel(s) from loaded lib(s), will be used like native ones.
I0425 19:39:35.542997 248452 init:158] Finished in LoadCustomDevice with libs_path: [/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle_custom_device]
I0425 19:39:35.543021 248452 init:243] CustomDevice: npu, visible devices count: 4
======================= Modified FLAGS detected =======================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
======================= Modified FLAGS detected =======================
=======================================================================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
[2025-04-25 19:39:36,304] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 4 npus. This may be caused by:
1. There is not enough GPUs visible on your system
2. Some GPUs are occupied by other process now
3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on
to test your NCCL, or reinstall it following .html
[2025-04-25 19:39:36,304] [ WARNING] install_check.py:297 -
Original Error is:
----------------------------------------------
Process 1 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 380, in _func_wrapper
result = func(*args)
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 183, in train_for_run_parallel
paddle.distributed.init_parallel_env()
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 1067, in init_parallel_env
_check_var_exists(FLAGS_selected_custom_devices)
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 947, in _check_var_exists
raise ValueError(
ValueError: paddle.distributed initialize error, environment variable FLAGS_selected_s is needed, but not set.
PaddlePaddle is installed successfully ONLY for single npu! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 212, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 627, in spawn
while not context.join():
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 431, in join
self._throw_exception(error_index)
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 453, in _throw_exception
raise Exception(msg)
Exception:
----------------------------------------------
Process 1 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 380, in _func_wrapper
result = func(*args)
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/utils/install_check.py", line 183, in train_for_run_parallel
paddle.distributed.init_parallel_env()
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 1067, in init_parallel_env
_check_var_exists(FLAGS_selected_custom_devices)
File "/data/miniconda3/envs/ascend-3.10.14/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 947, in _check_var_exists
raise ValueError(
ValueError: paddle.distributed initialize error, environment variable FLAGS_selected_s is needed, but not set.
其中比较重要的是:
代码语言:shell复制PaddlePaddle meets some problem with 4 npus. This may be caused by:
ValueError: paddle.distributed initialize error, environment variable FLAGS_selected_s is needed, but not set.
PaddlePaddle is installed successfully ONLY for single npu!
进一步查看run_check
源码, 可以发现报错位置在这里. parallel_env.device_type
是空.
if backend == "xccl":
FLAGS_selected_custom_devices = (
f'FLAGS_selected_{parallel_env.device_type}s'
)
_check_var_exists(FLAGS_selected_custom_devices)
def _check_var_exists(var_name):
var = getenv_or_backup(var_name, None)
if var is None:
raise ValueError(
"paddle.distributed initialize error, "
f"environment variable {var_name} is needed, but not set."
)
parallel_env.device_type
是在ParallelEnv.__init__
中赋值的. 其值取自os.getenv("PADDLE_XCCL_BACKEND", "")
.
def __init__(self):
self._rank = int(os.getenv("PADDLE_TRAINER_ID", "0"))
self._world_size = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
self._device_type = str(os.getenv("PADDLE_XCCL_BACKEND", ""))
self._pg_timeout = int(os.getenv("PADDLE_PG_TIMEOUT", "1800000"))
# imperative only support one gpu or xpu
if self._device_type != "":
FLAGS_selected_custom_devices = (
f'FLAGS_selected_{self._device_type}s'
)
selected_custom_devices = os.getenv(
FLAGS_selected_custom_devices, "0"
).split(",")
self._device_id = int(selected_custom_devices[0])
else:
if core.is_compiled_with_cuda():
selected_gpus = os.getenv("FLAGS_selected_gpus", "0").split(",")
self._device_id = int(selected_gpus[0])
elif core.is_compiled_with_xpu():
selected_xpus = os.getenv("FLAGS_selected_xpus", "0").split(",")
self._device_id = int(selected_xpus[0])
那应该是npu下, PADDLE_XCCL_BACKEND
需要设置个值, 是不是设置成npu
呢?
在 paddle的github仓库, 搜索PADDLE_XCCL_BACKEND
, 查到如下代码:
ENV PADDLE_XCCL_BACKEND=npu
确定就是这个了.
加上执行如下命令, 再次运行.
代码语言:shell复制> PADDLE_XCCL_BACKEND=npu python -c "import paddle; paddle.utils.run_check()"
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
I0425 21:02:37.260022 8098 tcp_utils:185] The server starts to listen on IP_ANY:48893; setting synclog to 2048
I0425 21:02:37.260236 8098 tcp_utils:134] Successfully connected to 127.0.0.1:48893
I0425 21:02:40.243685 8101 tcp_utils:134] Successfully connected to 127.0.0.1:48893
I0425 21:02:40.243683 8100 tcp_utils:134] Successfully connected to 127.0.0.1:48893
I0425 21:02:40.247510 8099 tcp_utils:134] Successfully connected to 127.0.0.1:48893
...........I0425 21:03:35.296976 8150 tcp_store:292] receive shutdown event and so quit from MasterDaemon run loop
PaddlePaddle works well on 4 npus.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
检测成功.
发布评论