分类导航

程序问答发布时间：2022-06-01 发布网站：大佬教程 code.js-code.com

大佬教程收集整理的这篇文章主要介绍了Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun，大佬教程大佬觉得挺不错的，现在分享给大家，也给大家做个参考。

@H_618_0@如何解决Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun？开发过程中遇到Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun的问题如何解决？下面主要结合日常开发的经验，给出你关于Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun的解决方法建议，希望对你解决Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun有所启发或帮助；

在 Slurm 中有两种分配 GPU 的方法：通用的 --gres=gpu:N 参数，或者像 --gpus-per-task=N 这样的特定参数。还有两种在批处理脚本中启动 MPI 任务的方法：使用 srun 或使用通常的 @H_998_3@mpirun（当 OpenMPI 编译时支持 Slurm）。我发现这些方法在行为上存在一些令人惊讶的差异。

我正在使用 sbatch 提交批处理作业，其中基本脚本如下：

#!/bin/bash

#SBATCH --job-name=sim_1        # job name (default is the name of this filE)
#SBATCH --output=log.%x.job_%j  # file name for stdout/stderr (%x will be replaced with the job name,%j with the jobID)
#SBATCH --time=1:00:00          # maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --partition=gpXY        # put the job into the gpu partition
#SBATCH --exclusive             # request exclusive alLOCATIOn of resources
#SBATCH --mem=20G               # RAM per node
#SBATCH --threads-per-core=1    # do not use hyperthreads (i.e. cpus = physical cores below)
#SBATCH --cpus-per-task=4       # number of cpus per process

## nodes alLOCATIOn
#SBATCH --nodes=2               # number of nodes
#SBATCH --ntasks-per-node=2     # MPI processes per node

## GPU alLOCATIOn - variant A
#SBATCH --gres=gpu:2            # number of GPUs per node (gres=gpu:N)

## GPU alLOCATIOn - variant B
## #SBATCH --gpus-per-task=1       # number of GPUs per process
## #SBATCH --gpu-bind=single:1     # bind each process to its own GPU (single:<tasks_per_gpu>)

# start the job in the directory it was submitted from
cd "$SLURM_submit_DIR"

# program execution - variant 1
mpirun ./sim

# program execution - variant 2
#srun ./sim

第一个块中的 #SBATCH 选项非常明显且无趣。接下来，当作业在至少 2 个节点上运行时，我将描述的行为是可观察到的。我每个节点运行 2 个任务，因为我们每个节点有 2 个 GPU。最后，GPU 分配有两种变体（A 和 B）和程序执行的两种变体（1 和 2）。因此，总共有 4 个变体：A1、A2、B1、B2。

变体 A1 (--gres=gpu:2,mpirun)

变体 A2 (--gres=gpu:2,srun)

在变体 A1 和 A2 中，作业以最佳性能正确执行，我们在日志中有以下输出：

Rank 0: rank on node is 0,using GPU ID 0 of 2,CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1,using GPU ID 1 of 2,1
Rank 2: rank on node is 0,1
Rank 3: rank on node is 1,1

变体 B1（--gpus-per-task=1，mpirun）

作业未正确执行，由于第二个节点上的 CUDA_VISIBLE_DEVICES=0 未正确映射 GPU：

Rank 0: rank on node is 0,using GPU ID 0 of 1,CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1,CUDA_VISIBLE_DEVICES=0

请注意，此变体在使用和不使用 --gpu-bind=single:1 时的行为相同。

变体 B2（--gpus-per-task=1，--gpu-bind=single:1，srun）

GPU 已正确映射（由于 --gpu-bind=single:1，现在每个进程只能看到一个 GPU）：

Rank 0: rank on node is 0,CUDA_VISIBLE_DEVICES=0
Rank 1: rank on node is 1,CUDA_VISIBLE_DEVICES=1
Rank 2: rank on node is 0,CUDA_VISIBLE_DEVICES=1

然而，当 rank 开始通信时会出现 MPI 错误（类似的消息对于每个 rank 重复一次）：

--------------------------------------------------------------------------
The call to cuIpcopenMemHandle Failed. This is an unrecoverable error
and will cause the program to abort.
  Hostname:                         gp11
  cuIpcopenMemHandle return value:  217
  address:                          0x7f40ee000000
check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory.  Try to reduce the device
memory footprint of your application.
--------------------------------------------------------------------------

虽然它说“这是一个不可恢复的错误”，但执行似乎进行得很好，除了日志中充斥着这样的消息（假设每个 MPI 通信调用一条消息）：

[gp11:122211] Failed to register remote memory,rc=-1
[gp11:122212] Failed to register remote memory,rc=-1
[gp12:62725] Failed to register remote memory,rc=-1
[gp12:62724] Failed to register remote memory,rc=-1

很明显，这是一条 OpenMPI 错误消息。我发现有关此错误的 old thread，建议使用 --mca btl_smcuda_use_cuda_ipc 0 禁用 CUDA IPC。但是，由于在这种情况下使用 srun 来启动程序，我不知道如何将这些参数传递给 OpenMPI。

请注意，在此变体中，--gpu-bind=single:1 仅影响可见 GPU (CUDA_VISIBLE_DEVICES)。但是即使没有这个选项，每个任务仍然可以选择正确的GPU，错误仍然出现。

知道发生了什么以及如何解决变体 B1 和 B2 中的错误吗？理想情况下，我们希望使用比 --gpus-per-task 更灵活的 --gres=gpu:...（当我们更改 --ntasks-per-node 时，要更改的参数少了一个）。使用 @H_998_3@mpirun 与 srun 对我们来说无关紧要。

我们有 Slurm 20.11.5.1、OpenMPI 4.0.5（使用 --with-cuda 和 --with-slurm 构建）和 CUDA 11.2.2。操作系统是 Arch linux。网络是 10G 以太网（没有 InfiniBand 或 OmniPath）。让我知道是否应该包含更多信息。

@H_618_0@解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）