Jean Zay: Memory allocation with Slurm on GPU partitions

The Slurm options --mem, --mem-per-cpu and --mem-per-gpu do not currently allow you to properly configure the memory allocation per node of your job on Jean Zay. The memory allocation per node is automatically determined by the number of reserved CPUs per node.

To adjust the amount of memory per node allocated to your job, you must adjust the number of CPUs reserved per task/process (in addition to the number of task and/or GPU) by specifying the following option in your batch scripts, or when using salloc in interactive mode:

--cpus-per-task=...      # --cpus-per-task=1 by default

Be careful, by default, --cpus-per-task=1. Therefore, if you do not modify its value, as explained below, you will not be able to access all of the potentially accessible memory per reserved task/GPU. In particular, you risk quickly making memory overflows at the level of the processes running on the processors.

The maximum value that can be specified in --cpus-per-task=..., depends on the number of processes/task requested per node (--ntasks-per-node=...) and the profile of the used nodes (different total number of cores per node) which depends on the used partition.

Note that there can also be memory overflows at GPU level because they have individual memory whose size varies depending on the used partition.

On the default gpu partition

Each node of the default gpu partition offers 160 GB of usable memory and 40 CPU cores. The memory allocation is therefore computed automatically on the basis of:

  • 160/40 = 4 GB per reserved CPU core if hyperthreading is deactivated (Slurm option --hint=nomultithread).

Each compute node of the default gpu partition is composed of 4 GPUs and 40 CPU cores: You can therefore reserve 1/4 of the node memory per GPU by requiring 10 CPUs (i.e. 1/4 of 40 cores) per GPU:

--cpus-per-task=10     # reserves 1/4 of the node memory per GPU (default gpu partition)

In this way, you have access to 4*10 = 40 GB of node memory per GPU if hyperthreading is deactivated (if not, half of that memory).

Note that you can request more than 40 GB of memory per GPU if necessary (need more memory per process). But this will generate overcharging of the job (allocation by Slurm of additional GPU resources that are not used): the GPU hours consumed by the job will then be calculated as if you had reserved more GPUs for the job but without them being used and therefore without benefit for the computation times (see comments at the bottom of the page).

On the gpu_p2 partition

The gpu_p2 partition is divided into two subpartitions:

  • The gpu_p2s subpartition with 360 GB usable memory per node
  • The gpu_p2l subpartition with 720 GB usable memory per node

As each node of this partition contains 24 CPU cores, the memory allocation is automatically determined on the basis of:

  • 360/24 = 15 GB per reserved CPU core on the gpu_p2s partition if hyperthreading is deactivated (Slurm option --hint=nomultithread)
  • 720/24 = 30 GB per reserved CPU core on the gpu_p2l partition if hyperthreading is deactivated

Each compute node of the gpu_p2 partition contains 8 GPUs and 24 CPU cores: You can reserve 1/8 of the node memory per GPU by reserving 3 CPUs (i.e. 1/8 of 24 cores) per GPU:

--cpus-per-task=3    # reserves 1/8 of the node memory per GPU (gpu_p2 partition)

In this way, you have access to:

  • 15*3 = 45 GB of node memory per GPU on the gpu_p2s partition
  • 30*3 = 90 GB of node memory per GPU on the gpu_p2l partition

if hyperthreading is deactivated (if not, half of that memory).

Note that you can request more than 45 GB (with gpu_p2s) or 90 GB (with gpu_p2l) of memory per GPU if necessary (need more memory per process). But this will generate overcharging of the job (allocation by Slurm of additional GPU resources that are not used): the GPU hours consumed by the job will then be calculated as if you had reserved more GPUs for the job but without them being used and therefore without benefit for the computation times (see comments at the bottom of the page).

On the gpu_p4 partition

Each node of the gpu_p4 partition offers 720 GB of usable memory and 48 CPU cores. The memory allocation is therefore computed automatically on the basis of:

  • 720/48 = 15 GB per reserved CPU core if hyperthreading is deactivated (Slurm option --hint=nomultithread).

Each compute node of the gpu_p4 partition is composed of 8 GPUs and 48 CPU cores: You can therefore reserve 1/8 of the node memory per GPU by requiring 6 CPUs (i.e. 1/8 of 48 cores) per GPU:

--cpus-per-task=6     # reserves 1/8 of the node memory per GPU (gpu_p4 partition)

In this way, you have access to 6*15 = 90 GB of node memory per GPU if hyperthreading is deactivated (if not, half of that memory).

Note that you can request more than 90 GB of memory per GPU if necessary (need more memory per process). But this will generate overcharging of the job (allocation by Slurm of additional GPU resources that are not used): the GPU hours consumed by the job will then be calculated as if you had reserved more GPUs for the job but without them being used and therefore without benefit for the computation times (see comments at the bottom of the page).

Comments

  • You can ask for more memory per GPU by increasing the value of --cpus-per-task as long as it does not exceed the total amount of memory available on the node. Be careful, the computing hours are counted proportionately. For example, if you ask for 1 GPU on the default gpu partition by specifying --ntasks=1 --gres=gpu:1 --cpus-per-task=20, the invoice will be the same as for a job running on 2 GPUs due to --cpus-per-task=20.
  • If you reserve a node in exclusive mode, you have access to the entire memory capacity of the node, regardless of the value of --cpus-per-task. The invoice will be the same as for a job running on an entire node.
  • The amount of memory allocated to your job can be seen by running the command:
    $ scontrol show job $JOBID     # searches for value of the "mem" variable

    Important: While the job is in the wait queue (PENDING), Slurm estimates the memory allocated to a job based on logical cores. Therefore, if you have reserved physical cores (with --hint=nomultithread), the value indicated can be two times inferior to the expected value. This is updated and becomes correct when the job is started.

  • To reserve resources on the prepost partition, you may refer to: Memory allocation with Slurm on CPU partitions. The GPU which is available on each node of the prepost partition is automatically allocated to you without needing to specify the --gres=gpu:1 option.