Queue System Concepts for Fox Users

This page explains some of the core concepts of the Slurm queue system.

For an overview of the Slurm concepts, Slurm has a beginners guide: Quick Start.

Partition

The nodes on a cluster is divided into sets, called partitions. The partitions can be overlapping.

Some job types on Fox are implemented as partitions, meaning that one specifies --partition to select job type -- for instance accel.

QoS - Quality of Service

A QoS is a way to assign properties and limitations to jobs. It can be used to give jobs different priority, and add or change the limitations on the jobs, for instance the size or lenght of jobs, or the number of jobs running at one time.

Some job-types on Fox are implemented as a QoS, meaning that one specifies --qos to select job type -- for instance devel. The job will then (by default) run in the standard (normal) partition, but have different properties.

Account

An account is an entity that can be assigned a quota for resource usage. All jobs run in an account, and the job's usage is subtracted from the account's quota.

Accounts can also have restrictions, like how man jobs can run in it at the same time, or which reservations its jobs can use.

OnFox, each project has its own account, with the same name "ecNNN". We use accounts mainly for accounting resource usage.

Read more about projects and accounting.

Jobs

Jobs are submitted to the job queue, and starts running on assigned compute nodes when there are enough resources available.

Job step

A job is divided into one or more job steps. Each time a job runs srun or mpirun, a new job step is created. Job steps are normally executed sequentially, one after each other. In addition to these, the batch job script itself, which runs on the first of the allocated nodes, is considered a job step (named batch).

sacct will show the job steps. For instance:

$ sacct -j 357055
	   JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
357055              DFT     normal    nn9180k        256  COMPLETED      0:0
357055.batch      batch               nn9180k         32  COMPLETED      0:0
357055.exte+     extern               nn9180k        256  COMPLETED      0:0
357055.0      pmi_proxy               nn9180k          8  COMPLETED      0:0

The first line here is the job allocation. Then comes the job script step (batch), and an artificial step that we can ignore here (extern), and finally a job step corresponding to an mpirun or srun (step 0). Further steps would be numbered 1, 2, etc.

Tasks

Each job step starts one or more tasks, which corresponds to processes. So for instance the processes (mpi ranks) in an mpi job step are tasks. This is why one specifies --ntasks etc in job scripts to select the number of processes to run in an mpi job.

Each task in a job step is started at the same time, and they run in parallel on the nodes of the job. srun and mpirun will take care of starting the right number of processes on the right nodes.

(Unfortunately, Slurm also calls the individual instances of array jobs for array tasks.)

CPUs

Each task of a job step by default gets access to one CPU. For multithreaded programs, it is usually best to have access to the same number of CPUs as the program has threads. This is done by specifying the job parameter --cpus-per-task. So for instance, if your program uses 4 threads, --cpus-per-task=4 is usually a good choice.

Note that when we use the term "CPU" in this documentation, we are technically referring to a CPU core, sometimes known as a physical CPU core. On the Fox GPU nodes, each CPU core has two hyperthreads, sometimes know as logical CPUs, and some programs will report the total number of hyperthreads as the number of CPUs on the node, but in this documentation, and in the Slurm setup on Fox, we only count the (physical) cores as CPUs. On the regular compute nodes, there are no hyperthreads, so there is no possibility of confusion there.

^{CC Attribution: This page is maintained by the University of Oslo IT FFU-BT group.
It has either been modified from, or is a derivative of, "Queue system concepts"
by NRIS under CC-BY-4.0.
Changes: Add section "CPUs".}