Slurm Overview

All computing work on the cluster is run through a job scheduling system called Slurm. Slurm does the following:

Job Distribution: Distributes jobs across the cluster, allowing work to run in parallel across multiple machines.
Job Queueing: Queues jobs when no computing resources are immediately available.
Resource Tracking: Tracks each user's resource consumption over time, and balances job priorities to ensure that everyone gets equal access to cluster resources.

Slurm Components

Login Node (cluster.stat.washington.edu): The gateway to the cluster. You log in here by SSH to submit jobs and manage your files.
Scheduler / Queue Controller: The controller manages the job queue, deciding when and where jobs run on the cluster. It ensures resources are used efficiently and fairly.
Compute Nodes: These handle the actual computing work. All work is distributed to the compute nodes by the Slurm controller.
Partitions: Partitions are groups of compute nodes organized based on resource constraints. They are designed to help fairly distribute work across the cluster. Each partition may have different limits on resources such as runtime and memory.

flowchart LR
    A[Your computer] -->|ssh| B(Login Node)
    B -->|sbatch / srun| C(Slurm Queue Controller)
    subgraph Slurm Partition
        D1[Compute Node 1]
        D2[Compute Node 2]
        D3[Compute Node 3]
        D4[Compute Node N]
    end    
    C -->|slurm| D1
    C -->|slurm| D2
    C -->|slurm| D3
    C -->|slurm| D4

Cluster Job Terminology

Job: The overall work submitted to the cluster, defined by a job script and encompassing all the work a user wants to perform.
Job Step: A division within a job that consists of one or more tasks. This is typically a single iteration of a parallel job.
Task: The smallest unit of work. Multiple tasks can run concurrently within a job step. For most Stat cluster jobs, each job step only runs a single task.