Skip to content

Slurm Troubleshooting

Evaluating Job Failures & Resource Consumption with sacct-detail

sacct-detail shows the status of pending, running and recently completed jobs. This is the primary tool for diagnosing failed jobs, and comparing your completed job's actual memory and timit consumption to the limits you allocated with sbatch/srun.

sacct-detail Output Reference

Field Description Notes
JobID The unique identifier assigned to the job by Slurm. For job arrays, the array iteration is appended.
State The current state of the job (e.g., PENDING, RUNNING, COMPLETED). Failures due to exceeding your slurm limits will be shown here (.e.g. OUT_OF_MEMORY, TIMEOUT)
Timelimit The time limit set by your sbatch/srun parameters: days-hours:minutes:seconds.
Start The date and time when the job started. If the job is still pending, this will be Unknown.
End The date and time when the job ended. Unknown for running or pending jobs.
Elapsed The total elapsed time that the job has been running. Updated in real-time for running jobs.
ReqMem The amount of memory requested for the job. Displayed in the same units used by your sbatch/srun command (e.g. 2048M or 2G)
MaxRSS The peak memory usage of the job during execution, in kilobytes If your job memory limit was specified in megabytes, divide this number by 1024 for comparison.
NNodes The number of nodes allocated to the job.
NCPUS The number of cores allocated to the job.
NodeList The list of nodes on which the job is running or ran.
ExitCode The exit status of the job. See Slurm Exit Codes

Slurm Exit Codes

When a job step finishes running, sacct-detail's ExitCode column will show the exit status for each step in the format exit code : exit signal. The exit code is the status returned from your code when the job step finishes running. If that status is anything other than 0, then the State column will say the job step failed. The meaning of the exit code varies depending on the application you are running, but for common exit codes in Linux see http://tldp.org/LDP/abs/html/exitcodes.html.

If the job step was terminated due to receiving a signal, then the second number listed in the ExitCode column of sacct-detail will be the signal that ended the job.

Common Slurm Problems

Stuck Interactive Sessions

Some applications freeze when their memory allocation is too low, rather than exceeding the allocation and causing the Slurm session to end with an Out Of Memory error. Common examples are Matlab and R (particularly when installing packages).

To correct this:

  • Exit the stuck command with Control + C
  • Close the interactive session with exit or Control + D
  • Retry your srun session with a larger memory allocation

"Illegal Instruction" Job Failures

Jobs failures with "illegal instruction" or core dump errors are usually due to running binaries compiled with newer CPU features that are not present on all cluster nodes. This typically occurs when software or libraries has been compiled using a partition other than the build partition.

To correct this, delete any compiled binaries or installed libraries, then reinstall them using an interactive session on the build partition.

Missing Software or Modules

See Environment Modules: Troubleshooting