Slurm Troubleshooting
Evaluating Job Failures & Resource Consumption with sacct-detail
sacct-detail
shows the status of pending, running and recently completed jobs. This is the primary tool for diagnosing failed jobs, and comparing your completed job's actual memory and timit consumption to the limits you allocated with sbatch/srun.
sacct-detail Output Reference
Field | Description | Notes |
---|---|---|
JobID |
The unique identifier assigned to the job by Slurm. | For job arrays, the array iteration is appended. |
State |
The current state of the job (e.g., PENDING , RUNNING , COMPLETED ). |
Failures due to exceeding your slurm limits will be shown here (.e.g. OUT_OF_MEMORY , TIMEOUT ) |
Timelimit |
The time limit set by your sbatch/srun parameters: days-hours:minutes:seconds . |
|
Start |
The date and time when the job started. | If the job is still pending, this will be Unknown . |
End |
The date and time when the job ended. | Unknown for running or pending jobs. |
Elapsed |
The total elapsed time that the job has been running. | Updated in real-time for running jobs. |
ReqMem |
The amount of memory requested for the job. | Displayed in the same units used by your sbatch/srun command (e.g. 2048M or 2G) |
MaxRSS |
The peak memory usage of the job during execution, in kilobytes | If your job memory limit was specified in megabytes, divide this number by 1024 for comparison. |
NNodes |
The number of nodes allocated to the job. | |
NCPUS |
The number of cores allocated to the job. | |
NodeList |
The list of nodes on which the job is running or ran. | |
ExitCode |
The exit status of the job. | See Slurm Exit Codes |
Slurm Exit Codes
When a job step finishes running, sacct-detail's
ExitCode
column will show the exit status for each step in the format exit code : exit signal
. The exit code is the status returned from your code when the job step finishes running. If that status is anything other than 0, then the State
column will say the job step failed. The meaning of the exit code varies depending on the application you are running, but for common exit codes in Linux see http://tldp.org/LDP/abs/html/exitcodes.html.
If the job step was terminated due to receiving a signal, then the second number listed in the ExitCode
column of sacct-detail
will be the signal that ended the job.
Common Slurm Problems
Stuck Interactive Sessions
Some applications freeze when their memory allocation is too low, rather than exceeding the allocation and causing the Slurm session to end with an Out Of Memory
error. Common examples are Matlab and R (particularly when installing packages).
To correct this:
- Exit the stuck command with
Control + C
- Close the interactive session with
exit
orControl + D
- Retry your
srun
session with a larger memory allocation
"Illegal Instruction" Job Failures
Jobs failures with "illegal instruction" or core dump errors are usually due to running binaries compiled with newer CPU features that are not present on all cluster nodes. This typically occurs when software or libraries has been compiled using a partition other than the build
partition.
To correct this, delete any compiled binaries or installed libraries, then reinstall them using an interactive session on the build
partition.