SLURM Job Output#
Reading .out and .err Files – critical for debugging
When you submit a SLURM job using:
sbatch job.slurm
SLURM automatically captures your program’s output in two key files:
File |
Description |
|---|---|
|
Captures standard output — anything your script prints via |
|
Captures standard error — runtime errors, syntax issues, missing files, MPI problems, etc. |
These files are automatically named using the job ID and should be your first stop when debugging a failed or stalled job.
Example#
You will find the output files in the base folder of where your SLURM job was executed, or tranferred to.
What to Look For#
*.out– Standard OutputThis file logs the normal progress of your job, such as:
Echoed input parameters or timestamps
Printed results or summaries
Completion messages from OpenSees or your script
Debugging output you inserted (e.g.,
print("step 1 done"))
If this file is empty, it might mean your script failed very early, before any standard output was produced.
*.err– Standard Error*This file contains critical diagnostics:
Tcl or Python syntax errors
Missing or misnamed input files
Failed module loads (e.g., a missing OpenSees executable)
MPI startup issues (
OpenSeesSPorOpenSeesMP)Permission problems (e.g., trying to write to a read-only directory)
Even if your job produced output, always check the
.errfile — it may reveal warnings or silent failures that don’t stop the job but indicate something went wrong.
Best Practice#
Always check both files after a job finishes (or fails). These files files are stored with your input script. You can view them in JupyterHub or the Data Depot, which you’d access from the Job-Status page.
In most cases, they will tell you exactly what went wrong, or confirm that your job completed successfully.
If you’re debugging an OpenSees model, this is where you’ll find:
The stack trace of a failed script
Errors about file paths, mesh loading, or convergence issues
Output from
putsorprintstatements that help trace execution
Failed Job Example#
Check the files:
xxxxx.err:mpirun: error: unable to open hostfile: No such file or directory ERROR: MPI process failed to start child process exited abnormally
xxxx.out:<empty>
Diagnosis:
MPI tried to launch
OpenSeesSPbut couldn’t find the required hostfile. This typically happens when:You didn’t request multiple nodes but your environment needs one
Your
mpirunconfiguration doesn’t match the schedulerOpenSees isn’t correctly installed or loaded
Troubleshooting Checklist#
Use this list when your SLURM job doesn’t behave as expected:
Step |
What to Check |
File |
|---|---|---|
1 |
Did the job run at all? Look for start/stop messages. |
|
2 |
Is there a syntax or runtime error? |
|
3 |
Any missing input files or path typos? |
|
4 |
Are MPI commands/formats correct? |
|
5 |
Are you using the correct executable ( |
|
6 |
Does the output stop partway through? Check for crashes. |
|
7 |
Does your script use absolute or relative paths? |
Both |
8 |
Is there a SLURM-specific error (e.g., exceeded time limit)? |
|
Tip
Empty
.outfile? Your job likely failed before any output was printed.Empty
.errfile? Great! But check.outfor unexpected early exits.Still unsure? Insert
puts "Starting..."orprint("Reached step 2")to help trace the point of failure.