Ranks in MPI Jobs

Ranks in MPI Jobs#

MPI ranks, nodes, and scratch space: what you need to know first

Before using node-local /tmp storage safely in an MPI job, it is important to understand how parallel processes are organized and identified at runtime. The file-management patterns that follow rely on these concepts to avoid file collisions, race conditions, and data loss.

This section introduces the core MPI and Slurm concepts you will see referenced throughout the storage chapter.

What is an MPI rank?#

In an MPI job, your program is launched multiple times in parallel. Each running instance is called a rank.

Every rank executes the same program
Each rank has a unique integer ID
Ranks typically cooperate by exchanging data via MPI

The rank ID is commonly referred to as:

MPI rank  (0, 1, 2, ..., N−1)

In Slurm-launched jobs (including Tapis jobs on TACC systems), this value is exposed as:

SLURM_PROCID

Example:

SLURM_PROCID=0 → rank 0 (often used for coordination or aggregation)
SLURM_PROCID=7 → the 8th MPI process

When writing files, two ranks with the same filename will overwrite each other unless separated. This is why rank-aware paths are essential.

Nodes vs. ranks: why this distinction matters#

A node is a physical (or virtual) machine in the cluster.

Each node has:
- its own CPUs
- its own memory
- its own /tmp directory
Multiple MPI ranks typically run on the same node

Slurm provides two important identifiers:

Concept	Meaning	Slurm variable
Rank ID	Global MPI process index	`SLURM_PROCID`
Node ID	Which node in the allocation	`SLURM_NODEID`
Local rank	Rank index within a node	`SLURM_LOCALID`

Example:

Rank 12 may be:
- SLURM_NODEID=1 (second node)
- SLURM_LOCALID=4 (5th rank on that node)

This matters because all ranks on the same node share the same /tmp directory.

Common environment variables#

These variables are automatically defined by Slurm (and therefore by Tapis on Slurm systems):

Variable	Description
`SLURM_PROCID`	Global MPI rank ID
`SLURM_LOCALID`	Rank index within a node
`SLURM_NODEID`	Node index within the job allocation
`SLURM_JOB_ID`	Scheduler job identifier
`USER`	Unix username
`TAPIS_JOB_WORKDIR`	Shared job execution directory

Throughout this chapter, these variables are used to:

create unique per-rank or per-node directories
coordinate file copies safely
control cleanup behavior

Why rank-aware file management is essential#

Without rank-aware file paths:

ranks overwrite each other’s temporary files
partial files are read before being fully written
jobs fail intermittently and are difficult to debug

By explicitly separating files by rank and node, you gain:

deterministic behavior
reproducibility
safe use of high-performance node-local storage

How this fits into the broader storage model#

In summary:

Shared filesystems (e.g., Work, Scratch):
- single job directory
- visible to all nodes
- ideal for inputs and final outputs
Node-local /tmp:
- fast but ephemeral
- requires explicit management
- ideal for temporary, I/O-intensive data

The MPI-safe patterns that follow build directly on these concepts and show how to combine correctness and performance in large-scale parallel workflows.