Node-Local Files in MPI#
MPI-safe patterns for per-rank scratch files (node-local /tmp)
When you use /tmp in an MPI job, remember:
/tmp is node-local, so each node has its own /tmp
multiple ranks on the same node share the same /tmp namespace
if you don’t separate per-rank (and sometimes per-node) paths, ranks will overwrite each other
Why /tmp requires special handling in MPI jobs#
The /tmp directory is:
Fast (node-local disk or RAM-backed)
Not shared across nodes
Shared by all ranks on the same node
Deleted when the job ends
As a result:
/tmp/input.dat is visible to all ranks on that node
Rank collisions will occur unless filenames or directories are separated
Files in /tmp must be explicitly copied back to shared storage
This is fundamentally different from Tapis-staged directories on shared filesystems, where all nodes see the same paths.
Managing /tmp files by Rank#
Below are safe, practical patterns you can drop into tapisjob_app.sh (or any Slurm-launched MPI wrapper).
Pattern 1 — One unique scratch directory per rank
This is the simplest and safest approach: every rank creates and uses its own folder.
# In your MPI-launched command context
# Assumes Slurm; adjust env vars if needed.
RANK="${SLURM_PROCID:-0}"
JOBID="${SLURM_JOB_ID:-manual}"
SCR_ROOT="/tmp/${USER}/tapis_${JOBID}"
SCR_RANK="${SCR_ROOT}/rank_${RANK}"
mkdir -p "${SCR_RANK}"
# Copy in only what this rank needs (or common files if small)
cp -p input_${RANK}.dat "${SCR_RANK}/" 2>/dev/null || true
cp -p common.dat "${SCR_RANK}/" 2>/dev/null || true
# Run using node-local paths
cd "${SCR_RANK}"
./solver common.dat "input_${RANK}.dat" > "rank_${RANK}.log" 2>&1
# Copy out required outputs to shared filesystem
cp -p output_${RANK}.dat "${TAPIS_JOB_WORKDIR}/" 2>/dev/null || true
cp -p "rank_${RANK}.log" "${TAPIS_JOB_WORKDIR}/logs/" 2>/dev/null || true
Why it’s safe: no file collisions, easy debugging (each rank has its own working directory).
Pattern 2 — One scratch directory per node, then per-rank inside it
This is useful when you want a shared node-local cache (e.g., a big lookup table) but still avoid rank collisions.
RANK="${SLURM_PROCID:-0}"
LOCALID="${SLURM_LOCALID:-0}" # rank index within node
NODEID="${SLURM_NODEID:-0}" # node index within allocation
JOBID="${SLURM_JOB_ID:-manual}"
SCR_NODE="/tmp/${USER}/tapis_${JOBID}/node_${NODEID}"
SCR_RANK="${SCR_NODE}/rank_${LOCALID}"
mkdir -p "${SCR_RANK}"
# (Optional) node-shared cache directory
SCR_CACHE="${SCR_NODE}/cache"
mkdir -p "${SCR_CACHE}"
Typical use: one rank per node populates ${SCR_CACHE}, others read from it.
Pattern 3 — “One rank per node” does the heavy copy (node-local caching)
Use this when you have a large common file (e.g., ground motions, big meshes) that would be wasteful to copy once per rank.
Goal: copy once per node, then all ranks on that node read it from /tmp.
NODEID="${SLURM_NODEID:-0}"
LOCALID="${SLURM_LOCALID:-0}"
JOBID="${SLURM_JOB_ID:-manual}"
SCR_NODE="/tmp/${USER}/tapis_${JOBID}/node_${NODEID}"
SCR_CACHE="${SCR_NODE}/cache"
mkdir -p "${SCR_CACHE}"
COMMON_SRC="${TAPIS_JOB_WORKDIR}/common_big.dat"
COMMON_DST="${SCR_CACHE}/common_big.dat"
# Only LOCALID==0 copies the large file on each node
if [[ "${LOCALID}" == "0" ]]; then
cp -p "${COMMON_SRC}" "${COMMON_DST}"
fi
# MPI-safe barrier so other ranks don't read before copy finishes
# (If you are launching with srun, the simplest is an srun barrier step)
Barrier options (choose one):
A) srun “barrier step” (recommended in Slurm jobs):
# Make sure every task reaches this point before proceeding
srun --ntasks="${SLURM_NTASKS}" --ntasks-per-node="${SLURM_NTASKS_PER_NODE}" bash -c 'true'
This works because Slurm won’t complete the step until all tasks start and finish the no-op.
B) File-based barrier (portable, but more management):
BARRIER="${SCR_NODE}/.ready"
if [[ "${LOCALID}" == "0" ]]; then
cp -p "${COMMON_SRC}" "${COMMON_DST}"
touch "${BARRIER}"
fi
# Everyone waits until the node-local copy is ready
while [[ ! -f "${BARRIER}" ]]; do
sleep 0.1
done
Then each rank can safely run:
./solver "${COMMON_DST}" other_inputs...
Pattern 4 — Output aggregation: avoid “N ranks writing one shared file”
A very common failure mode is all ranks appending to one output file on shared storage or even /tmp.
Safer alternatives:
4A) Per-rank outputs, then merge on rank 0
RANK="${SLURM_PROCID:-0}"
OUTDIR="${TAPIS_JOB_WORKDIR}/outputs"
mkdir -p "${OUTDIR}"
# each rank writes its own output
./solver ... > "${OUTDIR}/rank_${RANK}.out" 2>&1
# rank 0 merges in order (optional)
if [[ "${RANK}" == "0" ]]; then
cat "${OUTDIR}"/rank_*.out > "${OUTDIR}/merged.out"
fi
4B) Use MPI-IO or a parallel output format (best for large data)
If your code supports HDF5/MPI-IO or parallel NetCDF, that’s typically more scalable than ad-hoc text aggregation.
Pattern 5 — Cleanup that won’t break your run
You generally want cleanup, but you don’t want rank races deleting shared paths too early.
Rule of thumb:
each rank can delete its own rank directory
only one rank per node (LOCALID==0) deletes node directories
rank 0 deletes job-level scratch root (optional)
RANK="${SLURM_PROCID:-0}"
LOCALID="${SLURM_LOCALID:-0}"
NODEID="${SLURM_NODEID:-0}"
JOBID="${SLURM_JOB_ID:-manual}"
SCR_ROOT="/tmp/${USER}/tapis_${JOBID}"
SCR_NODE="${SCR_ROOT}/node_${NODEID}"
SCR_RANK="${SCR_NODE}/rank_${LOCALID}"
# delete per-rank
rm -rf "${SCR_RANK}"
# node leader deletes node dir after a brief wait
if [[ "${LOCALID}" == "0" ]]; then
# small delay helps avoid races in some workflows
sleep 0.2
rm -rf "${SCR_NODE}"
fi
# (Optional) global cleanup by rank 0
if [[ "${RANK}" == "0" ]]; then
sleep 0.5
rm -rf "${SCR_ROOT}"
fi
If you ever want post-mortem debugging, skip cleanup or gate it behind a flag like KEEP_TMP=1.
Quick decision guide#
Most robust: Pattern 1 (per-rank dir)
Best for big shared inputs: Pattern 3 (per-node cache + barrier)
Best for outputs: Pattern 4A (per-rank files + merge) or MPI-IO formats
Least fragile cleanup: Pattern 5 (hierarchical cleanup)
Combination Bash utility block
The following block combines all of the above. You can reuse it across all your Tapis app wrappers:
#!/usr/bin/env bash
#===============================================================================
# MPI / Slurm / Tapis: node-local /tmp utilities (drop-in)
#
# Goal:
# Safe, reusable helpers for per-rank scratch dirs, per-node caching,
# lightweight barriers, and staging outputs back to the shared job directory.
#
# Designed for:
# Slurm-launched MPI jobs (srun/mpirun within a Tapis job wrapper).
#
# Notes:
# - /tmp is node-local and ephemeral. Use only for temporary files.
# - Avoid collisions by always using rank- or node-scoped paths.
# - Prefer "copy once per node" for large shared inputs.
#===============================================================================
set -euo pipefail
#-----------------------------
# Environment discovery
#-----------------------------
mpi_rank() { echo "${SLURM_PROCID:-0}"; }
mpi_localid(){ echo "${SLURM_LOCALID:-0}"; }
mpi_nodeid() { echo "${SLURM_NODEID:-0}"; }
job_id() { echo "${SLURM_JOB_ID:-manual}"; }
user_name() { echo "${USER:-user}"; }
# Shared work directory (Tapis sets this; fall back to current working dir)
shared_workdir() {
if [[ -n "${TAPIS_JOB_WORKDIR:-}" ]]; then
echo "${TAPIS_JOB_WORKDIR}"
else
echo "$(pwd)"
fi
}
# Root for all node-local scratch for this job
tmp_root() {
local jid; jid="$(job_id)"
echo "/tmp/$(user_name)/tapis_${jid}"
}
# Node-scoped scratch root on this node
node_tmp_root() {
local node; node="$(mpi_nodeid)"
echo "$(tmp_root)/node_${node}"
}
#-----------------------------
# Logging helpers
#-----------------------------
log() { echo "[tmp-util] $*" >&2; }
die() { echo "[tmp-util][ERROR] $*" >&2; exit 1; }
#-----------------------------
# Core utilities
#-----------------------------
# make_rank_tmp [optional_subdir]
# Creates and prints a unique per-rank scratch directory on this node.
make_rank_tmp() {
local sub="${1:-}"
local lid; lid="$(mpi_localid)"
local base; base="$(node_tmp_root)/rank_${lid}"
local path="${base}"
if [[ -n "${sub}" ]]; then
path="${base}/${sub}"
fi
mkdir -p "${path}"
echo "${path}"
}
# make_node_tmp [optional_subdir]
# Creates and prints a node-scoped scratch directory (shared by ranks on node).
make_node_tmp() {
local sub="${1:-}"
local base; base="$(node_tmp_root)"
local path="${base}"
if [[ -n "${sub}" ]]; then
path="${base}/${sub}"
fi
mkdir -p "${path}"
echo "${path}"
}
# node_leader: true if this rank is "leader" on node (LOCALID==0)
node_leader() {
[[ "$(mpi_localid)" == "0" ]]
}
# global_leader: true if this rank is rank 0 in the MPI world
global_leader() {
[[ "$(mpi_rank)" == "0" ]]
}
# node_cache_file <source_path> <dest_basename> [cache_subdir]
# Copy a file ONCE PER NODE into node-local cache, then wait until ready.
# Returns full path to cached file.
node_cache_file() {
local src="${1:?source_path required}"
local name="${2:?dest_basename required}"
local sub="${3:-cache}"
local cache_dir; cache_dir="$(make_node_tmp "${sub}")"
local dst="${cache_dir}/${name}"
local ready="${cache_dir}/.${name}.ready"
# Only one rank per node copies
if node_leader; then
if [[ ! -f "${src}" ]]; then
die "node_cache_file: source not found: ${src}"
fi
# Copy atomically: write to temp then mv
local tmp="${dst}.tmp.$$"
log "Node leader copying to cache: ${src} -> ${dst}"
cp -p "${src}" "${tmp}"
mv -f "${tmp}" "${dst}"
: > "${ready}"
fi
# All ranks wait until cache is ready
while [[ ! -f "${ready}" ]]; do
sleep 0.1
done
echo "${dst}"
}
# mpi_barrier [barrier_name]
# Lightweight synchronization barrier.
# Implementation:
# - If running under srun with Slurm vars, uses a file-based barrier per node.
# - Optionally, you can replace with an "srun no-op step" barrier if desired.
#
# For most workflows, you only need node-local barriers (per-node cache).
mpi_barrier() {
local name="${1:-barrier}"
local dir; dir="$(make_node_tmp "sync")"
local flag="${dir}/.${name}.ready"
# Node-local barrier: node leader touches flag; others wait.
if node_leader; then
: > "${flag}"
fi
while [[ ! -f "${flag}" ]]; do
sleep 0.05
done
# NOTE: This is a node-local barrier (ranks on same node).
# For a *global* barrier across all nodes, use global_file_barrier below.
}
# global_file_barrier [barrier_name] [shared_dir]
# Global barrier across all ranks/nodes using the shared filesystem.
# Use sparingly: it adds shared FS metadata traffic.
global_file_barrier() {
local name="${1:-global_barrier}"
local shared="${2:-$(shared_workdir)}"
local n="${SLURM_NTASKS:-}"
[[ -n "${n}" ]] || die "global_file_barrier requires SLURM_NTASKS to be set."
local bdir="${shared}/.barriers/${name}"
mkdir -p "${bdir}"
local r; r="$(mpi_rank)"
local token="${bdir}/rank_${r}"
: > "${token}"
# rank 0 waits for all tokens then releases
local release="${bdir}/RELEASE"
if global_leader; then
log "Global barrier: waiting for ${n} ranks..."
local i
for (( i=0; i< n; i++ )); do
while [[ ! -f "${bdir}/rank_${i}" ]]; do
sleep 0.05
done
done
: > "${release}"
fi
while [[ ! -f "${release}" ]]; do
sleep 0.05
done
}
# stage_in <src> <dst_dir>
# Copy a file/dir from shared filesystem into a given /tmp directory.
# - Preserves timestamps/permissions.
# - Works for files and directories.
stage_in() {
local src="${1:?src required}"
local dst_dir="${2:?dst_dir required}"
mkdir -p "${dst_dir}"
if [[ -d "${src}" ]]; then
log "Staging in directory: ${src} -> ${dst_dir}/"
cp -a "${src}" "${dst_dir}/"
else
log "Staging in file: ${src} -> ${dst_dir}/"
cp -p "${src}" "${dst_dir}/"
fi
}
# stage_out <src> <dst_dir>
# Copy a file/dir from /tmp back to shared filesystem.
# - Creates destination dir.
# - For directories, copies recursively.
stage_out() {
local src="${1:?src required}"
local dst_dir="${2:?dst_dir required}"
mkdir -p "${dst_dir}"
if [[ -d "${src}" ]]; then
log "Staging out directory: ${src} -> ${dst_dir}/"
cp -a "${src}" "${dst_dir}/"
else
log "Staging out file: ${src} -> ${dst_dir}/"
cp -p "${src}" "${dst_dir}/"
fi
}
# safe_rm <path>
# Defensive delete for scratch paths (only under /tmp/<user>/tapis_<jobid>).
safe_rm() {
local p="${1:?path required}"
local root; root="$(tmp_root)"
if [[ "${p}" != "${root}"* ]]; then
die "Refusing to delete outside job tmp root. path=${p} root=${root}"
fi
rm -rf "${p}"
}
# cleanup_tmp [KEEP_TMP env honored]
# Hierarchical cleanup:
# - each rank deletes its own rank dir
# - node leader deletes node dir
# - global leader deletes root (optional)
cleanup_tmp() {
if [[ "${KEEP_TMP:-0}" == "1" ]]; then
log "KEEP_TMP=1 set; skipping cleanup."
return 0
fi
local lid; lid="$(mpi_localid)"
local node; node="$(mpi_nodeid)"
local rdir; rdir="$(node_tmp_root)/rank_${lid}"
local ndir; ndir="$(tmp_root)/node_${node}"
local root; root="$(tmp_root)"
# per-rank cleanup
[[ -d "${rdir}" ]] && safe_rm "${rdir}" || true
# node leader cleans node directory after brief delay
if node_leader; then
sleep 0.2
[[ -d "${ndir}" ]] && safe_rm "${ndir}" || true
fi
# global leader can remove root
if global_leader; then
sleep 0.5
[[ -d "${root}" ]] && safe_rm "${root}" || true
fi
}
#===============================================================================
# Example usage (copy/paste into your wrapper as needed)
#===============================================================================
#
# # 1) Create per-rank scratch dir
# RANK_DIR="$(make_rank_tmp)"
# cd "${RANK_DIR}"
#
# # 2) Cache a large shared input once per node
# COMMON_SHARED="$(shared_workdir)/inputs/ground_motions.bin"
# COMMON_LOCAL="$(node_cache_file "${COMMON_SHARED}" "ground_motions.bin")"
#
# # 3) Stage rank-specific input and run
# stage_in "$(shared_workdir)/inputs/input_$(mpi_rank).dat" "${RANK_DIR}"
# ./solver "${COMMON_LOCAL}" "input_$(mpi_rank).dat" > "rank_$(mpi_rank).log" 2>&1
#
# # 4) Stage outputs back to shared FS
# stage_out "${RANK_DIR}/rank_$(mpi_rank).log" "$(shared_workdir)/logs"
# stage_out "${RANK_DIR}/output_$(mpi_rank).dat" "$(shared_workdir)/outputs"
#
# # 5) Cleanup (or set KEEP_TMP=1 to preserve for debugging)
# cleanup_tmp
#
#===============================================================================