Node-Local Files in MPI

Node-Local Files in MPI#

MPI-safe patterns for per-rank scratch files (node-local /tmp)

When you use /tmp in an MPI job, remember:

/tmp is node-local, so each node has its own /tmp
multiple ranks on the same node share the same /tmp namespace
if you don’t separate per-rank (and sometimes per-node) paths, ranks will overwrite each other

Why /tmp requires special handling in MPI jobs#

The /tmp directory is:

Fast (node-local disk or RAM-backed)
Not shared across nodes
Shared by all ranks on the same node
Deleted when the job ends

As a result:

/tmp/input.dat is visible to all ranks on that node
Rank collisions will occur unless filenames or directories are separated
Files in /tmp must be explicitly copied back to shared storage

This is fundamentally different from Tapis-staged directories on shared filesystems, where all nodes see the same paths.

Managing /tmp files by Rank#

Below are safe, practical patterns you can drop into tapisjob_app.sh (or any Slurm-launched MPI wrapper).

Quick decision guide#

Most robust: Pattern 1 (per-rank dir)
Best for big shared inputs: Pattern 3 (per-node cache + barrier)
Best for outputs: Pattern 4A (per-rank files + merge) or MPI-IO formats
Least fragile cleanup: Pattern 5 (hierarchical cleanup)

Combination Bash utility block

The following block combines all of the above. You can reuse it across all your Tapis app wrappers:

#!/usr/bin/env bash
#===============================================================================
# MPI / Slurm / Tapis: node-local /tmp utilities (drop-in)
#
# Goal:
#   Safe, reusable helpers for per-rank scratch dirs, per-node caching,
#   lightweight barriers, and staging outputs back to the shared job directory.
#
# Designed for:
#   Slurm-launched MPI jobs (srun/mpirun within a Tapis job wrapper).
#
# Notes:
#   - /tmp is node-local and ephemeral. Use only for temporary files.
#   - Avoid collisions by always using rank- or node-scoped paths.
#   - Prefer "copy once per node" for large shared inputs.
#===============================================================================

set -euo pipefail

#-----------------------------
# Environment discovery
#-----------------------------
mpi_rank()   { echo "${SLURM_PROCID:-0}"; }
mpi_localid(){ echo "${SLURM_LOCALID:-0}"; }
mpi_nodeid() { echo "${SLURM_NODEID:-0}"; }

job_id()     { echo "${SLURM_JOB_ID:-manual}"; }
user_name()  { echo "${USER:-user}"; }

# Shared work directory (Tapis sets this; fall back to current working dir)
shared_workdir() {
  if [[ -n "${TAPIS_JOB_WORKDIR:-}" ]]; then
    echo "${TAPIS_JOB_WORKDIR}"
  else
    echo "$(pwd)"
  fi
}

# Root for all node-local scratch for this job
tmp_root() {
  local jid; jid="$(job_id)"
  echo "/tmp/$(user_name)/tapis_${jid}"
}

# Node-scoped scratch root on this node
node_tmp_root() {
  local node; node="$(mpi_nodeid)"
  echo "$(tmp_root)/node_${node}"
}

#-----------------------------
# Logging helpers
#-----------------------------
log()  { echo "[tmp-util] $*" >&2; }
die()  { echo "[tmp-util][ERROR] $*" >&2; exit 1; }

#-----------------------------
# Core utilities
#-----------------------------

# make_rank_tmp [optional_subdir]
# Creates and prints a unique per-rank scratch directory on this node.
make_rank_tmp() {
  local sub="${1:-}"
  local lid; lid="$(mpi_localid)"
  local base; base="$(node_tmp_root)/rank_${lid}"
  local path="${base}"
  if [[ -n "${sub}" ]]; then
    path="${base}/${sub}"
  fi
  mkdir -p "${path}"
  echo "${path}"
}

# make_node_tmp [optional_subdir]
# Creates and prints a node-scoped scratch directory (shared by ranks on node).
make_node_tmp() {
  local sub="${1:-}"
  local base; base="$(node_tmp_root)"
  local path="${base}"
  if [[ -n "${sub}" ]]; then
    path="${base}/${sub}"
  fi
  mkdir -p "${path}"
  echo "${path}"
}

# node_leader: true if this rank is "leader" on node (LOCALID==0)
node_leader() {
  [[ "$(mpi_localid)" == "0" ]]
}

# global_leader: true if this rank is rank 0 in the MPI world
global_leader() {
  [[ "$(mpi_rank)" == "0" ]]
}

# node_cache_file <source_path> <dest_basename> [cache_subdir]
# Copy a file ONCE PER NODE into node-local cache, then wait until ready.
# Returns full path to cached file.
node_cache_file() {
  local src="${1:?source_path required}"
  local name="${2:?dest_basename required}"
  local sub="${3:-cache}"

  local cache_dir; cache_dir="$(make_node_tmp "${sub}")"
  local dst="${cache_dir}/${name}"
  local ready="${cache_dir}/.${name}.ready"

  # Only one rank per node copies
  if node_leader; then
    if [[ ! -f "${src}" ]]; then
      die "node_cache_file: source not found: ${src}"
    fi
    # Copy atomically: write to temp then mv
    local tmp="${dst}.tmp.$$"
    log "Node leader copying to cache: ${src} -> ${dst}"
    cp -p "${src}" "${tmp}"
    mv -f "${tmp}" "${dst}"
    : > "${ready}"
  fi

  # All ranks wait until cache is ready
  while [[ ! -f "${ready}" ]]; do
    sleep 0.1
  done

  echo "${dst}"
}

# mpi_barrier [barrier_name]
# Lightweight synchronization barrier.
# Implementation:
#   - If running under srun with Slurm vars, uses a file-based barrier per node.
#   - Optionally, you can replace with an "srun no-op step" barrier if desired.
#
# For most workflows, you only need node-local barriers (per-node cache).
mpi_barrier() {
  local name="${1:-barrier}"
  local dir; dir="$(make_node_tmp "sync")"
  local flag="${dir}/.${name}.ready"

  # Node-local barrier: node leader touches flag; others wait.
  if node_leader; then
    : > "${flag}"
  fi
  while [[ ! -f "${flag}" ]]; do
    sleep 0.05
  done

  # NOTE: This is a node-local barrier (ranks on same node).
  # For a *global* barrier across all nodes, use global_file_barrier below.
}

# global_file_barrier [barrier_name] [shared_dir]
# Global barrier across all ranks/nodes using the shared filesystem.
# Use sparingly: it adds shared FS metadata traffic.
global_file_barrier() {
  local name="${1:-global_barrier}"
  local shared="${2:-$(shared_workdir)}"
  local n="${SLURM_NTASKS:-}"

  [[ -n "${n}" ]] || die "global_file_barrier requires SLURM_NTASKS to be set."

  local bdir="${shared}/.barriers/${name}"
  mkdir -p "${bdir}"

  local r; r="$(mpi_rank)"
  local token="${bdir}/rank_${r}"
  : > "${token}"

  # rank 0 waits for all tokens then releases
  local release="${bdir}/RELEASE"
  if global_leader; then
    log "Global barrier: waiting for ${n} ranks..."
    local i
    for (( i=0; i< n; i++ )); do
      while [[ ! -f "${bdir}/rank_${i}" ]]; do
        sleep 0.05
      done
    done
    : > "${release}"
  fi

  while [[ ! -f "${release}" ]]; do
    sleep 0.05
  done
}

# stage_in <src> <dst_dir>
# Copy a file/dir from shared filesystem into a given /tmp directory.
# - Preserves timestamps/permissions.
# - Works for files and directories.
stage_in() {
  local src="${1:?src required}"
  local dst_dir="${2:?dst_dir required}"
  mkdir -p "${dst_dir}"

  if [[ -d "${src}" ]]; then
    log "Staging in directory: ${src} -> ${dst_dir}/"
    cp -a "${src}" "${dst_dir}/"
  else
    log "Staging in file: ${src} -> ${dst_dir}/"
    cp -p "${src}" "${dst_dir}/"
  fi
}

# stage_out <src> <dst_dir>
# Copy a file/dir from /tmp back to shared filesystem.
# - Creates destination dir.
# - For directories, copies recursively.
stage_out() {
  local src="${1:?src required}"
  local dst_dir="${2:?dst_dir required}"
  mkdir -p "${dst_dir}"

  if [[ -d "${src}" ]]; then
    log "Staging out directory: ${src} -> ${dst_dir}/"
    cp -a "${src}" "${dst_dir}/"
  else
    log "Staging out file: ${src} -> ${dst_dir}/"
    cp -p "${src}" "${dst_dir}/"
  fi
}

# safe_rm <path>
# Defensive delete for scratch paths (only under /tmp/<user>/tapis_<jobid>).
safe_rm() {
  local p="${1:?path required}"
  local root; root="$(tmp_root)"

  if [[ "${p}" != "${root}"* ]]; then
    die "Refusing to delete outside job tmp root. path=${p} root=${root}"
  fi
  rm -rf "${p}"
}

# cleanup_tmp [KEEP_TMP env honored]
# Hierarchical cleanup:
#   - each rank deletes its own rank dir
#   - node leader deletes node dir
#   - global leader deletes root (optional)
cleanup_tmp() {
  if [[ "${KEEP_TMP:-0}" == "1" ]]; then
    log "KEEP_TMP=1 set; skipping cleanup."
    return 0
  fi

  local lid; lid="$(mpi_localid)"
  local node; node="$(mpi_nodeid)"
  local rdir; rdir="$(node_tmp_root)/rank_${lid}"
  local ndir; ndir="$(tmp_root)/node_${node}"
  local root; root="$(tmp_root)"

  # per-rank cleanup
  [[ -d "${rdir}" ]] && safe_rm "${rdir}" || true

  # node leader cleans node directory after brief delay
  if node_leader; then
    sleep 0.2
    [[ -d "${ndir}" ]] && safe_rm "${ndir}" || true
  fi

  # global leader can remove root
  if global_leader; then
    sleep 0.5
    [[ -d "${root}" ]] && safe_rm "${root}" || true
  fi
}

#===============================================================================
# Example usage (copy/paste into your wrapper as needed)
#===============================================================================
#
# # 1) Create per-rank scratch dir
# RANK_DIR="$(make_rank_tmp)"
# cd "${RANK_DIR}"
#
# # 2) Cache a large shared input once per node
# COMMON_SHARED="$(shared_workdir)/inputs/ground_motions.bin"
# COMMON_LOCAL="$(node_cache_file "${COMMON_SHARED}" "ground_motions.bin")"
#
# # 3) Stage rank-specific input and run
# stage_in "$(shared_workdir)/inputs/input_$(mpi_rank).dat" "${RANK_DIR}"
# ./solver "${COMMON_LOCAL}" "input_$(mpi_rank).dat" > "rank_$(mpi_rank).log" 2>&1
#
# # 4) Stage outputs back to shared FS
# stage_out "${RANK_DIR}/rank_$(mpi_rank).log" "$(shared_workdir)/logs"
# stage_out "${RANK_DIR}/output_$(mpi_rank).dat" "$(shared_workdir)/outputs"
#
# # 5) Cleanup (or set KEEP_TMP=1 to preserve for debugging)
# cleanup_tmp
#
#===============================================================================

Node-Local Files in MPI

Contents

Node-Local Files in MPI#

Why /tmp requires special handling in MPI jobs#

Managing /tmp files by Rank#

Quick decision guide#