Execution Strategies#
How workloads are mapped onto compute systems
An execution strategy describes how a workload is launched, distributed, coordinated, and completed on a computing system. While the workload defines computational behavior, the execution strategy defines control flow, parallel structure, and resource usage.
Crucially, execution strategies are independent of tools. The same strategy may be implemented using JupyterHub, SLURM scripts, or Tapis apps — what changes is automation, not intent.
Why Execution Strategies Matter#
Execution strategies sit between scientific intent and computing tools.
They answer questions such as:
Should tasks run independently or in coordination?
Should the workflow be one long job or many small jobs?
Is performance limited by CPU, memory, communication, or I/O?
Does scaling mean more tasks, larger tasks, or longer runs?
Understanding execution strategies prevents common pitfalls such as:
oversubscribing memory,
underutilizing nodes,
overwhelming the filesystem with small files,
or adding resources that reduce performance.
The Three Core Execution Dimensions#
Every execution strategy is shaped by how a workload behaves along three fundamental dimensions:
1. Task Independence
Do individual tasks depend on each other?
Independent tasks → can run in any order or in parallel (Monte Carlo, parameter sweeps)
Dependent tasks → must run in sequence or coordinated steps (time-marching simulations, multi-stage workflows)
2. Resource Coupling
Do tasks share memory or communicate frequently?
Loosely coupled → minimal communication, file-based exchange
Tightly coupled → frequent synchronization, shared state, MPI
3. Time Structure
How does the workload evolve over time?
Short-lived tasks → many fast jobs, scheduling overhead matters
Long-running tasks → stability, checkpointing, and walltime matter
Iterative tasks → repeated execution with evolving state
These dimensions—not the software—determine the correct execution strategy.
Importantly, a single workflow may change execution strategy over its lifetime — for example, starting as embarrassingly parallel during exploration and evolving into a tightly coupled execution at scale.
Common Execution Strategies#
Below are the most common execution strategies used on DesignSafe and similar HPC platforms.
1. Embarrassingly Parallel Execution
Best for: Monte Carlo, parameter sweeps, batch preprocessing
Each task runs independently
Minimal memory per task
Scales horizontally across many cores or nodes
Typical patterns:
Job arrays
Parameterized batch jobs
Task launchers
Key risk: scheduling overhead and file I/O explosion
2. Single Large Batch Execution
Best for: Stepwise simulations, long-running solvers
One job runs for a long time
Memory footprint is stable
Parallelism is moderate or internal
Typical patterns:
Single SLURM job
Multi-core shared-memory execution
Checkpoint/restart cycles
Key risk: underutilization if parallelism is limited
3. Tightly Coupled MPI Execution
Best for: Large structural models, domain-decomposed simulations
Tasks exchange data frequently
Strong synchronization requirements
Memory and network performance dominate
Typical patterns:
MPI ranks per node
Domain decomposition
Collective communication
Key risk: communication overhead and load imbalance
4. Pipeline / Multi-Stage Execution
Best for: Preprocess → simulate → postprocess workflows
Workload is decomposed into stages
Each stage may use a different execution strategy
Intermediate data must be staged carefully
Typical patterns:
Sequential job chaining
Workflow managers
Conditional execution
Key risk: data movement dominates runtime
5. Accelerated Execution (GPU / Specialized Hardware)
Best for: ML training, large matrix operations, some preprocessing
High compute intensity
Performance sensitive to memory layout and data transfer
Often paired with CPU preprocessing
Typical patterns:
GPU-enabled batch jobs
Hybrid CPU/GPU pipelines
Key risk: idle accelerators due to poor data staging
Execution Strategy ≠ Platform#
A critical distinction:
Execution strategies describe structure — not tools.
The same strategy can be implemented using:
JupyterHub (interactive, exploratory)
SLURM batch scripts (manual control)
Tapis apps (automated, repeatable workflows)
The strategy stays the same; only the level of automation and orchestration changes.
Execution Strategy ≠ Resource Size#
A critical misconception is that scaling a workload means adding more resources.
Many workloads fail to scale because the execution strategy does not match the workload structure.
Examples:
Adding nodes to a tightly coupled simulation may slow it down
Running many tiny tasks as one job may waste cores
GPU jobs without sufficient preprocessing may idle accelerators
Choosing the right execution strategy is often more important than choosing the largest system.
Looking Ahead#
In later chapters, these execution strategies will be mapped to:
Interactive environments (e.g., JupyterHub)
Batch systems (SLURM)
Automated pipelines (Tapis applications)
The goal is not to lock you into a single approach, but to give you a strategy-first mindset for building scalable, reusable computational workflows.
Guiding Principle#
Performance problems are usually strategy problems, not hardware problems.