Run a Tapis App#
Two Views of the Same Process (User Workflow + Tapis Runtime)
A Tapis job submission has one lifecycle, but it can be described from two perspectives:
A. What the app-user does (the “front” process)#
User-facing workflow: what you choose and what you provide (Portal / CLI / Tapipy)
Install Tapipy
Run this once to install the SDK:
pip install tapipy
tapipy may have already been installed in Jupyter Hub.
Connect to Tapis
Create the client and log in:
from tapipy.tapis import Tapis
# Replace with your credentials
t = Tapis(
base_url="https://tacc.tapis.io",
username="your-username",
password="your-password",
account_type="tacc"
)
t.get_tokens() # Log in to Tapis
Tip: You only need to call get_tokens() once per session.
1) Choose an app (and version)
You specify an appId and version (e.g., opensees-mp-s3 or ...-latest).
This determines:
the input schema (what files/params are allowed)
the runtime style (ZIP vs container)
the wrapper entrypoint (what actually runs on the cluster)
2) Provide inputs and parameters
Supply input files/directories and any runtime parameters defined by the app schema.
You provide:
file inputs (files/dirs that must be staged to the execution system)
parameters (simple values like flags, paths, numeric settings)
optional environment variables (modules, pip installs, custom toggles)
Archive system ID (e.g. “tacc-archive”)
Where you want your outputs to be stored (archivePath)
3) Define execution settings (job attributes)
You request compute resources:
nodes, tasks/cores, walltime, queue/partition
optional scheduler extras (reservation, constraints)
4) Submit Job
Tapis submits the job to the execution system and tracks status.
Example Submission using Tapipy
job = t.jobs.submitJob( jobName="my-first-job", appId="hello-world-1.0", parameters={}, # Replace with actual app parameters if needed fileInputs=[], # Or provide input files here archiveSystemId="tacc-archive", archivePath="myuser/outputs/hello-job", archiveOnAppError=True ) print("Job submitted!") print("Job ID:", job.id) print("Status:", job.status)
5) Monitor execution
You track status from:
Portal, CLI, or API/Tapipy
logs produced by Slurm (stdout/stderr) and by your wrapper script (summary logs)
You can check on your job:
job = t.jobs.getJob(jobUUID=job.id)
print("Current Status:", job.status)
Or just the status field directly:
status = t.jobs.getJobStatus(jobUUID=job.id)
print(status.status)
Job Status Values (for Filtering)
Common values you can use for the status field:
PENDING
STAGING_INPUTS
RUNNING
FINISHED
FAILED
CANCELLED
PAUSED
BLOCKED
You can filter jobs by status like:
jobs = client.jobs.listJobs(status='FINISHED')
Or via search:
search_query = json.dumps({"status": "FAILED"})
jobs = client.jobs.listJobs(search=search_query)
6) Retrieve outputs
When complete, outputs are available in the archive location and via the Files service for browsing/downloading/reuse.
Outputs are archived to the configured archive system/path.
You can browse, download, and reuse results in later workflows.
List available files
files = t.jobs.getJobOutputList(jobUUID=job.id)
for f in files:
print(f.name, f.length)
**Download a file:
output = t.jobs.getJobOutputDownload(
jobUUID=job.id,
path="stdout.txt"
)
with open("stdout.txt", "wb") as f:
f.write(output)
The file paths (like “stdout.txt”) depend on how your app writes output.
Full Example Script
Submit → poll → list outputs
from tapipy.tapis import Tapis
import json
t = Tapis(
base_url="https://tacc.tapis.io",
username="your-username",
password="your-password",
account_type="tacc"
)
t.get_tokens()
job = t.jobs.submitJob(
jobName="my-first-job",
appId="hello-world-1.0",
# parameterSet / fileInputs structure can vary by app definition
parameters={},
fileInputs=[],
archiveSystemId="tacc-archive",
archivePath="myuser/outputs/hello-job",
archiveOnAppError=True
)
print("Job ID:", job.id)
print("Status:", job.status)
# Poll status
job2 = t.jobs.getJob(jobUUID=job.id)
print("Current Status:", job2.status)
# Filter jobs (example)
search_query = json.dumps({"status": "FAILED"})
failed_jobs = t.jobs.listJobs(search=search_query)
print("Failed jobs returned:", len(failed_jobs))
B. What Tapis does (the “internal” runtime process)#
Runtime workflow: what Tapis automates on the execution system (SSH + filesystem + scheduler + archiving)
The internal runtime workflow (stage → submit → run → archive)
This is the same lifecycle, described by the system actions that occur after you submit.
This section describes what happens after you click “Run Job” (or submit via API). The exact details vary by execution system and runtime type, but the pattern is consistent:
validate → stage → unpack/prepare → submit → monitor → archive
1) Job Definition: Validation + job record creation
A Tapis App is defined by:
app.json → inputs, parameters, environment, runtime type
tapisjob_app.sh → wrapper script executed on the HPC system
optional supporting files (profiles, modules, docs)
When you submit a job, Tapis:
Validates your request against app.json
required inputs present
parameter types correct
enums/schema constraints satisfied
strictFileInputs enforced (if enabled)
Creates a job UUID and stores the resolved configuration (effective values)
Only after validation does the job move into staging.
2) Staging inputs (file-transfer phase #1)
Tapis prepares the execution environment on the HPC system:
Creates a job working directory (location depends on the Execution System definition)
Stages your input directory/files into the job’s working directory
Stages the runtime asset (ZIP bundle or container image reference)
Applies permissions and writes internal metadata for tracking No execution occurs in staging — this phase is file preparation.
3) Runtime preparation (ZIP unpack / container plan)
A. ZIP runtime
Tapis copies the ZIP into the job directory
Extracts it in place
Makes tapisjob_app.sh executable
In a ZIP runtime, the extracted bundle is effectively your “app container” — just implemented as a portable archive.
B. Container runtime (Singularity/Apptainer)
Tapis ensures the image is available on the system
Plans bind mounts (exec/input/output paths)
Encodes the container command into the scheduler script
Tapis itself does not “run the container”; the scheduler-run script does.
4) Scheduler submission
Tapis generates a batch script (e.g., Slurm), injects your resource requests and runtime command, then runs sbatch.
Tapis stores the scheduler job ID so it can poll state.
Tapis constructs a scheduler submission script using:
queue/partition
node/task/core counts
time limits
scheduler options (reservations, constraints)
execution-system profile settings
Then it submits the batch job (e.g., via sbatch) and stores the scheduler job ID for monitoring.
5) Wrapper script execution (your code runs here)
This is where the app logic lives.
Tha app’s tapisjob_app.sh typically:
Initializes timers and logs
Loads modules (from defaults and/or user-provided lists)
Configures Python (optionally installs packages)
Chooses a launcher
MPI → ibrun / srun
serial → direct execution
Runs the main executable
OpenSeesMP, OpenSeesSP, OpenSees, python3, etc.
Writes summary logs and exits with a code that Tapis can capture
Tapis does not interfere with what happens inside the wrapper — it only observes job state and outputs.
The app’s wrapper script (e.g., tapisjob_app.sh) controls:
module loads
python environment / pip installs
serial vs MPI launchers (
ibrun,srun, etc.)what gets written to output/logs
cleanup
Tapis does not “reach into” your runtime logic—it’s hands-off once Slurm is running your script.
6) Archiving outputs (file-transfer phase #2)
After the app’s wrapper exits:
Tapis creates the archive directory on the archive system/path
Copies outputs (excluding anything filtered by archiveFilter)
Includes Slurm logs (stdout/stderr) and wrapper logs
If archiveOnAppError=true, it still archives even when the job fails
This archiving phase can be slow if the output contains many small files.
7) Completion and user visibility
Once archived, you can:
browse outputs in the portal or via the Files API
download logs and results
compare runs across job UUIDs
rerun with modified parameters
share results with collaborators (where supported)
This completes the Tapis lifecycle: stage → prepare → run → archive
The key idea is that Tapis is an orchestrator: it stages files, generates a scheduler script, submits to Slurm, monitors, then archives outputs.
The lifecycle at a glance (swimlane)#
USER (Portal / CLI / Tapipy) TAPIS (Jobs Service + Files) HPC (Stampede3 / Slurm)
─────────────────────────────────── ───────────────────────────────────────── ─────────────────────────
1) Pick app + version ───────────────▶ Validate request (app schema)
2) Provide inputs/params ─────────────▶ Create job record (UUID, config)
3) Request resources ─────────────────▶ Stage inputs + runtime (file transfer)
Build batch script
sbatch batch_script ─────────────────────▶ Queue (PENDING)
Poll scheduler status ◀────────────────── Run (RUNNING)
4) Monitor status ◀─────────────────── Map scheduler states to Tapis states
5) Get results ◀────────────────────── Archive outputs (file transfer)
Provide outputs via Files API
On shared systems like Stampede3, jobs may queue before running due to demand — this delay is the trade-off for accessing powerful resources.
Appendix: Tapis job execution (SSH + Slurm timeline)#
Tapis does not run jobs internally. The Jobs service automates what you would otherwise do manually on an HPC system:
SSH into the execution system (as the effective HPC user)
Create job directories
Stage inputs and runtime assets
Write a scheduler batch script
Submit and monitor the scheduler job
Archive outputs and expose them via the Files service
Condensed behind-the-scenes timeline
SSH → mkdir job directories
SSH + Files → stage inputs
SSH → copy/unpack ZIP (or locate container image)
SSH → write scheduler script
SSH → submit (sbatch)
SSH → poll (squeue / sacct)
SSH → collect output metadata
Files Service → deliver outputs
Where to look when debugging#
If stuck in STAGING_INPUTS → input transfers, too many files, remote source delays
If stuck in QUEUED/PENDING → scheduler wait time (partition, allocation, walltime)
If failing in RUNNING → wrapper logic, module loads, env vars, executable errors
If slow after FINISHED → archiving overhead (again: too many files)
Important performance note: many “slow jobs” are not slow because compute is slow — they’re slow because file transfer is slow. The input staging and output archiving phases can dominate runtime when there are many small files. When possible: reduce file counts, reuse common datasets from Work/Scratch, or bundle inputs/outputs as a ZIP/TAR that you unpack/pack inside your wrapper.
Practical debugging: “Where is my time going?” Use the lifecycle to localize bottlenecks
When users say “Tapis is slow,” it usually means one of these stages:
Slow before RUNNING → input staging or queue wait
Slow after FINISHED → archiving (lots of files or large directories)
Slow during RUNNING → your executable/runtime environment
File-transfer advice (high impact):
Minimize file count (thousands of small files is worse than one big file)
Keep common datasets (e.g., ground motions) in Work/Scratch, and reuse them
Bundle inputs/outputs as ZIP/TAR and extract/pack inside the wrapper
Consider writing intermediate results to Work/Scratch and collecting only what you need at the end