filter_tapis_jobs_df()

filter_tapis_jobs_df()#

filter_tapis_jobs_df(SelectCriteria, filtered_df, displayIt=False)

This function filters a Tapis jobs DataFrame, which you typically create with get_tapis_jobs_df(), based on flexible user-defined selection criteria provided in a Python dictionary.

It’s designed to support the various time-related columns typically found in Tapis job metadata, as well as any other column you’d like to filter.

Supported filtering logic#

It intelligently handles different types of data:

1. Time-based ranges#

For example, if your DataFrame has columns like:

created_unix (seconds since epoch)
created_dt (UTC datetime)
created_date (just date)

and you provide:

SelectCriteria = {'created': ['2024-08-01', '2024-08-31']}

the function:

recognizes created is a time field,
converts your string dates to Unix timestamps,
and filters created_unix to include only jobs submitted in August 2024.

It automatically handles:

‘*_unix’ columns: with Unix second comparisons
‘*_dt’ columns: with datetime comparisons
‘*_date’ columns: with pure date comparisons

2. General isin list matching#

For non-time columns (like appId, status), you can pass lists:

SelectCriteria = {'appId': ['opensees-mp-s3', 'opensees-sp-s3']}

and it filters rows where appId matches any of those values.

3. Single date or exact value matching#

For single values (not a list), e.g.:

SelectCriteria = {'created': '2024-08-15'}

it checks for jobs submitted exactly on that date. For other fields (like status), it does simple == matching.

How it works internally#

Determines the type of filter needed based on your key and value.
If the key is a time column (created, remoteStarted, ended, lastUpdated) or any of their suffixes (_unix, _dt, _date), it applies special logic:
- If you give a range (a list with two values), it does a between-filter.
- If you give a single date string, it tries to match that date.
Otherwise, if you pass a list of values, it applies an isin filter.
Or if it’s a single value, it does a direct equality check.

Example usage#

criteria = {
    'status': ['FINISHED'],
    'appId': ['opensees-mp-s3'],
    'created': ['2024-08-01', '2024-08-31']  # filter by date range
}
uuids, df_filtered = filter_tapis_jobs_df(criteria, JobsData_df, displayIt=True)

This will:

Filter the DataFrame to jobs finished in August 2024, submitted with the opensees-mp-s3 app.
Print a summary and show the filtered dataframe.

Return values#

filtered_uuid: a list of the uuid strings for all matching jobs.
filtered_df: the filtered DataFrame itself.

Notes & robust design#

The function checks if each key exists in your DataFrame before trying to filter on it, so you can safely pass broad SelectCriteria.
It handles malformed or missing timestamp strings gracefully by converting them to -1, which effectively excludes them from time filters.
The display will show how many jobs matched, their UUIDs (if any), and the top of the filtered DataFrame.

In short:#

This is your central, reusable function for slicing and dicing Tapis job metadata, making it easy to build interactive dashboards or pipelines that explore your DesignSafe Tapis workflows.

Files#

You can find these files in Community Data.

filter_tapis_jobs_df.py

def filter_tapis_jobs_df(SelectCriteria, filtered_df, displayIt=False):
    """
    Filter a Tapis jobs DataFrame based on a flexible SelectCriteria dictionary.

    Parameters
    ----------
    SelectCriteria : dict
        A dictionary where keys are column names (like 'status', 'appId', or time fields)
        and values are either:
            - A list of values for `isin` checks,
            - A two-element list for ranges (especially dates),
            - Or a single value for exact matching.

    filtered_df : pandas.DataFrame
        The DataFrame to filter, typically generated by get_tapis_jobs_df().
        Should include time columns like created_unix, created_dt, created_date.

    displayIt : bool, optional
        If True, prints a summary of how many jobs matched, their UUIDs,
        and displays the filtered DataFrame.

    Returns
    -------
    tuple
        (filtered_uuid, filtered_df)
        - filtered_uuid : list of uuids matching the filters
        - filtered_df : the filtered DataFrame

    Filtering Logic
    ---------------
    - Recognizes time fields: 'created', 'remoteStarted', 'ended', 'lastUpdated'
      and handles any of their suffixes (_unix, _dt, _date) intelligently.
    - If the SelectCriteria key is one of these, it supports:
        - Ranges: ['2024-08-01', '2024-08-31'] → filters on unix or datetime columns.
        - Single dates: '2024-08-15' → matches that specific day.
    - For all other keys:
        - Lists are used with isin().
        - Single values are checked with ==.

    Notes
    -----
    - Internally converts timestamp strings to Unix time to allow robust comparisons.
    - Handles missing or malformed timestamps gracefully by excluding them.
    - Will skip any keys not present in the DataFrame.
    """
    # Silvia Mazzoni, 2025
    from datetime import datetime, timezone
    import re
    
    def convert_time_unix(timestamp_str):
        try:
            if not timestamp_str:
                return -1
            ts_clean = timestamp_str.rstrip('Z')
            if 'T' in ts_clean:
                fmt = "%Y-%m-%dT%H:%M:%S.%f" if re.search(r'\.\d+', ts_clean) else "%Y-%m-%dT%H:%M:%S"
            else:
                fmt = "%Y-%m-%d"
            outTime = datetime.strptime(ts_clean, fmt)
            outTime = outTime.replace(tzinfo=timezone.utc)
            return outTime.timestamp()
        except Exception:
            return -1

    time_keys = ['created', 'remoteStarted', 'ended', 'lastUpdated']
    time_keys_unix = [f"{k}_unix" for k in time_keys]
    time_keys_dt = [f"{k}_dt" for k in time_keys]
    time_keys_date = [f"{k}_date" for k in time_keys]

    for key, values in SelectCriteria.items():
        if key not in filtered_df.columns:
            continue

        # Handle ranges
        if isinstance(values, list) and len(values) == 2:
            if key in time_keys or key in time_keys_unix:
                # convert date strings to unix
                min_time = convert_time_unix(values[0]) if key in time_keys else values[0]
                max_time = convert_time_unix(values[1]) if key in time_keys else values[1]
                filtered_df = filtered_df[
                    (filtered_df[f"{key}_unix"] >= min_time) & (filtered_df[f"{key}_unix"] <= max_time)
                ]
            elif key in time_keys_dt:
                filtered_df = filtered_df[
                    (filtered_df[key] >= values[0]) & (filtered_df[key] <= values[1])
                ]
            elif key in time_keys_date:
                filtered_df = filtered_df[
                    (filtered_df[key] >= values[0]) & (filtered_df[key] <= values[1])
                ]
            else:
                filtered_df = filtered_df[filtered_df[key].isin(values)]

        # Handle single date match on _date
        elif key in time_keys + time_keys_unix + time_keys_dt + time_keys_date:
            try:
                target_date = datetime.strptime(values, "%Y-%m-%d").date()
                for tk in time_keys:
                    if key.startswith(tk):
                        filtered_df = filtered_df[filtered_df[f"{tk}_date"] == target_date]
            except Exception:
                pass

        # Handle single exact value
        else:
            filtered_df = filtered_df[filtered_df[key] == values]

    filtered_uuid = list(filtered_df['uuid'])

    if displayIt:
        print(f'Found {len(filtered_df)} jobs')
        if len(filtered_uuid) > 0:
            print('-- uuid --')
            display(filtered_uuid)
        print('-- Job Metadata --')
        display(filtered_df)

    return filtered_uuid, filtered_df