filter_tapis_jobs_df()#

filter_tapis_jobs_df(SelectCriteria, filtered_df, displayIt=False)

This function filters a Tapis jobs DataFrame, which you typically create with get_tapis_jobs_df(), based on flexible user-defined selection criteria provided in a Python dictionary.

It’s designed to support the various time-related columns typically found in Tapis job metadata, as well as any other column you’d like to filter.


Supported filtering logic#

It intelligently handles different types of data:

1. Time-based ranges#

For example, if your DataFrame has columns like:

  • created_unix (seconds since epoch)

  • created_dt (UTC datetime)

  • created_date (just date)

and you provide:

SelectCriteria = {'created': ['2024-08-01', '2024-08-31']}

the function:

  • recognizes created is a time field,

  • converts your string dates to Unix timestamps,

  • and filters created_unix to include only jobs submitted in August 2024.

It automatically handles:

  • ‘*_unix’ columns: with Unix second comparisons

  • ‘*_dt’ columns: with datetime comparisons

  • ‘*_date’ columns: with pure date comparisons

2. General isin list matching#

For non-time columns (like appId, status), you can pass lists:

SelectCriteria = {'appId': ['opensees-mp-s3', 'opensees-sp-s3']}

and it filters rows where appId matches any of those values.

3. Single date or exact value matching#

For single values (not a list), e.g.:

SelectCriteria = {'created': '2024-08-15'}

it checks for jobs submitted exactly on that date. For other fields (like status), it does simple == matching.


How it works internally#

  1. Determines the type of filter needed based on your key and value.

  2. If the key is a time column (created, remoteStarted, ended, lastUpdated) or any of their suffixes (_unix, _dt, _date), it applies special logic:

    • If you give a range (a list with two values), it does a between-filter.

    • If you give a single date string, it tries to match that date.

  3. Otherwise, if you pass a list of values, it applies an isin filter.

  4. Or if it’s a single value, it does a direct equality check.


Example usage#

criteria = {
    'status': ['FINISHED'],
    'appId': ['opensees-mp-s3'],
    'created': ['2024-08-01', '2024-08-31']  # filter by date range
}
uuids, df_filtered = filter_tapis_jobs_df(criteria, JobsData_df, displayIt=True)

This will:

  • Filter the DataFrame to jobs finished in August 2024, submitted with the opensees-mp-s3 app.

  • Print a summary and show the filtered dataframe.


Return values#

  • filtered_uuid: a list of the uuid strings for all matching jobs.

  • filtered_df: the filtered DataFrame itself.


Notes & robust design#

  • The function checks if each key exists in your DataFrame before trying to filter on it, so you can safely pass broad SelectCriteria.

  • It handles malformed or missing timestamp strings gracefully by converting them to -1, which effectively excludes them from time filters.

  • The display will show how many jobs matched, their UUIDs (if any), and the top of the filtered DataFrame.


In short:#

This is your central, reusable function for slicing and dicing Tapis job metadata, making it easy to build interactive dashboards or pipelines that explore your DesignSafe Tapis workflows.

Files#

You can find these files in Community Data.

filter_tapis_jobs_df.py
def filter_tapis_jobs_df(SelectCriteria, filtered_df, displayIt=False):
    """
    Filter a Tapis jobs DataFrame based on a flexible SelectCriteria dictionary.

    Parameters
    ----------
    SelectCriteria : dict
        A dictionary where keys are column names (like 'status', 'appId', or time fields)
        and values are either:
            - A list of values for `isin` checks,
            - A two-element list for ranges (especially dates),
            - Or a single value for exact matching.

    filtered_df : pandas.DataFrame
        The DataFrame to filter, typically generated by get_tapis_jobs_df().
        Should include time columns like created_unix, created_dt, created_date.

    displayIt : bool, optional
        If True, prints a summary of how many jobs matched, their UUIDs,
        and displays the filtered DataFrame.

    Returns
    -------
    tuple
        (filtered_uuid, filtered_df)
        - filtered_uuid : list of uuids matching the filters
        - filtered_df : the filtered DataFrame

    Filtering Logic
    ---------------
    - Recognizes time fields: 'created', 'remoteStarted', 'ended', 'lastUpdated'
      and handles any of their suffixes (_unix, _dt, _date) intelligently.
    - If the SelectCriteria key is one of these, it supports:
        - Ranges: ['2024-08-01', '2024-08-31'] → filters on unix or datetime columns.
        - Single dates: '2024-08-15' → matches that specific day.
    - For all other keys:
        - Lists are used with isin().
        - Single values are checked with ==.

    Notes
    -----
    - Internally converts timestamp strings to Unix time to allow robust comparisons.
    - Handles missing or malformed timestamps gracefully by excluding them.
    - Will skip any keys not present in the DataFrame.
    """
    # Silvia Mazzoni, 2025
    from datetime import datetime, timezone
    import re
    
    def convert_time_unix(timestamp_str):
        try:
            if not timestamp_str:
                return -1
            ts_clean = timestamp_str.rstrip('Z')
            if 'T' in ts_clean:
                fmt = "%Y-%m-%dT%H:%M:%S.%f" if re.search(r'\.\d+', ts_clean) else "%Y-%m-%dT%H:%M:%S"
            else:
                fmt = "%Y-%m-%d"
            outTime = datetime.strptime(ts_clean, fmt)
            outTime = outTime.replace(tzinfo=timezone.utc)
            return outTime.timestamp()
        except Exception:
            return -1

    time_keys = ['created', 'remoteStarted', 'ended', 'lastUpdated']
    time_keys_unix = [f"{k}_unix" for k in time_keys]
    time_keys_dt = [f"{k}_dt" for k in time_keys]
    time_keys_date = [f"{k}_date" for k in time_keys]

    for key, values in SelectCriteria.items():
        if key not in filtered_df.columns:
            continue

        # Handle ranges
        if isinstance(values, list) and len(values) == 2:
            if key in time_keys or key in time_keys_unix:
                # convert date strings to unix
                min_time = convert_time_unix(values[0]) if key in time_keys else values[0]
                max_time = convert_time_unix(values[1]) if key in time_keys else values[1]
                filtered_df = filtered_df[
                    (filtered_df[f"{key}_unix"] >= min_time) & (filtered_df[f"{key}_unix"] <= max_time)
                ]
            elif key in time_keys_dt:
                filtered_df = filtered_df[
                    (filtered_df[key] >= values[0]) & (filtered_df[key] <= values[1])
                ]
            elif key in time_keys_date:
                filtered_df = filtered_df[
                    (filtered_df[key] >= values[0]) & (filtered_df[key] <= values[1])
                ]
            else:
                filtered_df = filtered_df[filtered_df[key].isin(values)]

        # Handle single date match on _date
        elif key in time_keys + time_keys_unix + time_keys_dt + time_keys_date:
            try:
                target_date = datetime.strptime(values, "%Y-%m-%d").date()
                for tk in time_keys:
                    if key.startswith(tk):
                        filtered_df = filtered_df[filtered_df[f"{tk}_date"] == target_date]
            except Exception:
                pass

        # Handle single exact value
        else:
            filtered_df = filtered_df[filtered_df[key] == values]

    filtered_uuid = list(filtered_df['uuid'])

    if displayIt:
        print(f'Found {len(filtered_df)} jobs')
        if len(filtered_uuid) > 0:
            print('-- uuid --')
            display(filtered_uuid)
        print('-- Job Metadata --')
        display(filtered_df)

    return filtered_uuid, filtered_df