Task configuration
The task components are the core of the task. However, the task YAML in its entirety includes everything required to run a task. In addition to the task components, this includes the datasets to run on, task configuration settings, authentication details and other metadata.
A complete task file is a YAML document deemed valid according to the Bitfount task schema. This can be validated in your YAML editor of choice by referencing the Bitfount task schema at the top of the file like so:
# yaml-language-server: $schema=https://docs.bitfount.com/schemas/task-spec.json
This is done automatically when uploading a task via the Bitfount App or Hub.
In Bitfount terminology, a task initiator is often referred to as a modeller.
Meanwhile, a task runner is often referred to as a Pod (Processor of Data) which will contain one or more datasets to run the task on.
In some scenarios, the Pod and the modeller may be the same entity on the same machine, in other scenarios, they may be different entities on different machines.
Required fields
In addition to the core task component, the only other required field is the pods field which contains a list of dataset identifiers where the task will be sent to run.
- task: the task definition. See Task components for details.
- pods: a dictionary containing a list of dataset identifiers where the task will be sent to run.
When using the Bitfount App, dataset identifiers are automatically overwritten with the identifier that the task run is triggered against. So a typical task YAML uploaded via the app will usually not specify any dataset identifiers like so:
pods:
identifiers:
- <replace-with-dataset-identifier>
If running a task using the SDK, the dataset identifiers must be specified explicitly.
pods:
identifiers:
- alice/sensitive-data
- bob/sensitive-data
- charlie/sensitive-data
Optional fields
In addition to the required fields, the following optional fields can be specified:
-
modeller: required for specifying authentication details. Default is OIDC device code authentication which triggers an interactive prompt requiring the user to validate a code in their browser. The options are
key-based,oidc-auth-code,oidc-device-code.tipFor app-based runs, set to
key-basedto use RSA keys and avoid interactive prompts. -
run_on_new_data_only: whether to run the task on only new data that has not been seen in previous runs. Defaults to false. This will have no effect on the first run of a task on a specific dataset. Subsequent runs will only process new data that has not been seen in previous runs on that dataset only.
-
batched_execution: whether to run the task in batches. Defaults to false. If enabled, the task will be split into batches of records and each batch will be processed sequentially. This is useful for large datasets that cannot be held in memory in their entirety. The task can only switch this on or off, the number of records in each batch is determined by the environment where the dataset is held. If using the app, it can be configured in the app settings or if using the SDK by specifying the
BITFOUNT_TASK_BATCH_SIZEenvironment variable. -
test_run: run on a small subset for a quick validation. Defaults to false. This is useful for testing the task configuration and ensuring that the task will run correctly before running on the full dataset. The number of records that are processed is determined by the environment where the dataset is held. If using the app, it can be configured in the app settings or if using the SDK by specifying the
BITFOUNT_TEST_RUN_NUMBER_OF_FILESenvironment variable. Only applies to file-based datasets. -
force_rerun_failed_files: whether to force re-running failed files at the end of the task. Defaults to true. Failed files are files that failed to process during the main body of the task run. Only applies to file-based datasets if the following conditions are met:
- Batched execution is enabled in the task configuration.
- Batch resilience is enabled in the environment where the dataset is held. Defaults to enabled in the app settings.
- Individual file retry is enabled in the environment where the dataset is held. Defaults to enabled in the app settings.
-
template: a dictionary containing template definitions for the task. See Templated fields for details.
If running a task using the SDK, you may be required to also specify the project ID explicitly as a top level key in order to use your project-specific access to a particular dataset or model that are part of the task.
project_id: is used to associate the run to a specific project. When using the app, this is omitted as a task may be associated with multiple projects.
Minimal complete example
# yaml-language-server: $schema=https://docs.bitfount.com/schemas/task-spec.json
modeller:
identity_verification_method: key-based
pods:
identifiers:
- <replace-with-dataset-identifier>
batched_execution: true
test_run: false
force_rerun_failed_files: true
run_on_new_data_only: false
task:
protocol:
name: bitfount.InferenceAndCSVReport
algorithm:
- name: bitfount.ModelInference
model:
bitfount_model:
model_ref: MyModel
model_version: 2
username: my-user
- name: bitfount.CSVReportAlgorithm
data_structure:
select:
include:
- image_path