Running multi-dataset tasks

info

Multi-dataset tasks are a feature of the Bitfount SDK. These tasks cannot be initiated from the Bitfount App, but datasets connected in the App can participate in multi-dataset tasks as long as they are not connected to the same Pod or App instance. If you are connecting datasets:

via the App: each dataset involved in the task must be on a different machine
via the SDK: each dataset involved in the task must be on a different Pod, but they can be on the same machine

Multi-Pod tasks let you send a task to multiple datasets at once. The task will be run on each dataset in parallel with results sent back to the modeller. These results may optionally be aggregated as part of the task or returned as separate results for each dataset depending on the task. There are two main types of multi-dataset tasks:

Federated learning tasks: these tasks are used to train a model on multiple separate datasets without needing to centralise the data. The model is trained on each dataset in parallel with regular orchestration by the modeller to average the updated model parameters. Learn more about federated learning in the original paper by McMahan et al (2016) or our blog post.
SQL tasks: these tasks are used to run a SQL query against multiple datasets at once. The results from each dataset may be optionally aggregated into a single result which is then returned to the modeller or returned as separate results for each dataset depending on the task.

Federated learning

For federated learning tasks, the protocol used is bitfount.FederatedAveraging and the algorithm used is bitfount.FederatedModelTraining. In the example below, we are training a model on all the data in a tabular dataset with TARGET as the target column.

info

This is the exact same kind of task as used for fine-tuning. The only difference is that instead of specifying a single dataset, you specify multiple datasets.

pods:
  identifiers:
    - <replace-with-dataset-identifier-1>
    - <replace-with-dataset-identifier-2>
    - <replace-with-dataset-identifier-3>

task:
  protocol:
    name: bitfount.FederatedAveraging
    arguments:
      steps_between_parameter_updates: 10
  algorithm:
    name: bitfount.FederatedModelTraining
    model:
      bitfount_model:
        model_ref: <replace-with-model-identifier>
        model_version: 1
        username: <replace-with-model-owner-username>
      hyperparameters:
        steps: 100
        batch_size: 32
        learning_rate: 0.0001
  aggregator:
    secure: False # Set to True to use secure aggregation
  data_structure:
    schema_requirements: "full"
    assign:
      target: TARGET

note

If dataset sizes are significantly different, the task may be idle on many of the machines while others are still running. For optimal efficiency, it's advised to choose datasets of roughly the same size. If this is not possible, ensure that you specify the training in steps and not epochs to ensure the same amount of training is happening on each dataset. If training is specified in steps make sure that only steps_between_parameter_updates is passed to the FederatedAveraging protocol and similarly if training is specified in epochs make sure that instead epochs_between_parameter_updates is passed to the FederatedAveraging protocol.

SQL tasks

For SQL tasks, the protocol used is bitfount.ResultsOnly and the algorithm used is bitfount.SqlQuery. In the example below, we are running a SQL query against the ehr-records-2025 dataset for each of the three datasets. Recall in the SQL task documentation, if running a SQL task against a non-SQL-based dataset (e.g. a CSVSource dataset or otherwise), the table name will be the dataset identifier without the username, in between backticks(``). Since the same query is being run on each dataset, we need to make sure the dataset name is the same for each dataset across the three different users.

pods:
  identifiers:
    - alice/ehr-records-2025
    - bob/ehr-records-2025
    - charlie/ehr-records-2025

task:
  protocol:
    name: bitfount.ResultsOnly
    arguments:
      save_location: ["Modeller"]
  algorithm:
    name: bitfount.SqlQuery
    arguments:
      query: "SELECT * FROM `ehr-records-2025` LIMIT 10"
  data_structure:
    table_config:
      table: ehr-records-2025

In the example above, we are not using an aggregator. This means that the results from each dataset will be returned as separate results for each dataset. If you want to aggregate the results into a single result, you can specify an aggregator in exactly the same way as the federated learning task. The results will be saved as a CSV on the modeller side.

note

If your SQL query runs against a SQL-based dataset (i.e. an OMOPSource dataset), your query can operate on datasets of different names without issue.

Federated learning​

SQL tasks​

Federated learning

SQL tasks