Running multi-dataset tasks
Multi-dataset tasks are a feature of the Bitfount SDK. These tasks cannot be initiated from the Bitfount App, but datasets connected in the App can participate in multi-dataset tasks as long as they are not connected to the same Pod or App instance. If you are connecting datasets:
- via the App: each dataset involved in the task must be on a different machine
- via the SDK: each dataset involved in the task must be on a different Pod, but they can be on the same machine
Multi-Pod tasks let you send a task to multiple datasets at once. The task will be run on each dataset in parallel with results sent back to the modeller. These results may optionally be aggregated as part of the task or returned as separate results for each dataset depending on the task. There are two main types of multi-dataset tasks:
- Federated learning tasks: these tasks are used to train a model on multiple separate datasets without needing to centralise the data. The model is trained on each dataset in parallel with regular orchestration by the modeller to average the updated model parameters. Learn more about federated learning in the original paper by McMahan et al (2016) or our blog post.
- SQL tasks: these tasks are used to run a SQL query against multiple datasets at once. The results from each dataset may be optionally aggregated into a single result which is then returned to the modeller or returned as separate results for each dataset depending on the task.
Federated learning
For federated learning tasks, the protocol used is bitfount.FederatedAveraging and the algorithm used is bitfount.FederatedModelTraining. In the example below, we are training a model on all the data in a tabular dataset with TARGET as the target column.
This is the exact same kind of task as used for fine-tuning. The only difference is that instead of specifying a single dataset, you specify multiple datasets.
pods:
identifiers:
- <replace-with-dataset-identifier-1>
- <replace-with-dataset-identifier-2>
- <replace-with-dataset-identifier-3>
task:
protocol:
name: bitfount.FederatedAveraging
arguments:
steps_between_parameter_updates: 10
algorithm:
name: bitfount.FederatedModelTraining
model:
bitfount_model:
model_ref: <replace-with-model-identifier>
model_version: 1
username: <replace-with-model-owner-username>
hyperparameters:
steps: 100
batch_size: 32
learning_rate: 0.0001
aggregator:
secure: False # Set to True to use secure aggregation
data_structure:
schema_requirements: "full"
assign:
target: TARGET
If dataset sizes are significantly different, the task may be idle on many of the machines while others are still running. For optimal efficiency, it's advised to choose datasets of roughly the same size. If this is not possible, ensure that you specify the training in steps and not epochs to ensure the same amount of training is happening on each dataset. If training is specified in steps make sure that only steps_between_parameter_updates is passed to the FederatedAveraging protocol and similarly if training is specified in epochs make sure that instead epochs_between_parameter_updates is passed to the FederatedAveraging protocol.
SQL tasks
For SQL tasks, the protocol used is bitfount.ResultsOnly and the algorithm used is bitfount.SqlQuery. In the example below, we are running a SQL query against the ehr-records-2025 dataset for each of the three datasets. Recall in the SQL task documentation, if running a SQL task against a non-SQL-based dataset (e.g. a CSVSource dataset or otherwise), the table name will be the dataset identifier without the username, in between backticks(``). Since the same query is being run on each dataset, we need to make sure the dataset name is the same for each dataset across the three different users.
pods:
identifiers:
- alice/ehr-records-2025
- bob/ehr-records-2025
- charlie/ehr-records-2025
task:
protocol:
name: bitfount.ResultsOnly
arguments:
save_location: ["Modeller"]
algorithm:
name: bitfount.SqlQuery
arguments:
query: "SELECT * FROM `ehr-records-2025` LIMIT 10"
data_structure:
table_config:
table: ehr-records-2025
In the example above, we are not using an aggregator. This means that the results from each dataset will be returned as separate results for each dataset. If you want to aggregate the results into a single result, you can specify an aggregator in exactly the same way as the federated learning task. The results will be saved as a CSV on the modeller side.
If your SQL query runs against a SQL-based dataset (i.e. an OMOPSource dataset), your query can operate on datasets of different names without issue.