Data loaders
Data loaders bridge datasources and your model code. They handle batching, shuffling, device placement, and any collation logic required to convert raw rows into tensors or arrays.
Responsibilities
- Fetch samples from the datasource and apply the correct preprocessing pipeline.
- Batch inputs, pad variable-length fields if needed, and move tensors to the right device (GPU/CPU)
- Provide deterministic iteration when you set seeds, and configurable shuffling for training.
Using Bitfount DataLoaders
Bitfount has created wrappers around the standard PyTorch DataLoader class to make it compatible with Bitfount. These are used by default when creating a model and can be returned by calling the train_dataloader(), val_dataloader() and test_dataloader() methods on your model. These dataloaders are used by the fit(), evaluate() and predict() methods on your model respectively.
Output format
When implementing your model, if you are using the model for training, you will be implementing the training_step(), validation_step() and test_step() methods. These methods already give you a batch of data as input so you don't need to worry about iterating over the dataloader. However, if you are using the model for inference, you will need to iterate over the dataloader to get the batches of data. Other than that, the format of the batch is exactly the same.
def predict(self, data: Optional[BaseSource] = None, **_: Any) -> PredictReturnType:
for batch in self.test_dataloader():
x, y = batch[:2]
# ...
return PredictReturnType(preds=preds)
Due to the various ways in which data can be structured, the format of the batch is dependent on the data structure and schema that were used to create the model. At the top level, a batch is a 2 or 3-element tuple:
where x is the input tensors, y is the target tensors and if we are using a file-based datasource, data_key will be the list of paths to the files that were used to populate the batch in case we need to link back to them. If we are using a non-file-based datasource, the tuple will only have 2 elements. If there are no target tensors such as in the case of inference, y will still exist but will be an empty tensor. In most cases, we can ignore data_key and focus on x and y as follows if we doing training or validation:
x, y = batch[:2]
Or just x if we are doing inference:
x = batch[:1]
The shape of x and y will depend on the data and batch size:
-
yis a tensor of shape(batch_size, num_targets)wherenum_targetsis the number of target columns in the case of tabular data. In the case of image data for segmentation tasks,ywill be a 4D tensor of shape(batch_size, channels, height, width)(BCHW). -
xitself is again a tuple of tensors of the form:where the image tensor, if there is a single image column, is a 4D tensor in BCHW format and the tabular and support tensors are 2D tensors of shape
(batch_size, num_features). If there are multiple image columns,imagewill instead be a list ofBCHWtensors. At least one oftabularorimagewill always be present. The support columns are deprecated and will be removed in a future version. For now you only need to know that their presence is dictated by theignore_support_colsargument to theBitfountDataBunchclass (this is the class that creates the dataloaders withininitialise_model()) but you can safely ignore them regardless.This means the shape of
xcould also be written as follows:For instance, if you know that there are no tabular columns, an example unpacking of
xcould be:images, _sup = x
In the case of text data, this is not converted to tensors but rather included in the tabular data as-is. You will need to tokenize the text data as part of your model's training_step(), validation_step() and test_step() methods.
Using your own dataloaders
Proceed at your own risk. If you are using your own dataloaders, in addition to ensuring that the output format of your dataloader matches the expected input format of the model, you will also need to ensure that the dataloader is respecting the following:
- the data structure and schema that were used to create the model i.e. the
selected_cols,image_cols,targetcolumns and any transformations that were applied to the data - the specified data splits between training, validation and test sets including shuffling if specified. For the bitfount dataloaders, we have implemented a reservoir sampling algorithm to ensure that the data is shuffled in a deterministic manner even when we can't load the entire dataset into memory.
- the protocol-level batching logic (i.e.
batched_execution). This batching is at a higher level than the model-level batching logic (i.e.batch_size). When running in batched execution mode, the protocol-level batching logic will override the available files in the datasource by using theselected_file_names_overrideattribute of the datasource. In order to access only the files that are available that iteration, you will need to iterate overselected_file_names_iter()as opposed toyield_data()on the datasource. - whether they are returning batches according to
stepsorepochsand to stop iteration at the correct time
There is no requirement to use Bitfount DataLoaders. If you want to use your own dataloaders, you will need to create a custom DataLoader class nested inside your model class. Override the train_dataloader(), val_dataloader() and test_dataloader() methods to return your own dataloaders.