Data loaders

Data loaders bridge datasources and your model code. They handle batching, shuffling, device placement, and any collation logic required to convert raw rows into tensors or arrays.

Responsibilities

Fetch samples from the datasource and apply the correct preprocessing pipeline.
Batch inputs, pad variable-length fields if needed, and move tensors to the right device (GPU/CPU)
Provide deterministic iteration when you set seeds, and configurable shuffling for training.

Using Bitfount DataLoaders

Bitfount has created wrappers around the standard PyTorch DataLoader class to make it compatible with Bitfount. These are used by default when creating a model and can be returned by calling the train_dataloader(), val_dataloader() and test_dataloader() methods on your model. These dataloaders are used by the fit(), evaluate() and predict() methods on your model respectively.

Output format

When implementing your model, if you are using the model for training, you will be implementing the training_step(), validation_step() and test_step() methods. These methods already give you a batch of data as input so you don't need to worry about iterating over the dataloader. However, if you are using the model for inference, you will need to iterate over the dataloader to get the batches of data. Other than that, the format of the batch is exactly the same.

Inference example
def predict(self, data: Optional[BaseSource] = None, **_: Any) -> PredictReturnType:
    for batch in self.test_dataloader():
        x, y = batch[:2]
        # ...
    return PredictReturnType(preds=preds)

Due to the various ways in which data can be structured, the format of the batch is dependent on the data structure and schema that were used to create the model. At the top level, a batch is a 2 or 3-element tuple:

(x, y, [datakey])

where x is the input tensors, y is the target tensors and if we are using a file-based datasource, data_key will be the list of paths to the files that were used to populate the batch in case we need to link back to them. If we are using a non-file-based datasource, the tuple will only have 2 elements. If there are no target tensors such as in the case of inference, y will still exist but will be an empty tensor. In most cases, we can ignore data_key and focus on x and y as follows if we doing training or validation:

x, y = batch[:2]

Or just x if we are doing inference:

x = batch[:1]

The shape of x and y will depend on the data and batch size:

y is a tensor of shape (batch_size, num_targets) where num_targets is the number of target columns in the case of tabular data. In the case of image data for segmentation tasks, y will be a 4D tensor of shape (batch_size, channels, height, width) (BCHW).
x itself is again a tuple of tensors of the form:
$([tabular], [image], [support])$
where the image tensor, if there is a single image column, is a 4D tensor in BCHW format and the tabular and support tensors are 2D tensors of shape (batch_size, num_features). If there are multiple image columns, image will instead be a list of BCHW tensors. At least one of tabular or image will always be present. The support columns are deprecated and will be removed in a future version. For now you only need to know that their presence is dictated by the ignore_support_cols argument to the BitfountDataBunch class (this is the class that creates the dataloaders within initialise_model()) but you can safely ignore them regardless.

This means the shape of x could also be written as follows:
$(tabular, [support])\quad | \quad(image, [support])\quad |\quad (tabular, image, [support])$
For instance, if you know that there are no tabular columns, an example unpacking of x could be:
```
images, _sup = x
```

info

In the case of text data, this is not converted to tensors but rather included in the tabular data as-is. You will need to tokenize the text data as part of your model's training_step(), validation_step() and test_step() methods.

Using your own dataloaders

danger

Proceed at your own risk. If you are using your own dataloaders, in addition to ensuring that the output format of your dataloader matches the expected input format of the model, you will also need to ensure that the dataloader is respecting the following:

the data structure and schema that were used to create the model i.e. the selected_cols, image_cols, target columns and any transformations that were applied to the data
the specified data splits between training, validation and test sets including shuffling if specified. For the bitfount dataloaders, we have implemented a reservoir sampling algorithm to ensure that the data is shuffled in a deterministic manner even when we can't load the entire dataset into memory.
the protocol-level batching logic (i.e. batched_execution). This batching is at a higher level than the model-level batching logic (i.e. batch_size). When running in batched execution mode, the protocol-level batching logic will override the available files in the datasource by using the selected_file_names_override attribute of the datasource. In order to access only the files that are available that iteration, you will need to iterate over selected_file_names_iter() as opposed to yield_data() on the datasource.
whether they are returning batches according to steps or epochs and to stop iteration at the correct time

There is no requirement to use Bitfount DataLoaders. If you want to use your own dataloaders, you will need to create a custom DataLoader class nested inside your model class. Override the train_dataloader(), val_dataloader() and test_dataloader() methods to return your own dataloaders.

Responsibilities​

Using Bitfount DataLoaders​

Output format​

Using your own dataloaders​

Responsibilities

Using Bitfount DataLoaders

Output format

Using your own dataloaders