Skip to main content

Testing your model

Once you have migrated your model to the format required by Bitfount, you should test it to ensure it is working as expected before uploading it to the Bitfount Hub. This is particularly important for training or fine-tuning models as they are more complex than models that are only used for inference or evaluation.

The best way to validate your model is to run against a local datasource for testing purposes. This bypasses the need to create a task in Bitfount which can be cumbersome for rapid iteration.

tip

Validate your model before sharing it with collaborators or publishing it to a project. Lightweight checks catch most issues early and save time when running tasks on remote datasets

Local validation

The first step to validating your model is to create a local datasource. Recall that a datasource is the object that represents the type of data being connected to a Pod and encapsulates the specific logic required for loading and processing that kind of data. A datasource connected to a Pod is called a dataset but in this case, we don't want to connect to a Pod just yet - we only want to use the datasource locally for testing purposes.

Datasource

The first step is to choose the appropriate datasource for your data. Bitfount supports a variety of datasources, each with its own unique features and capabilities as documented in the Connecting datasets guide. See the API reference for your chosen datasource to get a list of the required and optional arguments. Most datasources typically take a path argument to the location of the data.

from bitfount.data import CSVSource

datasource = CSVSource(path="path/to/your/data.csv")
tip

All datasources can be imported from the bitfount.data namespace rather than having to import from the specific datasource module.

Datasources don't make any changes to the data itself as far as transformations and pre-processing are concerned. They are simply an iterable wrapper around the data which yields data in the form of a pandas DataFrame. Regardless of the type of data, the datasource's internal representation of the data is always a pandas DataFrame. Datasources have two main methods:

  • yield_data(): Returns an iterator that yields batches of data as specified by the partition_size argument.
  • get_data(): Returns a single batch of data as specified by the data_keys argument.
info

One of the core principles of how data is handled in Bitfount is that the data is never loaded into memory unless it is absolutely necessary. This is why datasources are designed to be iterable - they are not designed to be loaded into memory all at once.

Under the hood in Bitfount, yield_data is the method that is typically used to feed data to an algorithm. get_data is only used in certain cases with a small selection of data_keys. It is not advised to use get_data to return the entire dataset as it will be very memory intensive and may well crash the system if the dataset is too large.

for batch in datasource.yield_data(partition_size=32):
print(batch)

Image Datasources

Many users work with imaging datasets (medical or otherwise) so it is important to understand how images are handled under the hood within Bitfount datasources. When connecting a directory of image files, each file corresponds to a single row in the internal pandas DataFrame. The DataFrame will have a column for the raw image data as a numpy array which is always called Pixel Data. It will also have a number of columns for the metadata associated with the image. For medical images, these columns can number in the hundreds and correspond to DICOM tags (or equivalent for other imaging formats). For files that contain multiple images, for instance slices of a volumetric image, the DataFrame will have a column for each slice. In this case, they are numbered sequentially starting from 0 e.g Pixel Data 0, Pixel Data 1, etc.

Where possible, images are avoided being loaded into memory unless absolutely necessary. To aid in this, the datasource will cache the underlying dataframe in the file system with the exception of the Pixel Data columns which are replaced by placeholders. When calling yield_data or get_data, the datasource will automatically load the cached dataframe into memory and return the cached dataframe which does not contain the raw image data. If you need to access the raw image data, you can do so by passing the use_cache=False argument to the yield_data or get_data methods.

from bitfount.data import DICOMSource

datasource = DICOMSource(path="path/to/your/data")
for batch in datasource.yield_data(partition_size=32, use_cache=False):
print(batch["Pixel Data"].shape)

Schema

A schema is a serialisable representation of the data in a datasource. Schemas are automatically generated for each dataset when it is connected to a Pod and displayed on the hub. They contain information about the columns in the dataframe, the data types of the columns, the semantic types of the columns, and optional descriptions of the columns. Models require a schema to be provided when they are instantiated which must match the schema of the dataset that will be fed to the model for training, evaluation or inference.

A partial or full schema can be generated for a datasource using the BitfountSchema class. A partial schema is generated by default when a datasource is connected to a Pod based on the first batch of data. A full schema generation process is triggered in the background once the partial schema is generated. The full schema generation can take some time to complete depending on the size of the dataset. If your dataset is quite homogeneous, the full schema generation may not be necessary.

from bitfount.data import BitfountSchema

schema = BitfountSchema(name="your-dataset-name")
# Generate a partial schema
schema.generate_partial_schema(datasource=datasource)
# Or generate a full schema
schema.generate_full_schema(datasource=datasource)

The schema can then be serialised and visualised using methods such as dumps and to_json.

print(schema.dumps())

The data types of the columns in the schema are inferred from the data in the datasource and are out of your control. The semantic types of the columns are also inferred from the data types but these often require knowledge of the data or domain to be accurate. These can therefore be overridden by passing the force_stypes argument to the generate_full_schema method. The available semantic types are:

  • categorical: For columns where the values (strings or integers) are categorical in nature. For instance, if the column contains different integer values, Bitfount will interpret this as a continuous column by default so it must be overridden to categorical if the integers represent different categories.
  • continuous: This is the default semantic type for all numerical columns unless overridden.
  • image: For image data columns, such as Pixel Data. For CSV datasources where a column contains the path to an image file, the semantic type must be overridden to image if the images are to be treated as such.
  • text: For text data. By default, all string columns are treated as text columns unless overridden to a different semantic type such as categorical or image.
  • image_prefix: A utility semantic type where there are multiple image columns with a common prefix to avoid having to specify each column name individually.
tip

When connecting a dataset using the App or Hub, you can override the semantic types of the columns by editing the schema in the UI after the dataset has been connected.

Certain columns may also be ignored from the schema generation process by passing the ignore_cols argument to the generate_full_schema method.

schema.generate_full_schema(
datasource=datasource,
force_stypes={"image": ["Pixel Data"], "categorical": ["Patient's Sex"]},
ignore_cols=["Patient ID", "Study ID", "Series ID"]
)

DataStructure

We were introduced to the DataStructure in YAML format as a core task component in the Writing tasks section. It is a core component of a task that defines the structure of the data that will be fed to the model. Where the schema of the datasource reflects the structure of that data, the DataStructure defines the necessary modifications to that structure in order to feed the data to the model in the way that the model expects.

Typically, the most important parts of the data structure are specifying which columns to include or exclude from the data, which columns are images and which column(s) to map to the target variable if you are doing training or fine-tuning. If your model is only used for inference, you don't need to specify a target column. A typical data structure might look like this:

from bitfount.data import DataStructure

data_structure = DataStructure(
selected_cols=["Pixel Data", "Target", "Patient's Sex", "Age"],
image_cols=["Pixel Data"],
target=["Target"],
)

For a full list of the available arguments, see the API reference.

If the data contains image columns, some basic batch transformations are also applied by default to the image columns when the data is fed to the model. These transformations are:

  • Resize the image to 224x224 pixels
  • Normalize: Normalize the image to ImageNet statistics
  • ToTensorV2: Convert the image to a PyTorch tensor

Albumentations is the library of choice for applying these transformations. Learn more about how to use Albumentations to customise the transformations in the Transformations section.

Feeding the data to the model

Once you have created the datasource, schema and data structure, you can instantiate and initialise your model and feed the data to the model. The model needs to be instantiated with the DataStructure and schema objects that were created earlier. After this, the model must be initialised by calling the initialise_model method. This method creates the model under the hood by calling the create_model method and saving the model to the self._model attribute. It also creates the PyTorch data loaders from the datasource which will be used to feed the data to the model. You can learn more about the dataloaders in the DataLoaders section.

To feed the data to the model, you then need to call either the fit, predict or evaluate methods on the model. The fit method is used for training and fine-tuning the model and the predict method is used for inference. The evaluate method is used for evaluating the model on a dataset.

For training this might look like this:

model = MyModel(datastructure=data_structure, schema=schema, epochs=10, batch_size=32)
model.initialise_model(datasource)
results = model.fit(datasource)

Whereas for inference this might look like this:

model = MyModel(datastructure=data_structure, schema=schema)
model.initialise_model(datasource)
predictions = model.predict(datasource)

Calling .fit() or .predict() on a model will automatically feed the data to the model and return the results.

tip

You can find an end-to-end example of how to validate your model locally in the Tutorials section.