Skip to main content

Connecting datasets

This page covers how to connect datasets to a Pod using the Bitfount SDK. Any datasets connected using the SDK will also be visible in the Bitfount Desktop application and Hub but they won't be configurable. Under the hood, a dataset is powered by a datasource, which is the object that represents the type of data being connected to a Pod and encapsulates the specific logic required for loading and processing that kind of data.

Reminder

Recall that datasets are part of a Pod, which is the entity that contains the datasets and enables them to be used in tasks.

Available Datasources

Bitfount supports connecting various types of datasets to a Pod, organised by domain. For detailed API documentation on all datasource classes, see the Datasources API reference.

info

Datasources are the objects that represent the type of data being connected to a Pod and encapsulate the specific logic required for loading and processing that kind of data. Learn more about how they work here.

General Datasets

  • CSV files (CSVSource) - Structured tabular data from CSV (Comma-Separated Values) files. Supports local file paths, URLs, and custom read_csv options for flexible data loading.
  • Image folders (ImageSource) - Collections of image files in common formats such as JPG and PNG. Images are loaded from a directory and can optionally infer class labels from the folder structure.

Healthcare Datasets

Ophthalmic Datasets

  • Heidelberg Eye Explorer data (HeidelbergSource) - Retinal imaging data from Heidelberg Engineering devices, loaded from .sdb (Spectralis Database) files.
  • Topcon data (TopconSource) - Ophthalmic imaging from Topcon equipment, supporting various OCT and fundus imaging formats.
  • DICOM Ophthalmology data (DICOMOphthalmologySource) - Ophthalmic datasets in DICOM format, including data from Zeiss and other manufacturers, with support for OCT and SLO image extraction.

For specific API documentation on ophthalmic datasources, see the Ophthalmology Datasources API reference.

Connecting a dataset using the SDK

See the tutorials on Running a Pod for examples of how to connect CSV and Image folder datasets using the SDK. A DICOM dataset can be connected to a Pod in much the same way but instead simply using the DICOMSource class.

tip

Multiple datasets can be connected to a single Pod using the SDK by passing a list of DatasourceContainerConfig objects to the datasources argument of the Pod class.

Pod configuration objects

  • PodDetailsConfig provides human-readable metadata for a dataset (for example display_name and description) for display in the Bitfount Desktop application and Hub
  • PodDataConfig carries the operational options required to load data, such as datasource_args (for example path, connection strings, or ophthalmology flags), optional force_stypes to give control over column semantic types, and file_system_filters to filter files based on various criteria.

Example: Connecting a DICOM dataset using the SDK

This example shows how to connect a DICOM dataset to a Pod using the SDK. It also demonstrates how to filter files based on various criteria, such as file extension, file creation date, and file size.

run_dicom_pod.py
import loggingfrom bitfount import (    DICOMSource,    Pod,    setup_loggers,)from bitfount.data.datasources.types import Datefrom bitfount.runners.config_schemas import (    DatasourceContainerConfig,    FileSystemFilterConfig,    PodDataConfig,    PodDetailsConfig,)loggers = setup_loggers([logging.getLogger("bitfount")])if __name__ == "__main__":    datasource_details = PodDetailsConfig(        display_name="My DICOM Dataset",        description="This Pod contains data from my DICOM dataset",    )    datasource_args = {"path": "/path/to/dicom/dataset"}    datasource = DICOMSource(**datasource_args)    data_config = PodDataConfig(        datasource_args=datasource_args,        # DICOM frames are identified by the prefix "Pixel Data"        force_stypes={"image_prefix": ["Pixel Data"]},        file_system_filters=FileSystemFilterConfig(            file_extension="dcm",            file_creation_min_date=Date(2025, 1, 1),            min_file_size= 1.0, # 1MB        ),    )    pod = Pod(        name="my-pod",        datasources=[            DatasourceContainerConfig(                name="my-dicom-dataset",                datasource=datasource,                datasource_details=datasource_details,                data_config=data_config,            )        ],    )    pod.start()