Skip to main content

Connecting datasets

This page covers how to connect datasets to a Pod using the Bitfount SDK. Any datasets connected using the SDK will also be visible in the Bitfount Desktop application and Hub but they won't be configurable. Under the hood, a dataset is powered by a datasource, which is the object that represents the type of data being connected to a Pod and encapsulates the specific logic required for loading and processing that kind of data.

Reminder

Recall that datasets are part of a Pod, which is the entity that contains the datasets and enables them to be used in tasks.

Available Datasources

Bitfount supports connecting various types of datasets to a Pod, organised by domain. For detailed API documentation on all datasource classes, see the Datasources API reference.

info

Datasources are the objects that represent the type of data being connected to a Pod and encapsulate the specific logic required for loading and processing that kind of data. Learn more about how they work here.

General Datasets

  • CSV files (CSVSource) - Structured tabular data
  • Image folders (ImageSource) - Collections of image files

Healthcare Datasets

  • DICOM files (DICOMSource) - Medical imaging data in DICOM format
  • OMOP databases (OMOPSource) - Observational Medical Outcomes Partnership common data model
  • InterMine databases (InterMineSource) - Biological data warehouses

Ophthalmic Datasets

  • Heidelberg Eye Explorer data (HeidelbergSource) - Retinal imaging data from Heidelberg devices
  • Topcon data (TopconSource) - Ophthalmic imaging from Topcon equipment
  • DICOM Ophthalmology data (DICOMOphthalmologySource) - General ophthalmic datasets in DICOM format (including Zeiss)

For specific API documentation on ophthalmic datasources, see the Ophthalmology Datasources API reference.

Connecting a dataset using the SDK

See the tutorials on Running a Pod for examples of how to connect CSV and Image folder datasets using the SDK. A DICOM dataset can be connected to a Pod in much the same way but instead simply using the DICOMSource class.

tip

Multiple datasets can be connected to a single Pod using the SDK by passing a list of DatasourceContainerConfig objects to the datasources argument of the Pod class.

Pod configuration objects

  • PodDetailsConfig provides human-readable metadata for a dataset (for example display_name and description) for display in the Bitfount Desktop application and Hub
  • PodDataConfig carries the operational options required to load data, such as datasource_args (for example path, connection strings, or ophthalmology flags), optional force_stypes to give control over column semantic types, and file_system_filters to filter files based on various criteria.

Example: Connecting a DICOM dataset using the SDK

This example shows how to connect a DICOM dataset to a Pod using the SDK. It also demonstrates how to filter files based on various criteria, such as file extension, file creation date, and file size.

run_dicom_pod.py
import loggingfrom bitfount import (    DICOMSource,    Pod,    setup_loggers,)from bitfount.data.datasources.types import Datefrom bitfount.runners.config_schemas import (    DatasourceContainerConfig,    FileSystemFilterConfig,    PodDataConfig,    PodDetailsConfig,)loggers = setup_loggers([logging.getLogger("bitfount")])if __name__ == "__main__":    datasource_details = PodDetailsConfig(        display_name="My DICOM Dataset",        description="This Pod contains data from my DICOM dataset",    )    datasource_args = {"path": "/path/to/dicom/dataset"}    datasource = DICOMSource(**datasource_args)    data_config = PodDataConfig(        datasource_args=datasource_args,        # DICOM frames are identified by the prefix "Pixel Data"        force_stypes={"image_prefix": ["Pixel Data"]},        file_system_filters=FileSystemFilterConfig(            file_extension="dcm",            file_creation_min_date=Date(2025, 1, 1),            min_file_size= 1.0, # 1MB        ),    )    pod = Pod(        name="my-pod",        datasources=[            DatasourceContainerConfig(                name="my-dicom-dataset",                datasource=datasource,                datasource_details=datasource_details,                data_config=data_config,            )        ],    )    pod.start()