Connecting datasets
This page covers how to connect datasets to a Pod using the Bitfount SDK. Any datasets connected using the SDK will also be visible in the Bitfount Desktop application and Hub but they won't be configurable. Under the hood, a dataset is powered by a datasource, which is the object that represents the type of data being connected to a Pod and encapsulates the specific logic required for loading and processing that kind of data.
Recall that datasets are part of a Pod, which is the entity that contains the datasets and enables them to be used in tasks.
Available Datasources
Bitfount supports connecting various types of datasets to a Pod, organised by domain. For detailed API documentation on all datasource classes, see the Datasources API reference.
Datasources are the objects that represent the type of data being connected to a Pod and encapsulate the specific logic required for loading and processing that kind of data. Learn more about how they work here.
General Datasets
- CSV files (
CSVSource) - Structured tabular data from CSV (Comma-Separated Values) files. Supports local file paths, URLs, and customread_csvoptions for flexible data loading. - Image folders (
ImageSource) - Collections of image files in common formats such as JPG and PNG. Images are loaded from a directory and can optionally infer class labels from the folder structure.
Healthcare Datasets
- DICOM files (
DICOMSource) - Medical imaging data in DICOM (Digital Imaging and Communications in Medicine) format, the international standard for transmitting, storing, and sharing medical images. - NIfTI files (
NIFTISource) - NIfTI (Neuroimaging Informatics Technology Initiative) is an open file format commonly used to store brain imaging data obtained using Magnetic Resonance Imaging (MRI) methods. The file format supports.niiand compressed.nii.gzextensions. - OMOP databases (
OMOPSource) - The Observational Medical Outcomes Partnership (OMOP) Common Data Model is a standardised schema for organising observational health data. Supports versions v3.0, v5.3, and v5.4. - InterMine databases (
InterMineSource) - InterMine is an open-source biological data warehouse developed by the University of Cambridge, providing integrated access to genomic and proteomic data.
Ophthalmic Datasets
- Heidelberg Eye Explorer data (
HeidelbergSource) - Retinal imaging data from Heidelberg Engineering devices, loaded from.sdb(Spectralis Database) files. - Topcon data (
TopconSource) - Ophthalmic imaging from Topcon equipment, supporting various OCT and fundus imaging formats. - DICOM Ophthalmology data (
DICOMOphthalmologySource) - Ophthalmic datasets in DICOM format, including data from Zeiss and other manufacturers, with support for OCT and SLO image extraction.
For specific API documentation on ophthalmic datasources, see the Ophthalmology Datasources API reference.
Connecting a dataset using the SDK
See the tutorials on Running a Pod for examples of how to connect CSV and Image folder datasets using the SDK.
A DICOM dataset can be connected to a Pod in much the same way but instead simply using the DICOMSource class.
Multiple datasets can be connected to a single Pod using the SDK by passing a list of DatasourceContainerConfig objects to the datasources argument of the Pod class.
Pod configuration objects
PodDetailsConfigprovides human-readable metadata for a dataset (for exampledisplay_nameanddescription) for display in the Bitfount Desktop application and HubPodDataConfigcarries the operational options required to load data, such asdatasource_args(for examplepath, connection strings, or ophthalmology flags), optionalforce_stypesto give control over column semantic types, andfile_system_filtersto filter files based on various criteria.
Example: Connecting a DICOM dataset using the SDK
This example shows how to connect a DICOM dataset to a Pod using the SDK. It also demonstrates how to filter files based on various criteria, such as file extension, file creation date, and file size.
import loggingfrom bitfount import ( DICOMSource, Pod, setup_loggers,)from bitfount.data.datasources.types import Datefrom bitfount.runners.config_schemas import ( DatasourceContainerConfig, FileSystemFilterConfig, PodDataConfig, PodDetailsConfig,)loggers = setup_loggers([logging.getLogger("bitfount")])if __name__ == "__main__": datasource_details = PodDetailsConfig( display_name="My DICOM Dataset", description="This Pod contains data from my DICOM dataset", ) datasource_args = {"path": "/path/to/dicom/dataset"} datasource = DICOMSource(**datasource_args) data_config = PodDataConfig( datasource_args=datasource_args, # DICOM frames are identified by the prefix "Pixel Data" force_stypes={"image_prefix": ["Pixel Data"]}, file_system_filters=FileSystemFilterConfig( file_extension="dcm", file_creation_min_date=Date(2025, 1, 1), min_file_size= 1.0, # 1MB ), ) pod = Pod( name="my-pod", datasources=[ DatasourceContainerConfig( name="my-dicom-dataset", datasource=datasource, datasource_details=datasource_details, data_config=data_config, ) ], ) pod.start()