# Bitfount Documentation


## demo-projects.md

# Demo projects

Get started quickly with pre-configured demo projects designed by Bitfount to
help you explore the platform with minimal setup.

demo-projects.png

## Fine-tune & run inference with RETFound

RETFound, developed by researchers
at Moorfields Eye Hospital and University College London, is the first
foundation model trained on retinal images. Learn how to fine-tune RETFound to
classify images relevant to your specific research needs. This demo walks you
through the full process—from fine-tuning the model to running inference on new
datasets.

:::info
We also offer Python tutorials for our SDK that run on
Google Colab
:::

## Why use RETFound?

Previously, training AI to analyse retinal images required building separate
models from scratch for each disease—a time-consuming, expensive, and
data-intensive process.

With RETFound, you can start with a pre-trained model, fine-tune it for a
specific disease or classification task, and train on fewer labeled images using
just a single GPU.

This means faster, more efficient AI model development, unlocking new
possibilities for analysing retinal diseases.

product-demo-task-run-min.png

## Getting started

In this tutorial you will learn how to:

1. Connect a training dataset of retinal images.
2. Fine-tune RETFound for a specific classification task.
3. Run inference on a new dataset using your fine-tuned model.

By the end, you will be able to adapt RETFound to your research needs and
generate insights from retinal images with ease.

### Step 1: Connect a training dataset

The training dataset will need to consist of retinal images grouped into
categories (classes) that the fine-tuned model will learn to recognise.

:::tip
**Connecting OCTs?** Make sure your dataset consists of individual B-scan images rather than full volumetric scans as RETFound does not classify volumetric scans.
:::

#### Organising the dataset

Before connecting the dataset, arrange your image files into the following
folder structure:

📂 **Dataset folder** (top-level folder you will connect to Bitfount)\
📂 **Class label folders** (subfolders representing different diseases or severity levels)\
📂 **Data split folders** (separate folders for train, validation, test splits, e.g. 60–80%
training, 10–20% validation, 10–20% test)

  Example dataset folder structure with training, validation, and test splits

#### Connecting the dataset

1. Connect the dataset from the Datasets page in Bitfount, or connect a new
   dataset directly when linking a dataset in the demo project.
2. Choose the folder that contains your images.
3. Make sure you check the option,
   `Use folder names and structure for training tasks`.
4. Connect the dataset.

:::note
Folder-inferred data splits and class labels are only supported for DICOM or Heidelberg formats. Contact our support team if your data is in another format.
:::

### Step 2: Fine-tune the RETFound model

Fine-tuning adapts RETFound to your specific dataset, optimising its performance
for your research.

#### Setting up the fine-tuning task

1. Join the RETFound fine-tuning demo project.
2. Link your training dataset to the project.
3. Select the relevant RETFound model. Ensure the model version matches your
   dataset (OCT or Color Fundus).
4. Set task parameters:

| Parameter         | Description                                                                                                                                                                                                    |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Learning rate** | Controls how quickly the model learns. A lower value means slower but steadier learning, while a higher value speeds things up but can make learning less reliable.                                            |
| **Epochs**        | The number of times the model goes through the entire dataset. More epochs can improve accuracy but also increase training time and risk overfitting. **The RETFound paper suggests starting with 50 epochs**. |
| **Labels**        | Enter the class labels you defined in your training dataset folder structure. Choose conditions you have sample data for, or test with any labeled public dataset.                                             |
| **Batch size**    | The number of samples processed before updating the model. Larger batches can speed up training but require more memory.                                                                                       |
| **Image column**  | The dataset column that contains image data for model training and fine-tuning. The default is `Pixel Data 0` unless configured differently.                                                                   |
| **Target column** | The dataset column that contains the image labels. The default is `BITFOUNT-INFERRED-LABEL` unless specified otherwise.                                                                                        |

5. Run the task. Task processing time will depend on a number of factors
   including the size of data you connected, the batch size, the number of
   epochs, your machine's processing capabilities and more.
6. Once the task completes, review the results by navigating to the task run.
   The output includes a CSV file summarising the learning process, and three
   checkpoint files, including the `model_best.pth` checkpoint file, which
   represents the highest-performing fine-tuned model. The `model_best.pth`
   checkpoint file can be used in subsequent projects to obtain predictions on
   unlabelled images.

RETFound-task-results-ft.png

### Step 3: Run inference on new data

Now that you have fine-tuned RETFound, it's time to test it on unlabelled
images!

#### Preparing the test dataset

1. Curate a folder of new images (no need to add class labels).
2. Connect the dataset to Bitfount.

:::warning
**Do not** check the option `Use folder name and structure for training tasks`
:::

#### Running inference

Inference is the process of using your fine-tuned model to analyse a new
dataset. During this step, the model will classify each image based on the
categories it was trained on, generating predictions as output.

1. Join the RETFound inference demo project (_Classify retinal images using
   RETFound and a local checkpoint file_).
2. Link the test dataset you prepared.
3. Set task parameters:

| Parameter           | Description                                                                |
| ------------------- | -------------------------------------------------------------------------- |
| **Model version**   | Ensure it matches your dataset type (e.g. Color Fundus or OCT).            |
| **Class outputs**   | Use the same labels defined during fine-tuning.                            |
| **Checkpoint file** | Select the `model_best.pth` checkpoint file from your fine-tuning results. |
| **Image column**    | Defaults to `Pixel Data 0`, unless configured differently.                 |

4. Run the task.

### Interpreting predictions

The task results are provided in a CSV file for easy review. This file includes
metadata about your input images, but the most important columns are:

- First column: The file path of each image.
- Last columns: The predicted probabilities for each class (the number of these
  columns depends on the classes defined during fine-tuning).

Each score represents the model's confidence that an image belongs to a specific
class. The values for all classes will sum to 1, with higher numbers indicating
greater confidence in the model's prediction.

RETFound-task-results-inference.png

## FAQs

**What is a foundation model?**\
A model trained on a broad dataset (over 1 million retinal images) that can be fine-tuned for specific tasks.

**What is fine-tuning?**\
Training an existing foundation model on a specific dataset to specialise in a particular task.

**What are classes?**\
Categories a model classifies images into (e.g. 'Diabetic Retinopathy' vs. 'Normal').

**How many images do I need?**\
Start with at least 100 images per class, though more data improves performance.

**What kind of images can be used?**\
There are two versions of RETFound, one that runs on fundus images, while the other facilitates the use of OCTs. It's worth noting that the model does not classify volumetric scans and therefore will only run on single B-scans.


## faqs.md

# FAQs

**What does it mean when the “Run task” button is greyed out?**

The **Run task** button is disabled when the selected dataset cannot accept a new task. This can occur for the following reasons:

1. **The dataset is not connected to the EHR**

   This project requires an active EHR connection for every dataset. If the site has not logged into their EHR through the Bitfount app, the task cannot be started.

   → Please contact the site and ask them to log in to their EHR instance within Bitfount.

2. **The dataset is currently at capacity**

   The dataset is already processing another task and cannot accept a second one.

   → You will need to wait until the current task completes before trying again.

3. **The dataset is offline**

   The dataset is not online and therefore cannot receive a task.

   → Contact the dataset owner and request that they bring the dataset back online.

**The dataset I want to run a task on is marked in red as offline. What should I do?**

If a dataset is marked **offline**, it means Bitfount cannot connect to the dataset at the site. You will need to contact the site that owns the dataset. Common reasons for an offline status include:

- The dataset owner has logged out of the Bitfount application.
- The Bitfount laptop is powered off, or the Bitfount OS user is logged out.
- The imaging drive is unmounted locally at the site and can no longer be accessed.

  → The site’s IT Support team will need to remount the drive.

- The Bitfount laptop has lost the read permissions required to access the imaging drive.

  → The site’s IT Support team will need to restore these credentials.

- The site’s internet connection is unstable, preventing Bitfount from retrieving the model from the Bitfount Hub and preparing it for local analysis.

  → The site should check and restore their internet connection.

**Can I run multiple task runs at a time?**

At present, Bitfount does not support running multiple task runs in parallel from a single modeller instance. This limitation applies whether you are connecting to **remote datasets** or running a **local dataset** on your own machine. In both cases, one modeller can initiate a task for **only one dataset at a time**, so task runs must be started sequentially.

The same limitation applies on the dataset side:

a **single remote machine** hosting Bitfount can process **only one dataset task run at a time**, even if multiple datasets are configured on that machine.

If you need to run multiple task runs simultaneously across different sites, the current workaround is to use **separate Bitfount instances on separate devices**. Each device can initiate a task toward a different remote dataset, allowing those runs to proceed in parallel.

```mermaid
---
title: Bitfount - Current Task Run Architecture
---

flowchart LR
    subgraph "Local Modeller Instances"
        direction TB
        UA[User A  Bitfount Modeller]
        UB[User B  Bitfount Modeller]
    end

    subgraph "Remote Datasets (Pods)"
        direction TB
        S1[Site 1 Pod  Remote Dataset]
        S2[Site 2 Pod  Remote Dataset]
    end

    UA -->|Task Run| S1
    UB -->|Task Run| S2
    UA -.->|One modeller to one dataset at a time| S2

    style UA fill:#a6e7ff,stroke:#003366,stroke-width:2
    style UB fill:#a6e7ff,stroke:#003366,stroke-width:2
    style S1 fill:#a6e7ff,stroke:#003366,stroke-width:2
    style S2 fill:#a6e7ff,stroke:#003366,stroke-width:2

    linkStyle 0 stroke:#1ca3ff,stroke-width:3,color:#1ca3ff
    linkStyle 1 stroke:#1ca3ff,stroke-width:3,color:#1ca3ff
    linkStyle 2 stroke:#f36f21,stroke-width:2,color:#f36f21,stroke-dasharray:5 5
```

_The dotted line indicates that multiple task runs are not supported concurrently._

**It’s been a long time since I kicked off the task and it is still going -  when will it finish?**

Task run times in Bitfount can vary widely depending on several factors, including:

- The dataset owner’s **local network speed**
- The dataset owner’s **internet connection quality**
- The **hardware specifications** of the machine hosting the dataset
- The **size and number of files** being processed
- The **complexity** of the task or analysis being performed

Because of the variety of these conditions, Bitfount cannot predict the exact duration of a task. Once the task completes, the Collaborator who initiated it will automatically receive an **email notification** confirming that the task has finished and is ready for review.

**How will I know when the task is complete?**

The Collaborator that initiates a task run will be notified by an email alert when the task has either completed or the task has aborted.

**How do I pause or quit a task run?**

Bitfount does not currently support pausing an active task. If a task needs to be stopped, it must be **terminated by the dataset owner**, as Bitfount’s execution flow is designed so that control originates from the machine hosting the data.

To end a task run, the dataset owner should use the **Windows Task Manager** on the machine where the dataset is connected. From there, they must manually select **“End task”** for both the **Bitfount application** and the **Bitfount Orchestrator** processes. Ending both ensures that the environment restarts cleanly the next time a task is run.

Once terminated, the task initiator will see a notification indicating that the task has aborted. There is no risk to data or system stability when ending these processes.

If the dataset owner is unavailable or the task does not terminate as expected, please contact **support@bitfount.com** for assistance.

**How do I know that the correct filters have been applied to the dataset, by the dataset owner?**

Bitfount is designed to give dataset owners full control over their data and to support strong Information Governance practices. As part of this, **only the dataset owner can view or modify the dataset-level filters applied at connection time**. These filters are not visible to collaborators, and any changes made by the dataset owner will apply **only to future task runs**, not tasks that have already been completed.

If you believe the current filters need to be updated—for example, to include additional data or adjust the criteria being queried—you will need to **contact the dataset owner**. They can reconnect the dataset using the same data source but with updated filters applied. This ensures that any modifications are governed and explicitly approved by the data custodian.

Filters that may be applied include:

- Modality - derived from the image headers
- Date of birth - derived from the image headers
- Date created - derived from the file metadata
- Date modified - derived from the file metadata
- B-scan Min and Max - derived from the image headers
- File size Min and Max - derived from the file metadata
- Filter files missing required fields for calculations - this filter skips files that do not have values in the image header fields that are needed in order to derive a calculated value.

**What is the difference between dataset-level filters and task-level filters?**

Bitfount supports two types of data filtering:

**Dataset-level filters** are configured by the **dataset owner** when connecting a dataset. These filters:

- Are set at connection time and apply to all task runs on that dataset
- Are only visible to and modifiable by the dataset owner
- Define the maximum scope of data that can be accessed
- Require reconnecting the dataset to change

**Task-level filters** are configured by the **task initiator** at runtime. These filters:

- Are specified in the task YAML under `data_structure.filter`
- Can be different for each task run
- Can be templated to allow users to configure them via the UI
- Are combined with dataset-level filters using "most restrictive wins" logic

:::info
Task-level filters can only **further restrict** the data beyond what the dataset owner has configured—they cannot expand access to data that has been filtered out at the dataset level.
:::

**Example**: A dataset owner connects their imaging data with a dataset-level filter for `modality: OCT`. A collaborator then runs a task with a task-level filter for `min-frames: 50`. The resulting data will include only OCT images with at least 50 B-scan frames.

For more details on configuring task-level filters, see Task-level filters.

**As a dataset owner, a dataset I had previously connected with my Bitfount account is offline and listed as remote. How do I reconnect to it?**

Bitfount marks a dataset as **remote/offline** when it cannot find the local configuration file—called the **pod-config file**—that stores the details required to connect to your dataset.

The pod-config file is created on the **specific machine and OS user account** that originally connected the dataset. It is not shared across devices or operating system accounts. Bitfount uses this local file by design to ensure that your data remains on the custodian’s system, and that no connection or configuration information leaves your secure environment. This is part of Bitfount’s privacy-preserving model, which ensures that datasets are never transferred or centrally stored.

Your dataset may appear as **remote or offline** if:

- You are logged into **a different OS user account** on the same machine
- You are logged into Bitfount on **another computer**
- The original pod-config file is not accessible, was deleted, or has been moved

Because these environments do not have a copy of the pod-config file, they cannot establish the secure connection required, and Bitfount marks the dataset as remote.

**To reconnect the dataset:**

- Log in to Bitfount on the **same machine** where the dataset was originally connected
- Log in as the **same OS user** who performed the initial connection
- Then reopen Bitfount and navigate to your dataset to bring it back online

If you no longer have access to the original machine or OS account, you will need to **reconnect the dataset as new** by repeating the initial dataset connection steps.

**Can changes be made to the parameters of the demo projects available in Bitfount?**

No. Demo projects are fixed and cannot be customised.

If your use case requires different settings or functionality, we’d be happy to discuss options. Please contact us at **support@bitfount.com**.

**How should my data be structured for model fine-tuning projects?**

Your dataset must follow a standard machine-learning folder structure with three top-level folders:

- **train/**
- **validation/**
- **test/**

Inside each folder, images must be placed into separate subfolders representing the **labels** you want the model to learn (e.g., `train/DRUSEN/…`, `train/NORMAL/…`).

Each subfolder should contain the relevant images.

**When connecting a dataset, what variables are mandatory?**

The following fields must be provided when creating a dataset connection:

- **Dataset name**
- **DICOM folder location** — the file path to the directory containing your imaging data
- **Use folder names and structure for training tasks** — this option must be enabled so Bitfount can correctly infer labels and structure

**I want to try your demo projects but I'm unsure about connecting my own data first. Where can I find data to trial a project?**

All demo projects include a **small, built-in sample dataset** so you can explore Bitfount without connecting your own data.

For peace of mind, remember that **no imaging data ever leaves your institution**, and all analysis occurs locally on the device connected to Bitfount.

**When setting up a fine-tuning task run, what are the configurable task parameters for?**

These parameters allow you to control how the model trains. A quick overview:

- **Learning rate** — How quickly the model updates during training.
  - Lower = slower but more stable training
  - Higher = faster but riskier
- **Epochs** — Number of full passes through the dataset.
  - Increasing epochs can improve learning but may cause overfitting.
- **Labels** — The set of classes the model should learn from your folder structure.
- **Batch size** — How many images are processed at once.
  - Larger batches improve training speed but require more memory.
- **Image column** — The dataset column containing image references (for dataset-driven training workflows).
- **Target column** — The column containing labels.

  If your folders follow a structure like `test/class_1/image_1.jpg`, selecting **BITFOUNT_INFERRED_LABEL** will automatically extract labels from folder names.

**What is the suggested Bitfount workflow for using the fine-tuning and classification demo projects?**

The best workflow depends on whether you want to try **classification**, **fine-tuning**, or a full **end-to-end pipeline**:

**1. If you want to try classification only:**

Start with:

- **Classify retinal images using RETFound fine tuned on Kermany**

**2. If you want to try fine-tuning a retinal model:**

Choose a demo based on your imaging modality:

- **Colour fundus** → _Fine tune the RETFound retinal colour fundus foundation model_
- **OCT** → _Fine tune the RETFound retinal OCT foundation model_

**3. If you want the end-to-end workflow (fine-tune → classify):**

1. Fine-tune using one of the RETFound fine-tuning demos
2. Then classify using:
   - **Classify retinal images using RETFound and a local checkpoint file**

Your fine-tuning task must generate a checkpoint named **`model_best`** to be used in the classification project.

**Available demo projects:**

1. **Fine tune the RETFound retinal colour fundus foundation model**
   - Fine-tune Moorfields Eye Hospital’s RETFound model on colour fundus images.
2. **Fine tune the RETFound retinal OCT foundation model**
   - Fine-tune the RETFound model on OCT images.
3. **Classify retinal images using RETFound fine tuned on Kermany**
   - Classify OCT images into: `CNV`, `DME`, `DRUSEN`, `NORMAL`.
4. **Classify retinal images using RETFound and a local checkpoint file**
   - Classify fundus or OCT images using your own fine-tuned model checkpoint (`model_best`).


## security.md

# Security

## Firewalls

One of the fundamental architectural choices of the Bitfount platform, different from many other federated architectures, is that Bitfount follows a messaging architecture. This means that services that connect to Bitfount only make outgoing HTTP connections and can happily sit behind a firewall.

## Encryption

All data entering or leaving Bitfount uses TLS/HTTPS, and all messages are 256-bit AES end-to-end encrypted. This removes any requirement to trust Bitfount with respect to raw data or task results.

## Your data

Data accessed via Bitfount can be hosted locally or in cloud infrastructure. Data never leaves its location and is not accessible to Bitfount unless access is granted.

The only information shared with Bitfount is metadata. More information on the metadata Bitfount has access to can be found in our privacy policy.

## Bitfount's own security

Bitfount takes security very seriously. Security is a core part of what our product aims to help with! The following are some of the things we are doing to make sure our own code and infrastructure are secure:

- Automated security tests on all our code
- Regular penetration tests on all our services
- Monitoring tools to try to catch intrusions and incidents
- Segregated production environment with limited human access
- Various process-level security policies, including a secure development policy
- ISO 27001 certification, HIPAA compliant, GDPR compliant, UK Cyber Essentials Plus certified and NHS Data Security and Protection Toolkit (DSPT) compliant
- Access to Bitfount is protected by strong authentication and authorization controls, with user passwords not being held by Bitfount
- Bitfount's authentication (Auth0) and infrastructure (AWS) providers hold industry-leading security certifications such as SOC 2 Type II, ISO 27018 and ISO 27001


## for-data-scientists/models-and-datasets/connecting-datasets.md

# Connecting datasets

This page covers how to connect datasets to a Pod using the Bitfount SDK. Any datasets connected using the SDK will also be visible in the Bitfount Desktop application and Hub but they won't be configurable. Under the hood, a dataset is powered by a **datasource**, which is the object that represents the type of data being connected to a Pod and encapsulates the specific logic required for loading and processing that kind of data.

:::info[Reminder]
Recall that datasets are part of a Pod, which is the entity that contains the datasets and enables them to be used in tasks.
:::

## Available Datasources

Bitfount supports connecting various types of datasets to a Pod, organised by domain. For detailed API documentation on all datasource classes, see the Datasources API reference.

:::info

**Datasources** are the objects that represent the type of data being connected to a Pod and encapsulate the specific logic required for loading and processing that kind of data. Learn more about how they work here.

:::

### General Datasets

- **CSV files** (`CSVSource`) - Structured tabular data from CSV (Comma-Separated Values) files. Supports local file paths, URLs, and custom `read_csv` options for flexible data loading.
- **Image folders** (`ImageSource`) - Collections of image files in common formats such as JPG and PNG. Images are loaded from a directory and can optionally infer class labels from the folder structure.

### Healthcare Datasets

- **DICOM files** (`DICOMSource`) - Medical imaging data in DICOM (Digital Imaging and Communications in Medicine) format, the international standard for transmitting, storing, and sharing medical images.
- **NIfTI files** (`NIFTISource`) - NIfTI (Neuroimaging Informatics Technology Initiative) is an open file format commonly used to store brain imaging data obtained using Magnetic Resonance Imaging (MRI) methods. The file format supports `.nii` and compressed `.nii.gz` extensions.
- **OMOP databases** (`OMOPSource`) - The Observational Medical Outcomes Partnership (OMOP) Common Data Model is a standardised schema for organising observational health data. Supports versions v3.0, v5.3, and v5.4.
- **InterMine databases** (`InterMineSource`) - InterMine is an open-source biological data warehouse developed by the University of Cambridge, providing integrated access to genomic and proteomic data.

### Ophthalmic Datasets

- **Heidelberg Eye Explorer data** (`HeidelbergSource`) - Retinal imaging data from Heidelberg Engineering devices, loaded from `.sdb` (Spectralis Database) files.
- **Topcon data** (`TopconSource`) - Ophthalmic imaging from Topcon equipment, supporting various OCT and fundus imaging formats.
- **DICOM Ophthalmology data** (`DICOMOphthalmologySource`) - Ophthalmic datasets in DICOM format, including data from Zeiss and other manufacturers, with support for OCT and SLO image extraction.

For specific API documentation on ophthalmic datasources, see the Ophthalmology Datasources API reference.

## Connecting a dataset using the SDK

See the tutorials on Running a Pod for examples of how to connect CSV and Image folder datasets using the SDK.
A DICOM dataset can be connected to a Pod in much the same way but instead simply using the `DICOMSource` class.

:::tip

Multiple datasets can be connected to a single Pod using the SDK by passing a list of `DatasourceContainerConfig` objects to the `datasources` argument of the `Pod` class.

:::

### Pod configuration objects

- `PodDetailsConfig` provides human-readable metadata for a dataset (for example `display_name` and `description`) for display in the Bitfount Desktop application and Hub
- `PodDataConfig` carries the operational options required to load data, such as `datasource_args` (for example `path`, connection strings, or ophthalmology flags), optional `force_stypes` to give control over column semantic types, and `file_system_filters` to filter files based on various criteria.

### Example: Connecting a DICOM dataset using the SDK

This example shows how to connect a DICOM dataset to a Pod using the SDK. It also demonstrates how to filter files based on various criteria, such as file extension, file creation date, and file size.

```python showLineNumbers title="run_dicom_pod.py"
import logging

from bitfount import (
    DICOMSource,
    Pod,
    setup_loggers,
)
from bitfount.data.datasources.types import Date
from bitfount.runners.config_schemas import (
    DatasourceContainerConfig,
    FileSystemFilterConfig,
    PodDataConfig,
    PodDetailsConfig,
)

loggers = setup_loggers([logging.getLogger("bitfount")])

if __name__ == "__main__":
    datasource_details = PodDetailsConfig(
        display_name="My DICOM Dataset",
        description="This Pod contains data from my DICOM dataset",
    )
    datasource_args = {"path": "/path/to/dicom/dataset"}
    datasource = DICOMSource(**datasource_args)
    data_config = PodDataConfig(
        datasource_args=datasource_args,
        # DICOM frames are identified by the prefix "Pixel Data"
        force_stypes={"image_prefix": ["Pixel Data"]},
        file_system_filters=FileSystemFilterConfig(
            file_extension="dcm",
            file_creation_min_date=Date(2025, 1, 1),
            min_file_size= 1.0, # 1MB
        ),
    )

    pod = Pod(
        name="my-pod",
        datasources=[
            DatasourceContainerConfig(
                name="my-dicom-dataset",
                datasource=datasource,
                datasource_details=datasource_details,
                data_config=data_config,
            )
        ],
    )
    pod.start()

```


## for-data-scientists/models-and-datasets/installing-the-sdk.md

# Installing the SDK

Most bitfount functionality can be achieved using the Bitfount Desktop application. However, for more complex use cases, the SDK provides a more flexible and powerful way to interact with Bitfount. Anything that can be done in the Bitfount Desktop can also be done using the SDK. In this guide, we will cover the installation of the SDK and how to use it to interact with Bitfount to connect datasets, test machine learning models and run federated tasks.

The Bitfount SDK is published on PyPI and can be installed simply using pip.

```bash
pip install bitfount
```

The SDK requires a `Python 3.12` environment and can be installed on MacOS (Apple Silicon only), Linux and Windows.

If running on Windows or Linux with an NVIDIA GPU with CUDA installed, you will need to modify the command to include the appropriate PyTorch wheel for your CUDA version. For instance if you have CUDA 12.6 installed, you would use the following command:

```bash
pip install bitfount -f https://download.pytorch.org/whl/cu126
```

Alternatively, find the appropriate command for your OS and CUDA version here and install that in your environment first before installing the `bitfount` package as normal.


## for-data-scientists/models-and-datasets/bringing-your-models-to-bitfount/for-inference-or-evaluation.md

# Inference and Evaluation

This page covers how to bring your existing PyTorch or ONNX models onto the Bitfount platform specifically for inference or evaluation tasks. These are tasks that specifically use the `bitfount.ModelInference` or `bitfount.ModelEvaluation` algorithms. Models that do not perform any training or fine-tuning can be substantially simpler to onboard to Bitfount.

Bitfount supports two types of models:

- **PyTorch models**
- **ONNX models**

Regardless of the type of model you are bringing onto the Bitfount platform, the only thing you need to do is to make sure that your model implements the `InferrableModelProtocol` interface for inference or the `EvaluableModelProtocol` interface for evaluation. Both protocols require the implementation of `initialise_model`, which is used to initialise the model and `deserialize`, which is used to save the model parameters to a file. You will then need to additionally implement either the `predict` method for inference or the `evaluate` method for evaluation. Or both if you want to support both inference and evaluation.

:::tip

To validate that your model has implemented the appropriate interface correctly, you can use an `isinstance` check:

```python
assert isinstance(MyModel(), InferrableModelProtocol)
```

:::

## PyTorch

To make migration easier we've provided two base classes for PyTorch inference models:

- `PytorchInferenceModel`
- `PytorchLightningInferenceModel`

These base classes do not _need_ to be used but may make migration easier as they provide a lot of the functionality you need to get started. For examples of how to implement a PyTorch inference model using these base classes, see the PyTorch inference model tutorial.

## ONNX

Open Neural Network eXchange (ONNX) is a popular open standard format for representing machine learning models across many popular frameworks such as PyTorch, TensorFlow, and scikit-learn. Models are typically not written in pure ONNX but rather in a framework-specific language such as PyTorch or TensorFlow and then converted to ONNX. This is what we also recommend you do in order to bring your model to Bitfount in ONNX format.

### Converting your model to ONNX

For more information on how to convert your model to ONNX, see the relevant documentation for your framework or the ONNX documentation. PyTorch for instance has a built-in support for converting models to ONNX [documentation] whereas TensorFlow requires the use of a dedicated library for converting models to ONNX [documentation].

A simple example of converting a PyTorch model to ONNX is shown below:

```python showLineNumbers title="convert_to_onnx.py"
import torch
from torch import nn as nn

class BinaryClassificationModel(nn.Module):
    """Simple binary classification model with 2 input features."""

    def __init__(self, in_features: int = 2) -> None:
        """Initialise the binary classification model.

        Args:
            in_features: Number of input features (default: 2 for A and B).
        """
        super().__init__()
        self.linear = nn.Linear(in_features, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the model.

        Args:
            x: Input tensor of shape (batch_size, in_features).

        Returns:
            Output tensor of shape (batch_size, 1) with sigmoid activation.
        """
        x = self.linear(x)
        x = self.sigmoid(x)
        return x

if __name__ == "__main__":
    model = BinaryClassificationModel()
    model.eval()

    # Create dummy input for tracing
    dummy_input = torch.randn(1, 2)  # 2 features

    # Export to ONNX
    model_path = "binary_classification.onnx"
    torch.onnx.export(
        model,
        (dummy_input,),
        str(model_path),
        input_names=["x"],
        output_names=["y"],
        export_params=True,
        opset_version=13,
        dynamic_axes={"x": {0: "batch_size"}, "y": {0: "batch_size"}},
    )

```

### Encapsulating your model

Things work a little differently for ONNX models due to the fact that ONNX serialises the entire model graph alongside the weights. This means that you don't actually need to display your underlying model code when you upload your model to Bitfount. However, you still need to encapsulate your model with the appropriate `InferrableModelProtocol` or `EvaluableModelProtocol` interface so that Bitfount is able to interact with it.

:::tip

Bitfount allows models to be uploaded either _publicly_ or _privately_. However, even a private model will display the underlying model code when access has been granted to collaborators. If you want to keep your model code private, we recommend converting your model to an ONNX model.

:::

To make things easier, we have provided a base class for ONNX inference models: `ONNXModel`. This base class implements much of the boilerplate required to get started. See the API documentation for more details on the `ONNXModel` class.

Using this class could make your ONNX model code look as simple as:

```python showLineNumbers title="binary_classification_onnx_model.py"
from bitfount.backends.onnx.models import ONNXModel

class BinaryClassificationModel(ONNXModel):
    """Binary classification model using ONNX."""

```

The code is not missing - for most use cases there is simply no need to implement or override anything unless the input data or the output values require special handling. The `ONNXModel` class implements all of the required functionality. And anything specific to your model architecture has been serialised alongside the model weights.

For more information on how to upload the model code and `.onnx` file to Bitfount, see the Uploading your model page. The `.onnx` file is the file that contains the model graph alongside the model weights.


## for-data-scientists/models-and-datasets/bringing-your-models-to-bitfount/for-training-or-fine-tuning.md

# Training and Fine-tuning

If you don't already have a model that you can use for inference or evaluation tasks, you can train a new model on Bitfount. Training typically refers to the process of updating a model's weights from scratch on a dataset i.e. starting with a randomly initialised model. Whereas fine-tuning refers to the process of taking a pre-trained model and updating its weights only slightly to suit your specific task or dataset. The process itself is the same regardless of whether you are training a new model or fine-tuning an existing one and you will ultimately end up with a model that you can use for inference or evaluation tasks.

The interface required for training or fine-tuning models is naturally more complex than the interface required for inference or evaluation tasks and is currently only supported for PyTorch Lightning models.

## Required interface

In order to train a model on Bitfount, you need to extend the `PyTorchBitfountModelv2` class. Details on this can be found in the documentation in the API Reference.

The `PyTorchBitfountModelv2` uses the PyTorch Lightning library to provide high-level implementation options for a model in the PyTorch framework. This enables you to only have to implement the methods you need to dictate how the model training should be performed.

In addition to subclassing the `PyTorchBitfountModelv2` class, you will need to implement the following methods:

- `__init__()`: how to setup the model
- `configure_optimizers()`: how optimizers should be configured in the model
- `create_model()`: how to create the model
- `forward()`: how to perform a forward pass in the model, how the loss is calculated
- `_training_step()`: what one training step in the model looks like
- `_validation_step()`: what one validation step in the model looks like
- `_test_step()`: what one test step in the model looks like

### Classification models

Classification models are a very common type of model and are used to classify data into one of a number of classes. For this reason, we have provided some utilities to help you implement a classification model. These are:

- `PyTorchClassifierMixIn`: a mixin that provides helper methods and attributes for a classification model
- `get_torchvision_classification_model`: a function that creates a pre-trained classification model from the torchvision library

#### PyTorchClassifierMixIn

The `PyTorchClassifierMixIn` class requires the `multilabel` argument to be provided signifying whether a given record can belong to multiple classes. In exchange, it sets the `n_classes` attribute automatically based on the number of classes in the specified target column of the dataset and also provides a `do_output_activation` method that can be used to apply the appropriate activation function to the model's output (sigmoid or softmax) based on the number of classes and whether the problem is a multi-label problem. You may find many examples using this mixin class (such as the example below) but it is not required for your model to use it. **If you do use this mixin class, make sure to specify the mixin class _first_ in the model's inheritance hierarchy:**

```python
class MyClassificationModel(PyTorchClassifierMixIn, PyTorchBitfountModelv2):
    ...
```

#### get_torchvision_classification_model

The `get_torchvision_classification_model` function is a helper function that creates a pre-trained classification model from the `torchvision` library. It takes the following arguments:

- `model_name`: the name of the model to create. This can be any model supported by the torchvision library.
- `pretrained`: whether to return a pre-trained model (typically trained on ImageNet) or a randomly initialised model
- `num_classes`: the number of classes in the model which determines the output size of the model

It can be used directly in your model's `create_model` method to return a pre-trained classification model to be used for fine-tuning.

```python showLineNumbers
from bitfount.backends.pytorch.models.nn import get_torchvision_classification_model

class MyClassificationModel(PyTorchClassifierMixIn, PyTorchBitfountModelv2):
    ...

    def create_model(self) -> nn.Module:
        """Creates the model to use."""
        model = get_torchvision_classification_model(
            model_name="resnet18", pretrained=True, num_classes=self.n_classes
        )
        return model
```

## Full example

This example shows a simple logistic regression model that can be used for binary or multi-class classification tasks.

```python showLineNumbers title="logistic_regression_model.py"
from __future__ import annotations

from typing import Any

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchmetrics.functional import accuracy

from bitfount.backends.pytorch import PyTorchBitfountModelv2
from bitfount.backends.pytorch.models.base_models import (
    _TEST_STEP_OUTPUT,
    _TRAIN_STEP_OUTPUT,
    PyTorchClassifierMixIn,
    _OptimizerType,
)
from bitfount.types import _StrAnyDict

class LogisticRegressionModel(PyTorchClassifierMixIn, PyTorchBitfountModelv2):
    """A Logistic/Softmax Regression model built using PyTorch Lightning.

    This implements a single linear layer which acts as a Logistic Regression
    (for binary) or Softmax Regression (for multi-class) classifier.
    """

    def __init__(
        self, learning_rate: float = 0.0001, weight_decay: float = 0.0, **kwargs: Any
    ) -> None:
        """Initializes the LogisticRegressionModel.

        Args:
            learning_rate: The step size for the optimizer. Controls how much to
                change the model in response to the estimated error each time the
                model weights are updated.
            weight_decay: L2 regularization penalty. Adds a term to the loss function
                proportional to the sum of the squared weights, preventing the model
                from becoming too complex (overfitting).
            **kwargs: Additional arguments passed to the base PyTorchBitfountModelv2.
                This includes 'steps' (training iterations per round) or 'epochs'.
        """
        super().__init__(**kwargs)
        self.learning_rate = learning_rate
        self.weight_decay = weight_decay

    def create_model(self) -> nn.Module:
        """Creates the model architecture.

        Logistic Regression is essentially a single Linear layer mapping
        input features to class logits. The activation (Sigmoid/Softmax) is
        handled by the loss function (CrossEntropyLoss) during training.
        """
        if self.n_classes  Output Classes
        return nn.Linear(self.datastructure.input_size, self.n_classes)

    def forward(self, x: Any) -> Any:
        """Defines the operations we want to use for prediction."""
        x, sup = x
        assert self._model is not None
        # Pass through the linear layer
        x = self._model(x.float())
        return x

    def _training_step(self, batch: Any, batch_idx: int) -> _TRAIN_STEP_OUTPUT:
        """Computes and returns the training loss for a batch of data."""
        if self.skip_training_batch(batch_idx):
            return None  # type: ignore[return-value] # reason: Allow None to skip a batch. # noqa: E501
        x, y = batch
        y_hat = self(x)
        # CrossEntropyLoss in PyTorch combines LogSoftmax and NLLLoss.
        # We squeeze y to ensure it is 1D (N,) as expected by CrossEntropyLoss for
        # class indices.
        loss = F.cross_entropy(y_hat, y.squeeze())
        return loss

    def _validation_step(self, batch: Any, batch_idx: int) -> _StrAnyDict:
        """Operates on a single batch of data from the validation set."""
        x, y = batch
        preds = self(x)
        # Ensure y is squeezed for loss calculation
        loss = F.cross_entropy(preds, y.squeeze())

        # Apply softmax to get probabilities for accuracy calculation
        preds_prob = F.softmax(preds, dim=1)

        acc = accuracy(
            preds_prob, y.squeeze(), task="multiclass", num_classes=self.n_classes
        )

        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)
        return {
            "val_loss": loss,
            "val_acc": acc,
        }

    def _test_step(self, batch: Any, batch_idx: int) -> _TEST_STEP_OUTPUT:
        """Operates on a single batch of data from the test set."""
        x, y = batch
        preds = self(x)
        preds = F.softmax(preds, dim=1)

        return {"predictions": preds, "targets": y}

    def configure_optimizers(self) -> _OptimizerType:
        """Configure the optimizer."""
        # Using AdamW optimizer with L2 regularization via weight_decay.
        optimizer = torch.optim.AdamW(
            self.parameters(), lr=self.learning_rate, weight_decay=self.weight_decay
        )
        return optimizer
```

## Tutorials

For more complex models, we have two tutorials that walk you through the process of training a model on Bitfount:

- Training a Custom Model:
  This tutorial walks you through the process of training a tabular classification model on CSV data
- Training a Custom Segmentation Model:
  This tutorial walks you through the process of training a segmentation model on an image dataset


## for-data-scientists/models-and-datasets/bringing-your-models-to-bitfount/testing-your-model.md

# Testing your model

Once you have migrated your model to the format required by Bitfount, you should test it to ensure it is working as expected before uploading it to the Bitfount Hub. This is particularly important for training or fine-tuning models as they are more complex than models that are only used for inference or evaluation.

The best way to validate your model is to run against a local datasource for testing purposes. This bypasses the need to create a task in Bitfount which can be cumbersome for rapid iteration.

:::tip

Validate your model before sharing it with collaborators or publishing it to a project. Lightweight checks catch most issues early and save time when running tasks on remote datasets

:::

## Local validation

The first step to validating your model is to create a local datasource. Recall that a datasource is the object that represents the type of data being connected to a Pod and encapsulates the specific logic required for loading and processing that kind of data. A datasource connected to a Pod is called a **dataset** but in this case, we don't want to connect to a Pod just yet - we only want to use the datasource locally for testing purposes.

### Datasource

The first step is to choose the appropriate datasource for your data. Bitfount supports a variety of datasources, each with its own unique features and capabilities as documented in the Connecting datasets guide. See the API reference for your chosen datasource to get a list of the required and optional arguments. Most datasources typically take a `path` argument to the location of the data.

```python
from bitfount.data import CSVSource

datasource = CSVSource(path="path/to/your/data.csv")
```

:::tip
All datasources can be imported from the `bitfount.data` namespace rather than having to import from the specific datasource module.
:::

Datasources don't make any changes to the data itself as far as transformations and pre-processing are concerned. They are simply an _iterable_ wrapper around the data which yields data in the form of a `pandas` DataFrame. Regardless of the type of data, the datasource's internal representation of the data is always a `pandas` DataFrame. Datasources have two main methods:

- `yield_data()`: Returns an iterator that yields batches of data as specified by the `partition_size` argument.
- `get_data()`: Returns a single batch of data as specified by the `data_keys` argument.

:::info

One of the core principles of how data is handled in Bitfount is that the data is never loaded into memory unless it is absolutely necessary. This is why datasources are designed to be _iterable_ - they are not designed to be loaded into memory all at once.

:::

Under the hood in Bitfount, `yield_data` is the method that is typically used to feed data to an algorithm. `get_data` is only used in certain cases with a small selection of `data_keys`. It is not advised to use `get_data` to return the entire dataset as it will be very memory intensive and may well crash the system if the dataset is too large.

```python
for batch in datasource.yield_data(partition_size=32):
    print(batch)
```

#### Image Datasources

Many users work with imaging datasets (medical or otherwise) so it is important to understand how images are handled under the hood within Bitfount datasources. When connecting a directory of image files, each file corresponds to a single row in the internal `pandas` DataFrame. The DataFrame will have a column for the raw image data as a numpy array which is always called `Pixel Data`. It will also have a number of columns for the metadata associated with the image. For medical images, these columns can number in the hundreds and correspond to DICOM tags (or equivalent for other imaging formats). For files that contain multiple images, for instance slices of a volumetric image, the DataFrame will have a column for each slice. In this case, they are numbered sequentially starting from 0 e.g `Pixel Data 0`, `Pixel Data 1`, etc.

Where possible, images are avoided being loaded into memory unless absolutely necessary. To aid in this, the datasource will cache the underlying dataframe in the file system _with the exception_ of the `Pixel Data` columns which are replaced by placeholders. When calling `yield_data` or `get_data`, the datasource will automatically load the cached dataframe into memory and return the cached dataframe which does not contain the raw image data. If you need to access the raw image data, you can do so by passing the `use_cache=False` argument to the `yield_data` or `get_data` methods.

```python
from bitfount.data import DICOMSource

datasource = DICOMSource(path="path/to/your/data")
for batch in datasource.yield_data(partition_size=32, use_cache=False):
    print(batch["Pixel Data"].shape)
```

### Schema

A schema is a serialisable representation of the data in a datasource. Schemas are automatically generated for each dataset when it is connected to a Pod and displayed on the hub. They contain information about the columns in the dataframe, the data types of the columns, the semantic types of the columns, and optional descriptions of the columns. Models require a schema to be provided when they are instantiated which must match the schema of the dataset that will be fed to the model for training, evaluation or inference.

A partial or full schema can be generated for a datasource using the `BitfountSchema` class. A partial schema is generated by default when a datasource is connected to a Pod based on the first batch of data. A full schema generation process is triggered in the background once the partial schema is generated. The full schema generation can take some time to complete depending on the size of the dataset. If your dataset is quite homogeneous, the full schema generation may not be necessary.

```python
from bitfount.data import BitfountSchema

schema = BitfountSchema(name="your-dataset-name")
# Generate a partial schema
schema.generate_partial_schema(datasource=datasource)
# Or generate a full schema
schema.generate_full_schema(datasource=datasource)
```

The schema can then be serialised and visualised using methods such as `dumps` and `to_json`.

```python
print(schema.dumps())
```

The data types of the columns in the schema are inferred from the data in the datasource and are out of your control. The semantic types of the columns are also inferred from the data types but these often require knowledge of the data or domain to be accurate. These can therefore be overridden by passing the `force_stypes` argument to the `generate_full_schema` method. The available semantic types are:

- `categorical`: For columns where the values (strings or integers) are categorical in nature. For instance, if the column contains different integer values, Bitfount will interpret this as a continuous column by default so it must be overridden to `categorical` if the integers represent different categories.
- `continuous`: This is the default semantic type for all numerical columns unless overridden.
- `image`: For image data columns, such as `Pixel Data`. For CSV datasources where a column contains the path to an image file, the semantic type must be overridden to `image` if the images are to be treated as such.
- `text`: For text data. By default, all string columns are treated as text columns unless overridden to a different semantic type such as `categorical` or `image`.
- `image_prefix`: A utility semantic type where there are multiple image columns with a common prefix to avoid having to specify each column name individually.

:::tip
When connecting a dataset using the App or Hub, you can override the semantic types of the columns by editing the schema in the UI after the dataset has been connected.
:::

Certain columns may also be ignored from the schema generation process by passing the `ignore_cols` argument to the `generate_full_schema` method.

```python
schema.generate_full_schema(
    datasource=datasource,
    force_stypes={"image": ["Pixel Data"], "categorical": ["Patient's Sex"]},
    ignore_cols=["Patient ID", "Study ID", "Series ID"]
)
```

### DataStructure

We were introduced to the `DataStructure` in YAML format as a core task component in the Writing tasks section. It is a core component of a task that defines the structure of the data that will be fed to the model. Where the schema of the datasource reflects the structure of that data, the DataStructure defines the necessary modifications to that structure in order to feed the data to the model in the way that the model expects.

Typically, the most important parts of the data structure are specifying which columns to include or exclude from the data, which columns are images and which column(s) to map to the target variable if you are doing training or fine-tuning. If your model is only used for inference, you don't need to specify a target column. A typical data structure might look like this:

```python
from bitfount.data import DataStructure

data_structure = DataStructure(
    selected_cols=["Pixel Data", "Target", "Patient's Sex", "Age"],
    image_cols=["Pixel Data"],
    target=["Target"],
)
```

For a full list of the available arguments, see the API reference.

If the data contains image columns, some basic batch transformations are also applied by default to the image columns when the data is fed to the model. These transformations are:

- `Resize` the image to 224x224 pixels
- `Normalize`: Normalize the image to ImageNet statistics
- `ToTensorV2`: Convert the image to a PyTorch tensor

Albumentations is the library of choice for applying these transformations. Learn more about how to use Albumentations to customise the transformations in the Transformations section.

### Feeding the data to the model

Once you have created the datasource, schema and data structure, you can instantiate and initialise your model and feed the data to the model. The model needs to be instantiated with the DataStructure and schema objects that were created earlier. After this, the model must be initialised by calling the `initialise_model` method. This method creates the model under the hood by calling the `create_model` method and saving the model to the `self._model` attribute. It also creates the PyTorch data loaders from the datasource which will be used to feed the data to the model. You can learn more about the dataloaders in the DataLoaders section.

To feed the data to the model, you then need to call either the `fit`, `predict` or `evaluate` methods on the model. The `fit` method is used for training and fine-tuning the model and the `predict` method is used for inference. The `evaluate` method is used for evaluating the model on a dataset.

For training this might look like this:

```python
model = MyModel(datastructure=data_structure, schema=schema, epochs=10, batch_size=32)
model.initialise_model(datasource)
results = model.fit(datasource)
```

Whereas for inference this might look like this:

```python
model = MyModel(datastructure=data_structure, schema=schema)
model.initialise_model(datasource)
predictions = model.predict(datasource)
```

Calling .fit() or .predict() on a model will automatically feed the data to the model and return the results.

:::tip
You can find an end-to-end example of how to validate your model locally in the Tutorials section.
:::


## for-data-scientists/models-and-datasets/bringing-your-models-to-bitfount/uploading-your-model.md

# Uploading your model

Once you have validated your model locally, you are ready to make a model available to tasks, projects and datasets on Bitfount. Uploading registers your code, weights and metadata all of which are versioned and can be updated as needed.

## What gets packaged

- **Model code** that implements the relevant Bitfount interfaces (inference, evaluation, training, or fine-tuning).
- **Weights or checkpoints** for example `.pth` or `.onnx` files
- **Metadata**: display name, description, version, visibility (public or private), and any licensing notes

## Upload options

- **Bitfount Hub or Desktop**: create a model entry, attach your code archive and weight files, and set visibility. This is often the fastest route when collaborating.
- **SDK**: script the registration step so you can integrate uploads into CI pipelines. Supply storage locations for code and weights plus any parameters your task runners need.

### Hub or Desktop App

This is most straightforward way to upload your model to Bitfount. To upload a task, you need to have a Bitfount account and be logged into either the Bitfount App or Hub.

1. Navigate to the "Models" tab in the left sidebar.
2. Click the "Upload model" button in the top right corner.
3. Either paste or upload the model code and weight files you want to upload along with the name, description (including any licensing notes you want to add) and the visibility (public or private).
4. Click the "Upload" button.

  Create model page

Take care when choosing the visibility of your model. If you choose public, though your model won't be discoverable by other users, it will be usable to all users of Bitfount in their tasks and projects. If you choose private, your model will only be visible and accessible to you and your collaborators.

### SDK

To upload a model using the SDK, you need to create a `BitfountModelReference` object. This object contains the model code, weights and metadata and can be used to reference the model in tasks and projects. Point the `model_ref` argument to the path of the model code file you want to upload.

:::caution[Important]
Ensure the name of the model code file matches the name of your model class and the name of the model on the Hub.
:::

```python
from pathlib import Path
from bitfount import BitfountModelReference

reference_model = BitfountModelReference(
    model_ref=Path("MyModel.py"),
    datastructure=datastructure,
    schema=schema,
    hyperparameters={"epochs": 2},  # Epochs or steps need to be provided even for inference models
    private=True,
)
```

## Limitations

- The model code must only contain one model class definition at the top level of the file i.e. the class that implements the relevant Bitfount interfaces (inference, evaluation, training/fine-tuning). If your model makes use of multiple model classes, you will need to nest them within the top level class
- The model code can't be more than 3MB in size
- The model weights can't be more than 500MB in size. Please contact support@bitfount.com if you need to upload a larger model.

## Versioning and updates

- When updating a model, make sure to edit the most recent version to ensure changes from previous versions are not lost
- Consider keeping a changelog of changes to the model in each version in the model description, for example:
  ```markdown
  v1: Initial release
  v2: Changed optimizer to AdamW and added L2 regularization
  v3: Optimised for faster inference
  ```
- Each version of the model has its own associated description which can be edited without triggering a new version

:::tip

After uploading your model for the first time, make sure to validate it by running a small task in the Bitfount App to confirm it still works as expected. You can then share the model with collaborators or create tasks that link it to projects once validated.

:::


## for-data-scientists/models-and-datasets/bringing-your-models-to-bitfount/data-pipeline/data-loaders.md

# Data loaders

Data loaders bridge datasources and your model code. They handle batching, shuffling, device placement, and any collation logic required to convert raw rows into tensors or arrays.

## Responsibilities

- Fetch samples from the datasource and apply the correct preprocessing pipeline.
- Batch inputs, pad variable-length fields if needed, and move tensors to the right device (GPU/CPU)
- Provide deterministic iteration when you set seeds, and configurable shuffling for training.

## Using Bitfount DataLoaders

Bitfount has created wrappers around the standard PyTorch DataLoader class to make it compatible with Bitfount. These are used by default when creating a model and can be returned by calling the `train_dataloader()`, `val_dataloader()` and `test_dataloader()` methods on your model. These dataloaders are used by the `fit()`, `evaluate()` and `predict()` methods on your model respectively.

### Output format

When implementing your model, if you are using the model for training, you will be implementing the `training_step()`, `validation_step()` and `test_step()` methods. These methods already give you a batch of data as input so you don't need to worry about iterating over the dataloader. However, if you are using the model for inference, you will need to iterate over the dataloader to get the batches of data. Other than that, the format of the batch is exactly the same.

```python title="Inference example"
def predict(self, data: Optional[BaseSource] = None, **_: Any) -> PredictReturnType:
    for batch in self.test_dataloader():
        x, y = batch[:2]
        # ...
    return PredictReturnType(preds=preds)
```

Due to the various ways in which data can be structured, the format of the batch is dependent on the data structure and schema that were used to create the model. At the top level, a batch is a 2 or 3-element tuple:

```math
(x, y, [datakey])
```

where `x` is the input tensors, `y` is the target tensors and if we are using a file-based datasource, `data_key` will be the list of paths to the files that were used to populate the batch in case we need to link back to them. If we are using a non-file-based datasource, the tuple will only have 2 elements. If there are no target tensors such as in the case of inference, `y` will still exist but will be an empty tensor. In most cases, we can ignore `data_key` and focus on `x` and `y` as follows if we doing training or validation:

```python
x, y = batch[:2]
```

Or just x if we are doing inference:

```python
x = batch[:1]
```

The shape of `x` and `y` will depend on the data and batch size:

- `y` is a tensor of shape `(batch_size, num_targets)` where `num_targets` is the number of target columns in the case of tabular data. In the case of image data for segmentation tasks, `y` will be a 4D tensor of shape `(batch_size, channels, height, width)` (BCHW).
- `x` itself is again a tuple of tensors of the form:

  ```math
  ([tabular], [image], [support])
  ```

  where the image tensor, if there is a single image column, is a 4D tensor in BCHW format and the tabular and support tensors are 2D tensors of shape `(batch_size, num_features)`. If there are multiple image columns, `image` will instead be a list of `BCHW` tensors. At least one of `tabular` or `image` will always be present. The support columns are deprecated and will be removed in a future version. For now you only need to know that their presence is dictated by the `ignore_support_cols` argument to the `BitfountDataBunch` class (this is the class that creates the dataloaders within `initialise_model()`) but you can safely ignore them regardless.

  This means the shape of `x` could also be written as follows:

  ```math
  (tabular, [support])\quad | \quad(image, [support])\quad |\quad (tabular, image, [support])
  ```

  For instance, if you know that there are no tabular columns, an example unpacking of `x` could be:

  ```python
  images, _sup = x
  ```

:::info

In the case of text data, this is not converted to tensors but rather included in the tabular data _as-is_. You will need to tokenize the text data as part of your model's `training_step()`, `validation_step()` and `test_step()` methods.

:::

## Using your own dataloaders

:::danger

Proceed at your own risk. If you are using your own dataloaders, in addition to ensuring that the output format of your dataloader matches the expected input format of the model, you will also need to ensure that the dataloader is respecting the following:

- the data structure and schema that were used to create the model i.e. the `selected_cols`, `image_cols`, `target` columns and any transformations that were applied to the data
- the specified data splits between training, validation and test sets including shuffling if specified. For the bitfount dataloaders, we have implemented a reservoir sampling algorithm to ensure that the data is shuffled in a deterministic manner even when we can't load the entire dataset into memory.
- the protocol-level batching logic (i.e. `batched_execution`). This batching is at a higher level than the model-level batching logic (i.e. `batch_size`). When running in batched execution mode, the protocol-level batching logic will override the available files in the datasource by using the `selected_file_names_override` attribute of the datasource. In order to access only the files that are available that iteration, you will need to iterate over `selected_file_names_iter()` as opposed to `yield_data()` on the datasource.
- whether they are returning batches according to `steps` or `epochs` and to stop iteration at the correct time

:::

There is no requirement to use Bitfount DataLoaders. If you want to use your own dataloaders, you will need to create a custom DataLoader class nested inside your model class. Override the `train_dataloader()`, `val_dataloader()` and `test_dataloader()` methods to return your own dataloaders.


## for-data-scientists/models-and-datasets/bringing-your-models-to-bitfount/data-pipeline/postprocessing.md

# Postprocessing

Postprocessing turns raw model outputs into human-friendly results and task artefacts. This is an optional final step in the data pipeline and is not required for all models.

:::info

Postprocessing is supported by the following algorithms:

- `bitfount.ModelInference`
- `bitfount.HuggingFaceImageClassificationInference`
- `bitfount.HuggingFaceNERInference`
- `bitfount.HuggingFaceTextClassificationInference`

:::

## Available postprocessors

Bitfount provides a suite of built-in postprocessors to handle common output-preparation needs. You can mix and match them, even chaining several together using the `compound` type.

### Built-in postprocessor types

#### General postprocessors

These postprocessors can be used with any of the supported algorithms listed above.

| Name                | Description                                                                        | Example Use Case                                      |
| ------------------- | ---------------------------------------------------------------------------------- | ----------------------------------------------------- |
| `rename`            | Rename DataFrame columns.                                                          | Change "pred" column to "Prediction".                 |
| `transform`         | Apply a transformation function from `bitfount.transformations` on output columns. | Apply softmax or custom transform to logits.          |
| `json_restructure`  | Move fields between levels in nested JSON structures.                              | Move a key from nested JSON upwards for flattening.   |
| `string_to_json`    | Parse columns containing JSON as strings into JSON objects.                        | Safely load prediction results stored as strings.     |
| `json_key_rename`   | Rename keys within JSON fields in columns.                                         | Change "class1" to "cat" inside prediction JSON.      |
| `json_wrap_in_list` | Wrap JSON fields in an additional list.                                            | Ensure all predictions are in a JSON array format.    |
| `compound`          | Chain multiple postprocessors together in sequence.                                | Apply `string_to_json` followed by `json_key_rename`. |

#### Hugging Face postprocessors

These postprocessors are designed specifically for use with Hugging Face algorithms.

| Name                             | Description                                                                                                 | Example Use Case                                                         |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| `huggingface_apply_id_to_labels` | Maps model output IDs to human-readable labels using a mapping file from the Hugging Face model repository. | Convert numeric class IDs to descriptive labels for multi-headed models. |
| `ner_deidentification`           | De-identifies text by replacing named entities detected by NER models with placeholder tokens.              | Remove patient names from clinical text after NER inference.             |

### Example configuration in YAML

The postprocessors can be supplied as a list of dictionaries. The only required key is `type` which refers to the name of the postprocessor to use.
The other keys are specific to the postprocessor and are passed as keyword arguments to the postprocessor.

More information about the available postprocessors and their arguments can be found in the API documentation:

- General postprocessors
- Hugging Face postprocessors
- Base classes and compound postprocessor

#### ModelInference example

```yaml
algorithm:
  - name: bitfount.ModelInference
    model:
      bitfount_model:
        model_ref: MyModel
        model_version: 2
        username: amin-nejad
      hyperparameters:
        batch_size: 8
    arguments:
      postprocessors:
        - type: rename
          columns: ["logits"]
          mapping:
            logits: probabilities
        - type: transform
          columns: ["probabilities"]
          transform: softmax
```

#### Hugging Face image classification example

```yaml
algorithm:
  - name: bitfount.HuggingFaceImageClassificationInference
    arguments:
      model_id: google/vit-base-patch16-224
      postprocessors:
        - type: huggingface_apply_id_to_labels
          model_id: google/vit-base-patch16-224
          filepath: config.json
          key: id2label
```


## for-data-scientists/models-and-datasets/bringing-your-models-to-bitfount/data-pipeline/transformations.md

# Transformations

Batch transformations are handled by `albumentations` and are applied to images by the DataLoaders when the data is fed to the model. The default transformations are:

- `Resize` the image to 224x224 pixels
- `Normalize`: Normalize the image to ImageNet statistics
- `ToTensorV2`: Convert the image to a PyTorch tensor

These transformations are applied to the image data regardless of the task being performed (training, inference, evaluation). However the batch transformations can be customised to have different transformations for different segments of the data (training, validation, test). For training in particular it is important to have different transformations for the training and validation/test sets for augmentation purposes to avoid overfitting.

A full list of the available transformations is available in the Albumentations documentation.

## Custom transformations

Transformations can be specified either in code or in YAML. In code, you can create a `DataStructure` object and pass the transformations to the `batch_transforms` argument. In the example below where we have a single image column, we have specified different transformations for the training and validation sets.

```python
from bitfount.data import DataStructure

data_structure = DataStructure(
    selected_cols=["image_col", "target"],
    image_cols=["image_col"],
    target=["target"],
    batch_transforms=[
        {
            "albumentations": {
                "step": "train",
                "output": True,
                "arg": "image_col",
                "transformations": [
                    {"Resize": {"height": 224, "width": 224}},
                    {"HorizontalFlip": {"p": 0.5}},
                    "RandomBrightnessContrast",
                    "Normalize",
                    "ToTensorV2",
                ],
            }
        },
        {
            "albumentations": {
                "step": "validation",
                "output": True,
                "arg": "image_col",
                "transformations": [
                    {"Resize": {"height": 224, "width": 224}},
                    "Normalize",
                    "ToTensorV2",
                ],
            }
        },
    ],
)
```

In YAML, you can specify the transformations in the `transform` section of the task YAML file. The below YAML example specifies the exact same transformations as the code example above.

```yaml
task:
  data_structure:
    select:
      include: ["image_col", "target"]
    assign:
      target: image_col
      image_cols: ["image_col"]
    transform:
      batch:
        - albumentations:
            step: train
            output: true
            arg: image_col
            transformations:
              - { "Resize": { "height": 224, "width": 224 } }
              - { "HorizontalFlip": { "p": 0.5 } }
              - "RandomBrightnessContrast"
              - "Normalize"
              - "ToTensorV2"
        - albumentations:
            step: validation
            output: true
            arg: image_col
            transformations:
              [
                { "Resize": { "height": 224, "width": 224 } },
                "Normalize",
                "ToTensorV2",
              ]
```

### Multiple image columns

In cases where there are multiple image columns each with the same transformations, instead of specifying the transformations in `batch_transforms` individually, you can specify the transformations in the `image_prefix_batch_transforms` argument of the `DataStructure` object which will be applied to all image columns with the same prefix. In YAML, instead of specifying the transformations in the `transform.batch` section of the DataStructure, you can specify the transformations in the `transform.image` section of the DataStructure. In these cases, you will then omit the `arg` argument from the transformations.

## Tips for reproducibility

- Keep the same pre and post processing for training, validation, and inference
- Augmentations should only be applied to the training set
- Make transformations deterministic when running inference and evaluation tasks i.e. no transformations that take a probability (`p`) as an argument


## for-data-scientists/running-tasks/running-a-task-using-the-sdk.md

# Running a task using the SDK

Use the Bitfount Python SDK to submit and manage tasks programmatically. This may be useful for more technical users who want to automate tasks or integrate Bitfount into their existing workflows.

## Prerequisites

- Bitfount SDK installed - see SDK Installation
- Access to the dataset(s) and model(s) you intend to use
  - The model must be either public, owned by yourself, or part of a project's task in which you are an owner or collaborator
  - The dataset must be either owned by yourself or part of a project in which you are an owner or collaborator and the owner of the dataset has granted permission for other project collaborators to use it. If owned by yourself, the dataset can be connected through the App or through the SDK as described in the Connecting Datasets guide

## Methods

There are two main methods for running a task using the SDK:

- Running a task from the command line pointing to a task YAML file
- Running a task from a Python script

### Running a task from a task file

Running a task from the command line is as easy as running the following command:

```bash
bitfount run_modeller 
```

This will run the task and output the results to the console as well as to a logfile in a subdirectory of the current working directory called `bitfount_logs`.

:::tip

Make sure you have specified the dataset identifiers and the project id correctly in the task YAML file to avoid timeouts or authentication errors.

:::

### Running a task from a Python script

Running a task from a Python script requires creating all the necessary components of a task as python objects ending with the protocol object. The protocol object is the entry point for the task and is used to orchestrate the algorithms and handle communication between different parties within the task. Calling the `run` method on the protocol object with the dataset identifiers passed as an argument will kick off the task.

There are several examples of how to start tasks this way in the Tutorials section.


## for-data-scientists/running-tasks/running-multi-dataset-tasks.md

# Running multi-dataset tasks

:::info

Multi-dataset tasks are a feature of the Bitfount SDK. These tasks cannot be initiated from the Bitfount App, but datasets connected in the App can participate in multi-dataset tasks as long as they are not connected to the same Pod or App instance. If you are connecting datasets:

- **via the App:** each dataset involved in the task must be on a different machine
- **via the SDK:** each dataset involved in the task must be on a different Pod, but they can be on the same machine

:::

Multi-Pod tasks let you send a task to multiple datasets at once. The task will be run on each dataset in parallel with results sent back to the modeller. These results may optionally be aggregated as part of the task or returned as separate results for each dataset depending on the task. There are two main types of multi-dataset tasks:

- **Federated learning tasks:** these tasks are used to train a model on multiple separate datasets without needing to centralise the data. The model is trained on each dataset in parallel with regular orchestration by the modeller to average the updated model parameters. Learn more about federated learning in the original paper by McMahan et al (2016) or our blog post.
- **SQL tasks:** these tasks are used to run a SQL query against multiple datasets at once. The results from each dataset may be optionally aggregated into a single result which is then returned to the modeller or returned as separate results for each dataset depending on the task.

## Federated learning

For federated learning tasks, the protocol used is `bitfount.FederatedAveraging` and the algorithm used is `bitfount.FederatedModelTraining`. In the example below, we are training a model on all the data in a tabular dataset with `TARGET` as the target column.

:::info

This is the exact same kind of task as used for fine-tuning. The only difference is that instead of specifying a single dataset, you specify multiple datasets.

:::

```yaml
pods:
  identifiers:
    - 
    - 
    - 

task:
  protocol:
    name: bitfount.FederatedAveraging
    arguments:
      steps_between_parameter_updates: 10
  algorithm:
    name: bitfount.FederatedModelTraining
    model:
      bitfount_model:
        model_ref: 
        model_version: 1
        username: 
      hyperparameters:
        steps: 100
        batch_size: 32
        learning_rate: 0.0001
  aggregator:
    secure: False # Set to True to use secure aggregation
  data_structure:
    schema_requirements: "full"
    assign:
      target: TARGET
```

:::note

If dataset sizes are significantly different, the task may be idle on many of the machines while others are still running. For optimal efficiency, it's advised to choose datasets of roughly the same size. If this is not possible, ensure that you specify the training in `steps` and not `epochs` to ensure the same amount of training is happening on each dataset. If training is specified in `steps` make sure that only `steps_between_parameter_updates` is passed to the `FederatedAveraging` protocol and similarly if training is specified in `epochs` make sure that instead `epochs_between_parameter_updates` is passed to the `FederatedAveraging` protocol.

:::

## SQL tasks

For SQL tasks, the protocol used is `bitfount.ResultsOnly` and the algorithm used is `bitfount.SqlQuery`. In the example below, we are running a SQL query against the `ehr-records-2025` dataset for each of the three datasets. Recall in the SQL task documentation, if running a SQL task against a non-SQL-based dataset (e.g. a `CSVSource` dataset or otherwise), the table name will be the dataset identifier without the username, in between backticks(\`\`). Since the same query is being run on each dataset, we need to make sure the dataset name is the same for each dataset across the three different users.

```yaml
pods:
  identifiers:
    - alice/ehr-records-2025
    - bob/ehr-records-2025
    - charlie/ehr-records-2025

task:
  protocol:
    name: bitfount.ResultsOnly
    arguments:
      save_location: ["Modeller"]
  algorithm:
    name: bitfount.SqlQuery
    arguments:
      query: "SELECT * FROM `ehr-records-2025` LIMIT 10"
  data_structure:
    table_config:
      table: ehr-records-2025
```

In the example above, we are not using an aggregator. This means that the results from each dataset will be returned as separate results for each dataset. If you want to aggregate the results into a single result, you can specify an aggregator in exactly the same way as the federated learning task. The results will be saved as a CSV on the modeller side.

:::note
If your SQL query runs against a SQL-based dataset (i.e. an `OMOPSource` dataset), your query can operate on datasets of different names without issue.
:::


## for-data-scientists/task-catalogue/evaluation.md

# Model Evaluation

A model evaluation task simply runs a trained model on a dataset much like the inference task, but instead of returning the results, it returns a set of metrics about the model's performance on that dataset.

## Metrics

The metrics returned are dictated by the type of model that is detected. The algorithm looks for the presence of either `ClassifierMixIn`, `RegressorMixIn` or `SegmentationMixIn` in the model's inheritance hierarchy. Based on the presence of these mixins, the algorithm will determine the type of metrics to return. The `RegressorMixIn` and `SegmentationMixIn` mixins are currently only used for tagging purposes and have no configuration options whereas the `ClassifierMixIn` has logic for determining the type of classification problem which in turn determines the type of metrics to return.

| Model Type                | Metrics                                                                                                       |
| ------------------------- | ------------------------------------------------------------------------------------------------------------- |
| Binary Classification     | `Accuracy`, `Precision`, `Recall`, `F1 Score`, `ROC AUC`, `Brier Loss`                                        |
| Multiclass Classification | `Accuracy`, `Precision`, `Recall`, `F1 Score`, `ROC AUC`                                                      |
| Multilabel Classification | `Accuracy`, `Precision`, `Recall`, `F1 Score`, `ROC AUC`                                                      |
| Regression                | `Mean Absolute Error`, `Mean Squared Error`, `R2 Score`, `Root Mean Squared Error`, `Kolmogorov-Smirnov Test` |
| Segmentation              | `IoU`, `Dice Coefficients`, `Dice Score`                                                                      |

:::tip
Mixin classes must be specified _first_ in the model's inheritance hierarchy.
:::

## Results

As the name implies, the `bitfount.ResultsOnly` protocol simply returns the results from the model evaluation task. The results are returned as a dictionary of strings to floats. The keys of the dictionary are the names of the metrics and the values are the values of the metrics. By default, the results are not persisted anywhere. If running the protocol via the SDK, this behaviour may be fine because the results are returned to a variable which you can access. However, if running the protocol as part of a task in the app, the results are lost unless you specify a save location for the results. You can specify the save location for the results by setting the `save_location` argument to the `bitfount.ResultsOnly` protocol. The available save locations are:

- `Worker`: Save the results to the worker side.
- `Modeller`: Save the results to the modeller side.

Both locations can be chosen to save the results to both the worker and modeller sides.

## Example

An example task file for using a model evaluation task is shown below:

```yaml
pods:
  identifiers:
    - 

modeller:
  identity_verification_method: key-based

task:
  protocol:
    name: bitfount.ResultsOnly
    arguments:
      save_location:
        - Worker
        - Modeller
  algorithm:
    - name: bitfount.ModelEvaluation
      arguments:
        model:
          bitfount_model:
            username: amin-nejad
            model_ref: HeartDiseaseModel
            model_version: 3
  data_structure:
    select:
      include:
        - Age
        - Gender
        - Chest_Pain_Type
        - Resting_Blood_Pressure
        - Cholesterol
        - Fasting_Blood_Sugar
        - Resting_ECG
        - Max_Heart_Rate
        - Exercise_Induced_Angina
        - ST_Depression
        - ST_Slope
        - Number_of_Major_Vessels
        - Thalassemia
    assign:
      target: Heart_Disease
    data_split:
      args:
        shuffle: true
        test_percentage: 0
        validation_percentage: 100 # 100% of the data is used for the evaluation task
      data_splitter: percentage
```


## for-data-scientists/task-catalogue/fine-tuning.md

# Model Fine-tuning

Model fine-tuning tasks are supported for both Bitfount and TIMM models.

## Bitfount models

The protocol used is `bitfount.FederatedAveraging` and the algorithm used is `bitfount.FederatedModelTraining`. This combination also supports federated learning tasks where the model is trained on multiple datasets in a federated manner. In this case, we are only using a single dataset.

:::tip

For more information on running federated learning tasks, please refer to the documentation here.

:::

An example task file for using a Bitfount-hosted model for fine-tuning is shown below. In this case, the model is a simple binary classification model. The features are not specified meaning that all columns in the dataset will be used for training.

```yaml
modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

batched_execution: false
test_run: false
run_on_new_data_only: false

task:
  protocol:
    name: bitfount.FederatedAveraging
    arguments:
      epochs_between_parameter_updates: 10 # No need to share the model weights until the end of the training
  algorithm:
    - name: bitfount.FederatedModelTraining
      arguments:
        modeller_checkpointing: true # Whether to save the last checkpoint on the modeller side
        checkpoint_filename: best_checkpoint.pt
      model:
        bitfount_model:
          model_ref: MyBinaryClassificationModel
          model_version: 1
          username: bitfount
        hyperparameters:
          epochs: 10
          batch_size: "{{ batch_size }}"
          learning_rate: "{{ learning_rate }}"
          weight_decay: "{{ weight_decay }}"
  aggregator:
    secure: False
  data_structure:
    schema_requirements: partial
    assign:
      target:
        - "{{ target_column_name }}"

template:
  batch_size:
    label: "Batch size"
    tooltip: "Number of samples per batch during training."
    type: "number"
    default: 8
  learning_rate:
    label: "Learning rate"
    tooltip: "Learning rate for the model optimizer."
    type: "number"
    default: 0.0001
  weight_decay:
    label: "Weight decay"
    tooltip: "Weight decay (L2 regularization) for the model optimizer."
    type: "number"
    default: 0.01
  target_column_name:
    label: "Target column"
    tooltip: "The column containing dataset labels."
    type:
      schema_column_name:
        semantic_type: "categorical"
```

## TIMM models

A good example of a TIMM model is the RETFound (Retina foundation) model which is a multiclass image classification model. The below example task file shows how to use this model in a multiclass image classification task. This algorithm is only compatible with the `bitfount.ResultsOnly` protocol which simply runs the task and returns the results (if any) to the modeller. The new model parameters are _not_ part of the results returned by the algorithm meaning that the model parameters are only saved to the Pod-side i.e. where the dataset is located and the modeller may only receive metrics about the training process.

:::tip

Take a look at the RETFound demo project to easily run this task in the Bitfount app.

:::

```yaml
modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

task:
  protocol:
    name: bitfount.ResultsOnly
  algorithm:
    - arguments:
        model_id: bitfount/RETFound_MAE
        labels:
          - "0"
          - "1"
          - "2"
          - "3"
          - "4"
        args:
          epochs: 1
          batch_size: 32
          num_classes: 5
      name: bitfount.TIMMFineTuning
  data_structure:
    table_config:
      table: 
    select:
      include:
        - Image name
        - Retinopathy grade
    assign:
      target:
        - Retinopathy grade
```


## for-data-scientists/task-catalogue/inference.md

# Model Inference

A selection of example task files for using various models in inference tasks are shown below using the `bitfount.InferenceAndCSVReport` protocol. This protocol is the most appropriate protocol for most inference tasks as it runs the inference and then writes a CSV report of the results to the Pod-side. Alternatively, you may use the `bitfount.InferenceAndReturnCSVReport` which instead sends the CSV results back to the modeller. This is useful if the dataset is remote from the task initiator i.e. on a different machine.

:::info[Reminder]
The **Pod** is the entity that contains the datasets to run the task on. The **modeller** is the entity that initiates the task.
:::

## Bitfount-hosted models

An example task file for using a Bitfount-hosted model for inference is shown below. In this case, the model is a binary classification model for predicting chronic kidney disease based on a set of features. The features are not templated in this example meaning that only datasets with all those exact column names would be deemed compatible with the task.

```yaml
modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

batched_execution: true

task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.ModelInference
      model:
        bitfount_model:
          username: amin-nejad
          model_ref: ChronicKidneyDiseaseModel
          model_version: 1
    - name: bitfount.CSVReportAlgorithm
  data_structure:
    select:
      include:
        - Age
        - Gender
        - Creatinine (mg/dL)
        - Albumin (g/dL)
        - HbA1c (%)
        - Glucose (mg/dL)
        - Triglycerides (mg/dL)
```

## Hugging Face models

An example task file for using a Hugging Face text classification model for sentiment analysis is shown below:

```yaml
modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

batched_execution: true

task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.HuggingFaceTextClassificationInference
      arguments:
        top_k: 3
        model_id: finiteautomata/bertweet-base-sentiment-analysis
        target_column_name: "{{ target_column_name }}"
    - name: bitfount.CSVReportAlgorithm

template:
  target_column_name:
    label: "Target column"
    type:
      schema_column_name:
        semantic_type: text
```

## TIMM models

The below example task file shows how to use a TIMM model in a multiclass image classification task using the Retfound (Retina foundation) model

```yaml
modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

batched_execution: True

task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.TIMMInference
      arguments:
        model_id: "bitfount/RETFound_MAE_OCT_CNV_DME_DRU"
        class_outputs:
          - CNV (%)
          - DME (%)
          - DRUSEN (%)
          - NORMAL (%)
    - name: bitfount.CSVReportAlgorithm
      arguments:
        original_cols:
          - _original_filename
          - Filename
  data_structure:
    schema_requirements: "partial"
    select:
      include:
        - "{{ image_column_name }}"

template:
  image_column_name:
    label: "Image column"
    tooltip: "The dataset column that contains image data for predictions"
    default: "Pixel Data 0"
    type:
      schema_column_name:
        semantic_type: "image"
```

## MONAI models

The below example task file shows how to use a UNEST MONAI bundle for whole brain segmentation on NIfTI neuroimaging data:

```yaml
modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

batched_execution: True
test_run: false
force_rerun_failed_files: true

task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.MONAIBundleInference
      arguments:
        bundle_name: "wholeBrainSeg_Large_UNEST_segmentation"
        batch_size: 1
        num_workers: 1
        dataframe_output: True
        # nifti_output: True  # Enable to save NIfTI files to same directory as CSV
    - name: bitfount.CSVReportAlgorithm
  data_structure:
    compatible_datasources:
      - NIFTISource
    select:
      include:
        - "Pixel Data"
```


## for-data-scientists/task-catalogue/sql.md

# SQL

A SQL task is a task that allows you to run SQL queries on a dataset and optionally return the results. It is a powerful tool for data analysis and manipulation. The task below saves the results of the SQL query to both the modeller and the Pod-side.

:::info

If running your SQL query against a non-SQL-based dataset (e.g. a `CSVSource` dataset or otherwise), the table name will be the dataset identifier without the username, in between backticks(\`\`). Please ensure your SQL query operates on that table to make sure it is correctly parsed e.g. ``SELECT MAX(G) AS MAX_OF_G FROM `my-dataset-identifier` ``.

If running a SQL task against a SQL-based dataset (i.e. an `OMOPSource` dataset), you can write your query as normal.

:::

## Example

```yaml
modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

batched_execution: false
test_run: false
run_on_new_data_only: false

task:
  protocol:
    name: bitfount.ResultsOnly
    arguments:
      save_location: "{{ save_location }}"
  algorithm:
    - name: bitfount.SqlQuery
      arguments:
        query: "{{ query }}"
  data_structure:
    # Schema is not required for this task since we are returning all columns regardless
    schema_requirements: empty
    compatible_datasources:
      - CSVSource

template:
  query:
    type: string
    default: "SELECT * FROM `table` LIMIT 100"
    label: Query
    tooltip: The SQL query to execute.
  save_location:
    label: "Save Location"
    tooltip: "Specify where to save the results."
    type: "array"
    items:
      type: "string"
    minItems: 1
    default:
      - Modeller
      - Worker
```


## for-data-scientists/writing-tasks/referencing-a-model.md

# Referencing a model

Bitfount currently supports referencing models from the following providers:

- Bitfount Hub
- Hugging Face
  - Model Hub
  - TIMM (Pytorch image models)
- MONAI (Medical Open Network for AI)
  - MONAI Model Zoo

## Bitfount-hosted models

Models hosted in Bitfount are referenced inside an algorithm that requires a model (such as `bitfount.ModelInference`) via the `model.bitfount_model` block:

- **model_ref**: the identifier of the model in the Bitfount Hub (excluding the username)
- **model_version**: integer version to pin. Otherwise, the latest version will be used. It is recommended to pin the version to avoid unexpected changes to the model as well as to avoid access issues.
- **username**: model owner/namespace

Hyperparameters for the model can be set separately within the `model` block:

- **hyperparameters**: arguments to pass to the model constructor. For instance batch size is a commonly set hyperparameter. The specific hyperparameters accepted will differ between different models. If you have access to view the code, you can check the constructor arguments for the model to determine its hyperparameters.

```yaml
task:
  algorithm:
    - name: bitfount.ModelInference
      model:
        bitfount_model:
          model_ref: CatDogImageClassifier
          model_version: 3
          username: research-user
        hyperparameters:
          batch_size: 8
```

:::tip
Bitfount models can be used for inference, evaluation and fine-tuning tasks and can be toggled between public and private.
Models from the Hugging Face model hub however _must_ be made public in order to be used in Bitfount.
More information about uploading your models to the Bitfount Hub can be found here.
:::

## Hugging Face models

Hugging Face models are invoked via dedicated Bitfount algorithms that are specific to the model type. You pass a `model_id` (e.g., `google/vit-base-patch16-224`) in the algorithm `arguments`.
Do not use the `model` block for Hugging Face models. Confusingly, Hugging Face refers to these model types as _tasks_.

### Available task types

The _task_ types defined by Hugging Face and supported by Bitfount can be found below with their corresponding Bitfount algorithm:

- **Image classification**: `bitfount.HuggingFaceImageClassificationInference`
- **Image segmentation**: `bitfount.HuggingFaceImageSegmentationInference`
- **Image-text-to-text**: `bitfount.HuggingFaceImageTextGenerationInference`
- **Text classification**: `bitfount.HuggingFaceTextClassificationInference`
- **Text generation**: `bitfount.HuggingFaceTextGenerationInference`

:::info
Hugging Face models are currently only supported for inference tasks within Bitfount.
:::

:::tip
Make sure to choose a model that is compatible with the algorithm you are using.
The links above will take you to the Hugging Face model hub filtered for that specific task type.
:::

#### Image segmentation example

```yaml
task:
  algorithm:
    - name: bitfount.HuggingFaceImageSegmentationInference
      arguments:
        model_id: CIDAS/clipseg-rd64-refined
        dataframe_output: true
        batch_size: 1
  data_structure:
    select:
      include:
        - image_path
```

#### Image-text-to-text example

The image-text-to-text task type enables the use of vision-language models that take both an image and a text prompt as input and generate a text response. This is useful for tasks such as medical image captioning or visual question answering. A notable example is MedGemma, a medical vision-language model.

```yaml
task:
  algorithm:
    - name: bitfount.HuggingFaceImageTextGenerationInference
      arguments:
        model_id: google/medgemma-1.5-4b-it
        max_new_tokens: 500
        prompt_template: "Describe the findings in this medical image given the following clinical notes: {context}"
  data_structure:
    select:
      include:
        - image_path
        - clinical_notes
```

:::tip
The `prompt_template` argument is optional. If provided, the `{context}` placeholder will be replaced by the value from the context column. If omitted, the context column value is used as the prompt directly. See the API documentation for the full list of configuration options.
:::

### TIMM models

TIMM is a popular PyTorch image models library that provides a collection of the latest pretrained image models.
Originally developed independently by Ross Wightman, it has now been brought under the Hugging Face umbrella.

TIMM models are supported by Bitfount for both inference and fine-tuning tasks via the `bitfount.TIMMInference`
and `bitfount.TIMMFineTuning` algorithms respectively. The model is specified in the same way as Hugging Face models via the `model_id` argument.

#### TIMM fine-tuning example

Hyperparameters for the model can be set separately within the `args` block. The full list of hyperparameters can be found
here. As mentioned in the timm documentation,
the variety of training args is large and not all combinations of options (or even options) have been fully tested.

```yaml
task:
  protocol:
    name: bitfount.ResultsOnly
  algorithm:
    - name: bitfount.TIMMFineTuning
      arguments:
        model_id: bitfount/RETFound_MAE
        labels:
          - "0"
          - "1"
          - "2"
          - "3"
          - "4"
        args:
          epochs: 1
          batch_size: 32
          num_classes: 5
```

#### TIMM inference example

```yaml
task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.TIMMInference
      arguments:
        model_id: bitfount/RETFound_MAE
        num_classes: 5
    - name: bitfount.CSVReportAlgorithm
```

## MONAI models

MONAI (Medical Open Network for AI) is a PyTorch-based framework specialising in deep learning for medical imaging. Bitfount supports running inference using pre-trained models from the MONAI Model Zoo via the `bitfount.MONAIBundleInference` algorithm.

MONAI models are referenced by their bundle name from the MONAI Model Zoo. The algorithm downloads the specified bundle and runs inference using the bundle's pre-trained weights and preprocessing pipeline.

:::warning[Hardware Recommendation]
MONAI models can be computationally intensive. Running on CPU can be very slow, so **CUDA-enabled GPU** hardware is strongly recommended. Note that MPS (Apple Silicon) is **not supported** for MONAI models.
:::

### MONAI inference example

```yaml
task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.MONAIBundleInference
      arguments:
        bundle_name: wholeBrainSeg_Large_UNEST_segmentation
        nifti_output: true
        batch_size: 1
    - name: bitfount.CSVReportAlgorithm
  data_structure:
    select:
      include:
        - image_path
```

For the full list of configuration options, see the MONAIBundleInference API documentation.


## for-data-scientists/writing-tasks/task-components.md

# Task Components

As we've seen, a Bitfount task is the brain of a project. It specifies what will run on any dataset linked to the project, in what order and over what view of the data. Tasks are written in YAML format, and at a high level, are made up of 3 key components:

- **Protocol**: orchestrates the run and lifecycle
- **Algorithm(s)**: the units of work to execute (can be a list)
- **Data structure**: how to select, assign and transform input data

A minimal skeleton might look something like this:

```yaml
task:
  protocol:
    name: bitfount.ResultsOnly
    arguments: { ... }
  algorithm:
    - name: bitfount.ModelInference
      arguments: { ... }
  data_structure:
    select:
      include:
        - image_path
```

### Protocols

- **What they are**: the task's entry point that orchestrates the algorithms and handles communication between different parties within the task. A given protocol will only be compatible with a certain set of algorithms.
- **How to specify**: each entry takes a `name` and `arguments`. Use the prefix `bitfount.` followed by the protocol name. A full list of protocols can be found here. The `arguments` may be optional and are used to configure the protocol. Search for the protocol in the API documentation to see its available arguments.
- **Examples**:
  - `bitfount.InferenceAndCSVReport`: runs model inference and writes a CSV report from the results

```yaml
task:
  protocol:
    name: bitfount.InferenceAndCSVReport
    arguments: { ... }
```

### Algorithms

- **What they are**: the concrete steps executed by the protocol. You can supply a single algorithm or a list; lists run each algorithm in order. Configuring how the output of one algorithm can be fed into the next algorithm is baked into the protocol in which they are used. Therefore, a given algorithm will only be compatible with a certain set of protocols.
- **How to specify**: each entry takes a `name` and optional `arguments`. Use the prefix `bitfount.` followed by the algorithm name. A full list of algorithms can be found here. The `arguments` may be optional and are used to configure the algorithm. Search for the algorithm in the API documentation to see its available arguments. Algorithms that require a model to be passed in accept a separate `model` block, see Referencing a model for more information.
- **Common patterns**:
  - Model inference (e.g., `bitfount.ModelInference`, `bitfount.HuggingFaceImageClassificationInference`)
  - Post-processing (e.g., calculations, matching)
  - Reporting (e.g., `bitfount.CSVReportAlgorithm`)

```yaml
task:
  algorithm:
    - name: bitfount.ModelInference
      arguments: { ... }
      model: { ... } # see "Referencing a model"
    - name: bitfount.CSVReportAlgorithm
      arguments: { ... }
```

### Data Structures

Defines what the data should look like before it is passed to the algorithms in the task.

:::tip
More information about the data structure arguments can be found here.
:::

:::note
The data structure is currently only used to define the input data for tasks that use a model.
:::

- **table_config**: optional configuration to select a specific table from the datasource if the datasource has multiple tables.
- **select**: choose columns to include/exclude from the data; `include_prefix` can be helpful for datasets that have multiple image columns.
- **assign**: map column names to semantic roles (e.g., `image_prefix`, `target`).
- **transform**: define dataset/batch/image transforms to apply to the data (e.g., Albumentations pipelines, grayscale handling). Important for tasks that use a model. More information about the transform arguments can be found here.
- **data_split**: optional configuration for defining how to split data into train/validation/test sets.
- **compatible_datasources**: list of dataset types that are compatible with this data structure configuration.
- **schema_requirements**: specify dataset schema requirements level (`"empty"`, `"partial"`, or `"full"`), or a dictionary mapping requirement levels to specific dataset types. Defaults to `"partial"`.
- **filter**: optional task-level filters to apply at runtime. These filters allow the task initiator to further restrict which data is processed, without modifying the dataset connection. See Task-level filters below for details.

```yaml
task:
  data_structure:
    compatible_datasources:
      - DICOMOphthalmologySource
      - HeidelbergSource
    schema_requirements: partial
    data_split:
      args:
        shuffle: false
        test_percentage: 100
        validation_percentage: 0
      data_splitter: percentage
    assign:
      image_prefix: Pixel Data
    select:
      include:
        - Columns
        - Rows
      include_prefix: Pixel Data
    transform:
      image:
        - albumentations:
            step: test
            output: true
            transformations:
              - ToTensorV2
```

### Task-level Filters

Task-level filters allow the task initiator to specify data filtering criteria at runtime, without requiring the dataset owner to modify their dataset connection. This is useful when the same dataset needs to be queried with different criteria across different task runs.

:::info
Task-level filters are applied **in addition to** any dataset-level filters configured by the dataset owner at connection time. The resulting filter is the intersection of both—meaning task-level filters can only **further restrict** the data, never expand it beyond what the dataset owner has allowed.
:::

Filters are specified as a list of filter objects, each containing a `filter_type` and `value`:

```yaml
task:
  data_structure:
    filter:
      - filter_type: modality
        value: OCT
      - filter_type: min-frames
        value: 50
      - filter_type: scan-acquisition-min-date
        value:
          year: 2020
          month: 1
```

#### Available filter types

| Filter Type                  | Value Type         | Description                                |
| ---------------------------- | ------------------ | ------------------------------------------ |
| `modality`                   | `"OCT"` or `"SLO"` | Filter by imaging modality                 |
| `min-frames`                 | integer            | Minimum number of B-scan frames            |
| `max-frames`                 | integer            | Maximum number of B-scan frames            |
| `min-file-size`              | number (MB)        | Minimum file size in megabytes             |
| `max-file-size`              | number (MB)        | Maximum file size in megabytes             |
| `file-creation-min-date`     | date object        | Earliest file creation date                |
| `file-creation-max-date`     | date object        | Latest file creation date                  |
| `file-modification-min-date` | date object        | Earliest file modification date            |
| `file-modification-max-date` | date object        | Latest file modification date              |
| `min-dob`                    | date object        | Minimum patient date of birth              |
| `max-dob`                    | date object        | Maximum patient date of birth              |
| `scan-acquisition-min-date`  | date object        | Earliest scan acquisition date             |
| `scan-acquisition-max-date`  | date object        | Latest scan acquisition date               |
| `check-required-fields`      | list of strings    | Required DICOM fields that must be present |
| `series-description`         | string             | Filter by DICOM series description         |

:::tip
Date values are specified as objects with `year` (required), and optional `month` and `day` fields:

```yaml
value:
  year: 2023
  month: 6
  day: 15
```

:::


## for-data-scientists/writing-tasks/task-configuration.md

# Task configuration

The task components are the core of the task. However, the task YAML in its entirety includes _everything_ required to run a task. In addition to the task components, this includes the datasets to run on, task configuration settings, authentication details and other metadata.

A complete task file is a YAML document deemed valid according to the Bitfount task schema. This can be validated in your YAML editor of choice by referencing the Bitfount task schema at the top of the file like so:

```yaml
# yaml-language-server: $schema=https://docs.bitfount.com/schemas/task-spec.json
```

This is done automatically when uploading a task via the Bitfount App or Hub.

:::tip
In Bitfount terminology, a task initiator is often referred to as a **modeller**.

Meanwhile, a task runner is often referred to as a **Pod** (Processor of Data) which will contain one or more datasets to run the task on.

In some scenarios, the Pod and the modeller may be the same entity on the same machine, in other scenarios, they may be different entities on different machines.
:::

## Required fields

In addition to the core `task` component, the only other required field is the `pods` field which contains a list of dataset identifiers where the task will be sent to run.

- **task**: the task definition. See Task components for details.
- **pods**: a dictionary containing a list of dataset identifiers where the task will be sent to run.

When using the Bitfount App, dataset identifiers are automatically overwritten with the identifier that the task run is triggered against. So a typical task YAML uploaded via the app will usually not specify any dataset identifiers like so:

```yaml
pods:
  identifiers:
    - 
```

:::warning[Advanced Usage]

If running a task using the SDK, the dataset identifiers must be specified explicitly.

```yaml
pods:
  identifiers:
    - alice/sensitive-data
    - bob/sensitive-data
    - charlie/sensitive-data
```

:::

## Optional fields

In addition to the required fields, the following optional fields can be specified:

- **modeller**: required for specifying authentication details. Default is OIDC device code authentication which triggers an interactive prompt requiring the user to validate a code in their browser. The options are `key-based`, `oidc-auth-code`, `oidc-device-code`.

  :::tip
  For app-based runs, set to `key-based` to use RSA keys and avoid interactive prompts.
  :::

- **run_on_new_data_only**: whether to run the task on only new data that has not been seen in previous runs. Defaults to false. This will have no effect on the first run of a task _on a specific dataset_. Subsequent runs will only process new data that has not been seen in previous runs _on that dataset only_.
- **batched_execution**: whether to run the task in batches. Defaults to false. If enabled, the task will be split into batches of records and each batch will be processed sequentially. This is useful for large datasets that cannot be held in memory in their entirety. The task can only switch this on or off, the number of records in each batch is determined by the environment where the dataset is held. If using the app, it can be configured in the app settings or if using the SDK by specifying the `BITFOUNT_TASK_BATCH_SIZE` environment variable.
- **test_run**: run on a small subset for a quick validation. Defaults to false. This is useful for testing the task configuration and ensuring that the task will run correctly before running on the full dataset. The number of records that are processed is determined by the environment where the dataset is held. If using the app, it can be configured in the app settings or if using the SDK by specifying the `BITFOUNT_TEST_RUN_NUMBER_OF_FILES` environment variable. Only applies to file-based datasets.
- **force_rerun_failed_files**: whether to force re-running failed files at the end of the task. Defaults to true. Failed files are files that failed to process during the main body of the task run. Only applies to file-based datasets if the following conditions are met:
  - Batched execution is enabled in the task configuration.
  - Batch resilience is enabled in the environment where the dataset is held. Defaults to enabled in the app settings.
  - Individual file retry is enabled in the environment where the dataset is held. Defaults to enabled in the app settings.
- **template**: a dictionary containing template definitions for the task. See Templated fields for details.
- **task.data_structure.filter**: task-level filters to apply at runtime. These allow task initiators to further restrict which data is processed without modifying the dataset connection. Filters are combined with any dataset-level filters using "most restrictive wins" logic. See Task-level filters for available filter types and examples.

:::warning[Advanced Usage]

If running a task using the SDK, you may be required to also specify the project ID explicitly as a top level key in order to use your project-specific access to a particular dataset or model that are part of the task.

- `project_id`: is used to associate the run to a specific project. When using the app, this is omitted as a task may be associated with multiple projects.
  :::

## Minimal complete example

```yaml
# yaml-language-server: $schema=https://docs.bitfount.com/schemas/task-spec.json

modeller:
  identity_verification_method: key-based

pods:
  identifiers:
    - 

batched_execution: true
test_run: false
force_rerun_failed_files: true
run_on_new_data_only: false

task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.ModelInference
      model:
        bitfount_model:
          model_ref: MyModel
          model_version: 2
          username: my-user
    - name: bitfount.CSVReportAlgorithm
  data_structure:
    select:
      include:
        - image_path
```


## for-data-scientists/writing-tasks/task-upload.md

# Uploading tasks

Once you have written your task YAML, you can upload it to the Bitfount Hub using the UI. This will make it available to other users to run on their datasets.

:::warning
All tasks are currently public but not discoverable by other users. This means that users cannot search for another user's tasks by name or description but can navigate to their task if they have the URL.
:::

## Uploading a task

To upload a task, you need to have a Bitfount account and be logged into either the Bitfount App or Hub.

1. Navigate to the "Tasks" tab in the left sidebar.
2. Click the "Upload task" button in the top right corner.
3. Either paste or upload the task YAML file you want to upload along with the name, description, type and any tags you want to add.
4. Click the "Upload" button.

### Task type

The task type is the type of task you are uploading. When creating a Project, users will be able to filter by task type to find tasks that are suitable for their project. The available task types are:

- Image Classification
- Image Segmentation
- Object Detection
- Tabular Analytics
- Tabular Classification
- Tabular Regression
- Text Classification
- Text Generation

### Task tags

The task tags are a list of tags that will be displayed for any projects that use the task. These are not used for filtering and are only used for display purposes. At least one tag is required. The available task tags are:

- Prediction
- Training
- Evaluation
- Querying
- Comparison
- Ophthalmology

## Validation

The task YAML will be automatically validated against the Bitfount task schema and any errors will be displayed in the UI with a red squiggly line underneath the line of code that is in error with a tooltip showing the error message.


## for-data-scientists/writing-tasks/templated-fields.md

# Templated fields

Templating is a powerful feature that lets you build re-usable task configurations with user-supplied inputs. Define inputs under a top-level `template` block, then reference them with `{{  }}` in place of the actual value. This is particularly useful for model IDs, column names, and other values that may vary between runs. to avoid hardcoding values in the task file and then having to create multiple task files for different values.

:::info
Templated variables are inserted into the task at task runtime after the project has been created. The only exception to this is the `model_slug` field which can be specified either at project creation time _or_ at task runtime.
:::

## Where templating is supported

Templating is supported for a range of simple and domain-specific field types. The sections below show how to define a template input for each supported type.

### string

Use a `string` template when you want a free-text value. You can optionally enforce a `pattern` using a regular expression and provide a `default`.

```yaml
template:
  run_label:
    label: "Run label"
    type: string
    default: "baseline"
    pattern: "^[a-z0-9_-]+$"
```

### boolean

Use a `boolean` template for simple on/off switches.

```yaml
template:
  test_run:
    label: "Test run"
    type: boolean
    default: false
```

### number

Use a `number` template for numeric values, with optional lower bound and a default.

```yaml
template:
  top_k:
    label: "Number of top predictions to return"
    type: number
    minimum: 1
    default: 5
```

### array

Use an `array` template when you need the user to supply a list of values, for example multiple labels or column names. Specify `items.type: string`, and optionally `minItems` and a `default` list.

```yaml
template:
  include_labels:
    label: "Labels to include"
    type:
      array:
        items:
          type: string
        minItems: 1
        default:
          - "cat"
          - "dog"
```

:::info
The array type currently only supports strings.
:::

### file_path

Use a `file_path` template to open the user's file explorer and let them select a file, constrained by extension where appropriate.

```yaml
template:
  input_csv:
    label: "Input CSV file"
    type:
      file_path:
        extension: ".csv"
```

### model_slug

Use a `model_slug` template to expose a model picker allowing the user to select a model from a supported provider and library. The `provider` and `library` are required fields whilst `pipeline_tag` and `author` can be optionally provided to further restrict the available models.

:::info
The model_slug type currently only supports `huggingface` as the provider.
:::

The full list of available libraries can be found on the Hugging Face model hub but include the likes of:

- transformers
- timm
- keras
- pytorch
- tensorflow
- jax

`pipeline_tag` is the type of model you are looking for (or the _task_ as referred to by Hugging Face). For example, `image-classification`, `image-segmentation`, `text-classification`, `text-generation` as mentioned on the Referencing a Model page.

Meanwhile the `author` is simply the username of the model owner.

```yaml
template:
  hf_model_slug:
    label: "Hugging Face Image Classification Model"
    type:
      model_slug:
        provider: huggingface
        library: transformers
        pipeline_tag: image-classification
        author: google
```

:::info
The model slug can be specified either at project creation time or at task runtime. At project creation time, you will be presented with an option to choose the model from a picker as well as the option to allow the user to then override this model at task runtime.

A screenshot of the Bitfount task configuration UI with a templated model slug.
:::

### schema_column_name

Use a `schema_column_name` template when you want the user to choose a single column from their chosen dataset restricted by semantic type (for example `categorical`, `continuous`, `image`, or `text`). For instance, if you are working with an image model you may want to restrict the user to selecting only image columns.

```yaml
template:
  target_column_name:
    label: "Target column"
    type:
      schema_column_name:
        semantic_type: image
```

### schema_column_name_array

Use a `schema_column_name_array` template when the field you are templating takes an array of columns such as `data_structure.select.include` or `data_structure.assign.image_cols`.

```yaml
template:
  feature_columns:
    label: "Feature columns"
    type:
      schema_column_name_array:
        semantic_type: continuous
```

### task_filters

Use a `task_filters` template to allow users to configure data filtering criteria at task runtime. This presents a filter configuration UI where users can specify which data should be included based on file metadata, patient information, and imaging parameters.

```yaml
template:
  data_filters:
    label: "Data filters"
    type: task_filters
    tooltip: "Configure filters to restrict which data is processed"
```

When referenced in the task, the filters are applied to `data_structure.filter`:

```yaml
task:
  data_structure:
    filter: "{{ data_filters }}"
```

:::info
Task-level filters configured via templates are combined with any dataset-level filters set by the dataset owner at connection time. The resulting filter uses "most restrictive wins" logic—task filters can only further restrict the data, never expand it.
:::

Available filter options include:

- **Modality**: OCT or SLO imaging modality
- **B-scan frames**: Minimum and maximum frame counts
- **File size**: Minimum and maximum file size in MB
- **Date filters**: File creation, modification, scan acquisition, and patient date of birth ranges
- **Required fields**: DICOM fields that must be present
- **Series description**: Filter by DICOM series description

See Task-level filters for the complete list of filter types and their value formats.

## Referencing template variables

Use double curly braces within quotation marks to reference templated values:

```yaml
task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.TIMMInference
      arguments:
        model_id: "{{ retfound_model }}"
        class_outputs: "{{ class_outputs }}"
        checkpoint_path: "{{ checkpoint_path }}"
    - name: bitfount.CSVReportAlgorithm
      arguments:
        original_cols:
          - _original_filename
          - Filename
  data_structure:
    select:
      include:
        - "{{ image_column_name }}"
    table_config:
      table: 

template:
  test_run:
    type: boolean
    label: Test run
    default: false
    tooltip: >-
      Run the task with a small subset of data to test the configuration before
      running on the full dataset.
  class_outputs:
    type: array
    items:
      type: string
    label: Class outputs
    default:
      - CNV (%)
      - DME (%)
      - DRUSEN (%)
      - NORMAL (%)
    tooltip: >-
      The number of output categories for a classification task. Must match the
      number of labels.
    minItems: 1
  retfound_model:
    type:
      model_slug:
        author: bitfount
        library: timm
        provider: huggingface
    label: RETFound model
    tooltip: Select the RETFound model to be used in this task
  checkpoint_path:
    type:
      file_path:
        extension: tar
    label: Checkpoint file
    tooltip: >-
      Load a previously saved model checkpoint file to resume training from a
      previous state instead of starting from scratch.
  image_column_name:
    type:
      schema_column_name:
        semantic_type: image
    label: Image column
    default: Pixel Data 0
    tooltip: >-
      The dataset column that contains image data for model training and
      fine-tuning
  force_rerun_failed_files:
    type: boolean
    label: Re-run failed files
    default: true
    tooltip: >-
      Include files that failed in the last task run. Turn this off to skip
      them.
test_run: "{{ test_run }}"
force_rerun_failed_files: "{{ force_rerun_failed_files }}"
```

The above templated yaml would yield the following task configuration UI:

  Templated fields in the Bitfount task configuration UI


## getting-started/about-bitfount.md

# Introduction

Bitfount exists to safely unlock the value of sensitive data for the benefit of
humankind. We enable data collaborations **without** needing to transfer data to
other parties, an approach known as federated data science.

## What can you do with Bitfount?

You can use Bitfount to securely adopt AI in a range of scenarios, including:

- **Model inference:** Get predictions on your data and run a wide selection of
  models locally. All data remains behind your firewall with results only
  accessible to you.
- **Fine-tune models:** More efficient than training a model from scratch. Adapt
  foundation models on your sensitive data to complete specific downstream tasks
  you are interested in.
- **Federated learning:** The traditional application of federated data science.
  Federated learning enables you to train models across multiple distributed
  datasets.
- **Federated evaluation:** Test model performance on variety of real-world data
  you don't have access to in raw form.
- **Private set intersection:** Determine the overlapping records in two (or
  more) disparate datasets without providing access to the underlying raw data
  of either dataset to any other collaborators.
- **Private analytics:** Run analysis queries over and retrieve back valuable
  insights. Data custodians remain in control of what kind of metrics can be
  retrieved.

If you would like to learn more about what you could achieve with Bitfount,
please contact us at support@bitfount.com.

## How Bitfount works

With Bitfount, AI models can be securely deployed and run locally in
environments like hospitals and clinics, and other sensitive environments. This
means:

- Data never leaves its original location.
- Insights are generated without writing a single line of code.

Before exploring federated data science, let's first look at how AI models are
typically built and used today.

### How AI models work

AI models use data to learn and perform tasks—like detecting diseases in medical
images or powering voice assistants. To work well, AI models need lots of
training data, which is often:

- Stored in different locations (e.g., across hospitals, banks, or mobile devices).
- Too sensitive to share due to privacy laws and security risks.

### The traditional approach: Data centralisation

The traditional way to solve this problem is data centralisation—bringing all
the data together in one place, like a data lake or cloud server.

  All data is moved to a single location for analysis and insights

However, in industries like healthcare and finance, where data is highly
sensitive, this approach has serious limitations. Personal information, medical
records, or financial details can't simply be shared due to privacy risks, legal
restrictions, and strict regulations.

AI has enormous potential to solve big problems—but how can we apply it to
sensitive data without compromising security? This is where federated data
science comes in.

### The alternative: Federated data science

Rather than moving data, **federated data science sends the AI model to the
data**.

This enables organisations to:

- Train, and improve AI models on datasets they never actually see—even when the
  data is spread across multiple locations.
- Ensure compliance with privacy regulations.
- Collaborate securely while keeping full control of their data.

  Insights are shared, while data remains securely in its original location

This keeps information safe while still allowing AI to learn and improve.
Organisations get the insights they need without sharing private data. Let's
look at some real-world applications.

## How is federated data science used?

Federated data science is already making AI safer and more effective. Some
real-world uses include:

- Healthcare: Training AI to detect diseases without sharing patient records.
- Finance: Banks spotting fraud patterns without exposing customer data.
- Smartphones: Improving voice assistants without collecting users'
  conversations.

### Training AI securely with federated learning

Federated Learning (FL) allows AI models to be trained across multiple locations
without sharing raw data. Instead of collecting data in one place, the model
learns from distributed datasets stored across different institutions, servers,
or devices. Here's how it works:

1. **Model setup:** A global AI model is prepared and sent to multiple locations
   where data exists.
2. **Local training:** Each site trains the model using its own data but never
   shares the raw data itself.
3. **Model updates:** Each site sends back only the improvements (updated model
   parameters), not the data.
4. **Aggregation:** A central system combines all updates to refine the global
   model. The most common method, Federated Averaging (FedAvg), ensures sites
   with more data have a greater impact on training.
5. **Repeat:** The improved model is sent back for further training, repeating
   the process until it reaches peak performance.

This approach distributes the computing workload across many locations, making
AI training more efficient, scalable, and privacy-preserving.


## getting-started/ehr-integration.md

# EHR Integration

Bitfount supports the integration of the following EHR systems:

- NextGen
- Nextech
- ModMed
- Optivate/EyeMD
- Epic

If you would like to learn more about connecting one of these EHR systems (or any other FHIR compatible EHR) with Bitfount, please contact us at support@bitfount.com.


## getting-started/installation.md

# Installation

A quick overview to get you up and running with Bitfount.

## Creating an account

Firstly, create your Bitfount account.
Your username will be your primary identifier within the Bitfount ecosystem and
can be used to connect datasets, run tasks, use models or join projects for
collaborations.

## Accessing Bitfount

After creating your account, we recommend installing the right software to make
the most of the Bitfount platform.

1. **Bitfount Desktop:** Connect datasets
   and run tasks easily with our desktop application. Bitfount Desktop provides
   a no-code interface for managing projects and datasets, allowing you to
   securely run pre-built AI models and SQL tasks locally.
2. **Bitfount SDK:** For data scientists
   and technical users to interact directly with Bitfount APIs. The SDK supports
   the deployment of models, provisioning task templates to our desktop
   app users, and more complex use cases like configuring entire federated data
   collaboration networks.
3. **Bitfount Hub:** Both SDK and Desktop users
   leverage the Hub as a point of authentication. The cloud-based Hub mirrors
   the functionality of Bitfount Desktop but does not facilitate connecting
   datasets, or running tasks.

## Installing Bitfount Desktop

Installing Bitfount Desktop is as simple as downloading and running the
installer from our website. When you first launch Bitfount Desktop you will be
prompted to sign into your Bitfount account and link it to the application.

We support two app versions:

1. Windows - for Windows 10 or later
2. MacOS - for Apple Silicon machines

If you don't have access to a machine with either of these operating systems,
please get in contact with our support team.

:::info
Looking to interact directly with Bitfount APIs and provision tasks? Please visit our SDK guides for details about installation and running federated analyses.
:::

### Hardware recommendations

Data science tasks are compute-intensive and run more efficiently with appropriate hardware. We recommend installing Bitfount on a machine with a GPU (Graphics Processing Unit). This could be any Apple device with a Silicon chip, or any machine running Windows and fitted with a NVIDIA GPU.

### Software updates

We always recommend running the latest version for the Bitfount app. As we roll out new features you will be notified in-app when updates are available, this is accompanied by release notes outlining details of the updates.


## getting-started/settings.md

# Application Settings

In the Application Settings page, you can configure various settings for the Bitfount application. Settings are organised into the following sections: Aggregate tracker and Orchestrator.

## Aggregate tracker

There are some projects that make use of an "aggregate" report i.e. a single file that tracks various metrics over the course of multiple task runs within a project. Each task run will still produce its own results, but the aggregate report will aggregate the results of all task runs. This is enabled by default for the projects that support this feature. However, what can be configured is the location of the aggregate report. Often, it can be beneficial to set this to a shared location where non-Bitfount users can also access the report.

aggregate-tracker-settings.png

## Orchestrator

The Orchestrator is the engine in the Bitfount app that coordinates the execution of tasks. Certain configuration settings that pertain to the orchestrator have been exposed here for adjustment.

orchestrator-settings.png

### The number of files to load for test runs

Some tasks allow a test run to be performed on a subset of the files in the dataset to identify any issues with the data or task configuration more quickly. If your project task supports this, it will be visible as a checkbox in the task run page. This setting controls how many files are processed when running a test run. Default is 1.

### Enable batch resilience

Continue with the task even when a batch of data raises an error, instead of aborting the entire task run. Default is enabled.

### Retry failed files

Automatically re-run any files that failed as part of a batch individually, after batch processing completes. Often, a single broken file can cause an entire batch of data to fail even though the rest of the data is valid. This setting allows you to automatically re-run the failed files individually to make sure that otherwise valid data is not ignored. Only applies if batch resilience is enabled. Default is enabled.

### Heidelberg DoB Fix

Apply a fix for Heidelberg files where dates of birth before 1944-11-07 may otherwise be misparsed. Default is enabled.

### Allow extra Zeiss Transfer Syntaxes

Enable additional DICOM TransferSyntaxUIDs to be considered when decoding Zeiss images. This feature is still under development so is disabled by default.

### Orchestrator log level

Controls the level of detail stored in Orchestrator logs. Options: `DEBUG`, `INFO`, `WARN`, `ERROR`. Default is `INFO`.

### Network drive robustness

Enable additional robustness checks for network drive operations. Use if data is located on a network drive but the connection may be unreliable. This may be slower and consume more resources. Default is disabled.

### Task batch size

Number of files processed in each batch. Larger batches can speed up processing but require more memory so this setting should be adjusted to the available memory on the machine. Default is 16.

### Max consecutive batch failures

Maximum consecutive batch failures before the task is marked as failed. Use `unset` to disable the limit. Default is 5. This setting is only applicable if batch resilience is enabled.

### Hugging Face User Access Token

You can configure a Hugging Face User Access Token to authenticate with the Hugging Face Hub when downloading models that require access permissions (e.g. gated models). The token is sent from the modeller to the pod when a task is initiated, and dictates the modeller's permissions for accessing Hugging Face models.

:::warning[Privacy & Security]
When setting up your Hugging Face token, we strongly recommend using a **read-only (fine-grained) token**. This ensures that the token can only be used to download models and cannot be used to modify any resources on your Hugging Face account. Since the token is transmitted to the pod as part of task execution, using a read-only token minimises security risk.
:::

### MPS (Apple Silicon)

If you are running Bitfount on a Mac with Apple Silicon (M1, M2, M3, M4 chips), you can enable **MPS (Metal Performance Shaders)** acceleration. MPS leverages Apple's GPU hardware to significantly speed up machine learning model inference and training compared to running on CPU alone.

This setting only applies to tasks running locally on the device — it has no effect on remote datasets or pods running on other machines.

:::info
MPS is not supported by all model types. If a model or algorithm does not support MPS, the setting will be ignored and the task will fall back to CPU.
:::


## getting-started/platform-overview/datasets.md

# Datasets

Datasets in Bitfount act as references to your data, storing only metadata and schema—not the raw data itself.
Your datasets always remain on your system and are never transferred or stored by Bitfount.

This guide covers how to connect a dataset to Bitfount, link it to a project,
and manage dataset access.

## Connecting datasets

Before using a dataset in a project, you must first connect it to Bitfount using
Bitfount Desktop. Connecting a dataset to
Bitfount is like registering it—only its metadata (name, description, and
schema) is stored, never the raw data itself.

### Format

It's important to ensure your dataset is formatted correctly to be compatible
with the task used in the project. If you are joining an existing project,
please check with the project contact to ensure your dataset meets the
requirements for the task.

### Selecting a data source

To connect a dataset, click `Connect dataset` either from the `Datasets` page,
or within the project when you link a dataset, and choose from the available
data sources supported by Bitfount.

product-modal-datasources-min.png

:::tip
If your dataset contains DICOM files and you intend to run Ophthalmic tasks, we recommend selecting the **DICOM (Ophthalmology)** data source for optimal compatibility.
:::

After selecting a data source, enter a dataset name and optionally, a
description, then click `Connect dataset`. The system will then process the connection,
making the dataset available within Bitfount.

Once connected, the dataset should appear `Online`.

:::note
**Can't find the data source you need?** Please reach out to the Bitfount support team—we're happy to help you connect your dataset to Bitfount.
:::

### Schema

When you connect a dataset, Bitfount automatically generates a schema that
defines the column names and data types within your dataset. This schema is used
to verify compatibility with the task used in a project, and **does not contain
any actual data** (such as patient records), only structural information about
the dataset.

If you are working with data scientists, they may also reference the schema to
design analyses and tasks that align with your dataset's structure.

product-schema-min.png

## Managing datasets

### Status

When you start Bitfount Desktop, the system automatically attempts to establish a connection with all connected datasets, whether they are online or offline.

If needed, you can manually take a dataset offline from the `Settings` tab, which will temporarily disable task execution for that dataset.

:::info
Tasks cannot run until Bitfount has finished connecting all datasets at startup
:::

### History

A full audit trail is available for datasets via the `Activity history` tab. To
view project-specific activity, navigate to the same tab in the relevant
project.

### Archiving

From the `Settings` tab you can archive your dataset. Archiving does not delete
the raw data source connected to Bitfount. Archived datasets can be unarchived
and reused in projects when appropriate.

### Access

You can view all projects the dataset is currently linked to via the
`Linked projects` tab on the Dataset's detail page. Unlinking a dataset from a project can be completed at
any time by clicking the `Unlink dataset` button within the project's `Datasets`
tab.


## getting-started/platform-overview/models.md

# Models

Generally, AI models are a core component of tasks. They are programs developed
by data scientists designed to analyse datasets to find patterns and make
predictions. Models are leveraged to complete tasks related to computer vision,
natural language processing, and applied in a range of other AI domains.

## Inspecting models

Just like tasks, model code can also be inspected if you have permission to access the model. Model owners
can mark models as private or public.

model-overview.png

## Tasks with interchangeable models

In Bitfount, tasks define the model that will run within a project. Some tasks offer flexibility,
allowing you to choose from a selection of open-source models.

If a task includes a model that requires usage approval, it will be marked as `Pending Approval`.
The model owner must approve its use before you can run the task.

model-selection.png

## Creating a model

The platform supports the integration of user-provided models which once uploaded can
be templated into tasks to either train, run evaluation, or run inference. If you are a data
scientist, please see the guide,
or visit our SDK Tutorials for more information on implementation.

## Uploading your model

Already have a model you want to use in a task? You can upload it to Bitfount using the Hub or Desktop App. See our guide for more information.


## getting-started/platform-overview/projects.md

# Projects

Projects are the core workspace in Bitfount, where you extract valuable insights
from your datasets and collaborate securely with others. Here, collaborators can
link their datasets and run the task assigned to the project.

- If you have been invited to collaborate, skip ahead to
  Joining a project.
- If you are creating a project from scratch, keep reading to learn how to set
  it up.

## Creating a project

Projects can be created by simply navigating to the projects tab and clicking
the `Create project` button.

product-projects-min.png

### Project metadata

The table below outlines all the metadata that can be defined within a project.

| Metadata type            | Definition                                                                                              |
| ------------------------ | ------------------------------------------------------------------------------------------------------- |
| Project name             | Title of the project on the platform (3–50 characters)                                                  |
| Description              | 1-2 sentence description of the goals of the project                                                    |
| Official link (optional) | URL to more information or an official brochure associated with the project                             |
| Organisation (optional)  | Person leading or organisation sponsoring the project                                                   |
| Contact email (optional) | Contact for all collaborator-related queries                                                            |
| Duration (optional)      | Timeline for the project                                                                                |
| Project terms (optional) | Terms & conditions of the project. If included, these must be accepted by collaborators before joining. |
| Task                     | Predefined machine learning tasks or analyses available within the project                              |

### Defining project terms

If your project requires terms and conditions, we recommend consulting with
legal advisors to ensure they are appropriate. Bitfount does not verify or
enforce project terms beyond standard role-based data access permissions.

Consider including details on:

- Confidentiality: Protecting sensitive information.
- Exclusivity: Restrictions on participation or data use.
- Data Management & Collection: How data is handled within the project.
- Data Subject Privacy: Ensuring compliance with relevant privacy regulations.
- Participation Rules: Defining who can join and under what conditions.
- Scope of Tasks: Outlining the specific analyses or AI tasks to be performed.

If your terms are too long or reference multiple documents, you can link to
hosted documents instead of including them in full.

### Selecting a task

A Bitfount task is the _brain_ of a project. It specifies the algorithm(s) that
will run on any dataset linked to the project which can include the use of AI
models as well as other data science operations.

To add a task to your project, click `Add task` and browse the available
options. Bitfount offers a variety of pre-built tasks, some of which allow you
to choose from a selection of open-source models.

:::caution
You cannot change or remove the task once you create the project
:::

task-selection.png

Each task has a unique set of input parameters that must be configured by collaborators
for the task to run successfully. These parameters are visible when the project is created,
allowing collaborators to review and set the required values before running the task.

:::info
Looking to create your own tasks to use in Bitfount Desktop? Please
refer to Task Templates &
Models
in the Data Scientist documentation.
:::

## Managing a project

Once you have set up a project, you may want to invite collaborators,
edit project metadata, or archive the project.

### Inviting collaborators

To invite new collaborators, select the project, navigate to the
`Collaborators` tab, click the `Invite collaborators` button and enter the
email or username of the users you wish to invite. Any users invited will receive an email
invitation to join the project.

Once the user has created a Bitfount account, they will be able to review the
project details and must accept the project terms (if defined) before joining.

Once the user joins, they can link their dataset and run the task associated
within the project.

You can remove a collaborator at any time by navigating to the 'Collaborators'
tab. Once a collaborator is removed, any of their connected datasets will also
be unlinked.

### Updating or archiving a project

To edit a project's metadata, click the three dots on the projects page and
select `Edit project`.

Archiving a project can be achieved by navigating to the `Settings` tab and
selecting `Archive projects`. Any datasets will be unlinked from the archived
project and running the associated task will be blocked.

product-archive-project-min.png

Projects can be restored at any time by returning to the settings tab and
clicking `Restore project`.

## Monitoring project activity

We recognise how important it is to have sufficient oversight of how
collaborators are interacting with one another's data or tasks to fulfill the
needs of the project.

Different users can see different views as follows:

- **Project Owners** can view model usage as well as activity history for the
  whole project, including when projects were created, invitations that were
  issued or revoked, and task run history only against their own datasets.
- **Data Custodians** can view the activity history related to their dataset
  only, allowing them to see when they last ran a task, and any results related
  to that task run.

### Accessing logs

Logs are technical audit trails of a user's interaction with Bitfount and can
hold useful information for the Bitfount team to help resolve any technical
issues that might occur. To retrieve log files, click the `Logs` link in the
sidebar.

## Joining a project

If you have been invited to collaborate on a project, you will receive an email
invitation to join the project.

Joining a project allows you to link datasets and run the assigned task.
Before joining, review the project details, task configuration, models, and any
available terms and conditions to ensure alignment with your expectations.

If you are unsure about the project's scope, consider consulting your project contact
or legal team before proceeding.

Finally, when you're happy to continue, go ahead and click
`Accept and join project`.

product-join-modal-min.png

## Linking datasets to projects

After creating or joining a project, the next step is to connect and link a dataset
(or link an already connected dataset - see more on our
Datasets page).
This is essential because the project's task can only be run on linked datasets.

Linking a dataset ensures that the assigned task can access the necessary data
while keeping it securely stored in its original location.

To run the associated task for the project you need to click the `Link dataset`
button.

After selecting the dataset, Bitfount will automatically check if the dataset schema
is compatible with the task. Once this is complete, you will see your linked dataset
within the project.

If the schema check returns an error, please review the dataset schema and ensure that
the expected columns are present in the data and named accordingly.

:::info
Learn more about connecting and managing datasets on our
Datasets page
:::

## Running tasks

Once you have linked a dataset you are ready to run the project's associated
task(s) by clicking the `Run task` button within the `Task runs` tab. Before
running, you must first set any parameters required to run the task. These will
vary based on the task.

Task completion times will depend on the type and size of the dataset, the
complexity of the task, and available compute resources on your machine.

Once the task is complete, any results will appear within the task run.

product-task-complete-min.png

## Interpreting results

The output generated from a successful task run will vary based on the algorithm
used in the task. This could take the form of a PDF report, CSV file or other
formats depending on the configuration of the task. For more details on how to
interpret results, please reach out to your project contact or
support@bitfount.com.

## Next steps

You can now go ahead and join, or create, your first project. Alternatively,
choose to get up and running quicker using one of our
Demo projects.


## getting-started/platform-overview/tasks.md

# Tasks

A Bitfount task is the _brain_ of a project. It specifies the algorithm(s) that
will run on any dataset linked to the project which can include the use of AI
models as well as other data science operations.

## Selecting a task

Available tasks can be viewed when clicking the `Add task` button within a
project, or by navigating to the `Tasks` tab. Bitfount hosts off-the-shelf tasks
that are provisioned in our demo projects, other users can also create their own
tasks for use in projects.

task-selection.png

## Inspecting a task

If you have been invited to join a project, a task will already have been added
by the project owner. Before joining, you will be able to review the task
configuration by clicking on the task card in the project. This will show
details about the protocols, algorithms, and models that will run on your
dataset.

product-task-details-min.png

## Running a task

Tasks are run within projects to generate insights from your datasets.

### How to run a task

1. Navigate to the project and click `New task run`.
2. Link a dataset that is compatible with the project's task.
3. Add or adjust any required task parameters.
4. Click `Run task` to begin processing.

:::note
Task completion time depends on dataset size, task complexity, and available computing resources.
:::

### Viewing results

Once the task run is complete, results can be accessed via the task run.
Depending on the task, output formats may include CSV files, reports, or other
structured data formats. For guidance on interpreting results, refer to your
project lead or contact the
Bitfount support team.

:::info
Task results are only accessible to the owner of the dataset and remain completely private. Bitfount does not have access to any results.
:::

## Creating a task

If you're looking to build tasks and provision them on the Bitfount platform please see the documentation here.

