# Bitfount Documentation


## SDK/environment-variables.md

# Environment Variables

Bitfount exposes a number of configuration options through environment variables which you can set. Most users don't need to be aware of these but they are available to use if you would like to customise your experience. Ensure you set these in your environment prior to running or importing bitfount.

## Setting environment variables

### Unix

For example, if running bitfount from the CLI, you can set the environment variable in the same command:

```bash
BITFOUNT_USE_MPS=True bitfount run_pod --path_to_config_yaml=
```

Or you can set it at any point prior to the running the bitfount command or launching the environment from which you will import bitfount (e.g. jupyter notebook):

```bash
export BITFOUNT_USE_MPS=True
# Any number of other commands in between
jupyter notebook
```

### Windows

In windows, setting the environment variable in-line is more difficult so it is recommended you set it beforehand, as in the second example given above. Instead of using `export`, we need to use `set`.

```
set BITFOUNT_USE_MPS=True
jupyter notebook
```

Alternatively, windows allows you to set environment variables globally in the control panel settings.

## Bitfount environment variables

| Environment Variable             | Type   | Default         | Description                                                                                                                                                                                                                                                                                                                             |
| -------------------------------- | ------ | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| BITFOUNT_API_KEY_ID              | string | None            | Bitfount API Key ID to be used for logging in to a specific Pod. This is generated on Bitfount AM. More information.                                                                                                                    |
| BITFOUNT_API_KEY                 | string | None            | Bitfount API Key to be used for logging in to a specific Pod. This is generated on Bitfount AM. More information.                                                                                                                       |
| BITFOUNT_ENGINE                  | string | pytorch         | Determines the backend used. Current accepted values are `basic` or `pytorch`. Currently installing pytorch is compulsory, but in the future, more options will be supported here allowing the user to choose which backend they want to use and install.                                                                               |
| BITFOUNT_ENVIRONMENT             | string | production      | Accepted values are `production`, `staging` or `dev`. This should only be used for development purposes by Bitfount employees.                                                                                                                                                                                                          |
| BITFOUNT_HOME                    | string | ~               | Path to a directory where a `.bitfount` subdirectory will be created for local storage purposes and configuration. Defaults to the user's home directory. Setting this environment variable to a different directory can effectively mimic a fresh installation which can be useful for development and troubleshooting purposes.       |
| BITFOUNT_LOG_TO_FILE             | bool   | True            | Determines whether bitfount logs to file as well as to console. Setting this to `False` can potentially save you a lot of disk space!                                                                                                                                                                                                   |
| BITFOUNT_LOGS_DIR                | string | ./bitfount_logs | Determines where log files are stored. If empty, logs will be stored in a subdirectory called `bitfount_logs` in the directory where the script is run from.                                                                                                                                                                            |
| BITFOUNT_ONLINE_CHECK_SOFT_LIMIT | int    | 180             | The number of seconds of inactivity in a task before the Pod will check to see if the task initiator is still online.                                                                                                                                                                                                                   |
| BITFOUNT_ONLINE_CHECK_HARD_LIMIT | int    | 180             | The number of seconds of inactivity in a task after the soft limit before the Pod will abort the task.                                                                                                                                                                                                                                  |
| BITFOUNT_POD_VITALS_PORT         | int    | 29209           | Determines the TCP port number to serve the Pod vitals health check over. You can check the state of a running pod's health by accessing `http://localhost:{{ BITFOUNT_POD_VITALS_PORT }}/health`.                                                                                                                                      |
| BITFOUNT_PROXY_SUPPORT           | bool   | False           | Adds automatic support for using Bitfount behind a proxy (even an HTTPS proxy) and ensures all parts of the system will communicate by the proxy. Additionally ensures that any custom root CAs installed on the system are loaded. Enables passthrough support for common proxy environment variables (HTTP_PROXY, HTTPS_PROXY, etc.). |
| BITFOUNT_TASK_BATCH_SIZE         | int    | 100             | This is used by the pod to determine how many batches to split a task into if the modeller has requested batched execution.                                                                                                                                                                                                             |
| BITFOUNT_TB_LIMIT                | int    | 3               | Traceback limit for exceptions. By default, only the final 3 exceptions are shown.                                                                                                                                                                                                                                                      |
| BITFOUNT_USE_MPS                 | bool   | False           | Whether to use MPS acceleration on Apple M1/M2 silicon chips if available.                                                                                                                                                                                                                                                              |

## Arbitrary environment variables

In addition to the Bitfount-specific environment variables listed above, you can also utilise your own custom environment variables in Pod YAML config files by enclosing them in curly brackets.
See here for an example.


## SDK/index.md

# SDK

The Bitfount Software Development Kit (SDK) can be leveraged to interact directly with Bitfount APIs, programmatically connect data sources or save models, and build your own federated data collaboration networks.

## Getting Started

Deploying Bitfount via the SDK involves a few easy steps:

1. **Create Account:** Create a Bitfount account.
2. **Install Bitfount:** Run:

```bash showLineNumbers
pip install bitfount
```

- Bitfount requires Python 3.8 or 3.9 to be installed.

- If you wish to explore Bitfount without installing our libraries locally, please see our interactive tutorials.

- Please see Installation Guides for more detailed guidance for your operating system

3. **Connect Data:** You or your partners will need to connect data to a Pod and set up access controls prior to analysis. We recommend starting with the Running a Pod Tutorial to see how this process works. For details on how to connect your own data, see For Data Custodians.

4. **Analyse Data:** You or your partners can train Bitfount-supported or custom models on Pod-connected data. We recommend executing the Running a Pod Tutorial and For Data Scientists.

## Next Steps

Now you understand how to get started with the SDK, please see Installation Guides to get Bitfount running on your OS.

Once complete, head to the appropriate guide for your use of the Bitfount platform - For Data Custodians or For Data Scientists.

## Need Help?

If you have any questions after reviewing our Guides and Tutorials, visit our FAQs. If you can't find the answer you are looking for or would like to discuss anything further, please contact us at support@bitfount.com. We're here to help!


## SDK/Installation_Guides/Debian_Ubuntu_Linux.md

# Debian/Ubuntu Linux

:::caution
We do not support Linux systems that run on ARM processors due to being unsupported for some of our underlying packages.
:::

Installing on Linux simply requires Python 3.8 or 3.9 and `pip`:

```bash showLineNumbers
apt update
apt install python3.8 pip
```

Our recommendation is to install and use Bitfount in a virtual environment, see instructions on how to do this here.

Then you can use pip to install **`bitfount`**:

```bash showLineNumbers
pip install bitfount
```

If you are planning on using the `bitfount` package with Jupyter Notebooks we recommend you install extra tutorial dependencies via `bitfount[tutorials]`, which will make sure you are running compatible Jupyter dependencies.

```bash showLineNumbers
pip install 'bitfount[tutorials]'
```

If you want to use differential privacy (DP), you will need to install the DP extras as well:

```
pip install 'bitfount[dp]'
```

## Next Steps

Once you’ve installed Bitfount, the party who will provide the data for analysis must connect the data to a Pod. See For Data Custodians for detailed instructions.

If data is already connected to a Pod for analysis and you wish to train models on or query it, see For Data Scientists.


## SDK/Installation_Guides/Intel_MacOS.md

# Intel MacOS

Note: This guide is suitable for Mac machines with Intel processors. For newer Macs with M1 processors, see M1 MacOS.

## Python installation

The first thing you need to do is to set up python 3.8 or 3.9.

```bash showLineNumbers
$ brew install python3.8
```

Also, make sure that pip is installed. (You can check this by running `pip help`). If not, use the below command to install:

```bash showLineNumbers
python3 -m ensurepip --upgrade
```

Along with python, you also need to install `libomp`. Note that version 12.0.0 of `libomp` on Homebrew is incompatible with LightGBM on MacOS. We recommend running the following brew commands for getting the required version:

```bash showLineNumbers
curl https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb -o libomp.rb
brew unlink libomp
brew install libomp.rb
```

## Bitfount Installation

Our recommendation is to install and use Bitfount in a virtual environment, see instructions on how to do this here.

You can then proceed to install `bitfount`.

```bash showLineNumbers
pip install bitfount
```

This might take a few moments, so do not worry if it does — grab a coffee or tea while you wait!

If you are planning on using the `bitfount` package with Jupyter Notebooks we recommend you install the splinter package `bitfount[tutorials]` which will make sure you are running compatible jupyter dependencies.

```bash showLineNumbers
pip install 'bitfount[tutorials]'
```

If you want to use differential privacy (DP), you will need to install the DP extras as well:

```
pip install 'bitfount[dp]'
```

## Next Steps

Once you’ve installed Bitfount, whichever party that will provide the data for analysis must connect the data to a Pod. See For Data Custodians for detailed instructions.

If data is already connected to a Pod for analysis and you wish to train models on or query it, see For Data Scientists.


## SDK/Installation_Guides/M1_MacOS.md

# M1 MacOS

These are instructions to support installation of the Bitfount package on Macs running an M1 chip. For older Mac machines running on Intel chips, see Intel MacOS.

Several packages that Bitfount depends on do not yet have extensive support for the M1. In particular, NumPy can be quite challenging to get set up. For this reason, we recommend configuring a version of Homebrew for installing Intel-based versions of all dependency libraries.

## Terminal and brew setup

In order to interpret Intel-based versions of libraries, you will need to make use of Rosetta. Rosetta is software developed by Apple that enables a Mac with Apple silicon to use apps built for a Mac with an Intel processor. It works automatically in the background whenever you use an app built only for Mac computers with an Intel processor. It translates the app for use with Apple silicon.

The sequence of steps for installation of relevant version of brew is:

- Install Rosetta
- Set up your terminal to use Rosetta
- Install Intel and M1 versions of brew

The individual steps are explained below.

### Installing Rosetta

Rosetta does not come pre-installed on Mac and we have to explicitly install it. You can install it using the default Mac terminal.

```bash showLineNumbers
/usr/sbin/softwareupdate --install-rosetta
```

### Using Rosetta with the default Mac Terminal

Go to Applications, make a copy of Terminal in Applications, rename it and set it up to run under Rosetta by right clicking on the application and click on ‘Get Info’. Tick the checkbox that says ‘_Open using Rosetta’_. You only need to do this once.

### Using Rosetta within iTerm

If you prefer to use iTerm instead of the default Mac Terminal, once iTerm is downloaded and installed, right click on the application in Finder and select ‘Get Info’. Tick the checkbox that says ‘_Open using Rosetta’_ — This ensures everything works fine with the Apple Silicon (M1 chip). You only need to do this once.

### Installing both M1 and Intel versions of brew

Now, install Homebrew for M1:

```bash showLineNumbers
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```

then also install it for Intel by running:

```bash showLineNumbers
arch -x86_64 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
```

Then, you can make aliases for both the M1 and Intel Homebrew (e.g. `mbrew` and `ibrew` respectively). This can be done as follows by adding the following two lines to the `.zshrc` file in your root directory:

```bash showLineNumbers
alias ibrew='arch -x86_64 /usr/local/bin/brew'
alias mbrew='arch -arm64e /opt/homebrew/bin/brew'
```

## Installing Python

You should use Python 3.8 or 3.9. In order to do this. Once your chosen Rosetta terminal (Terminal or iTerm) is installed or configured, use either the iTerm Rosetta (`brew`) or the alias for the **Intel** Homebrew (`ibrew`) to install python and its dependencies as follows:

`brew`

```bash showLineNumbers
brew install lapack
brew install cmake
brew install openblas
curl https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb -o libomp.rb
brew unlink libomp
brew install libomp.rb
brew install openssl
brew install python@3.8
```

`ibrew`

```bash showLineNumbers
ibrew install lapack
ibrew install cmake
ibrew install openblas
curl https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb -o libomp.rb
ibrew unlink libomp
ibrew install libomp.rb
ibrew install openssl
ibrew install python@3.8
```

Also, make sure that pip is installed. (You can check this by running `pip help`). If not, use the below command to install:

```bash showLineNumbers
python3 -m ensurepip --upgrade
```

If you receive an error, we recommend trying again after entering a virtual environment by running `source /bin/activate`

:::caution

Even though it might seem easier to run `brew/ibrew install libomp`, version 12.0.0 of `libomp` on Homebrew is incompatible with LightGBM on MacOS, hence we need to install an earlier version.

:::

As recommended by Homebrew, the following variables should also be set in the `.zshrc` file in your root directory:

```bash showLineNumbers
export PATH="/usr/local/opt/python@3.8/bin:$PATH"
export LDFLAGS="-L/usr/local/opt/lapack/lib -L/usr/local/opt/openblas/lib -L/usr/local/opt/python@3.8/lib"
export CPPFLAGS="-I/usr/local/opt/lapack/include -I/usr/local/opt/openblas/include"
export PKG_CONFIG_PATH="/usr/local/opt/lapack/lib/pkgconfig:/usr/local/opt/openblas/lib/pkgconfig:/usr/local/opt/python@3.8/lib/pkgconfig"
```

## Bitfount Installation

Once the python and `libomp` dependencies are installed you should be able to install `bitfount` using the package manager pip.

Our recommendation is to install and use Bitfount in a virtual environment, see instructions on how to do this here.

Either in the virtual environment or on your local machine you can then proceed to install `bitfount`.

```bash showLineNumbers
pip install bitfount
```

This might take a few moments, so do not worry if so — grab a coffee or tea while you wait!

If you are planning on using the `bitfount` package with Jupyter Notebooks we recommend you install the splinter package `bitfount[tutorials]` which will make sure you are running compatible Jupyter dependencies.

```bash showLineNumbers
pip install 'bitfount[tutorials]'
```

If you want to use differential privacy (DP), you will need to install the DP extras as well:

```
pip install 'bitfount[dp]'
```

## Next Steps

Once you’ve installed Bitfount, whichever party that will provide the data for analysis must connect the data to a Pod. See For Data Custodians for detailed instructions.

If data is already connected to a Pod for analysis and you wish to train models on or query it, see For Data Scientists.


## SDK/Installation_Guides/Virtual_Environments.md

# Virtual environments guides

## Windows:

First install `virtualenv` as follows:

```
pip install virtualenv
```

And then create and activate a virtual environment:

```python
python -m virtualenv 
\Scripts\activate
```

## Linux:

```python
pip install virtualenv
python -m venv 
source /bin/activate
```

## Mac:

Intel:

First install `virtualenv` as follows:

```
pip install virtualenv
```

And then create and activate a virtual environment:

```python
python -m venv 
source /bin/activate
```

For more resources, see:
https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/


## SDK/Installation_Guides/Windows.md

# Windows

The first requirement is to install Python 3.8 or 3.9. Our recommendation is to download and install Python from the official website, with the following steps.

## Python installation

Step 1: Download Python

To start, go to python.org/downloads and then click on the button to download the version of Python you need (3.8 or 3.9).

Step 2: Run the .exe file

Next, run the .exe file that you just downloaded, and then follow the installation instructions.

Note that you should also check the box to `add Python to the Path`, and make sure that `pip` is also installed (you can check this by running `pip help`). If not, use the below command to install:

```bash showLineNumbers
python3 -m ensurepip --upgrade
```

## Bitfount Installation

Once Python is installed, you should be able to install `bitfount` from pypi.

Our recommendation is to install and use Bitfount in a virtual environment, see instructions on how to do this here.

Now, use `pip` to install `bitfount`.

```bash showLineNumbers
pip install bitfount
```

If you are planning on using the `bitfount` package with Jupyter Notebooks we recommend you install extra tutorials dependencies via `bitfount[tutorials]` which will make sure you are running compatible Jupyter dependencies.

```bash showLineNumbers
pip install 'bitfount[tutorials]'
```

If you want to use differential privacy (DP), you will need to install the DP extras as well:

```
pip install 'bitfount[dp]'
```

## Other libraries

### `torchcsprng`

The `torchcsprng` library is required by `bitfount` for Cryptographically Secure Pseudo-Random Number Generation. On Windows, the installation of this library needs to be done slightly differently depending on whether you are on CPU or to match your CUDA version if present. If you have any issues, please follow the instructions here on how best to install the library for your environment.

## Next Steps

Once you’ve installed Bitfount, whoever will provide the data for analysis must connect the data to a Pod. See For Data Custodians for detailed instructions.

If data is already connected to a Pod for analysis and you wish to train models on or query it, see For Data Scientists.


## SDK/bitfount-task-elements/algorithms.md

# Bitfount-Supported Algorithms

Algorithms dictate what type of task a Data Scientist is looking to execute. Bitfount requires specification of these algorithms to ensure a Data Scientist has the appropriate authorisation to execute them.

Bitfount currently supports the following algorithms, which are described in more technical detail in the API Reference:

| Algorithm                                                                                                                           | Description                                                                                                          |
| ----------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| Federated Training                                        | Algorithm for training a model remotely and returning its updated parameters.                                        |
| Evaluate                                                            | Algorithm for evaluating a model and returning metrics.                                                              |
| Train and Evaluate                                        | Algorithm for training a model, evaluating it, and returning metrics.                                                |
| Inference                                                          | Algorithm for running inference on a model and returning predictions.                                                |
| SQL Query                                                                           | Simple algorithm for running a SQL query on a table.                                                                 |
| Private SQL Query                                                           | Enables the user to run a private SQL query using differential privacy by specifying epsilon and delta.              |
| CSV Report                                                               | Saves the results of another algorithm as a CSV on the Pod.                                                          |
| Hugging Face Transformer Perplexity           | Computes and returns the perplexity of a huggingface model on the Pod's text data.                                   |
| Hugging Face Transformer Text Generation | Generates text using a huggingface model. Prompts can exist on the Pod or be sent dynamically by the Data Scientist. |
| Hugging Face Image Classification   | Image Classification Algorithm for pre-trained HuggingFace models.                                                   |
| Hugging Face Image Segmentation       | Image Segmentation Algorithm for pre-trained HuggingFace models.                                                     |
| Hugging Face Text Classification     | Text Classification Algorithm for pre-trained HuggingFace models.                                                    |
| Hugging Face TIMM Fine Tuning                        | Algorithm for fine-tuning Hugging Face image models from the TIMM library.                                           |
| Hugging Face TIMM Inference                            | Algorithm for running inference on Hugging Face image models from the TIMM library.                                  |

Don’t see an algorithm that works for you? Submit your suggestion or provide feedback to us.


## SDK/bitfount-task-elements/bitfount-schemas.md

# Bitfount Schemas

Every data source in the Bitfount ecosystem has an associated schema, whether
it is a remote Pod, or a local DataSource object. This
BitfountSchema
defines what the data types of the columns are. For categorical variables, the schema
also specifies the list of possible values and the ordering that should be used
for categorical embeddings. For image types, the schema also specifies the dimensions
of the image.

A default BitfountSchema for a given DataSource can be generated by using the
BitfountSchema constructor, for example:

```python
datasource = CSVSource(path='file_path.csv')
schema = BitfountSchema(datasource=datasource)
```

When using a remote Pod, the BitfountSchema can be looked up on the hub using the
`get_pod_schema` function

```python
schema = get_pod_schema('user1/pod1')
```

When building a federated model across multiple Pods, it is important to use a
common schema across the Pods for modelling. For this purpose we also provide a
helper function to join together pod schemas:

```python
schema1 = get_pod_schema('user1/pod1')
schema2 = get_pod_schema('user2/pod2')
schema = combine_pod_schemas([schema1, schema2])
```

This function currently only supports joining schemas where the names of all
columns to be joined overlap. The `combine_pod_schemas` function ensures
that the superset of all categorical values is used for categorical variables.


## SDK/bitfount-task-elements/data-structures.md

# Bitfount-Supported Data Structures

Data Structures are parameters to Model-based algorithms. For this class of
algorithm, the model needs a definition of how the data in Pods should be
mapped into tensors that can be used by the algorithm. The Data Structure
defines this mapping.

For those familiar with machine learning methods, you can think of the
Data Structure as a declarative definition of the
pyTorch and
TensorFlow
DataLoader and Dataset concepts.

Bitfount currently only supports a single DataStructure, which maps data into
two dimensions (especially useful for classification) along with additional
dimensions to allow for multitask learning and weightings.

The simplest use of a DataStructure is to specify the columns from the
Pod that should be mapped into the first dimension via the _selected_cols_
parameter and to specify which columns from the Pod should be mapped to the
second dimension via the _target_ parameter:

```python
ds = DataStructure(selected_cols=["A", "B", "C"], target=["C"])
```

The details of this mapping will depend on the types of the columns A, B and C.
If they are all simple int or float types, the first dimension of data will be
formatted as a simple vector `[a, b]`, and the second as `[c]`. If a column
is a categorical variable, then a 1-hot encoding is generated, using the
schema for defining the order of categorical values. When a column is an
image path, then the image is resized to a constant length of values and
included in the data.

:::note

The schema used for 1-hot encodings is currently specified in the model, rather than the data structure.

:::

In many cases, you may want to run transformations of the data in a remote
Pod before inputting data into the model. For this purpose, Bitfount provides
a set of transformations.
These are specified in the _batch_transformations_ parameter as follows (note
that separate transformations should be specified for each of train, validation,
and testing):

```python
datastructure = DataStructure(
    target="TARGET",
    selected_cols=["image", "TARGET"],
    image_cols=["image"],
    batch_transforms=[
        {
            "albumentations": {
                "step": "train",
                "output": True,
                "arg": "image",
                "transformations": [
                    {"Resize": {"height": 224, "width": 224}},
                    "Normalize",
                    "HorizontalFlip",
                    "ToTensorV2",
                ],
            }
        },
        {
            "albumentations": {
                "step": "validation",
                "output": True,
                "arg": "image",
                "transformations": [
                    {"Resize": {"height": 224, "width": 224}},
                    "Normalize",
                    "ToTensorV2",
                ],
            }
        },
        {
            "albumentations": {
                "step": "test",
                "output": True,
                "arg": "image",
                "transformations": [
                    {"Resize": {"height": 224, "width": 224}},
                    "Normalize",
                    "ToTensorV2",
                ],
            }
        },
    ]
)
```

Currently, only image albumentations transformations are supported, and the
names of the transformations are exactly as in the
Albumentations API.

More technical detail and variations in specification for the DataStructure
class can be found in the API Reference.

When using a custom model, you are also able to write arbitrary code for
data loading as described in
Custom Models.


## SDK/bitfount-task-elements/index.md

# Bitfount Task Elements

A core component of the Bitfount package is a declarative method for defining
tasks. Data scientists define the task they want to perform, and on which Pods
through declarative task definitions. When submitted, these task definitions
are sent to the relevant Pods, each of which checks that the submitting data
scientist has the required permissions to run that particular task on its data.
The Pods communicate their decisions and the task then proceeds with all
parties that have accepted.

Bitfount uses the term _task element_ to describe constituents of the task
definitions. The access control system used in the
Bitfount Pod Policies
is all defined in terms of the task elements supported in the platform. Similarly
all tasks that can be submitted in the platform must be definable in terms of
supported task elements.

Bitfount comes pre-installed with a number of default task elements.

Please check that you have upgraded to the latest Bitfount version to ensure
you are able to leverage all the task elements described in this section.

Task elements include:

- **Protocols:** A protocol defines how communication is orchestrated between Pods and the Data Scientist.
- **Algorithms:** An algorithm defines what computation is done on each Pod at each iteration of a protocol.
- **Models:** A model is a parameter to Model-based algorithms. For this class of algorithm, the model is the model architecture used for doing the analysis.
- **Data Structures:** A data structure is a parameter of Model-based algorithms. For this class of algorithm, the data structure defines how to map data from the Pod into tensors for use in the analysis.
- **Bitfount Schema:** A Bitfount Schema is a definition of the columns in a data source (whether a remote Pod or a local DataSource object).

Each of the sub-pages in this section explores the most up-to-date list of built-in Bitfount task elements by element type. Click the pages above to learn more!

## Task Configuration Parameters

Once you are ready to run a task, you can familiarise yourself with the task configuration parameters in the table below. If you leverage the **default Bitfount protocols and algorithms**, you **do not** have to specify these parameters. For more details on task configuration parameters, see the API Reference.

| Parameter Name        | Description                                                                                                                                                                                                                                                                                                                                                               | Accepted Values                                                                                    |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| pod_identifier        | The Pod or list of Pods that holds the data on which you want to train. This can be found in the Bitfount Hub and is the identifier (\{pod_owner_username\}/\{pod_name\}) in the URL for the relevant Pod’s page. This can also be copied from under the display name on the Pod’s page. Be sure to capture both the username and Pod name when inputting this parameter. | Typically alphanumeric based on Pod name.                                                          |
| datastructure         | The structure of the data on which you will train the model. This should contain the target column, columns to select, and columns to ignore for training.                                                                                                                                                                                                                | Typically taken from schema. i.e. datastructure=DataStructure(target="TARGET", table="adult-demo") |
| schema                | The schema config specified for the Pod. This should have already been specified by the Pod owner, so you can use the get_pod_schema command to specify.                                                                                                                                                                                                                  | get_pod_schema(pod_identifier)                                                                     |
| protocol              | The protocol to use. Note: The chosen protocol must have been permissioned by the Pod owner.                                                                                                                                                                                                                                                                              | Any supported protocol. i.e. FederatedAveraging                                                    |
| algorithm             | The algorithm to use. Note: The chosen algorithm must be accessible given your assigned role in relation to the specified Pod.                                                                                                                                                                                                                                            | Any supported algorithm. i.e. Train                                                                |
| aggregator            | \{optional\} The secure aggregator to leverage in the event you choose a protocol requiring aggregator specification.                                                                                                                                                                                                                                                     | i.e. bitfount.federated.aggregators.aggregator.\_ModellerSide                                      |
| model                 | The model you want to train.                                                                                                                                                                                                                                                                                                                                              | See “Training Models” Tutorials for examples.                                                      |
| model_hyperparameters | The settings used by the model                                                                                                                                                                                                                                                                                                                                            | i.e. hyperparameters: Optional[_StrAnyDict] = None                                                 |


## SDK/bitfount-task-elements/models.md

# Bitfount-Supported Models

Models are parameters to Model-based algorithms. For this class of algorithm, the
model is the model architecture used for doing the analysis.

In the future, Bitfount community members will be able to submit models for
consideration as default Bitfount-supported models. In the meantime, if you have
an unsupported model you'd like to use, you may create a custom model as described in
Custom Models.

Instantiating a model will generally require that the Data Scientist specify the
DataStructure
and the Schema, for example:

```python
schema = get_pod_schema('user1/pod1')
datastructure = DataStructure(selected_cols=["A", "B", "C"], target=["C"])
model = PyTorchLogisticRegressionClassifier(
    datastructure=datastructure,
    schema=schema
)
```

Once constructed, built-in models include syntactic sugar for running
algorithms in a standard federated way. For example, specifying `pod_identifiers`
when calling `model.fit` is generally equivalent to running the
FederatedAveraging
protocol with the FederatedTraining
algorithm on the given model:

```python
model.fit(pod_identifiers=['user1/pod1', 'user2/pod2'])
# is equivalent to:
protocol = FederatedAveraging(algorithm=FederatedModelTraining(model=model))
protocol.run(pod_identifiers=[pod_identifier])
```

The model can also be run on a local data source via the `data` parameter:

```python
data = CSVSource(path='file_path.csv')
model.fit(data=data)
```

Don’t see a model that works for you? Submit your suggestion or provide feedback to us.


## SDK/bitfount-task-elements/protocols.md

# Bitfount-Supported Protocols

Protocols represent the agreed upon method by which a Data Scientist and a
collection of Pods collaborate. The roles that are granted by Data Custodians
are defined in terms of the protocols, and they in turn define
the data and statistics that will be shared during task execution.

Bitfount currently supports the following protocols:

| Protocol                                                                                           | Description                                                                                                                                                                                                                                                                                       | When to Use                                                                                                                                                                                                                                                                                   |
| -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Federated Averaging       | This protocol performs a predetermined number of epochs or steps of training on each remote Pod before sending the updated model parameters to the Data Scientist. These parameters are then averaged and sent back to the Pods for as many federated iterations as the Data Scientist specifies. | When training or evaluating ML models in a federated manner.**Works With:**- Federated Training                                                                                                                                                                                |
| Results Only                                     | Returns the results from the provided algorithm. This protocol is the most permissive protocol and only involves one round of communication. It simply runs the algorithm on the specified Pod(s) and returns the results as a list (one element for every Pod).                                  | To return the results of a task to the Data Scientist.**Works With:** - Column Average - Model Evaluate - Model Train and Evaluate - Model Inference - SQL Query - Private SQL Query - Transformer Perplexity - Transformer Text Generation |
| Inference And CSV Report | Runs remote model inference and then generates a CSV report locally from the results.                                                                                                                                                                                                             | Best applied when model inference must be run on remote data but the results cannot leave the Pod.**Works With:** - Model Inference - CSV Report                                                                                                                          |

Once instantiated, Data Scientists can initiate a run a protocol on a set of Pods using the `run` method:

```python
protocol = ... # Instantiate one of the relevant protocols
protocol.run(pod_identifiers=["user1/pod1", "user2/pod2"])
```

The protocol will only succeed if the Data Scientists has the requisite permissions
from all of the Pods specified.

Don’t see a protocol that works for you? Submit your suggestion or provide feedback to us.


## SDK/for-data-custodians/index.md

# For Data Custodians

Once you’ve installed Bitfount, you must connect the data to which you wish to enable access to the platform. The guides in this section are intended to help Data Custodians make datasets available for analysis. The basic steps are:

1. **Prepare Data:** Ensure your data meets the required format specifications prior to Pod creation. See Preparing to Connect Data.
2. **Create a Pod** (or several): Data must be connected to a Pod for it to be analysed. The simplest way to create a Pod is to generate a yaml config file for a dataset contained in a .csv file. See sample text below.

   a. Once generated, create the Pod by executing: `python bitfount run_pod `

   b. For more detailed instructions and formatting requirements, see Connecting Data & Running Pods.

3. **Authorise Pod(s):** Manage who has access to your data for which purposes from “Datasets” in Bitfount Hub. See Authorising Pods for more details.

Once you’ve completed this process, you or your partners are ready to analyse your data!

Have questions? Check out our Troubleshooting & FAQs page or contact us via support@bitfount.com.

## Sample YAML File

```yaml showLineNumbers
#Enter your desired parameters and remove brackets and arrows prior to running a pod.

name: []
pod_details_config:
	display_name: []
	description: []
datasource: CSVSource
data_config:
	ignore_cols: []
	datasource_args:
		path: /.csv
```


## SDK/for-data-custodians/managing-pod-access/Authorising_Pods.md

# Authorising Pods

One of the key benefits of using Bitfount is the ability to set granular authorisation policies against your datasets to ensure they are protected according to your needs. Before any analysis can occur, you must grant a collaborator access to a Pod. You can then specify the permissions your collaborators have when analysing the data in the Pod.

Bitfount currently enables only the user who created the Pod to grant access to a given Pod.

:::tip
Check out our Bitfount Pod Policies guide for details on all available policies. If authorising data on behalf of an organisation, we recommend coordinating with your governance team in order to determine the appropriate policies for each Pod.
:::

## Managing Access

1. Log in to Bitfount Hub.
2. Navigate to the “My Pods” page from the main navigation bar.
3. Navigate to the “Access Requests” tab.
4. Click the “Grant Access” button in the top right.
5. Enter or select the desired Pod name from the dropdown menu.
6. Input the username for the Bitfount user to whom you’d like to grant Pod access in the “User” field. This is the user’s Bitfount username as indicated by their profile in the Hub. If you do not know it, you will need to ask your collaborator for this input.
7. Using the “Assign role” dropdown, assign the user the role you’d like them to have in relation to the specified Pod. The role you specify will indicate what the user can do with the data associated with the Pod, so it is important to ensure you assign the role correctly based on the Permissions outlined in the role dropdown.

:::caution
By granting access to a Pod, you are agreeing that grantees are able to perform tasks against it as well as _any_ Pods to which they also have access. Grantees can only perform tasks according to the role you assign them. Note, multi-Pod tasks can only be performed against Pods with the same column layout and unique values within the columns.
:::

:::info
For more details on role types, see Bitfount Pod Policies.
:::

## Editing & Revoking Access

If you accidentally assign the wrong role to a collaborator or wish to terminate a collaboration, you can change the role settings for a user by revoking access to the Pod.

Access can be revoked as follows:

1. Navigate to “My Pods”, then go to the “Assigned Roles” tab.
2. Locate the user(s) from whom you wish to revoke access.
3. Click the “Revoke” bin icon.
4. Confirm you’d like to revoke the role by clicking “Yes, Revoke Role”.

## Custom Model Permissions

You may wish to monitor which custom models can be used on which Pods by which partners. There are two mechanisms to allow the use of custom models when granting role-based access to a Pod:

1. **Super-Modeller**: If you grant Super-Modeller access to a user, you are permitting the user to use any custom model (including any arbitrary code they upload) against the data associated with the given Pod. Note that Bitfount **does not** vet the contents of custom models, so only grant the Super Modeller role to a user if you have a high level of trust in the partner.
2. **General Modeller**: General Modellers can make requests to use custom models with a given Pod. When the user makes this request, the Pod owner will receive an access request and will be prompted to review that you are ok with the user running the specified model against the specified Pod. You can then reject or accept the request.

These are the only two roles with permissions to run custom models. Custom model requests are managed from the "My Pods" > "Access Requests" tab, just like any other access request.

### Revoking Custom Models

If you are a Pod owner, you can revoke the use of custom models by a user via the "Assigned Roles" tab. To do so:

1. Find the user:Pod pair you wish to edit in the list.
2. You will see a list of the custom models for which the user has permissions in the "Role" column.
3. To remove a specific custom model, click the circled 'x' in its toggle and confirm you would like it removed for the user:Pod pair.

## Additional Policies

Bitfount also supports several additional Pod access, role, and dataset permissions policies you may wish to apply depending on the sensitivity of the dataset associated with the Pod you are authorising. These include but are not limited to privacy-preserving data science policies, granular use case controls, and other Pod access policies. For a complete list of these policies with detailed explanations, see Bitfount Pod Policies.

## Next Steps

Now that you’ve authorised access to your Pod, you or your collaborators are ready to analyse data! Check out the For Data Scientists section for next steps.


## SDK/for-data-custodians/managing-pod-access/bitfount-pod-policies.md

# Bitfount Pod Policies

A key feature of Bitfount’s federated data science platform is its ability to enable Pod owners to ‘authorise’ policies against Pods. You can think of policies as rules which dictate who has access to a Pod and what tasks can be performed against it.

Bitfount has designed policies to be flexible according to the relationship between parties accessing the data, the data custodian’s risk posture in relation to its collaborators, and the use cases the data custodian wishes to support by activating a Pod. All available policies and policy permissions are outlined below with descriptions and example use cases.

## Bitfount Policy Hierarchy

Data Custodians have the opportunity to authorise usage-based access control policies against Pods depending on the relationship between the Custodian and their collaborators.

Today, this is largely based on controls a Custodian would like to enforce in relation to individual users. In the future, these controls will extend to organisations, advanced use cases, and future techniques for enforcing privacy. In the meantime, it’s helpful to think of Bitfount policies as dictating who has access to a given Pod, what tasks or use cases they can perform once they have access, and how those tasks can be performed.

Instructions for granting or revoking access can be found in Authorising Pods.

## Usage-Based Control Policies

Usage-based control policies focus on the **_purpose_** of access based on a collaborator’s given role in relation to the Pod. In order to apply usage-based controls, a Pod owner must assign a **role** to each collaborator who will access the Pod. You can think of a role as a collection of permissions which dictate how the collaborator can interact with the dataset associated with the Pod.

Roles are associated with a set of default **permissions** (see permissions sections below for greater detail), which you can think of as rules that enforce _how_ the authorised collaborator can interact with the dataset associated with the Pod.

Bitfount currently supports several default roles to which collaborators may be assigned:

| Role                   | Description                                                                                                                                                                                                                                                                                                                                                            | Default Permissions                                                                                                                                                                                               | When to Apply                                                                                                                                                        |
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Super Modeller         | This modeller can perform any modelling task supported by Bitfount on the selected pods. WARNING: Super Modellers are allowed to use any custom models or arbitrary code. Bitfount does not vet the contents of custom models, so Data Custodians should be confident in the model(s) prior to granting this role.                                                     | **Protocols**: - Any Bitfount-installed protocol **Algorithms**: - Any installed algorithm **Models:** - Any installed model - Any custom model                                                                   | This role is best assigned to users with whom a Data Custodian has full trust.                                                                                       |
| General Modeller       | This modeller can perform any modelling task supported by Bitfount on the selected pods, apart from running the ModelInference or SqlQuery algorithms and furthermore the PrivateSqlQuery algorithm is limited to a DP epsilon of 10. Finally, custom models are not permitted by default but the modeller can still request extra access for a specific custom model. | **Protocols:** - Federated averaging - Results only - Private Set Intersection **Algorithms:** - Train - Evaluate - Train and evaluate - Private SQL - Compute Intersection RSA **Models:** - Any installed model | This role is best assigned to users with whom a Data Custodian has a high level of trust.                                                                            |
| DP Restricted Modeller | This modeller can only perform modelling tasks on the selected pods with the restriction that Differential Privacy must be enforced and the parameter epsilon expended in the modelling task is less than or equal to 3. No custom model permissions can be added to this role as we cannot guarantee that differential privacy will be enforced.                      | **Protocols:** - Federated averaging - Results only **Algorithms:** - Train - Evaluate - Train and evaluate - Private SQL **Models:** - Any installed model                                                       | This role is best assigned to users for whom a Data Custodian would like to impose additional privacy-preserving techniques to mitigate the risk of privacy leakage. |
| Metrics Only           | This user can only perform tasks on the selected pods that return evaluation metrics.                                                                                                                                                                                                                                                                                  | **Protocols:** - Results only **Algorithms:** - Evaluate - Train and evaluate **Models:** - Any installed model                                                                                                   | This role is best assigned to users with whom a Data Custodian has a lower level of trust or to those who do not need to leverage advanced algorithms.               |
| Viewer                 | This user can see the selected pods on the Explore page and on their profile pages in the Bitfount Hub                                                                                                                                                                                                                                                                 | View-only                                                                                                                                                                                                         | This role is best assigned to users who want to monitor Pod activity but who will not need to interact with data.                                                    |

:::caution
By granting access to a Pod, you are agreeing that grantees are able to perform tasks against it as well as _any_ Pods to which they also have access. Grantees can only perform tasks according to the role you assign them. Note, multi-Pod tasks can only be performed against Pods with the same column layout and unique values within the columns.
:::

Don’t see a role or capability listed here? Have an additional use case that you’d like to see enabled? We’d love to hear your feedback via support@bitfount.com.

## Task Permissions

Each role is associated with a set of task permissions, which you can think of as dictating what types of interactions a given collaborator can have with a dataset associated to a given Pod. Tasks currently fall into three categories:

- **Protocols:** The protocol defines how communication is orchestrated between Pods and the Data Scientist.
- **Algorithms:** The algorithm defines what computation is done on each Pod at each iteration of a protocol
- **Models:** The model is a parameter to Model-based algorithms. For this class of algorithm, the model is the model architecture used for doing the analysis.

The Bitfount package a user installs comes pre-loaded with a set of default protocols, algorithms, and models Bitfount has vetted for use with the platform. More details on the contents and recommended use of these task types can be found in Bitfount Task Elements.

## Privacy-Preserving Permissions

These permissions reflect privacy-enhancing techniques a Data Custodian may wish to apply to a Pod depending on their risk posture or relationship to the Data Scientist. Note that a Data Scientist can run privacy-preserving tasks without the need for privacy-permissions to be set if they wish to test this capability.

Beyond the privacy-preserving techniques ingrained in the Bitfount-installed protocols, Bitfount supports a role, DP Restricted Modeller, with a differential privacy budget (epsilon) of 3 already applied. This means that the Data Scientist will need to specify a value of epsilon, which will dictate the level of noise added to the task outputs, when performing a task.

### Interacting With Privacy-Preserving Permissions

Bitfount currently does not support configurable privacy-preserving permissions. Instead, if a Data Custodian wishes to apply differential privacy to their Pod, a budget of 3 will be set against the Pod.

## Policy Conflicts

Sometimes, a user attempts to interact with Pods in an un-authorised manner or is unable to execute a task because they’ve been throttled by privacy-preserving permission constraints. If you are the Data Custodian, you may choose whether or not to relax the Pod policies in order to resolve the conflict based on your own risk posture. If you are the Data Scientist and cannot be granted fewer restrictions, check out our Obtaining Pod Permissions & Understanding Pod Metadata guide for helpful tips on how to adjust your queries to meet policy requirements given your use case.

Still have questions on policies? Check out our Troubleshooting & FAQs guide or reach out to us via support@bitfount.com


## SDK/for-data-custodians/setting-up/Preparing_Data.md

# Preparing to Connect Data

Before you can connect your data to a Pod for analysis, you may want to work with your colleagues or partners to answer important questions about the data and how it will be used. If you have a simple dataset or use case, you can likely configure your Pod based on the examples provided in the Bitfount tutorials or in Data Source Configuration Best Practices without referring too much to this guide.

However, for more complex use cases or for your own reference, the answers to the below questions will dictate what arguments you will choose when configuring the Pod using the Pod class. You may wish to consider them before moving to the next step!

## Pod Nomenclature

Naming Pods clearly is important for searchability within the Bitfount Hub and for avoiding errors when working with Pods. Names are specified using two arguments:

- `name`: This is an argument for the `Pod` class and is the name for the Pod which will be used for interaction with the Pod via yaml or the Bitfount python API
- `display_name`: This is an argument for the `PodDetailsConfig` class and is the name you or your partners will see displayed in the Bitfount Hub when exploring or authorising Pods.

**1. What are best practices for naming Pods?**

If you will be working with colleagues or external partners on authorised Pods, it is a good idea to ensure your Pods are named such that they will be able to easily find and work with them based on the data they'd expect to be connected to the Pods. Typically, we suggest:

- Make the names and display names as equivalent as possible and human-readable, unless you have understood codes for databases across those who will be interacting with the Pods.
- No underscores or punctuation in names; these names will be rejected. If you need spaces for the `name` argument, use hyphens to separate words.

**2. What happens if I make a mistake on Pod names or create two Pods with the same name?**

The `name` argument is the source of truth for creating new or overwriting existing Pods. If you wish to change a Pod's display name, you can easily do so by re-running the Pod configuration and using the same `name` argument as you did previously with the new `display_name` you'd like to specify.

If you specify different `name` arguments in different Pod configurations with the same display name, however, this will create two different Pods with the same display name. This may cause confusion among your colleagues or partners, so it is best practice to check that you are not creating a duplicate Pod prior to configuring a new Pod.

Deleting a Pod is not yet supported by default. Please reach out via support@bitfount.com if you run into issues with Pod naming and configurations.

## Data Sources

Data sources are Bitfount's term for the format or database type from which a data custodian connects datasets for analysis for permissioning within Bitfount Hub. They are specified in the Pod class using `datasource`.

**1. In which format or database type is my data currently?**

Bitfount supports the below file types and databases by default. If your data is not in one of these formats or accessible by database connection, you may wish to convert it to one of these data sources. Note, if your dataset is a set of image files, Bitfount supports connecting these via any data source (see Data Source Configuration Best Practices for an example). We also provide the option to use custom DataSource plugins if desired.

All Bitfount-supported data sources leverage pandas to ensure your file or database contents are compatible with our systems. We do not impose any Bitfount-specific limitations, however, if you run into errors connecting your data, you may need to specify keyword arguments as a dictionary for Bitfount to pass through to pandas.

**2. What kind of analyses will I or my partners wish to perform on the dataset(s)?**

The analysis you or your partners wish to perform will affect the data source you choose. Most default data sources support Bitfount-supported task elements by default, however, if your data is in a multi-sheet Excel file, and you or your partners will wish to perform tasks across sheets, you must convert your file to a SQLite format as demonstrated in Data Source Configuration Best Practices.

### Supported DataSources

Connecting data to a Bitfount Pod is done by specifying the appropriate `DataSource`, which is Bitfount’s class for enabling a Pod to read and access data in its correct format. Bitfount currently supports the following DataSources:

| DataSource                                       | Description                                       | Supported Configuration Mechanisms |
| ------------------------------------------------ | ------------------------------------------------- | ---------------------------------- |
| CSV | Supports connection of comma-delimited .csv files | YAML, Bitfount Python API          |

For detailed examples on when to use and how to configure each data source type, see Data Source Configuration Best Practices. For more technical details, see the datasources classes page in our API Reference guide. If you don’t see your preferred DataSource here, you may wish to contribute a custom DataSource plugin to Bitfount. Please see Using Custom Data Sources for more details.

### DataSources requiring additional licensing

We also maintain plugins that support various specialised use cases:

| DataSource | Description                                                                                                                                      | Supported Configuration Mechanisms |
| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------- |
| Heidelberg | Supports connection of Heidelberg Eye Explorer data | YAML, Bitfount Python API          |

These currently require additional licensing. Please contact us on info@bitfount.com if you would like to explore further.

### Supported Databases

:::info

The **`bitfount`** package does not install datasource specific dependencies, so ensure you’ve installed any additional dependencies required to connect your datasource.

:::

**Postgres Installation**

Bitfount supports most SQL databases as data sources. To use a postgres database as a DataSource within the Bitfount platform you must have the following packages installed:

| Package         | Version  |
| --------------- | -------- |
| bitfount        | ≥ 0.5.15 |
| psycopg2-binary | ≥ 2.7.4  |

We also support any databases supported by `sqlalchemy`: https://docs.sqlalchemy.org/en/14/dialects/.

Want us to support a specific file format or database type not listed here? Please provide feedback to us via support@bitfount.com.

## Data Configuration

The `PodDataConfig` class enables you to specify arguments dictating how data will be presented to you or your collaborators and 'read' when performing tasks. Within the `Pod` class, these arguments are passed via `pod_details_config`, which is an optional argument. To determine whether you need to configure any of these settings, ask yourself:

**1. Does my dataset contain any fields/features (including images) for which default data types are not already specified or readable by default?**

If you have any fields where values might not conform to common data standards, it is worthwhile to use the `force_stypes` argument to specify the semantic type for each field/feature of the dataset. Bitfount will attempt to assign these by default, so this parameter is optional unless your data includes images. Image columns must be specified using the `image` parameter to map to the columns. Details on this parameter can be found in `PodDetailsConfig`.

**2. Do I want to exclude any columns from my dataset from being used for analysis?**

It is possible to ignore columns of your dataset for the purposes of connecting data to Bitfount using the `ignore_cols` argument defined in your relevant data source class (i.e. `CSVSource`, etc.) passed directly through the data source specification within your `Pod` configuration. This is most commonly used to remove personally-identifiable information fields or fields used for internal use which will not be relevant to partners or the analysis they wish to perform. It is helpful to use this argument if you don't want to create a new cut of data for every collaboration. Take note of any fields you wish to ignore prior to configuring a Pod.

**3. Where and how will my file or database be located/accessed?**

Depending on your selected data source, there are additional parameters you need to specify in order to properly connect the data. These will be defined on the DataSource level. When using yaml format, you will need to pass them via the `datasource_args` argument. Examples will include file paths, database connection URLs and credentials, or other similar authentication requirements.

See below for a list of additional required parameters by data source type:

| Data Source                                                                               | Parameters                             |
| ----------------------------------------------------------------------------------------- | -------------------------------------- |
| CSVSource | - `path`: Path or URL to the .csv file |

Example configurations for each data source type are available on Data Source Configuration Best Practices

**4. Will my data source consist of multiple files or image files?**

If your data source contains references to file paths and you want to change the root folder, you can optionally use the `modifiers` parameter to specify the file path prefix or extension. This is required for image files stored in a directory, so Bitfount can cycle through all of your image data via one data source.

**5. Will I or my partners perform ML or SQL tasks requiring test, training, or validation sets of data? Will the experiments require consistency?**

If you are unfamiliar with this process, it's common for ML engineers or data scientists to require subsets of data to perform different functions when developing a model or performing analysis. Within the `PodDataConfig` class, Bitfount provides data custodians with a mechanism for dictating how to split datasets for these purposes within the `data_split` argument. Using this argument requires you to specify the `DataSplitConfig`, which takes the `data_splitter` parameter.

The default for `data_splitter` is to assign 10% of the dataset to a test sample, 10% to a validation sample, and 80% to a sample used for training. If you wish to change the defaults, you will first specify the integer representing the test sample percentage, then the validation sample percentage in the `data_splitter` argument passed to the data source. The unspecified portion will be assigned to the training set. For example:

```python
...
datasource=CSVSource(..., data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10))
...
```

If your or your partners' analyses require consistent splits, you will also want to specify the `seed` parameter in the DataSource. The `seed` parameter dictates the random number from which to start for any randomisation task a data scientist wishes to perform. The default value for this parameter is `42`, however you can specify any integer if you'd like to. If you specify the seed, your configuration will look like:

```python
...
datasource=CSVSource(..., seed = 100, ...
)
...
```

**6. Does my data require a bit of additional cleaning?**

Bitfount does support the removal of NaNs and the normalisation of numeric values if you'd like. Just set `auto_tidy=True` in the `PodDataConfig` when configuring the Pod if you would like us to perform this step on your behalf. Otherwise, this parameter is optional.

## Data Schemas

When you connect a dataset to a Pod, you will need to define its schema to ensure you or your collaborators are able to correctly perform tasks on the dataset later.

The data schema is displayed on the Pod’s profile page in the Bitfount Hub so that Data Scientists can understand which data fields are available to them for analysis and their semantic types (e.g. integers, strings, floats, etc). If you’ve correctly specified a DataSource, but do not specify the schema for the Pod, Bitfount will attempt to define the schema on your behalf. However, you may wish to consider the following before leaving the `schema` argument as `None`:

**1. Are the column or file headers for my dataset what we want to see in the Bitfount Hub or to use in performing tasks? Are they human-readable?**

Bitfount currently does not support the alteration or specification of field/feature headers, so we recommend you set file or column headers to human-readable names or references you or your collaborators will understand prior to connecting data to a Pod.

**2. Do I want to include multiple tables in the Pod?**

The `BitfountSchema` class allows you to specify multiple `table_name`s or descriptions for tables or columns. This will allow you to associate multiple tables from a given data source to a single Pod if desired. These can then be passed to the `Pod` class via the `schema` argument.

## Multi-Pod Interactions

Data Scientists are generally able to perform tasks across Pods to which they have permission as long as the Pods meet the multi-Pod ML or SQL task requirements by default. However, in the case where a Data Scientist wishes to use the `SecureAggregation` protocol, the Pods they specify will need to be in the `approved_pods` list of each relevant Pod. To determine whether you need to leverage the `approved_pods` argument for the `Pod` class when configuring a Pod, ask:

**Will I or my collaborators need to use this Pod's data in combination with that of another Pod for Secure Aggregation?**

If no, do not specify the `approved_pods` argument. If yes, be sure to specify the list of Pods for which you are comfortable for tasks to be run across in concert with the Pod you are configuring. Keep in mind:

- Any Pods you list will be permissible for querying or running ML tasks in combination with one another only if a Data Scientist has permission to _all_ Pods in the list.
- If the Pods you list do not _also_ list your Pod in their `approved_pods` list(s), Data Scientists who have access to the Pods still will not be able to perform `SecureAggregation` tasks across them.

## Privacy-Preserving Configurations

Pod owners have the option to specify additional privacy-preserving controls atop datasets connected to a given Pod. This is typically done based on the `DP Modeller` role you would assign to a given user with access to your Pod. However, if you will always want differential privacy to be applied to your dataset, you can override these user-level controls and assign them at the Pod-level. This allows you to enforce the guarantees various privacy-preserving techniques provide if desired. Today, Bitfount supports configurable differential privacy controls. To determine whether you need to set the `differential_privacy` argument, ask yourself:

**Is my dataset sensitive to the degree it requires additional privacy protections, and/or do I have concerns that my partners will perform malicious attacks against the data?**

Most datasets do not require additional privacy-preserving controls by default, in which case specifying `differential_privacy` is unnecessary. You may wish to apply these controls if you are dealing with highly regulated or sensitive data, such as patient healthcare records or financial transaction records.

A privacy budget is typically determined based on your risk tolerance for the given dataset -- A rule of thumb is that the lower your risk tolerance, the higher one should set the budget. However, you must also balance this with the "usefulness" of the data to the Data Scientist. If you set the budget _too_ high, the data will no longer provide the Data Scientist with valuable insights. As a result of this dynamic, we have set what we believe to be a reasonable default budget of 3 for each task -- We believe this provides sufficient privacy protections for somewhat sensitive data where the Data Custodian permissions access only to relatively trusted parties.

## Next Steps

Now that you've thought through what you'll need to create and run a Pod, it's time to connect some data! Head to Connecting Data & Running Pods for more details.


## SDK/for-data-custodians/setting-up/authentication-auth-options-data-custodians.md

# Authentication & Authorisation Mechanisms

## For Data Custodians

The default mechanism for authenticating a Pod is to run the Pod and log in via the web-based Bitfount Hub app to verify your credentials. You are welcome to leave this as the default behaviour, however we have also provided an alternative mechanism in the form of API keys detailed below.

### Pod API Keys

Bitfount has enabled a mechanism to ensure you are no longer required to authenticate via the online Hub each time you wish to run a Pod. Instead, each Pod is assigned a set of API keys, which are used to authenticate upon initialisation.

You can enable an API Key by going to the Hub > My Pods > API Keys tab and clicking "Add API Key". Be sure to:

- Type the desired Pod's name correctly.
- Copy the `accessKeyID` and `accessKey` values.

To run the Pod indefinitely using these keys, you can add them to your YAML Pod configuration file like so prior to running the Pod runner script (details on how to do the whole process can be found here):

```yaml
api_key:
  access_key: type key here
  access_key_id: id key here
pod_name: name of pod
### rest of Pod config
```

If you'd prefer to configure the Pod in python, you can set the access_key and access_key_id as environment variables on your machine. Then, specify these variables for the `pod` class using the `pod_keys` parameter.

In terminal:

```bash
export BITFOUNT_API_KEY="access key value"
export BITFOUNT_API_KEY_ID="id key value"
```

When configuring the Pod, you do not have to specify these variables once they've been stored as environment variables.

### Changing & Deleting API Keys

In the event of a breach of your systems or to enable shared ownership of a Pod, you can change API keys as desired via the Pod’s “API Keys” tab within “My Pods” in Bitfount Hub.

:::note

Pods are limited to two API keys per Pod.

If you wish to use API keys for multiple Pods, you will need to specify the
different API keys in different YAML files.

:::

Pod API keys can also be used to enable you to build a network of distributed Pods with the same set of credentials. This is most typically done using Bitfount’s SDK. For more information on how to deploy Bitfount for this use case, please reach out to info@bitfount.com.


## SDK/for-data-custodians/setting-up/connecting-data-running-pods.md

# Connecting Data & Running Pods

The basic component of the Bitfount network is the Pod (Processor of data). Pods are co-located with data, check whether users are authorised to do given operations on the data, and then carry out any approved computation. A Pod must be created with connected data prior to any data analysis taking place.

When you create a Pod, you assign it metadata according to its contents. If you are getting started with production-ready data, it’s useful to understand your desired Pod metadata prior to setting up a Pod. For more details on possible Pod configuration parameters, see the API Reference.

There are three primary mechanisms for creating a Pod, which you can choose from based on your level of comfort with each method:

1. YAML configuration
2. Bitfount Python API
3. Docker configuration

We've included examples of .csv file sources for Pod configurations below. If you don't see an example for your data source or want more information on best practices for a given data source, we have outlined Pod configuration for each data source type in more detail in Data Source Configuration Best Practices.

Given the federated nature of the Bitfount architecture, **Pods can only be accessed if they are online.** This means that a Data Scientist can only run tasks on the data in a Pod if the Pod is running on the Pod owner’s local machine or server at the time of submitting a query or task. As a result, Bitfount recommends that Pod owners create Pods that leverage always-on servers to reduce the need to work synchronously across teams and partners.

:::tip
To ensure authorised data scientists can access Pods at their convenience, create Pods linked to always-on machines. This includes cloud-based server instances, on-premise servers with guaranteed uptime, or hosted server solutions - all with access to their own local data sources.
:::

## YAML Configuration

Setting up a pod can be done through configuration using a yaml file along with the _run_pod_ script. The format for executing this is:

```bash showLineNumbers
bitfount run_pod 
```

The configuration yaml file needs to follow the format specified in the PodConfig class.

### CSV setup

Example yaml file format for .csv files:

```yaml showLineNumbers
name: census-income
datasource: CSVSource

pod_details_config:
  display_name: Census income demo pod
  description: >
		This pod contains data from the census-income public dataset,
		which includes data related to individual adults, their circumstances,
		and income.
data_config:
  ignore_cols: ["fnlwgt"]
  force_stypes:
    census-income:
      categorical: ["TARGET", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "gender", "education"]
  datasource_args:
    path: /.csv
    seed: 100
```

For yaml configuration examples for other data source types (i.e. postgres databases, Excel files, SQLite database files, etc.) see Data Source Configuration Best Practices.

## Using the Bitfount Python API

Pods can be set up directly through the Python API, as shown in the Running a Pod Tutorial. We recommend using a python-compatible notebook such as Jupyter to organise Pod creation.

Prior to installing Bitfount, we recommend generating a virtual environment from which to orchestrate Pod creation and other Bitfount tasks. If you’re unfamiliar with this process, you can run the commands:

```bash showLineNumbers
python -m venv {directory_for_venv}
source {directory_for_venv}/bin/activate
```

If you’ve already installed Bitfount, activate your previously created virtual environment by running:

```bash showLineNumbers
source {directory_for_venv}/bin/activate
```

To launch a Jupyter notebook (optional):

```bash showLineNumbers
jupyter notebook
```

Once you’ve prepped your environment:

1. Import the relevant pieces. The classes illustrated below are the minimum required for running Pods. However, you may wish to include additional classes depending on what you plan to achieve. More details on available classes can be found in the API Reference.

   ```python showLineNumbers
   import logging

   import nest_asyncio

   from bitfount import Pod
   from bitfount.runners.config_schemas import (
       DataSplitConfig,
       PodConfig,
       PodDataConfig,
       PodDetailsConfig,
   )
   from bitfount.runners.utils import setup_loggers

   nest_asyncio.apply()  # Needed because Jupyter also has an asyncio loop
   ```

2. Set up the loggers, which are required to view status of your Pod:

   ```python showLineNumbers
   loggers = setup_loggers([logging.getLogger("bitfount")])
   ```

3. Create the `Pod` and configure it using the `PodDetailsConfig` and `PodDataConfig` classes. The former is required to specify the display name and description of the Pod, whilst the latter is used to customise the schema and underlying data in the datasource (e.g. modifiers, semantic types of columns, `DataSplitConfig` inputs, etc.). For more information, refer to the config_schemas reference guide. Note, Pod names cannot include underscores.

   ```python showLineNumbers
   pod = Pod(
       name="my-pod",
       datasource=datasource,  # see Data Source Configuration Best Practices guide for examples of each supported source type
       pod_details_config=PodDetailsConfig(
           display_name="Demo Pod",
           description="This pod is for demonstration purposes only",
       ),
       data_config=PodDataConfig(
           force_stypes={
   		"my-pod": {
   		"categorical": ["target"],
   		"image": ["file"]
   			},
   		},
       ),
   )
   ```

4. Finally, start the pod by calling the `start` method on the Pod

   ```python showLineNumbers
   pod.start()
   ```

## Using Docker

Instead of running the python script directly, you can also run it via Docker using the Pod docker image hosted on GitHub.

The image requires a config.yaml file, which follows the same format as the yaml used by the run_pod script as shown above. By default the Docker image will try to load it from `/mount/config/config.yaml` inside the Docker container.

You can provide this file in one of two ways:

1. Mounting/binding a volume to the container. Exactly how you do this may vary depending on your platform/environment (Docker/docker-compose/ECS).
2. Copy a config file into a stopped container using docker cp.

- **CSV or Excel:** If you're using a CSV or Excel data source then you'll need to mount your data to the container. You will need to specify the path to mount in your config. For simplicity it's easiest to put your config and your CSV in the same directory and then mount it to the container.
- **Database:** If you're using a Database as your data source, then you'll need to make sure you've exported the relevant environment variables. These are specified in the previous section.

Once your container is running you will need to check the container logs and complete the login step, allowing your container to authenticate with Bitfount. The process is the same as when running locally (e.g. the tutorials), except that we can't open the login page automatically for you.

## Configuring GPU support

If you are using a PyTorch model, the GPUs are automatically configured under the hood (using PyTorch Lightning). Bitfount currently supports single GPU processing. The important thing is to ensure you have the correct version of PyTorch installed. The PyTorch maintainers provide a useful guide for ensuring this at https://pytorch.org/get-started/locally/

If running through Docker, you will need to also ensure that the Docker image has access to the GPU!

## Running Pods

Once you’ve created and authorised a Pod, you’ll need to ensure it is online before you or a collaborator can query or train models on the dataset associated to the Pod. For this reason, we recommend associating a Pod with an always-on machine rather than your local machine. This will ensure collaborators can work asynchronously and reduces communication required between parties. In order to get a Pod up and running or re-start an offline Pod, you can:

1. Check if a Pod is online in the Bitfount Hub by going to “My Pods” and searching for it. You can tell if a Pod is online or offline based on the icon in its card. If the icon is red, the Pod is offline. If the icon is green, the Pod is online.
2. If a Pod is offline, you can re-start it by recreating the Pod as outlined in the steps above.
3. Confirm the Pod is back online in the Bitfount Hub. At this stage, you may wish to notify your collaborators that the Pod is ready 🎉.

## Next Steps

Congrats, you’ve successfully configured a Pod! You are now ready to authorise the Pod for yourself or your partners. For detailed instructions, see Authorising Pods.


## SDK/for-data-custodians/setting-up/data-source-configuration-best-practices.md

# Data Source Configuration Best Practices

In Connecting Data & Running Pods, we outlined the mechanisms for configuring Pods. In this guide, we will provide examples by data source for each mechanism and details on best practices for each data source type. Note we leverage various parameter combinations in Pod configuration here which are not representative of all possible configurations. For more details on Pod configuration, please refer to the API Reference.

**Jump to:**

- CSV
- DataFrame
- Image Sources

## CSV

A local .csv file is the most common data source for Bitfount data connections.

### Best Practices

For .csv files:

- Pods based on .csv files will run as long as they are not interrupted, meaning Pods configured to point to local files will be interrupted if the machine on which they are hosted is turned off or experiences transient disconnection. Bitfount will attempt to bring Pods back online if there is an interruption, but in most cases they will need to be re-started.
- Best practice is to point to a .csv file hosted on a server not tied to a user's local machine.
- Exclude any personally identifiable information fields from the Pod configuration specification.
- Bitfount will automatically generate the schema for a .csv file based on its header row. Please ensure to include a header row in your .csv file with the column names you wish to be reflected in the Pod's schema.

### YAML Configuration Example

The configuration yaml file needs to follow the format specified in the PodConfig class:

```yaml showLineNumbers
name: 
datasource: CSVSource

pod_details_config:
  display_name: 
  description: >
		This is a description of the data connected to this Pod with any relevant details for potential collaborators.
data_config:
  ignore_cols: ["Name", "DOB", "National Tax Number"]
  force_stypes:
    enter-your-pod-name:
      categorical: ["TARGET", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "gender", "education"]
  datasource_args:
    path: /.csv
    seed: 100
    data_split: 30,10
```

### Bitfount Python API Configuration Example

Using the python API is quite similar to specification via yaml. With the python API, we configure Pods using the `PodDetailsConfig` and `PodDataConfig` classes. The former is required to specify the display name and description of the Pod, whilst the latter is used to customise the schema and underlying data in the data source. For more information, refer to the config_schemas reference guide. Note, Pod names cannot include underscores. Here is a .csv python API configuration example:

```python showLineNumbers
   pod = Pod(
       name="enter-pod-name-for-system-id",
       datasource=CSVSource(, seed = 100,data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10))
       pod_details_config=PodDetailsConfig(
           display_name="Hub Pod Display Name",
           description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",
       ),
       # Specify the structure of the dataset
       data_config=PodDataConfig(
           # Specify stypes for fields (optional)
           force_stypes={
   		"enter-pod-name-for-system-id": {
   		"categorical": ["target"]
   			},
   		}
        ),
       ),
   )
```

## DataFrame

### Best Practices

- Ensure you know the structure of the DataFrame prior to Pod configuration.

### YAML Configuration Example

Yaml configuration is not supported for DataFrame data sources.

### Bitfount Python API Configuration Example

The main difference between connecting DataFrame sources and connecting other source types is the requirement to instantiate the data source with a pd.DataFrame object. An example of how to do this is as follows:

```python showLineNumbers
data_structure = {'col1': [1, 2], 'col2': [3, 4]}
dataframe_object = pd.DataFrame(data=data_structure)

pod = Pod(
    name="enter-pod-name-for-system-id",
    datasource=DataFrameSource(
        dataframe_object,data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10)
    ),
    pod_details_config=PodDetailsConfig(
        display_name="Hub Pod Display Name",
        description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",
    ),
    # Specify the structure of the dataset
       data_config=PodDataConfig(
           # Specify stypes for fields (optional)
           force_stypes={
   		"enter-pod-name-for-system-id": {
   		"categorical": ["target"]
   			},
   		},
       )
)
```

## Image Sources

Images can be stored in any supported data source and configured as in the examples above. However, data sources must be configured to have an image reference column indicating the names of the image files.

### Best Practices

- If connecting image data via a database or other non-local source, be sure to use the `force_stypes` parameter and classify image columns as `"image"`.
- If connecting image data via files on your local machine, create a reference file .csv with a column indicating all of the file names for the images you wish to connect to the Pod. This column can be something as simple as:

```python
image_file_name
0001.png
...
```

You can label this column however you wish, though you must be sure to reference it when generating the PodDataConfig for your DataSource.

- If using a .csv source, we recommend you take note of the filepaths for both the .csv file and the images themselves.
- Place all images in the same folder or cloud services bucket.

## FAQ

Don't see a data source you're hoping to use? See Using Custom Data Sources or reach out to us with your request!


## SDK/for-data-custodians/setting-up/using-custom-data-sources.md

# Using Custom Data Sources

When connecting data and running Pods, you may wish to enable a custom DataSource plugin that is not supported by Bitfount by default. We support a built-in plugin manager which allows you to do so by simply dropping files in the appropriate subdirectory within `$HOME/.bitfount/_plugins` (e.g. `_plugins/datasources/` for datasource plugins) within the Bitfount-connected device or server.

This extensibility allows you to easily build on top of Bitfount according to your own needs. It also allows you to keep your custom components private if desired. However, if you would like to share their plugins with other users with whom you would like to collaborate, you can do so by copying and pasting the files into the appropriate directory.

You can also leverage other libraries within your plugins by simply installing the relevant libraries in your virtual environment and importing them in your plugin modules.

## DataSource plugins

To ensure that the datasource plugins are compatible with Bitfount, they must inherit from BaseSource. An example DataSource plugin module called `excel_source.py` is shown below which extends `BaseSource` to be able to read Excel files:

```python showLineNumbers
import logging
import os
from typing import Any, Dict, Iterable, List, Optional, Union

import numpy as np
import pandas as pd
from pydantic import AnyUrl

from bitfount.data.datasources.base_source import BaseSource
from bitfount.types import _Dtypes

logger = logging.getLogger(__name__)

class MyExcelSource(BaseSource):
    """Data source for loading excel files.

    Args:
        path: The path to the excel file.
        **read_excel_kwargs: Additional arguments to be passed to `pandas.read_excel`.
    """

    def __init__(
        self,
        path: Union[os.PathLike, AnyUrl, str],
        read_excel_kwargs: Optional[Dict[str, Any]] = None,
        **kwargs: Any,
    ):
        if not str(path).endswith((".xls", ".xlsx")):
            raise TypeError("Please provide a Path or URL to an Excel file.")
        self.path = str(path)
        if not read_excel_kwargs:
            read_excel_kwargs = {}
        self.read_excel_kwargs = read_excel_kwargs

    def get_data(self, **kwargs: Any) -> pd.DataFrame:
        """Loads and returns data from Excel dataset.

        Returns:
            A DataFrame-type object which contains the data.
        """
        df: pd.DataFrame = pd.read_excel(self.path, **self.read_excel_kwargs)
        return df

    def get_values(
        self, col_names: List[str], **kwargs: Any
    ) -> Dict[str, Iterable[Any]]:
        """Get distinct values from columns in Excel dataset.

        Args:
            col_names: The list of the columns whose distinct values should be
                returned.

        Returns:
            The distinct values of the requested column as a mapping from col name to
            a series of distinct values.

        """
        return {col: self.get_data()[col].unique() for col in col_names}

    def get_column(self, col_name: str, **kwargs: Any) -> Union[np.ndarray, pd.Series]:
        """Loads and returns single column from Excel dataset.

        Args:
            col_name: The name of the column which should be loaded.

        Returns:
            The column request as a series.
        """
        df: pd.DataFrame = self.get_data()
        return df[col_name]

    def get_dtypes(self, **kwargs: Any) -> _Dtypes:
        """Loads and returns the columns and column types of the Excel dataset.

        Returns:
            A mapping from column names to column types.
        """
        df: pd.DataFrame = self.get_data()
        return self._get_data_dtypes(df)

    def __len__(self) -> int:
        return len(self.get_data())

    @property
    def multi_table(self) -> bool:
        """Attribute to specify whether the datasource is multi table."""
        return False
```

### API Example

Once the `excel_source.py` is placed inside `$HOME/.bitfount/_plugins/datasources/`, it and its contents are automatically readable and accessible from `bitfount` as if it were any other module. For example:

```python showLineNumbers
from bitfount import MyExcelSource, Pod

pod = Pod(
    name="my-excel-datasource",
    datasource=MyExcelSource("/path/to/my/excel/file.xlsx"),
)
pod.start()
```

### YAML Example

Alternatively, the plugin can be referenced in the pod yaml config file as follows:

```yaml showLineNumbers
name: my-excel-datasource
datasource: MyExcelSource
data_config:
  datasource_args:
    path: /path/to/my/excel/file.xlsx
```

### Docker example

Datasource plugins can also be used with the bitfount pod docker container. All that's needed is to mount your existing plugins from your machine onto the container before they can be referenced in the YAML config file. The location inside the container to mount the plugins to is: `/root/.bitfount/_plugins`, so your `docker run` command might look like this:

```bash showLineNumbers
docker run -d -v /path/to/config/directory:/mount/config -v $HOME/.bitfount/_plugins:/root/.bitfount/_plugins ghcr.io/bitfount/pod:stable
```

If you wish to make use of extra libraries within your plugins, you will have to create your own docker image based off our image and add the dependencies in there.


## SDK/for-data-scientists/Obtaining_Pod_Permissions_Understanding_Pod_Metadata.md

# Obtaining Pod Permissions & Understanding Pod Metadata

Once you install Bitfount, you need to ensure you have access to the relevant Pod(s) for your use cases. The basic component of the Bitfount network is the Pod (Processor of Data). Pods are co-located with data, check that users are authorised to perform given analyses on the data, and then execute any approved computation. A Pod must be created with connected data and authorised to you prior to any data analysis taking place. If this step has not occurred yet, the Pod owner should enable access by following the instructions in Authorising Pods.

## Accessing Pods

If you are not the Pod owner:

1. Communicate with the Pod owner regarding the role you should be assigned in order to execute on the use cases required. See Bitfount Pod Policies for more details on role-based access controls. If you are not initially assigned the correct role or need to expand your remit at a later point, you can request expanded access to the Pod via Bitfount Hub.
2. Provide your Bitfount username to the Pod owner such that they can grant you access to the appropriate Pod(s). Your username can be found on the Settings page.
3. Once you’ve been granted access, check that the Pod is online in Bitfount Hub (ask the Pod owner for the name if you do not have it). You can tell if a Pod is online or offline based on the icon in its card; if the icon is red, the Pod is offline. If it is green, it is online.

If the required Pod is offline, reach out to the Pod owner to re-start it by following the instructions in Connecting Data & Running Pods.

## Understanding Pod Metadata

Before you embark on data analysis or model training, it’s useful to inspect the Pod metadata and permissions for the dataset associated with the Pod you are accessing. The Pod metadata will tell you the data schema and structure of the dataset, description metadata for the dataset, and the use case or privacy-preserving policies applied to the Pod.

To find the Pod metadata:

1. From the Bitfount Hub, navigate to “Accessed Pods” and click the desired Pod name or “View All” button.

2. To view or download the schema for a Pod, scroll down to the “Data” tab on the Pod’s profile page. For more details on schema contents, see the API Reference guide.

3. To view your applicable permissions for the Pod, see the dropdown in the top-right of the Pod profile page. For more details on how permissions policies will affect your ability to execute on your use cases, see Bitfount Pod Policies.

# Navigating Pod Policy Conflicts

If you are a Data Scientist, you might be unable to execute all of your desired tasks on a given Pod depending on the Pod’s policy permission settings. If you run into an error message stating you cannot complete the task due to epsilon or due to lack of authorisation, it is a good sign you’ve run into a policy conflict.

If this happens to you, do not fear! There are two primary mechanisms for navigating Pod policy conflicts:

1. Request broader permissions from the Data Custodian
2. Adjust your query to meet the confines of the Pod permissions

## Request Permissions

To request a greater level of permissions to a Pod, navigate to the Pod’s card in Accessible Pods within the Bitfount Hub. Then, click “Request Access” and select the access level you’d like to be granted.

We recommend providing a reason for your request, as this will help the Pod owner evaluate whether to grant you additional rights.

## Privacy-Preserving Queries

Data Scientists often accidentally exceed their privacy budgets when attempting to execute differentially private tasks against restricted Pods. In order to avoid this challenge, we recommend only spending the epsilon required to run your task effectively.

If your task is throttled, you can:

- Specify a smaller epsilon value
- Wait until the budget is replenished after 24 hrs have passed

Still have questions? See our Troubleshooting & FAQs guide or reach out to us.

## Next Steps

Now that you’ve gained access to the relevant Pod(s), you’re ready to perform your analysis or tasks! If you’d like to train or execute ML models, see Machine Learning Tasks. If you’d like to run SQL or private SQL queries, see SQL Tasks.


## SDK/for-data-scientists/authentication-authorisation-options.md

# Authentication & Authorisation Mechanisms

## For Data Scientists

### **Authentication**

Each time a Data Scientist wishes to perform a task, Bitfount authenticates them to prove their identity to the Pod(s) they are trying to access. User authentication can be carried out using one of three authentication methods:

- OpenID Connect (OIDC) Device Authorisation Flow (Default Authentication Method)
- OpenID Connect (OIDC) Authorisation Code Flow
- Key-based Authentication

All Pods support all authentication methods. If a Data Scientist wishes to authenticate in a different manner to the default Device Authorisation Flow, they can specify which alternative method they’d like to use by passing it into the `identity_verification_method` parameter.

### Methods

**OIDC Device Authorisation Flow**

OpenID Connect (OIDC) is an identity layer built on top of the OAuth 2.0 framework. It allows third-party applications to verify the identity of the end-user and to obtain basic user profile information. There are multiple different flows for OIDC - more information can be found here. The device authorisation flow is Bitfount's default authorisation mechanism because it does not require a temporary web server to be set up by the Data Scientist to authenticate. Instead, the Data Scientist receives a code from the Pod which they must match against the code displayed to them in their browser. This flow is required if you're running code remotely, which means you will need to confirm your device in the browser each time you attempt to connect to and execute a task against a Pod.

```python showLineNumbers
identity_verification_method = "oidc-device-code"
```

**OIDC Authorisation Code Flow**

The authorisation code flow is an alternative to the device authorisation flow which does not require a click-through verification each time a Data Scientist wishes to act on a Pod. Instead, this method connects to a specified local webserver to verify authentication.
See below for details on how to switch to this authorisation method.

```python showLineNumbers
identity_verification_method = "oidc-auth-code"
```

**Key-based Authentication**

Key-based authentication is a method of authentication that uses a key pair to authenticate the user. The Data Scientist must generate a key pair and provide the public key to Bitfount. The private key is stored securely on the Data Scientist's machine. This method is useful for Data Scientists who wish to automate their tasks and not have to authenticate each time they wish to act on a Pod. This method of authentication is used by the Bitfount Desktop App.

### How to Authenticate Manually

**Python API**

To change authentication methods, pass the `identity_verification_method` parameter to the appropriate function or method with your identity verification method of choice. The string names for these methods can be found in the IdentityVerificationMethod class. In the example below, the `identity_verification_method` is being passed to the model `fit` method but it is taken by any function or method which is used to submit task requests to Pods.

```python showLineNumbers
model.fit(
    pod_identifiers=["pod-identifier-1", "pod-identifier-2"],
    identity_verification_method="oidc-auth-code",
)
```

**YAML Config**

The Data Scientist can also specify their identity verification method of choice in the YAML config if using the Bitfount CLI:

```yaml showLineNumbers
modeller:
  username: my-username
  identity_verification_method: oidc-auth-code

pods:
  identifiers:
    - username/my-pod

task:
  protocol:
    name: FederatedAveraging
    arguments:
      steps_between_parameter_updates: 100
  algorithm:
    name: FederatedModelTraining
  aggregator:
    secure: False
  model:
    name: PyTorchTabularClassifier
    hyperparameters:
      steps: 100
      batch_size: 32
      optimizer:
        name: RAdam
        params:
          lr: 0.0001
```


## SDK/for-data-scientists/index.md

# For Data Scientists

The guides in this section describe how to access and leverage data if you are a Data Scientist and wish to run SQL queries, train ML models, or evaluate existing models on data held within a Bitfount Pod. As a reminder, the Bitfount platform can be used in multiple configurations, as described in Introduction to Bitfount.

For testing purposes, you might wish to act as both Data Custodian and Data Scientist for a dataset. In both cases, you will need to go through the following process prior to any data analysis taking place:

1. **Obtain Pod Permissions:** Ensure you have the appropriate analysis permissions for the required datasets. See Obtaining Pod Permissions & Understanding Pod Metadata.
2. **Inspect Pod Metadata:** Verify the Pod schema and the relevant transformations and data splits for your use-case. See Obtaining Pod Permissions & Understanding Pod Metadata .
3. **Analyse Data:** Perform your analyses or run tasks in line with your assigned permissions. See Machine Learning Tasks and SQL Tasks.

Have questions? Check out our Troubleshooting & FAQs page or reach out to us.


## SDK/for-data-scientists/setting-up-tasks.md

# Introduction

This guide describes the concept of task templates, providing instructions for how to create and upload your own.

A task template is a pre-defined type of task, which is expected to be used repeatedly across different projects, and may also be executed multiple times by participants in a given project. We currently support two types of task templates, namely `image-classification` and `image-segmentation`.

Two YAML configuration files are required to create task templates. One which defines the overall task template (Task Template YAML), and one which defines the actual task to execute (Task YAML).

The Task Template YAML specifies one or more task templates to be uploaded, together with additional information about them to the hub. It is also used to upload custom models and weights files to the hub, as may be required by the tasks defined in the task templates.

The Task YAML specifies how a task is actually executed.

## Setting Up Task Templates

### Task Templates YAML

Every task template has 5 elements:

- `task-templates:slug`: Used for the identifying the task template on the Bitfount Hub. You can find your previously uploaded task templates on the hub: `https://hub.bitfount.com//tasks/`.
- `task-templates:title`: The name of the task that will be displayed on the hub.
- `task-templates:type`: The type of task, can be either `image-classification` or `image-segmentation`.
- `task-templates:description`: The description of the task which will be displayed on the hub.
- `task-templates:template`: The task YAML file. We go into more detail on how to set this up below.

Additionally, if the task YAML uses pre-trained or custom models, they must be added to the task template as well.

- `models:private` : Whether the model is private or public. Use `true` or `false` for this attribute to indicate model privacy.
- `models:model_file`: The path to the model used.
- `models:weights_file`: The path to the model weights file.

```yaml showLineNumbers
task-templates:
  - slug: 
    title: 
    type: image-classification / image-segmentation # delete as appropriate
    description: >

    template: 
models: # can be populated with as many models as needed
  - private: true # Can be true or false
    model_file: 
    weights_file: 
```

Note that you can add multiple templates or models at the same time.

### Task Template YAML Example

```yaml showLineNumbers
task-templates:
  - slug: image-classification-custom-model
    title: Example image classification task template
    type: image-segmentation
    description: >
      Example image classification description
    template: task_templates/task_yaml_example.yaml
models:
  - private: true
    model_file: MyCustomModel.py
```

## Setting Up Tasks

### Task YAML

Now, let's look at how to set up a task YAML file. This is the file specified under `task-templates:template` for the task template. There are a few key things to specify:

- `pods:identifiers`: The list of pods that hold the data to use. This needs to be part of the task but is ignored when used as a template. This means that once the template is uploaded to the hub, it may be used on other compatible datasets in the project.
- `modeller:identity_verification_method`: The authentication method to use. If not provided, the OIDC Device Authorisation Flow will be used. For more details, see our Hub Authentication & Authorisation Options
- `task:protocol`: The protocol to use for the task. Read more about Bitfount protocols.
- `task:algorithm`: The algorithm or list of algorithms to use for the task. Note that most of our built-in protocols only use one algorithm. Read more about Bitfount algorithms.
- `task:algorithm:model`: The model to use in the specified algorithm. Read more about Bitfount models.
- `task:algorithm:model:bitfount_model`: If using a custom model, this needs to be specified. The arguments for this include:
- `task:algorithm:model:bitfount_model:model_ref`: The reference for the custom model. This can be the path of the model or just the name of the model class. For the purpose of setting up the task request, we suggest using the model class name.
- `task:algorithm:model:bitfount_model:version`: The version for the custom model to use. If not provided, the latest version available on the hub will be used.
- `task:algorithm:model:bitfount_model:username`: The username of the user that owns the custom model. If the user creating the task template also owns the model, this argument can be skipped. Otherwise, it must be specified in order to ensure the correct model is used.
- `task:model:hyperparameters`: The settings used by the model.
- `task:datastructure`: The datastructure for the models. This needs to be part of the task but is ignored when the task YAML is used in a task template. Read more about Bitfount datastructures.

For more details on the task elements, please see the Bitfount Task Elements Guide

### Task YAML Example

Below we provide an example of a task YAML.

```yaml showLineNumbers
pods:
  identifiers:
    - "mnist-demo-dataset"
# Task details
task:
  protocol:
    name: bitfount.InferenceAndCSVReport
  algorithm:
    - name: bitfount.ModelInference
      arguments:
        class_outputs: ["0", "1"]
      model:
        bitfount_model:
          model_ref: MyCustomModel
          model_version: 1
          username: bitfount # if the model is used by the user who own it, this can be skipped
        hyperparameters:
          batch_size: 4
          epochs: 1
    - name: bitfount.CSVReportAlgorithm
      arguments:
        save_path: "."
  data_structure:
    table_config:
      table: "mnist-demo-dataset"
    assign:
      target: "target"
      image_cols: ["file"]
```

Tasks can be tested prior to uploading, using the following snippet of code:

```python showLineNumbers
from bitfount.runners.modeller_runner import setup_modeller_from_config_file, run_modeller

path_to_task_yaml = ""

(
    modeller,
    pod_identifiers,
    project_id,
    run_on_new_data_only,
    batched_execution,
) = setup_modeller_from_config_file(path_to_task_yaml) # replace with your task YAML
run_modeller(
    modeller,
    pod_identifiers,
    project_id=project_id,
    run_on_new_data_only=run_on_new_data_only,
    batched_execution=batched_execution,
)

```

> ℹ️ Make sure the target pod is up and running on the Bitfount Hub prior to testing the task YAML.

## Uploading Task Templates to the Bitfount Hub

Once all the task templates required are set up, they can be uploaded to the hub for use within different projects. We provide a script for uploading the task templates with the Bitfount library.

```bash showLineNumbers
python -m bitfount.runners.upload_task_templates  -u 
```

In the above command replace `path-to-task-template-file` with the path to the relevant task template YAML file and the username should be that of the user creating the task template.

Contact our support team at support@bitfount.com if you have any questions.


## SDK/for-data-scientists/machine-learning-tasks/custom-models.md

# Custom Models

Custom models require an additional level of permissions because they can be defined as arbitrary code. To use a custom model, you will need to first ensure you have the required permissions and then run the model using the BitfountModelReference class.

## Defining Custom Models

:::tip
A detailed example of creating a custom model is available in the Training a Custom Model tutorial.
:::

:::caution
Saving a custom model to Bitfount currently does not mean the model is private. Any model a Data Scientist saves to the Hub is accessible via its hub.bitfount.com URL, so be sure you are comfortable with your model architecture being publicly accessible prior to upload.
:::

To define a custom model, you need to create a class that
inherits from the `PyTorchBitfountModel` class and implement the following methods:

- `__init__()`: how to set up the model
- `configure_optimizers()`: how optimizers should be configured in the model
- `create_model()`: gets the size of input data and initialises the model appropriately
- `forward()`: how to perform a forward pass in the model, how the loss is calculated
- `training_step()`: what one training step in the model looks like
- `validation_step()`: what one validation step in the model looks like
- `test_step()`: what one test step in the model looks like

Once you’ve saved the file, navigate to the
My Models page on Bitfount Hub. Here you can upload your file for future use.

## Permissions

To use a custom model with a Pod you need to either be granted the:

- Super Modeller role, which means that you can run arbitrary code; or
- General Modeller role, along with the specific permission to use the chosen Custom Model.

:::danger
Bitfount does **not** vet the contents of custom models. Granting the Super Modeller role means the Data Scientist can execute any code on the specified Pod.
:::

If you do not have the required permissions, ask the Pod owner to authorise custom models using the instructions in the Authorising Pods guide.

If you already have visibility on a Pod, you can request to use a given model for a specific Pod:

1. In the Hub, go to My Models, then click the card for the model you wish to apply.
2. Click the "Request Access" button.
3. Specify the Pod you wish to use the model against and your purpose for using the model.
4. Confirm the "Specific custom model(s)" at the bottom of the permissions section includes your desired model(s).
5. Click "Request Access"

Once submitted, the Pod owner will receive an access request in their Hub view and will be able to review your model contents and grant or reject the access request.

## Running Custom Models

:::tip
A detailed example of running a custom model is available in the Training a Custom Model tutorial.
:::

To run a custom model, you need to instantiate it as a BitfountModelReference. The api for this model follows much
the same as for other models.

```python
model = BitfountModelReference(
    username="INSERT USERNAME",
    model_ref="MyCustomModel",
    datastructure=datastructure,
    schema=schema,
    model_version=1, # NOTE: this uses a specific version of the model (default is latest uploaded),
    hyperparameters={"epochs": 2},
)
model.fit(pod_identifiers=...)
```


## SDK/for-data-scientists/machine-learning-tasks/index.md

# Machine Learning Tasks

Once you’ve received Pod access, you are now ready to run model tasks! If you’d rather learn how to run SQL queries, see SQL Tasks.

Bitfount refers to training or executing models as performing **tasks**, while protocols, models, and algorithms are all **task elements** that need to be specified as part of a task. For more details on Bitfount’s definitions of these elements, see Bitfount Glossary. All task execution is tracked in a Pod’s activity history.

Before you run tasks, it’s always a good idea to determine:

1. If there are any Pod policy restrictions which might dictate what tasks you can perform against a dataset in a given Pod based on the role you’ve been assigned.
2. If the Pod is **online**. You can tell this by the green icon in the Pod’s card on the “My Pods” page. If the Pod is offline, the Pod owner will need to bring the Pod back online for you.
3. The structure of the dataset upon which you are acting.

## Modelling with the Bitfount Python API

The standard approach to model training using Bitfount is the Python API. We recommend using a notebook tool to train models. See below for detailed instructions if you are using Bitfount default protocols and algorithms vs. custom models.

### Using Bitfount-Supported Task Elements

The simplest approach to running an ML task is to make use of pre-defined
Bitfount Task Elements. To do this
you will:

1. Import relevant classes from `bitfount` for your modelling needs.
   - Relevant classes can be found in the
     Bitfount Task Elements guide and API Reference and will depend on the task you are planning to run.
   - For standard use cases, examples of relevant classes are covered in the tutorials.
2. Set up the loggers. Loggers enable you to receive input on the progress of your task and details on completion or failure.
3. Define the model and data structure you will use to train. A list of currently supported options is given on the
   Models page.
4. Train the model on the desired Pod(s): `model.fit(pod_identifiers=[pod-identifier])`

   - Note: If training on multiple Pods, ensure the data structures for the Pods are the same. Bitfount currently only supports horizontal federated learning.
   - model.fit automatically chooses the FederatedAveraging protocol; if you would like to specify a different protocol, you can do so and run the model like so:

   ```python showLineNumbers
   protocol = FederatedAveraging(algorithm=FederatedModelTraining(model=model))
   protocol.run(pod_identifiers=[pod_identifier])
   ```

5. \{Optional\} Serialise and save the model:

```python showLineNumbers
model_out= Path("desired_model_path.pt")
model.serialize(model_out)
```

## Using Custom Models

For cases when you wish to train a model that isn’t included natively with the Bitfount SDK, Bitfount supports custom models. For more details on how to use and manage custom models, please see the Custom Models guide.

## Model Evaluation

Bitfount also enables remote evaluation of an existing pre-trained model without the need to return the final model output using the evaluate method.

## FAQs & Additional Relevant Tutorials

Ran into errors? Want to do something a bit more advanced?

You may wish to check out the Troubleshooting & FAQs page and explore more advanced model training capabilities via our additional tutorials:

## Next Steps

You did it! For more detailed illustrations of the Bitfount product suite, feel free to peruse our tutorials.


## SDK/for-data-scientists/machine-learning-tasks/multi-pod-ml.md

# Multi-Pod Machine Learning Tasks

One of the most useful things about a federated data science platform is the ability to run tasks across multiple datasets without needing to centralise the data! Bitfount enables you to train and evaluate models across multiple Pods (this is often referred to as federated learning and evaluation), so your or your partners' data need not be in the same database to collaborate.

There are three quick steps to running ML tasks on multiple Pods:

1. Confirm Permissions
2. Verify the Data Structures
3. Run multi-Pod tasks!

Let's get you up and running!

## Confirming Permissions

Before running multi-Pod tasks, you need to check two items:

1. You have a sufficient level of access to all of the Pods you wish to perform tasks against given the task you wish to perform.
2. If you intend to use `SecureAggregation` as an aggregator for your task, the Pod owners of the Pods you wish to perform tasks against need to allow their Pod(s) to be used in combination with the other Pod(s).

To check whether you have a sufficient level of access for a Pod owned by someone else, go to Accessible Pods in Bitfount Hub, and find the Pod(s) you wish to use. You can see what permission level you have by toggling the "Access Granted" menu.

You must have sufficient access across _all_ of the Pods with which you wish to interact to perform a given task. This means your role does not need to be the same across Pods, however, you need to have been assigned a sufficient level of permissions for the task you wish to perform across all of them.

Next, if you wish to use `SecureAggregation`, the Data Custodians for the desired Pods need to have approved the ability to run ML tasks against the full list of the Pods you wish to use. This includes the case where you are the Pod owner for multiple Pods and wish to query against them. This approval step is performed via the `approved_pods` parameter using the Bitfount python API or yaml configuration step when creating the Pods. Work with the Data Custodian(s) of the Pods you wish to use to ensure they have included all of the Pods required in the `approved_pods` list.

## Verifying Data Structures

Next, verify that the datasets associated to the Pods you wish to perform tasks against meet the requirements for multi-Pod interaction.

In order to run multi-Pod tasks, the datasets connected to the Pods need to:

- Contain the specified columns with the same column names. They do not need to be in the same order across Pods, and the Pods do not have to have the same set of columns. However, the columns you do specify need to have the same names across Pods.

- Include the same set of distinct values for fields classified as `categorical`. For example, if two Pods both have a "Gender" field, the possible categorical values within the field need to be equivalent across Pods, and there must be at least one example of the values within each dataset. This means if one Pod has "M,F,NB" as values for "Gender", and the other has "M,F,Unspecified" as values, you will not be able to run a task using "Gender" against both Pods. Fields not classified as `categorical` do not carry this requirement.

:::tip

It is possible to force this mapping if the two datasets do not specify the same names for the same value options using the `schema_types_override` argument with the `DataStructure` class.

:::

## Running multi-Pod tasks

Once you've verified your permissions and the data structures, you are ready to run a multi-Pod task!

Here is a snippet illustrating a python example of the config required before running the task:

```python showLineNumbers
first_pod_identifier = "census-income-demo"
second_pod_identifier = "census-income-yaml-demo"
datastructure = DataStructure(
    target="TARGET",
    table={
        "census-income-demo": "census-income-demo",
        "census-income-yaml-demo": "census-income-yaml-demo",
    },
)
schema = combine_pod_schemas([first_pod_identifier, second_pod_identifier])

model = PyTorchTabularClassifier(
    datastructure=datastructure,
    schema=schema,
    epochs=2,
    batch_size=64,
    optimizer=Optimizer(name="SGD", params={"lr": 0.001}),
)
```

## Next Steps

You're now ready to run multi-Pod ML tasks! Still have questions? Check out our Troubleshooting & FAQs guide or ask for help via our support team at support@bitfount.com.


## SDK/for-data-scientists/sql-tasks/index.md

# SQL Tasks

Once you’ve received Pod access, you are now ready to query datasets! If you’d rather learn how to train ML models, see Machine Learning Tasks.

Bitfount refers to any activity a data scientist performs on a dataset in a Pod, including running SQL queries, as a ‘**task**’. All task execution is tracked in a Pod’s activity history.

Before you embark on running SQL queries, it’s always a good idea to determine:

1. If there are any Pod policy restrictions which might dictate what tasks you can perform against a dataset in a given Pod.
2. If the Pod is **online**. You can tell this by the green icon in the Pod’s box on the “My Pods” page. If the Pod is offline, the Pod owner will need to bring the Pod back online for you.
3. The structure of the dataset upon which you are acting.

## Running SQL Queries

Currently, running SQL queries against a Pod is only supported via the Bitfount Python API. To do so, you must specify your query as a parameter to the `SQLQuery` algorithm. We recommend doing so using a notebook tool, such as Jupyter.

The steps are:

1. Import relevant pieces from the installed Bitfount package (see tutorial for example).
2. Set up the loggers. Loggers enable you to receive input on the progress of your task and details on completion or failure.
3. Specify your pod_identifier(s) and query prior to running the query. The Pod identifier(s) can be found in the Bitfount Hub at the top of the Pod’s page under its display name or at the end of the Pod’s URL.

For example:

```sql showLineNumbers
pod_identifier= "my-data-pod"
query= SqlQuery(
    query="""
SELECT `occupation`, AVG(`age`)
FROM df
GROUP BY `occupation`
"""
)
query.execute(pod_identifiers=[pod_identifier])
```

## Next Steps

You did it! For more detailed illustrations of the Bitfount product suite, feel free to peruse our tutorials. Have more questions? Check out the Troubleshooting & FAQs guide.


## SDK/for-data-scientists/sql-tasks/multi-pod-sql-tasks.md

# Multi-Pod SQL Tasks

One of the most useful things about a federated data science platform is the ability to run tasks across multiple datasets without needing to centralise the data! Bitfount enables you to run a SQL query against multiple Pods simultaneously, so your or your partners' data need not be in the same database to collaborate.

There are three quick steps to running SQL tasks on multiple Pods:

1. Confirm Permissions
2. Verify the Data Structures
3. Run multi-Pod tasks!

## Confirming Permissions

Before running multi-Pod tasks, you need to check two items:

1. You have a sufficient level of access to all of the Pods you wish to perform tasks against given the task you wish to perform.
2. If you intend to use `SecureAggregation` as an aggregator for your task, the Pod owners of the Pods you wish to perform tasks against need to allow their Pod(s) to be used in combination with the other Pod(s).

To check whether you have a sufficient level of access for a Pod owned by someone else, go to Accessible Pods in Bitfount Hub, and find the Pod(s) you wish to use. You can see what permission level you have by toggling the "Access Granted" menu.

You must have sufficient access across _all_ of the Pods with which you wish to interact to perform a given task. This means your role does not need to be the same across Pods, however, you need to have been assigned a sufficient level of permissions for the task you wish to perform across all of them.

Next, if you wish to use `SecureAggregation`, the Data Custodians for the desired Pods need to have approved the ability to run SQL tasks against the full list of the Pods you wish to use. This includes the case where you are the Pod owner for multiple Pods and wish to query against them. This approval step is performed via the `approved_pods` parameter using the Bitfount python API or yaml configuration step when creating the Pods. Work with the Data Custodian(s) of the Pods you wish to use to ensure they have included all of the Pods required in the `approved_pods` list.

## Verifying Data Structures

When you run a SQL query against multiple Pods, the SQL query runs independently against each Pod prior to averaging the results. As a result, all Pods specified in the task configuration must contain any columns or formatting you specify in your query. This will often require column names across files or databases to be specified with the same name, and contained in one table within each Pod. Before attempting to perform a multi-Pod SQL query, check that each of the Pods will be compatible with your query.

## Running Multi-Pod Tasks

There are two possible expected outputs for running SQL queries across multiple Pods:

1. **No aggregator**: If you do not provide an `aggregator`, your SQL task will run independently against all specified Pods, then return the results for each Pod.
2. **With aggregator**: If you provide an `aggregator` to the `execute` method for the `SqlQuery` or `PrivateSqlQuery` algorithms, you will receive an average of all of the results across Pods.

Note that queries requiring joins across Pods are not yet supported.

Querying across multiple Pods is similar to the normal usage of the `SqlQuery` algorithm. When querying across multiple Pods, you must additionally specify the Pod identifiers for the additional Pods. For example:

```python
pod_identifier_1 = "census-income-demo"
pod_identifier_2 = "census-income-demo-yaml"
query = SqlQuery(
    query="""
SELECT `occupation`, AVG(`age`)
FROM df
GROUP BY `occupation`
"""
)
query.execute(pod_identifiers=[pod_identifier_1,pod_identifier_2])
```

Note that when you are using the `PrivateSqlQuery` algorithm, the protections in place for federated SQL analysis on a single Pod apply in the multi-Pod context as well. When using this algorithm, Bitfount will not enable queries which allow for extraction of raw data, triangulation of a single record, or other potentially nefarious `GROUP BY` clauses.

## Next Steps

You're now ready to run multi-Pod SQL tasks! Still have questions? Check out our Troubleshooting & FAQs guide or ask your question via support@bitfount.com.


## demo-projects/index.md

# Demo projects

Get started quickly with pre-configured demo projects designed by Bitfount to
help you explore the platform with minimal setup.

## Fine-tune & run inference with RETFound

Learn how to fine-tune RETFound, the world’s first retinal foundation model, to
classify images relevant to your specific research needs. This demo walks you
through the full process—from fine-tuning the model to running inference on new
datasets.

🚀 **Ready to get started?** Follow the
Retfound demo project tutorial to
begin.

:::info
We also offer Python tutorials for our SDK that run on
Google Colabs
:::


## demo-projects/retfound-tutorial.md

# Fine-tuning & running inference with RETFound

RETFound, developed by researchers
at Moorfields Eye Hospital and University College London, is the first
foundation model trained on retinal images.

## Why use RETFound?

Previously, training AI to analyse retinal images required building separate
models from scratch for each disease—a time-consuming, expensive, and
data-intensive process.

With RETFound, you can start with a pre-trained model, fine-tune it for a
specific disease or classification task, and train on fewer labeled images using
just a single GPU.

This means faster, more efficient AI model development, unlocking new
possibilities for analysing retinal diseases.

product-demo-task-run-min.png

## Getting started

In this tutorial you will learn how to:

1. Connect a training dataset of retinal images.
2. Fine-tune RETFound for a specific classification task.
3. Run inference on a new dataset using your fine-tuned model.

By the end, you will be able to adapt RETFound to your research needs and
generate insights from retinal images with ease.

### **Step 1: Connect a training dataset**

The training dataset will need to consist of retinal images grouped into
categories (classes) that the fine-tuned model will learn to recognise.

:::tip
**Connecting OCTs?** Make sure your dataset consists of individual B-scan images rather than full volumetric scans as RETFound does not classify volumetric scans.
:::

#### Organising the dataset

Before connecting the dataset, arrange your image files into the following
folder structure:

📂 **Dataset folder** (top-level folder you will connect to Bitfount)\
📂 **Class label folders** (subfolders representing different diseases or severity levels)\
📂 **Data split folders** (separate folders for train, validation, test splits, e.g. 60–80%
training, 10–20% validation, 10–20% test)

  Example dataset folder structure with training, validation, and test splits

#### Connecting the dataset

1. Connect the dataset from the Datasets page in Bitfount, or connect a new
   dataset directly when linking a dataset in the demo project.
2. Choose the folder that contains your images.
3. Make sure you check the option,
   `Use folder names and structure for training tasks`.
4. Connect the dataset.

:::note
Folder-inferred data splits and class labels are only supported for DICOM or Heidelberg formats. Contact our support team if your data is in another format.
:::

### Step 2: Fine-tune the RETFound model

Fine-tuning adapts RETFound to your specific dataset, optimising its performance
for your research.

#### Setting up the fine-tuning task

1. Join the RETFound fine-tuning demo project.
2. Link your training dataset to the project.
3. Select the relevant RETFound model. Ensure the model version matches your
   dataset (OCT or Color Fundus).
4. Set task parameters:

| Parameter         | Description                                                                                                                                                                                                    |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Learning rate** | Controls how quickly the model learns. A lower value means slower but steadier learning, while a higher value speeds things up but can make learning less reliable.                                            |
| **Epochs**        | The number of times the model goes through the entire dataset. More epochs can improve accuracy but also increase training time and risk overfitting. **The RETFound paper suggests starting with 50 epochs**. |
| **Labels**        | Enter the class labels you defined in your training dataset folder structure. Choose conditions you have sample data for, or test with any labeled public dataset.                                             |
| **Batch size**    | The number of samples processed before updating the model. Larger batches can speed up training but require more memory.                                                                                       |
| **Image column**  | The dataset column that contains image data for model training and fine-tuning. The default is `Pixel Data 0` unless configured differently.                                                                   |
| **Target column** | The dataset column that contains the image labels. The default is `BITFOUNT-INFERRED-LABEL` unless specified otherwise.                                                                                        |

5. Run the task. Task processing time will depend on a number of factors
   including the size of data you connected, the batch size, the number of
   epochs, your machine's processing capabilities and more.
6. Once the task completes, review the results by navigating to the task run.
   The output includes a CSV file summarising the learning process, and three
   checkpoint files, including the `model_best.pth` checkpoint file, which
   represents the highest-performing fine-tuned model. The `model_best.pth`
   checkpoint file can be used in subsequent projects to obtain predictions on
   unlabelled images.

RETFound-task-results-ft.png

### Step 3: Run inference on new data

Now that you have fine-tuned RETFound, it's time to test it on unlabelled
images!

#### Preparing the test dataset

1. Curate a folder of new images (no need to add class labels).
2. Connect the dataset to Bitfount.

:::warning
**Do not** check the option `Use folder name and structure for training tasks`
:::

#### Running inference

Inference is the process of using your fine-tuned model to analyse a new
dataset. During this step, the model will classify each image based on the
categories it was trained on, generating predictions as output.

1. Join the RETFound inference demo project (_Classify retinal images using
   RETFound and a local checkpoint file_).
2. Link the test dataset you prepared.
3. Set task parameters:

| Parameter           | Description                                                                |
| ------------------- | -------------------------------------------------------------------------- |
| **Model version**   | Ensure it matches your dataset type (e.g. Color Fundus or OCT).            |
| **Class outputs**   | Use the same labels defined during fine-tuning.                            |
| **Checkpoint file** | Select the `model_best.pth` checkpoint file from your fine-tuning results. |
| **Image column**    | Defaults to `Pixel Data 0`, unless configured differently.                 |

4. Run the task.

### Interpreting predictions

The task results are provided in a CSV file for easy review. This file includes
metadata about your input images, but the most important columns are:

- First column: The file path of each image.
- Last columns: The predicted probabilities for each class (the number of these
  columns depends on the classes defined during fine-tuning).

Each score represents the model's confidence that an image belongs to a specific
class. The values for all classes will sum to 1, with higher numbers indicating
greater confidence in the model's prediction.

RETFound-task-results-inference.png

## FAQs

**What is a foundation model?**\
A model trained on a broad dataset (over 1 million retinal images) that can be fine-tuned for specific tasks.

**What is fine-tuning?**\
Training an existing foundation model on a specific dataset to specialise in a particular task.

**What are classes?**\
Categories a model classifies images into (e.g. 'Diabetic Retinopathy' vs. 'Normal').

**How many images do I need?**\
Start with at least 100 images per class, though more data improves performance.

**What kind of images can be used?**\
There are two versions of RETFound, one that runs on fundus images, while the other facilitates the use of OCTs. It's worth noting that the model does not classify volumetric scans and therefore will only run on single B-scans.


## general-reference/bitfount-glossary.md

# Bitfount Glossary

| Term                      | Definition                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Pod                       | “Processor of data” - A Pod is the component of the Bitfount network which allows for models or queries to run on remote data. Pods are co-located with data, check that users are authorised to perform a given operation, and then execute any approved computation.                                                                                                                                                                                                                                                                                                                                        |
| Pod policies              | A policy is a rule applied to a Pod to dictate who has access, what actions can be taken on a given Pod, and if and how to apply privacy-preserving techniques to training or query results. Bitfount enables access and usage policies with additional policy permissions layered on top.                                                                                                                                                                                                                                                                                                                    |
| Pod authorisation         | The process of Pod authorisation enables a Pod creator to provide access to other Bitfount users. In authorising a Pod, the Pod creator dictates who has access, what actions they can perform, and how they can perform them.                                                                                                                                                                                                                                                                                                                                                                                |
| Pod roles                 | A role is a collection of permissions granted to a user in relation to their access of a given Pod. You can think of the role as dictating what a user can do once granted access to a Pod.                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Pod permissions           | Permissions refer to the set of actions a user can take once given access to a Pod and how those actions can be performed. Permissions include which types of tasks can be performed and what privacy-preserving techniques should be applied.                                                                                                                                                                                                                                                                                                                                                                |
| Data Custodian            | The Data Custodian is the user or organisation who stores the relevant dataset for collaboration on their servers or local machines. Typically, Data Custodians grant analysis permissions (without granting direct access to raw data) to Data Scientists for two basic use cases:1. External: A Data Custodian wishes to collaborate with an external party (or several) without transferring data. 2. Internal: A Data Custodian wishes to apply privacy-preserving or usage-based policies to data within their organisation without giving internal data scientists direct access to raw data. |
| Data Scientist (modeller) | The Data Scientist is the user who wishes to access datasets to perform a given task.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Task                      | Task is a catchall term used at Bitfount to describe the activity a user wishes to perform on a dataset associated with a given Pod.                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Task Element              | A task element is an input required to execute a given task as defined by the activity a user wishes to perform. Protocols, algorithms, and models are all considered task elements.                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Protocol                  | A protocol defines how communication is orchestrated between Pods and the Data Scientist.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Algorithm                 | An algorithm defines what computation is done on each Pod at each iteration of a protocol.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| Model                     | A model is a parameter to Model-based algorithms. For this class of algorithm, the model is the model architecture used for doing the analysis.                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Transformation            | A transformation is a parameter used to manipulate datasets prior to task execution to ensure their format matches the required format for the task.                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Secure Aggregation        | A privacy-preserving technique, based on secure multi-party computation, which allows summations to be computed without any of the individual values being known to the modeller                                                                                                                                                                                                                                                                                                                                                                                                                              |
| Card                      | The name for the visual element in the Bitfount Hub in which a Pod or model’s metadata is organised.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |


## general-reference/managing-multiple-accounts.md

# Managing Multiple Bitfount Accounts

You may find yourself using multiple Bitfount accounts on the same
machine if you:

- Are affiliated with multiple institutions/email domains.
- Maintain accounts with access to different datasets or code.
- Simply wish to test federated learning and evaluation techniques as if you are multiple users.

There are two important parts that need to be configured correctly when
using multiple accounts. First, you will need to specify the user account and
second, you will need to ensure that appropriate adjustments are made for
your chosen authentication method.

## Specifying the username

There are two ways to specify which user account you are using

- Manually specify a username
- Leave Bitfount to automatically detect your username from you logging in
  through a web browser. In this case, the username is stored in your `BITFOUNT_HOME`
  folder

When using YAML, the username is specified at the root level of the YAML. For example:

```python
pod_name: census-income
datasource: CSVSource
username = "your-username"
### Input rest of pod configuration
```

If you are using the Python API, you can pass username as a parameter to the `Pod` constructor, and `protocol.run` and `model.fit` methods:

```python
p = Pod(name="census-income", username="your-username")
...
p = FederatedAveraging(...)
p.run(username="your-username", ...)
...
m = PyTorchLogisticRegression(...)
m.fit(username="your-username", ...)
```

If you want to use your web browser's authentication to specify your username
but also want to run multiple accounts simultaneously then you will need
to specify different `BITFOUNT_HOME` configurations for each user. Each
separate `BITFOUNT_HOME` configuration will then have a different default username
which is used, without any need for specifying it in YAML or the Python API.

The `BITFOUNT_HOME` environment variable should point to
a path where the default username and tokens can be stored. By default
this is set to `~/.bitfount`. For example, you could run a Pod via

```bash
BITFOUNT_HOME="/file/location_1/.bitfount/" bitfount run_pod /path/to/yaml
```

## Ensuring authentication works

Besides the requirement to specify the username, running multiple accounts on the
same machine may require some additional steps depending on the
authentication method.

For setting up Pods, we recommend using API keys for authentication,
as described in the API key setup
documentation.

When running a task as a data scientist, the steps for logging in as the
correct user in the correct browser window will depend on the authentication
mode:

- If you are using `oidc-device-code` authentication (the default) then
  each time you need to authenticate, Bitfount will attempt to do this through
  your default browser. If the default browser is logged in with a different
  account you will need to copy the URL to a separate browser or incognito
  tab to authenticate.
- If you are using `oidc-auth-code` authentication then the flow
  will only work if you ensure that the browser window opens where you can
  log in as the desired user. The easiest way of ensuring this is to simply
  run a jupyter notebook and then open up the notebook in a separate browser
  or in incognito mode.


## general-reference/security.md

# Security

## Firewalls

One of the fundamental architectural choices of the Bitfount platform, different from many other federated architectures, is that Bitfount follows a messaging architecture. This means that services that connect to Bitfount only make outgoing HTTP connections and can happily sit behind a firewall.

## Communication protocols

All data entering or leaving Pods uses TLS/HTTPS, and all messages are end-to-end encrypted. This removes any requirement to trust Bitfount with respect to the raw Pod data or analysis outcomes.

## Your data (accessed via Bitfount Pods)

Data accessed via Bitfount Pods can be hosted locally or in cloud infrastructure. Data never leaves the Pod and is not accessible to Bitfount unless access is granted.

The only Pod information shared with Bitfount is metadata. More information on the metadata Bitfount has access to can be found in our privacy policy.

## Securing your network

We recommend several network security controls for running a Pod:

1. By default, all code that runs is pre-installed with your Pod and can only be called through configuration options. This can be overridden if you want to use custom models, but this option should only be enabled for highly trusted users (e.g. employees of your organisation).
2. Pods should be run within a docker container with limited access to the rest of your network. You should only allow these docker containers access to the specific data that you want to make available.
3. We recommend that any data that you make available is a copy of source data.
4. Any connections to databases should be set to be read-only.
5. Pods can be set to refuse specific jobs or require privacy settings regardless of what has been set in the access manager. This can enforce additional certainty that specific operations can never happen.

## Bitfount's own security

As you might imagine, Bitfount takes security very seriously. Security is a core part of what our product aims to help with! The following are some of the things we are doing to make sure our own code and infrastructure are secure:

- Automated security tests on all our code
- Regular penetration tests on all our services
- Monitoring tools to try to catch intrusions and incidents
- Segregated production environment with limited human access
- Various process-level security policies, including a secure development policy.
- ISO 27001 certification.
- Access to the Hub and Access Managers is protected by strong authentication and authorization controls, with user passwords not being held by Bitfount.
- Bitfount's authentication (Auth0) and infrastructure (AWS) providers hold industry-leading security certifications such as SOC 2 Type II, ISO 27018 and ISO 27001.


## general-reference/troubleshooting-faqs.md

# Troubleshooting & FAQs

This page is frequently updated with the most common questions we receive from community members. If you can't find answers to your questions here, please contact us at support@bitfount.com.

## Logging In

**Why do I have to log in twice each time I attempt to submit a task?**

Bitfount takes security seriously. Part of our no-trust approach to data analysis is ensuring you both validate you have the appropriate authorisation to the Pod you are executing against AND to perform the task you are attempting to execute. This means you must verify the device twice each time you attempt to execute a task.

**Why have I gotten locked out of my account?**

Bitfount will not lock you out of your account as a result of activity you take while using the platform. If you are having trouble logging in, reset your password, go to the login page and click "Forgot your password?".

## Installing Bitfount

### Mac OS M1

**Why can't I use the pip and python commands as outlined in the guide?**

Bitfount is supported on an older version of python than python3, which is the latest version and is compatible with Rosetta. In order to make this work with Rosetta, just append `3` to the python commands like so: `pip3 ...`.

## Running Pods

**Which file formats does Bitfount support?**

See Preparing to Connect Data.

**Which database types does Bitfount support?**

See Preparing to Connect Data.

**What should I do if Bitfount doesn't support my data, file, or database type?**

Please provide feedback to us if you'd like us to consider including a DataSource as a default DataSource. Otherwise, you are welcome to develop your own custom DataSource. Please see Using Custom Data Sources for more details.

**I followed the YAML format in the guides, but creating my Pod didn't work! What do I do?**

First, check that you are running the latest version of the Bitfount package by running `pip freeze` and finding `bitfount`. The latest version can be found here. If your version does not match, upgrade by running:

```bash showLineNumbers
pip install bitfount --upgrade
```

Try again with your YAML file. If this still does not solve the problem, please ask for help via support@bitfount.com.

## Training Models

**Why am I getting a permissions or policy error when trying to train my model?**

It's possible your targeted Pod is not set up to allow you to execute the desired task. Check:

1. You have the appropriate role permissions to execute your task on the Pod's profile page.
2. You are not running into Pod policy conflicts as a result of the permissions on your role or the Pod.

If you need additional permissions, you can make a request on the Pod's profile page and notify the Pod owner.

## Troubleshooting Errors

**How do I find the product logs?**

When troubleshooting, it is helpful to provide logs to other users to help you diagnose the problem. These logs are located in the `.../bitfount/bitfount_logs/` directory where you installed the `bitfount` package. Open the relevant file for the failed or hung task you most recently ran, and copy the details to debug. If you need additional help, contact us via support@bitfount.com.

## Report Bugs & Product Feedback

Contact our support team at support@bitfount.com.


## getting-started/accessing-bitfount.md

# Getting started

A quick overview to get you up and running with Bitfount.

## Creating an account

Firstly, create your Bitfount account.
Your username will be your primary identifier within the Bitfount ecosystem and
can be used to connect datasets, run tasks, use models or join projects for
collaborations.

## Accessing Bitfount

After creating your account, we recommend installing the right software to make
the most of the Bitfount platform.

1. **Bitfount Desktop:** Connect datasets
   and run tasks easily with our desktop application. Bitfount Desktop provides
   a no-code interface for managing projects and datasets, allowing you to
   securely run pre-built AI models and SQL tasks locally.
2. **Bitfount SDK:** For data scientists
   and technical users to interact directly with Bitfount APIs. The SDK supports
   the deployment of custom models, provisioning task templates to our desktop
   app users, and more complex use cases like configuring entire federated data
   collaboration networks.
3. **Bitfount Hub:** Both SDK and Desktop users
   leverage the Hub as a point of authentication. The cloud-based Hub mirrors
   the functionality of Bitfount Desktop but does not facilitate connecting
   datasets, or running tasks.

## Installing Bitfount Desktop

Installing Bitfount Desktop is as simple as downloading and running the
installer from our website. When you first launch Bitfount Desktop you will be
prompted to sign into your Bitfount account and link it to the application.

We support two app versions:

1. Windows - for Windows 7 or later
2. MacOS - for Apple Silicon machines

If you don't have access to a machine with either of these operating systems,
please get in contact with our support team.

:::info
Looking to interact directly with Bitfount APIs and provision tasks? Please visit our SDK guides for details about installation and running federated analyses.
:::

### Hardware recommendations

Data science tasks are compute-intensive and run more efficiently with appropriate hardware. We recommend installing Bitfount on a machine with a GPU (Graphics Processing Unit). This could be any Apple device with a Silicon chip, or any machine running Windows and fitted with a NVIDIA GPU.

### Software updates

We always recommend running the latest version for the Bitfount app. As we roll out new features you will be notified in-app when updates are available, this is accompanied by release notes outlining details of the updates.


## getting-started/introduction-to-bitfount.md

# Introduction

Bitfount exists to safely unlock the value of sensitive data for the benefit of
humankind. We enable data collaborations **without** needing to transfer data to
other parties, an approach known as federated data science.

## How Bitfount works

With Bitfount, AI models can be securely deployed and run locally in
environments like hospitals and clinics, and other sensitive environments. This
means:

- Data never leaves its original location.
- Insights are generated without writing a single line of code.

Before exploring federated data science, let's first look at how AI models are
typically built and used today.

### How AI models work

AI models use data to learn and perform tasks—like detecting diseases in medical
images or powering voice assistants. To work well, AI models need lots of
training data, which is often:

- Stored in different locations (e.g., across hospitals, banks, or mobile devices).
- Too sensitive to share due to privacy laws and security risks.

### The traditional approach: Data centralisation

The traditional way to solve this problem is data centralisation—bringing all
the data together in one place, like a data lake or cloud server.

  All data is moved to a single location for analysis and insights

However, in industries like healthcare and finance, where data is highly
sensitive, this approach has serious limitations. Personal information, medical
records, or financial details can't simply be shared due to privacy risks, legal
restrictions, and strict regulations.

AI has enormous potential to solve big problems—but how can we apply it to
sensitive data without compromising security? This is where federated data
science comes in.

### The alternative: Federated data science

Rather than moving data, **federated data science sends the AI model to the
data**.

This enables organisations to:

- Train, and improve AI models on datasets they never actually see—even when the
  data is spread across multiple locations.
- Ensure compliance with privacy regulations.
- Collaborate securely while keeping full control of their data.

  Insights are shared, while data remains securely in its original location

This keeps information safe while still allowing AI to learn and improve.
Organisations get the insights they need without sharing private data. Let's
look at some real-world applications.

## How is federated data science used?

Federated data science is already making AI safer and more effective. Some
real-world uses include:

- Healthcare: Training AI to detect diseases without sharing patient records.
- Finance: Banks spotting fraud patterns without exposing customer data.
- Smartphones: Improving voice assistants without collecting users'
  conversations.

### Training AI securely with federated learning

Federated Learning (FL) allows AI models to be trained across multiple locations
without sharing raw data. Instead of collecting data in one place, the model
learns from distributed datasets stored across different institutions, servers,
or devices. Here's how it works:

1. **Model setup:** A global AI model is prepared and sent to multiple locations
   where data exists.
2. **Local training:** Each site trains the model using its own data but never
   shares the raw data itself.
3. **Model updates:** Each site sends back only the improvements (updated model
   parameters), not the data.
4. **Aggregation:** A central system combines all updates to refine the global
   model. The most common method, Federated Averaging (FedAvg), ensures sites
   with more data have a greater impact on training.
5. **Repeat:** The improved model is sent back for further training, repeating
   the process until it reaches peak performance.

This approach distributes the computing workload across many locations, making
AI training more efficient, scalable, and privacy-preserving.

## What can you do with Bitfount?

You can use Bitfount to securely adopt AI in a range of scenarios, including:

- **Model inference:** Get predictions on your data and run a wide selection of
  models locally. All data remains behind your firewall with results only
  accessible to you.
- **Fine-tune models:** More efficient than training a model from scratch. Adapt
  foundation models on your sensitive data to complete specific downstream tasks
  you are interested in.
- **Federated learning:** The traditional application of federated data science.
  Federated learning enables you to train models across multiple distributed
  datasets.
- **Federated evaluation:** Test model performance on variety of real-world data
  you don't have access to in raw form.
- **Private set intersection:** Determine the overlapping records in two (or
  more) disparate datasets without providing access to the underlying raw data
  of either dataset to any other collaborators.
- **Private analytics:** Run analysis queries over and retrieve back valuable
  insights. Data custodians remain in control of what kind of metrics can be
  retrieved.

If you would like to learn more about what you could achieve with Bitfount,
please contact us at support@bitfount.com.


## platform-overview/datasets.md

# Datasets

Datasets in Bitfount act as references to your data, storing only metadata and schema—not the raw data itself.
Your datasets always remain on your system and are never transferred or stored by Bitfount.

This guide covers how to connect a dataset to Bitfount, link it to a project,
and manage dataset access.

## Connecting datasets

Before using a dataset in a project, you must first connect it to Bitfount using
Bitfount Desktop. Connecting a dataset to
Bitfount is like registering it—only its metadata (name, description, and
schema) is stored, never the raw data itself.

### Format

It's important to ensure your dataset is formatted correctly to be compatible
with the task used in the project. If you are joining an existing project,
please check with the project contact to ensure your dataset meets the
requirements for the task.

### Selecting a data source

To connect a dataset, click `Connect dataset` either from the `Datasets` page,
or within the project when you link a dataset, and choose from the available
data sources supported by Bitfount.

product-modal-datasources-min.png

:::tip
If your dataset contains DICOM files and you intend to run Ophthalmic tasks, we recommend selecting the **DICOM (Ophthalmology)** data source for optimal compatibility.
:::

After selecting a data source, enter a dataset name and optionally, a
description, then click `Connect dataset`. The system will then process the connection,
making the dataset available within Bitfount.

Once connected, the dataset should appear `Online`.

:::note
**Can't find the data source you need?** Please reach out to the Bitfount support team—we're happy to help you connect your dataset to Bitfount.
:::

### Schema

When you connect a dataset, Bitfount automatically generates a schema that
defines the column names and data types within your dataset. This schema is used
to verify compatibility with the task used in a project, and **does not contain
any actual data** (such as patient records), only structural information about
the dataset.

If you are working with data scientists, they may also reference the schema to
design analyses and tasks that align with your dataset's structure.

product-schema-min.png

## Managing datasets

### Status

When you start Bitfount Desktop, the system automatically attempts to establish a connection with all connected datasets, whether they are online or offline.

If needed, you can manually take a dataset offline from the `Settings` tab, which will temporarily disable task execution for that dataset.

:::info
Tasks cannot run until Bitfount has finished connecting all datasets at startup
:::

### History

A full audit trail is available for datasets via the `Activity history` tab. To
view project-specific activity, navigate to the same tab in the relevant
project.

### Archiving

From the `Settings` tab you can archive your dataset. Archiving does not delete
the raw data source connected to Bitfount. Archived datasets can be unarchived
and reused in projects when appropriate.

### Access

You can view all projects the dataset is currently linked to via the
`Linked projects` tab on the Dataset's detail page. Unlinking a dataset from a project can be completed at
any time by clicking the `Unlink dataset` button within the project's `Datasets`
tab.

If linking your dataset to projects does not fit your use case and you are
working directly with a data scientist, please
see Managing Pod Access.
This guide outlines how to manage direct access to datasets outside of the
context of a project via the `Assigned roles` tab.


## platform-overview/index.md

# Platform overview

An outline of the core concepts that are helpful to understand before using
Bitfount.

## Projects

Secure data collaborations happen in projects. Projects contain a collection of
settings that configure the who, what, and how of the data collaboration taking
place within it.

> Projects are created and managed by a Project Owner who defines the terms and
> conditions and selects the machine learning or analysis task that will run.
> When collaborating with other users, Project Owners will invite participants
> to join, link their datasets and run the associated task.

## Tasks

A Bitfount task is the _brain_ of a project. It specifies the algorithm(s) that
will run on any dataset linked to the project which can include the use of AI
models as well as other data science operations.

> Tasks are defined and templated by data scientists via the Python SDK, making
> them available for repeatable use in one, or several projects. Once added to a
> project, the task configuration code can be inspected by any participating
> user.

## Models

Not all tasks include models, but generally, they are a key component of tasks.
AI models are programs developed by data scientists designed to analyse datasets
to find patterns and make predictions.

> Off-the-shelf tasks provided by Bitfount host models owned by Bitfount, our
> partners, and even grant you access to use and test a whole library of
> open-source models. Data scientists can also integrate their own custom models
> to Bitfount and template them into tasks to be run in the app at the click of
> a button. Models can be public and open to any user, or private. Private
> models generally require a licensing agreement with the Project Owner before
> they can be used in a task that's used in a project.

## Datasets

A dataset is a collection of raw data records that can be used in Bitfount.
Connecting a dataset to Bitfount is like registering it—only its metadata (name,
description, and schema) is stored, never the raw data itself.

> Datasets remain on the Data Custodian's (user that owns the dataset) systems
> at all times and are never transferred to, or stored by, Bitfount or any other
> collaborator's systems unless explicitly agreed. Once connected, datasets can
> be linked to projects, allowing tasks to run securely. Bitfount logs all tasks
> performed against the dataset, creating an audit trail for accountability and
> transparency.


## platform-overview/models.md

# Models

Generally, AI models are a core component of tasks. They are programs developed
by data scientists designed to analyse datasets to find patterns and make
predictions. Models are leveraged to complete tasks related to computer vision,
natural language processing, and applied in a range of other AI domains.

## Inspecting models

Just like tasks, model code can also be inspected. The owner of the model does
have the option to limit code visibility if they wish to do so.

model-overview.png

## Tasks with interchangeable models

In Bitfount, tasks define the model that will run within a project. Some tasks offer flexibility,
allowing you to choose from a selection of open-source models.

If a task includes a model that requires usage approval, it will be marked as `Pending Approval`.
The model owner must approve its use before you can run the task.

model-selection.png

## Creating a custom model

The platform supports the integration of custom models which once uploaded can
be templated into tasks to either train, or run inference. If you are a data
scientist, please see Custom Models,
or visit our SDK Tutorials for more information on implementation.


## platform-overview/projects.md

# Projects

Projects are the core workspace in Bitfount, where you extract valuable insights
from your datasets and collaborate securely with others. Here, collaborators can
link their datasets and run the task assigned to the project.

- If you have been invited to collaborate, skip ahead to
  Joining a project.
- If you are creating a project from scratch, keep reading to learn how to set
  it up.

## Creating a project

Projects can be created by simply navigating to the projects tab and clicking
the `Create project` button.

product-projects-min.png

### Project metadata

The table below outlines all the metadata that can be defined within a project.

| Metadata type            | Definition                                                                                              |
| ------------------------ | ------------------------------------------------------------------------------------------------------- |
| Project name             | Title of the project on the platform (3–50 characters)                                                  |
| Description              | 1-2 sentence description of the goals of the project                                                    |
| Official link (optional) | URL to more information or an official brochure associated with the project                             |
| Organisation (optional)  | Person leading or organisation sponsoring the project                                                   |
| Contact email (optional) | Contact for all collaborator-related queries                                                            |
| Duration (optional)      | Timeline for the project                                                                                |
| Project terms (optional) | Terms & conditions of the project. If included, these must be accepted by collaborators before joining. |
| Task                     | Predefined machine learning tasks or analyses available within the project                              |

### Defining project terms

If your project requires terms and conditions, we recommend consulting with
legal advisors to ensure they are appropriate. Bitfount does not verify or
enforce project terms beyond standard role-based data access permissions.

Consider including details on:

- Confidentiality: Protecting sensitive information.
- Exclusivity: Restrictions on participation or data use.
- Data Management & Collection: How data is handled within the project.
- Data Subject Privacy: Ensuring compliance with relevant privacy regulations.
- Participation Rules: Defining who can join and under what conditions.
- Scope of Tasks: Outlining the specific analyses or AI tasks to be performed.

If your terms are too long or reference multiple documents, you can link to
hosted documents instead of including them in full.

### Selecting a task

A Bitfount task is the _brain_ of a project. It specifies the algorithm(s) that
will run on any dataset linked to the project which can include the use of AI
models as well as other data science operations.

To add a task to your project, click `Add task` and browse the available
options. Bitfount offers a variety of pre-built tasks, some of which allow you
to choose from a selection of open-source models.

:::caution
You cannot change or remove the task once you create the project
:::

task-selection.png

Each task has a unique set of input parameters that must be configured by collaborators
for the task to run successfully. These parameters are visible when the project is created,
allowing collaborators to review and set the required values before running the task.

:::info
Looking to create your own tasks to use in Bitfount Desktop? Please
refer to Task Templates &
Custom Models
in the SDK guides.
:::

## Managing a project

Once you have set up a project, you may want to invite collaborators,
edit project metadata, or archive the project.

### Inviting collaborators

To invite new collaborators, select the project, navigate to the
`Collaborators` tab, click the `Invite collaborators` button and enter the
email or username of the users you wish to invite. Any users invited will receive an email
invitation to join the project.

Once the user has created a Bitfount account, they will be able to review the
project details and must accept the project terms (if defined) before joining.

Once the user joins, they can link their dataset and run the task associated
within the project.

You can remove a collaborator at any time by navigating to the 'Collaborators'
tab. Once a collaborator is removed, any of their connected datasets will also
be unlinked.

### Updating or archiving a project

To edit a project's metadata, click the three dots on the projects page and
select `Edit project`.

Archiving a project can be achieved by navigating to the `Settings` tab and
selecting `Archive projects`. Any datasets will be unlinked from the archived
project and running the associated task will be blocked.

product-archive-project-min.png

Projects can be restored at any time by returning to the settings tab and
clicking `Restore project`.

## Monitoring project activity

We recognise how important it is to have sufficient oversight of how
collaborators are interacting with one another's data or tasks to fulfill the
needs of the project.

Different users can see different views as follows:

- **Project Owners** can view model usage as well as activity history for the
  whole project, including when projects were created, invitations that were
  issued or revoked, and task run history only against their own datasets.
- **Data Custodians** can view the activity history related to their dataset
  only, allowing them to see when they last ran a task, and any results related
  to that task run.

### Accessing logs

Logs are technical audit trails of a user's interaction with Bitfount and can
hold useful information for the Bitfount team to help resolve any technical
issues that might occur. To retrieve log files, click the `Logs` link in the
sidebar.

## Joining a project

If you have been invited to collaborate on a project, you will receive an email
invitation to join the project.

Joining a project allows you to link datasets and run the assigned task.
Before joining, review the project details, task configuration, models, and any
available terms and conditions to ensure alignment with your expectations.

If you are unsure about the project's scope, consider consulting your project contact
or legal team before proceeding.

Finally, when you're happy to continue, go ahead and click
`Accept and join project`.

product-join-modal-min.png

## Linking datasets to projects

After creating or joining a project, the next step is to link a dataset.
This is essential because the project's task can only be run on linked datasets.

Linking a dataset ensures that the assigned task can access the necessary data
while keeping it securely stored in its original location.

To run the associated task for the project you need to click the `Link dataset`
button.

After selecting the dataset, Bitfount will automatically check if the dataset schema
is compatible with the task. Once this is complete, you will see your linked dataset
within the project.

If the schema check returns an error, please review the dataset schema and ensure that
the expected columns are present in the data and named accordingly.

:::info
Learn more about connecting and managing datasets on our
Datasets page
:::

## Running tasks

Once you have linked a dataset you are ready to run the project's associated
task(s) by clicking the `Run task` button within the `Task runs` tab. Before
running, you must first set any parameters required to run the task. These will
vary based on the task.

Task completion times will depend on the type and size of the dataset, the
complexity of the task, and available compute resources on your machine.

Once the task is complete, any results will appear within the task run.

product-task-complete-min.png

## Interpreting results

The output generated from a successful task run will vary based on the algorithm
used in the task. This could take the form of a PDF report, CSV file or other
formats depending on the configuration of the task. For more details on how to
interpret results, please reach out to your project contact or
support@bitfount.com.

## Next steps

You can now go ahead and join, or create, your first project. Alternatively,
choose to get up and running quicker using one of our
Retfound demo project tutorial.


## platform-overview/tasks.md

# Tasks

A Bitfount task is the _brain_ of a project. It specifies the algorithm(s) that
will run on any dataset linked to the project which can include the use of AI
models as well as other data science operations.

## Selecting a task

Available tasks can be viewed when clicking the `Add task` button within a
project, or by navigating to the `Tasks` tab. Bitfount hosts off-the-shelf tasks
that are provisioned in our demo projects, other users can also create their own
tasks for use in projects.

task-selection.png

## Inspecting a task

If you have been invited to join a project, a task will already have been added
by the project owner. Before joining, you will be able to review the task
configuration by clicking on the task card in the project. This will show
details about the protocols, algorithms, and models that will run on your
dataset.

product-task-details-min.png

## Running a task

Tasks are run within projects to generate insights from your datasets.

### How to run a task

1. Navigate to the project and click `New task run`.
2. Link a dataset that is compatible with the project's task.
3. Add or adjust any required task parameters.
4. Click `Run task` to begin processing.

:::note
Task completion time depends on dataset size, task complexity, and available computing resources.
:::

### Viewing results

Once the task run is complete, results can be accessed via the task run.
Depending on the task, output formats may include CSV files, reports, or other
structured data formats. For guidance on interpreting results, refer to your
project lead or contact the
Bitfount support team.

:::info
Task results are only accessible to the owner of the dataset and remain completely private. Bitfount does not have access to any results.
:::

## Creating a task

You can use our Python SDK to create tasks. If you're looking to build tasks and provision them on the Bitfount platform please see our SDK guides.