Skip to main content

Preparing to Connect Data

Before you can connect your data to a Pod for analysis, you may want to work with your colleagues or partners to answer important questions about the data and how it will be used. If you have a simple dataset or use case, you can likely configure your Pod based on the examples provided in the Bitfount tutorials or in Data Source Configuration Best Practices without referring too much to this guide.

However, for more complex use cases or for your own reference, the answers to the below questions will dictate what arguments you will choose when configuring the Pod using the Pod class. You may wish to consider them before moving to the next step!

Pod Nomenclature

Naming Pods clearly is important for searchability within the Bitfount Hub and for avoiding errors when working with Pods. Names are specified using two arguments:

  • name: This is an argument for the Pod class and is the name for the Pod which will be used for interaction with the Pod via yaml or the Bitfount python API
  • display_name: This is an argument for the PodDetailsConfig class and is the name you or your partners will see displayed in the Bitfount Hub when exploring or authorising Pods.

1. What are best practices for naming Pods?

If you will be working with colleagues or external partners on authorised Pods, it is a good idea to ensure your Pods are named such that they will be able to easily find and work with them based on the data they'd expect to be connected to the Pods. Typically, we suggest:

  • Make the names and display names as equivalent as possible and human-readable, unless you have understood codes for databases across those who will be interacting with the Pods.
  • No underscores or punctuation in names; these names will be rejected. If you need spaces for the name argument, use hyphens to separate words.

2. What happens if I make a mistake on Pod names or create two Pods with the same name?

The name argument is the source of truth for creating new or overwriting existing Pods. If you wish to change a Pod's display name, you can easily do so by re-running the Pod configuration and using the same name argument as you did previously with the new display_name you'd like to specify.

If you specify different name arguments in different Pod configurations with the same display name, however, this will create two different Pods with the same display name. This may cause confusion among your colleagues or partners, so it is best practice to check that you are not creating a duplicate Pod prior to configuring a new Pod.

Deleting a Pod is not yet supported by default. Please reach out via the Bitfount Forum's Support and Feedback section if you run into issues with Pod naming and configurations.

Data Sources

Data sources are Bitfount's term for the format or database type from which a data custodian connects datasets for analysis for permissioning within Bitfount Hub. They are specified in the Pod class using datasource.

1. In which format or database type is my data currently?

Bitfount supports the below file types and databases by default. If your data is not in one of these formats or accessible by database connection, you may wish to convert it to one of these data sources. Note, if your dataset is a set of image files, Bitfount supports connecting these via any data source (see Data Source Configuration Best Practices for an example). We also provide the option to use custom DataSource plugins if desired.

All Bitfount-supported data sources leverage pandas to ensure your file or database contents are compatible with our systems. We do not impose any Bitfount-specific limitations, however, if you run into errors connecting your data, you may need to specify keyword arguments as a dictionary for Bitfount to pass through to pandas.

2. What kind of analyses will I or my partners wish to perform on the dataset(s)?

The analysis you or your partners wish to perform will affect the data source you choose. Most default data sources support Bitfount-supported task elements by default, however, if your data is in a multi-sheet Excel file, and you or your partners will wish to perform tasks across sheets, you must convert your file to a SQLite format as demonstrated in Data Source Configuration Best Practices.

Supported DataSources

Connecting data to a Bitfount Pod is done by specifying the appropriate DataSource, which is Bitfount’s class for enabling a Pod to read and access data in its correct format. Bitfount currently supports the following DataSources:

DataSourceDescriptionSupported Configuration Mechanisms
CSVSupports connection of comma-delimited .csv filesYAML, Bitfount Python API
DatabaseSupports connection of databasesYAML, Bitfount Python API
DataFrameSupports connection of Pandas dataframe structuresBitfount Python API
Excel fileSupports connection of standard Excel filesYAML, Bitfount Python API
IntermineSupports connection to Intermine databasesYAML, Bitfount Python API

For detailed examples on when to use and how to configure each data source type, see Data Source Configuration Best Practices. For more technical details, see the datasources classes page in our API Reference guide. If you don’t see your preferred DataSource here, you may wish to contribute a custom DataSource plugin to Bitfount. Please see Using Custom Data Sources for more details.

DataSources requiring additional licensing

We also maintain plugins that support various specialised use cases:

DataSourceDescriptionSupported Configuration Mechanisms
HeidelbergSupports connection of Heidelberg Eye Explorer dataYAML, Bitfount Python API
TopConSupports connection of TopCon dataYAML, Bitfount Python API

These currently require additional licensing. Please contact us on info@bitfount.com if you would like to explore further.

Supported Databases

info

The bitfount package does not install datasource specific dependencies, so ensure you’ve installed any additional dependencies required to connect your datasource.

Postgres Installation

Bitfount supports most SQL databases as data sources. To use a postgres database as a DataSource within the Bitfount platform you must have the following packages installed:

PackageVersion
bitfount≥ 0.5.15
psycopg2-binary≥ 2.7.4

We also support any databases supported by sqlalchemy: https://docs.sqlalchemy.org/en/14/dialects/.

Intermine Installation

To set up an IntermineSource within the Bitfount platform, you must have the following packages installed:

PackageVersion
bitfount≥ 0.5.15
intermine≥ 1.13.0

Want us to support a specific file format or database type not listed here? Please provide feedback in the Bitfount Forum's Product Feedback section or up-vote others' requests.

Data Configuration

The PodDataConfig class enables you to specify arguments dictating how data will be presented to you or your collaborators and 'read' when performing tasks. Within the Pod class, these arguments are passed via pod_details_config, which is an optional argument. To determine whether you need to configure any of these settings, ask yourself:

1. Does my dataset contain any fields/features (including images) for which default data types are not already specified or readable by default?

If you have any fields where values might not conform to common data standards, it is worthwhile to use the force_stypes argument to specify the semantic type for each field/feature of the dataset. Bitfount will attempt to assign these by default, so this parameter is optional unless your data includes images. Image columns must be specified using the image parameter to map to the columns. Details on this parameter can be found in PodDetailsConfig.

2. Do I want to exclude any columns from my dataset from being used for analysis?

It is possible to ignore columns of your dataset for the purposes of connecting data to Bitfount using the ignore_cols argument defined in your relevant data source class (i.e. CSVSource, DatabaseSource, etc.) passed directly through the data source specification within your Pod configuration. This is most commonly used to remove personally-identifiable information fields or fields used for internal use which will not be relevant to partners or the analysis they wish to perform. It is helpful to use this argument if you don't want to create a new cut of data for every collaboration. Take note of any fields you wish to ignore prior to configuring a Pod.

3. Where and how will my file or database be located/accessed?

Depending on your selected data source, there are additional parameters you need to specify in order to properly connect the data. These will be defined on the DataSource level. When using yaml format, you will need to pass them via the datasource_args argument. Examples will include file paths, database connection URLs and credentials, or other similar authentication requirements.

See below for a list of additional required parameters by data source type:

Data SourceParameters
CSVSource- path: Path or URL to the .csv file
DatabaseSource- db_conn: Takes arguments specified in DatabaseConnection to specify database URL and authentication variables.
- database_partition_size: Option to dictate how the database is partitioned.
- max_row_buffer: An integer representing the maximum number of rows to stream for the database at a given time; useful for large datasets where a user may run into memory issues when querying or training.
DataFrameSource- data: Specifies the name of the dataframe to be loaded
ExcelSource- path: The path or URL to the Excel file.
- sheet_name: The name(s) of the sheet(s) to connect. By default, all sheets are loaded.
IntermineSource- service_url: Your Intermine database url
- token: Your Intermine authentication token

Example configurations for each data source type are available on Data Source Configuration Best Practices

4. Will my data source consist of multiple files or image files?

If your data source contains references to file paths and you want to change the root folder, you can optionally use the modifiers parameter to specify the file path prefix or extension. This is required for image files stored in a directory, so Bitfount can cycle through all of your image data via one data source. An example of the usage of the modifiers parameter is demonstrated in the Training on Images tutorial.

5. Will I or my partners perform ML or SQL tasks requiring test, training, or validation sets of data? Will the experiments require consistency?

If you are unfamiliar with this process, it's common for ML engineers or data scientists to require subsets of data to perform different functions when developing a model or performing analysis. Within the PodDataConfig class, Bitfount provides data custodians with a mechanism for dictating how to split datasets for these purposes within the data_split argument. Using this argument requires you to specify the DataSplitConfig, which takes the data_splitter parameter.

The default for data_splitter is to assign 10% of the dataset to a test sample, 10% to a validation sample, and 80% to a sample used for training. If you wish to change the defaults, you will first specify the integer representing the test sample percentage, then the validation sample percentage in the data_splitter argument passed to the data source. The unspecified portion will be assigned to the training set. For example:

...
datasource=CSVSource(..., data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10))
...

If your or your partners' analyses require consistent splits, you will also want to specify the seed parameter in the DataSource. The seed parameter dictates the random number from which to start for any randomisation task a data scientist wishes to perform. The default value for this parameter is 42, however you can specify any integer if you'd like to. If you specify the seed, your configuration will look like:

...
datasource=CSVSource(..., seed = 100, ...
)
...

6. Does my data require a bit of additional cleaning?

Bitfount does support the removal of NaNs and the normalisation of numeric values if you'd like. Just set auto_tidy=True in the PodDataConfig when configuring the Pod if you would like us to perform this step on your behalf. Otherwise, this parameter is optional.

Data Schemas

When you connect a dataset to a Pod, you will need to define its schema to ensure you or your collaborators are able to correctly perform tasks on the dataset later.

The data schema is displayed on the Pod’s profile page in the Bitfount Hub so that Data Scientists can understand which data fields are available to them for analysis and their semantic types (e.g. integers, strings, floats, etc). If you’ve correctly specified a DataSource, but do not specify the schema for the Pod, Bitfount will attempt to define the schema on your behalf. However, you may wish to consider the following before leaving the schema argument as None:

1. Are the column or file headers for my dataset what we want to see in the Bitfount Hub or to use in performing tasks? Are they human-readable?

Bitfount currently does not support the alteration or specification of field/feature headers, so we recommend you set file or column headers to human-readable names or references you or your collaborators will understand prior to connecting data to a Pod.

2. Do I want to include multiple tables or Excel sheets in the Pod?

The BitfountSchema class allows you to specify multiple table_names or descriptions for tables or columns. This will allow you to associate multiple tables from a given data source to a single Pod if desired. These can then be passed to the Pod class via the schema argument.

Multi-Pod Interactions

Data Scientists are generally able to perform tasks across Pods to which they have permission as long as the Pods meet the multi-Pod ML or SQL task requirements by default. However, in the case where a Data Scientist wishes to use the SecureAggregation protocol, the Pods they specify will need to be in the approved_pods list of each relevant Pod. To determine whether you need to leverage the approved_pods argument for the Pod class when configuring a Pod, ask:

Will I or my collaborators need to use this Pod's data in combination with that of another Pod for Secure Aggregation?

If no, do not specify the approved_pods argument. If yes, be sure to specify the list of Pods for which you are comfortable for tasks to be run across in concert with the Pod you are configuring. Keep in mind:

  • Any Pods you list will be permissible for querying or running ML tasks in combination with one another only if a Data Scientist has permission to all Pods in the list.
  • If the Pods you list do not also list your Pod in their approved_pods list(s), Data Scientists who have access to the Pods still will not be able to perform SecureAggregation tasks across them.

Privacy-Preserving Configurations

Pod owners have the option to specify additional privacy-preserving controls atop datasets connected to a given Pod. This is typically done based on the DP Modeller role you would assign to a given user with access to your Pod. However, if you will always want differential privacy to be applied to your dataset, you can override these user-level controls and assign them at the Pod-level. This allows you to enforce the guarantees various privacy-preserving techniques provide if desired. Today, Bitfount supports configurable differential privacy controls. To determine whether you need to set the differential_privacy argument, ask yourself:

Is my dataset sensitive to the degree it requires additional privacy protections, and/or do I have concerns that my partners will perform malicious attacks against the data?

Most datasets do not require additional privacy-preserving controls by default, in which case specifying differential_privacy is unnecessary. You may wish to apply these controls if you are dealing with highly regulated or sensitive data, such as patient healthcare records or financial transaction records.

We cover the basics of differential privacy in the Privacy-Preserving Techniques tutorial. Basically, a privacy budget is typically determined based on your risk tolerance for the given dataset -- A rule of thumb is that the lower your risk tolerance, the higher one should set the budget. However, you must also balance this with the "usefulness" of the data to the Data Scientist. If you set the budget too high, the data will no longer provide the Data Scientist with valuable insights. As a result of this dynamic, we have set what we believe to be a reasonable default budget of 3 for each task -- We believe this provides sufficient privacy protections for somewhat sensitive data where the Data Custodian permissions access only to relatively trusted parties.

Next Steps

Now that you've thought through what you'll need to create and run a Pod, it's time to connect some data! Head to Connecting Data & Running Pods for more details.