Skip to main content

Data Source Configuration Best Practices

In Connecting Data & Running Pods, we outlined the mechanisms for configuring Pods. In this guide, we will provide examples by data source for each mechanism and details on best practices for each data source type. Note we leverage various parameter combinations in Pod configuration here which are not representative of all possible configurations. For more details on Pod configuration, please refer to the API Reference.

Jump to:

CSV

A local .csv file is the most common data source for Bitfount data connections.

Best Practices

For .csv files:

  • Pods based on .csv files will run as long as they are not interrupted, meaning Pods configured to point to local files will be interrupted if the machine on which they are hosted is turned off or experiences transient disconnection. Bitfount will attempt to bring Pods back online if there is an interruption, but in most cases they will need to be re-started.
  • Best practice is to point to a .csv file hosted on a server not tied to a user's local machine.
  • Exclude any personally identifiable information fields from the Pod configuration specification.
  • Bitfount will automatically generate the schema for a .csv file based on its header row. Please ensure to include a header row in your .csv file with the column names you wish to be reflected in the Pod's schema.

YAML Configuration Example

The configuration yaml file needs to follow the format specified in the PodConfig class:

name: <enter-pod-name-for-system-id>datasource: CSVSourcepod_details_config:  display_name: <Enter Name You'll See in the Hub>  description: >		This is a description of the data connected to this Pod with any relevant details for potential collaborators.data_config:  ignore_cols: ["Name", "DOB", "National Tax Number"]  force_stypes:    enter-your-pod-name:      categorical: ["TARGET", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "gender", "education"]  datasource_args:    path: <PATH TO CSV>/<filename>.csv    seed: 100    data_split: 30,10

Bitfount Python API Configuration Example

Using the python API is quite similar to specification via yaml. With the python API, we configure Pods using the PodDetailsConfig and PodDataConfig classes. The former is required to specify the display name and description of the Pod, whilst the latter is used to customise the schema and underlying data in the data source. For more information, refer to the config_schemas reference guide. Note, Pod names cannot include underscores. Here is a .csv python API configuration example:

   pod = Pod(       name="enter-pod-name-for-system-id",       datasource=CSVSource(</PATH/OR_URL/TO/YOUR/CSV_FILE.csv>, seed = 100,data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10))       pod_details_config=PodDetailsConfig(           display_name="Hub Pod Display Name",           description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",       ),       # Specify the structure of the dataset       data_config=PodDataConfig(           # Specify stypes for fields (optional)           force_stypes={   		"enter-pod-name-for-system-id": {   		"categorical": ["target"]   			},   		}        ),       ),   )

SQL Database

Connecting a database is our recommended approach to Pod creation if you will engage in a lot of asynchronous collaboration with Data Scientists. More details on the DatabaseSource class can be found here.

Best Practices

  • Ensure any databases connected are tabular.
  • Connect 'always-on' databases which operate independently of local machines to ensure collaborators can always interact with your Pod, even if you yourself are offline.
  • If including multiple tables with a single database connection, ensure you will be happy for all tables to be accessed by any data scientist to which you will give access to the Pod.

YAML Configuration Example

When setting up a Pod with a SQL database data source, the Pod owner has the discretion of giving access to a single table in a database, to multiple tables within the database, or to a specific view of the data within the database. The basic YAML file format remains the same as that of the CSVSource.

Single or Multiple Tables: To set up a Pod with access to a single table, the database connection string and the single table name must be defined.

name: single-table-databasepod_details_config:  display_name: Single table database pod  description: This pod provides access to data from a single table in a databasedatasource: DatabaseSourcedata_config:    datasource_args:        db_conn: postgresql://{USERNAME}:{DBPASSWORD}@{DBHOST}/{DBNAME}	    db_conn_kwargs:	        table_names: [table_name1]
caution

If table_names is not specified, all tables the Database user has permission to access will be accessible. To specify multiple tables, add the additional table names to the table_names parameter. For example, table_names: [table_name1, table_name2].

View: To populate a Pod with a restricted view of the data in the database, replace the tables_names field by a query field containing the definition of the view, keeping other descriptive fields the same. e.g.:

...
data_config:
datasource_args:
db_conn: postgresql://{USERNAME}:{DBPASSWORD}@{DBHOST}/{DBNAME}
db_conn_kwargs:
query: """SELECT * FROM table_name1 t1 JOIN table_name2 t2 ON t1.id = t2.id"""
...
tip

Bitfount can parameterise YAML values via Environment variables. If using environment variables, set values by entering export ENV_VAR_NAME=env_var_value in the terminal. Replace credentials of other values in your config with ${} - wrapped variable names. Here we have specified the db_conn as postgresql://${USERNAME}:${DBPASSWORD}@${DBHOST}/${DBNAME} but this could equally have been been specified as db_conn: ${DB_CONN}. These are the supported databases: https://docs.sqlalchemy.org/en/14/dialects/.

caution

Environment variables may need to be exported if you are setting up the Pod in a subprocess.

The bitfount package does not install database-specific dependencies. Please see Preparing to Connect Data for recommended dependencies for a select number of databases.

Bitfount Python API Configuration Example

Connecting a database via the Bitfount Python API requires you to specify the database's connection parameters like so:

import osdb_password= os.environ["DB_PASSWORD"]db_conn=DatabaseConnection(conn=f"sqlite://USERNAME;{db_password}@DBHOST:DBNAME")pod = Pod(       name="enter-pod-name-for-system-id",       datasource=DatabaseSource(db_conn, seed = 100,data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10))       pod_details_config=PodDetailsConfig(           display_name="Hub Pod Display Name",           description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",       ),       # Specify the structure of the dataset       data_config=PodDataConfig(           # Specify stypes for fields (optional)           force_stypes={   		"enter-pod-name-for-system-id": {   		"categorical": ["target"]   			},   		}       ),   )

DataFrame

Best Practices

  • Ensure you know the structure of the DataFrame prior to Pod configuration.

YAML Configuration Example

Yaml configuration is not supported for DataFrame data sources.

Bitfount Python API Configuration Example

The main difference between connecting DataFrame sources and connecting other source types is the requirement to instantiate the data source with a pd.DataFrame object. An example of how to do this is as follows:

data_structure = {'col1': [1, 2], 'col2': [3, 4]}dataframe_object = pd.DataFrame(data=data_structure)pod = Pod(    name="enter-pod-name-for-system-id",    datasource=DataFrameSource(        dataframe_object,data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10)    ),    pod_details_config=PodDetailsConfig(        display_name="Hub Pod Display Name",        description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",    ),    # Specify the structure of the dataset       data_config=PodDataConfig(           # Specify stypes for fields (optional)           force_stypes={   		"enter-pod-name-for-system-id": {   		"categorical": ["target"]   			},   		},       ))

Excel File

Bitfount supports Excel file sources in much the same way as we do CSV files. See below for examples and best practices.

Best Practices

  • If connecting an Excel file, avoid connecting one with multiple sheets you will wish to query across. If you have a multi-sheet Excel file, we suggest converting the file into a SQLite file database and using SQLite as the source instead.
  • Ensure your Excel file has inherited a supported encoding (this should occur by default) and adheres to the pandas requirements.

YAML Configuration Example

Connecting an Excel file via YAML is comparable to connecting a CSV file. See below for an example configuration file:

name: <enter-pod-name-for-system-id>datasource: ExcelSourcepod_details_config:  display_name: <Enter Name You'll See in the Hub>  description: >		This is a description of the data connected to this Pod with any relevant details for potential collaborators.data_config:  ignore_cols: ["Name", "DOB", "National Tax Number"]  force_stypes:    enter-your-pod-name:      categorical: ["TARGET", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "gender", "education"]  datasource_args:    path: <PATH TO EXCEL>/<filename>.xslx    seed: 100 # optional  data_split:    data_splitter: percentage    args:        test_percentage: 30        validation_percentage: 10        shuffle: False # if no shuffling of the data is required

Bitfount Python API Configuration Example

Again, connecting an Excel file via API is comparable to connecting a CSV file. See below for an example configuration command:

   pod = Pod(       name="enter-pod-name-for-system-id",       datasource=ExcelSource(</PATH/OR_URL/TO/YOUR/EXCEL_FILE.xlsx>, seed = 100, data_splitter=PercentageSplitter(validation_percentage = 30, test_percentage = 10))       pod_details_config=PodDetailsConfig(           display_name="Hub Pod Display Name",           description="This is a description of the data connected to this Pod with any relevant details for potential collaborators.",       ),       # Specify the structure of the dataset       data_config=PodDataConfig(           # Specify stypes for fields (optional)           force_stypes={   		"enter-pod-name-for-system-id": {   		"categorical": ["target"]   			},   		}       ),   )

SQLite

We recommend using SQLite sources if you want to connect an Excel file with multiple sheets/tables. Using SQLite will allow you to run tasks between tables (with the exception of PrivateSQL, which does not support joins) without needing to go through the more complex process of uploading your file contents to a separate postgres database server. SQLite acts the same as a DatabaseSource.

Best Practices

  • Convert any Excel or csv files to SQLite prior to configuring the Pod. We provide a python script template for this below.
  • Ensure you know the accepted SQL field types prior to connecting the Pod.

Converting Excel or CSV to SQLite

Note, you can execute this directly, or split cells in a notebook as desired.

import pandas as pdimport sqlite3sheet_name_1 = pd.read_excel(    '/PATH/TO/YOUR/EXCEL.xlsx',    sheet_name='<sheet_name_1>',    header=1)sheet_name_2 = pd.read_excel(    '/PATH/TO/YOUR/EXCEL.xlsx',    sheet_name='<sheet_name_2>',    header=1)db_conn = sqlite3.connect("</path_to_file/final_db_name.db>")c = db_conn.cursor()# sheet_name_1c.execute(    """    CREATE TABLE IF NOT EXISTS sheet_name_1 (        field_1 FIELD_TYPE PRIMARY KEY,        field_2 FIELD_TYPE,        field_3 FIELD_TYPE        );     """)# sheet_name_2c.execute(    """     CREATE TABLE IF NOT EXISTS sheet_name_2 (        field_1 FIELD_TYPE PRIMARY KEY,        field_2 FIELD_TYPE,        field_3 FIELD_TYPE        );    """)sheet_name_1.to_sql('sheet_name_1', db_conn, if_exists='append', index=False)sheet_name_2.to_sql('sheet_name_2', db_conn, if_exists='append', index=False)#test connectionresult=pd.read_sql("SELECT * FROM sheet_name_1 LIMIT 5", db_conn)print(result)db_conn2 = sqlite3.connect("</path_to_file/final_db_name.db>")res = db_conn2.read_sql("SELECT * FROM sheet_name_1 LIMIT 5")print(res)

YAML Configuration Example

Connecting a SQLite database file to a Pod uses the same DatabaseSource as exemplified above. The only difference is that rather than pointing to a hosted database, the SQLite configuration uses the path to the local file:

pod:	name: sqlite-table-database	display_name: <sqlite pod name>	description: This pod provides access to data from a single table in a sqlite filedatasource: DatabaseSourcedata_config:    datasource_args:        db_conn: sqlite:///path/to/local/file.db	    table_names: [sheet_name_1,sheet_name_2]

Bitfount Python API Configuration Example

The SQLite data source requires importing additional classes from outside of the bitfount library, like so:

from bitfount import DatabaseSource, Podfrom bitfount.data.utils import DatabaseConnectionfrom sqlalchemy import create_enginefrom bitfount.runners.config_schemas import (PodDetailsConfig)pod = Pod(    name="enter-pod-name-for-system-id",    datasource=DatabaseSource(        db_conn=DatabaseConnection(create_engine('sqlite:///path/to/local/file.db'))    ),    pod_details_config=PodDetailsConfig(        display_name="Hub Display Name",        description='This is a description of the data in my SQLite database file'    ),)

Image Sources

Images can be stored in any supported data source and configured as in the examples above. However, data sources must be configured to have an image reference column indicating the names of the image files.

Best Practices

  • If connecting image data via a database or other non-local source, be sure to use the force_stypes parameter and classify image columns as "image".
  • If connecting image data via files on your local machine, create a reference file .csv with a column indicating all of the file names for the images you wish to connect to the Pod. This column can be something as simple as:
image_file_name
0001.png
...

You can label this column however you wish, though you must be sure to reference it when generating the PodDataConfig for your DataSource.

  • If using a .csv source, we recommend you take note of the filepaths for both the .csv file and the images themselves.
  • Place all images in the same folder or cloud services bucket.

For a detailed example on connecting an image data Pod, see the Running an Image Data Pod tutorial.

FAQ

Don't see a data source you're hoping to use? See Using Custom Data Sources or reach out to us with your request!