Skip to main content

Bitfount-Supported Data Structures

Data Structures are parameters to Model-based algorithms. For this class of algorithm, the model needs a definition of how the data in Pods should be mapped into tensors that can be used by the algorithm. The Data Structure defines this mapping.

For those familiar with machine learning methods, you can think of the Data Structure as a declarative definition of the pyTorch and TensorFlow DataLoader and Dataset concepts.

Bitfount currently only supports a single DataStructure, which maps data into two dimensions (especially useful for classification) along with additional dimensions to allow for multitask learning and weightings.

The simplest use of a DataStructure is to specify the columns from the Pod that should be mapped into the first dimension via the selected_cols parameter and to specify which columns from the Pod should be mapped to the second dimension via the target parameter:

ds = DataStructure(selected_cols=["A", "B", "C"], target=["C"])

The details of this mapping will depend on the types of the columns A, B and C. If they are all simple int or float types, the first dimension of data will be formatted as a simple vector [a, b], and the second as [c]. If a column is a categorical variable, then a 1-hot encoding is generated, using the schema for defining the order of categorical values. When a column is an image path, then the image is resized to a constant length of values and included in the data.

note

The schema used for 1-hot encodings is currently specified in the model, rather than the data structure.

In many cases, you may want to run transformations of the data in a remote Pod before inputting data into the model. For this purpose, Bitfount provides a set of transformations. These are specified in the batch_transformations parameter as follows (note that separate transformations should be specified for each of train, validation, and testing):

datastructure = DataStructure(
target="TARGET",
selected_cols=["image", "TARGET"],
image_cols=["image"],
batch_transforms=[
{
"albumentations": {
"step": "train",
"output": True,
"arg": "image",
"transformations": [
{"Resize": {"height": 224, "width": 224}},
"Normalize",
"HorizontalFlip",
"ToTensorV2",
],
}
},
{
"albumentations": {
"step": "validation",
"output": True,
"arg": "image",
"transformations": [
{"Resize": {"height": 224, "width": 224}},
"Normalize",
"ToTensorV2",
],
}
},
{
"albumentations": {
"step": "test",
"output": True,
"arg": "image",
"transformations": [
{"Resize": {"height": 224, "width": 224}},
"Normalize",
"ToTensorV2",
],
}
},
]
)

Currently, only image albumentations transformations are supported, and the names of the transformations are exactly as in the Albumentations API.

More technical detail and variations in specification for the DataStructure class can be found in the API Reference.

When using a custom model, you are also able to write arbitrary code for data loading as described in Custom Models.