datastructure

Classes concerning data structures.

DataStructures provide information about the columns of a BaseSource for a specific Modelling Job.

Classes

BaseDataStructure

class BaseDataStructure():

Base DataStructure class.

Subclasses

DataStructure

DataStructure

class DataStructure(    table: Optional[Union[str, Mapping[str, str]]] = None,    schema_requirements: SCHEMA_REQUIREMENTS_TYPES = 'partial',    schema_types_override: Optional[Union[SchemaOverrideMapping, Mapping[str, SchemaOverrideMapping]]] = None,    target: Optional[Union[str, list[str]]] = None,    ignore_cols: list[str] = [],    selected_cols: list[str] = [],    compatible_datasources: list[str] = ['CSVSource', 'DICOMSource', 'ImageSource', 'DICOMOphthalmologySource', 'HeidelbergSource'],    selected_cols_prefix: Optional[str] = None,    data_splitter: Optional[DatasetSplitter] = None,    image_cols: Optional[list[str]] = None,    image_prefix: Optional[str] = None,    batch_transforms: Optional[list[dict[str, _JSONDict]]] = None,    dataset_transforms: Optional[list[dict[str, _JSONDict]]] = None,    auto_convert_grayscale_images: bool = True,    image_prefix_batch_transforms: Optional[list[dict[str, _JSONDict]]] = None,):

Information about the columns of a BaseSource.

This component provides the desired structure of data to be used by discriminative machine learning models.

note

If the datastructure includes image columns, batch transformations will be applied to them.

Arguments

schema_requirements: The schema requirements for the data. This is either a string from the list ["none", "partial", "full"], or a mapping from each the elements of the list to different datasources. Defaults to "partial" for all datasource types.
target: The training target column or list of columns.
ignore_cols: A list of columns to ignore when getting the data. Defaults to None.
selected_cols: A list of columns to select when getting the data. The order of this list determines the order in which the columns are fed to the model. Defaults to None.
selected_cols_prefix: A prefix to use for selected columns. Defaults to None.
image_prefix: A prefix to use for image columns. Defaults to None.
image_prefix_batch_transforms: A mapping of image prefixes to batch transform to apply.
data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
image_cols: A list of columns that will be treated as images in the data.
batch_transforms: A dictionary of transformations to apply to batches. Defaults to None.
dataset_transforms: A dictionary of transformations to apply to the whole dataset. Defaults to None.
auto_convert_grayscale_images: Whether or not to automatically convert grayscale images to RGB. Defaults to True.
table: Defaults to None. Deprecated.

Raises

DataStructureError: If 'sql_query' is provided as well as either selected_cols or ignore_cols.
DataStructureError: If both ignore_cols and selected_cols are provided.
ValueError: If a batch transformation name is not recognised.

Ancestors

BaseDataStructure
bitfount.types._BaseSerializableObjectMixIn

Variables

static auto_convert_grayscale_images : bool

static batch_transforms : Optional[list]

static compatible_datasources : list

static data_splitter : Optional[DatasetSplitter]

static dataset_transforms : Optional[list]

static fields_dict : ClassVar[dict[str, marshmallow.fields.Field]]

static ignore_cols : list

static image_cols : Optional[list]

static image_prefix : Optional[str]

static image_prefix_batch_transforms : Optional[list]

static nested_fields : ClassVar[dict[str, collections.abc.Mapping[str, Any]]]

static schema_requirements : Union[Literal['empty', 'partial', 'full'], Dict[Literal['empty', 'partial', 'full'], Any]]

static schema_types_override : Union[collections.abc.Mapping[Literal['categorical', 'continuous', 'image', 'text'], list[Union[str, collections.abc.Mapping[str, collections.abc.Mapping[str, int]]]]], collections.abc.Mapping[str, collections.abc.Mapping[Literal['categorical', 'continuous', 'image', 'text'], list[Union[str, collections.abc.Mapping[str, collections.abc.Mapping[str, int]]]]]], ForwardRef(None)]

static selected_cols : list

static selected_cols_prefix : Optional[str]

static table : Union[str, collections.abc.Mapping[str, str], ForwardRef(None)]

static target : Union[str, list[str], ForwardRef(None)]

Static methods

create_datastructure

def create_datastructure(    select: DataStructureSelectConfig,    transform: DataStructureTransformConfig,    assign: DataStructureAssignConfig,    data_split: Optional[DataSplitConfig] = None,    schema_requirements: SCHEMA_REQUIREMENTS_TYPES = 'partial',    compatible_datasources: list[str] = ['CSVSource', 'DICOMSource', 'ImageSource', 'DICOMOphthalmologySource', 'HeidelbergSource'],    table_config: Optional[DataStructureTableConfig] = None,    *,    schema: Optional[BitfountSchema],) ‑> DataStructure:

Creates a datastructure based on the yaml config and pod schema.

Arguments

select: Configuration for columns to be included/excluded.
transform: Configuration for dataset and batch transformations.
assign: Configuration for special columns.
data_split: Configuration for splitting the data.
schema_requirements: Schema requirements for the data. Defaults to "partial" schema for all datasources.
compatible_datasources: The datasources that are compatible with the datastructure.
table_config: The table in the Pod schema to be used for local data.
schema: The Bitfount schema of the target pod.

Returns A DataStructure object.

Methods

apply_dataset_transformations

def apply_dataset_transformations(self, datasource: BaseSource) ‑> BaseSource:

Applies transformations to whole dataset.

Arguments

datasource: The BaseSource object to be transformed.

Returns datasource: The transformed datasource.

get_columns_ignored_for_training

def get_columns_ignored_for_training(self, schema: BitfountSchema) ‑> list:

Adds all the extra columns that will not be used in model training.

Arguments

schema: The schema of the table.

Returns ignore_cols_aux: A list of columns that will be ignored when training a model.

set_columns_after_transformations

def set_columns_after_transformations(    self, transforms: list[dict[str, _JSONDict]],) ‑> None:

Updates the selected/ignored columns based on the transformations applied.

It updates self.selected_cols by adding on the new names of columns after transformations are applied, and removing the original columns unless explicitly specified to keep.

Arguments

transforms: A list of transformations to be applied to the data.

set_training_column_split_by_semantic_type

def set_training_column_split_by_semantic_type(self, schema: BitfountSchema) ‑> None:

Sets the column split by type from the schema.

This method splits the selected columns from the dataset based on their semantic type.

Arguments

schema: The TableSchema for the data.

set_training_input_size

def set_training_input_size(self, schema: BitfountSchema) ‑> None:

Get the input size for model training.

Arguments

schema: The schema of the table.

Classes​

BaseDataStructure​

Subclasses​

DataStructure​

Ancestors​

Variables​

Static methods​

create_datastructure​

Methods​

apply_dataset_transformations​

get_columns_ignored_for_training​

set_columns_after_transformations​

set_training_column_split_by_semantic_type​

set_training_input_size​

Classes

BaseDataStructure

Subclasses

DataStructure

Ancestors

Variables

Static methods

create_datastructure

Methods

apply_dataset_transformations

get_columns_ignored_for_training

set_columns_after_transformations

set_training_column_split_by_semantic_type

set_training_input_size