Skip to main content

datastructure

Classes concerning data structures.

DataStructures provide information about the columns of a BaseSource for a specific Modelling Job.

Classes

BaseDataStructure

class BaseDataStructure():

Base DataStructure class.

Subclasses

DataStructure

class DataStructure(    table: Optional[Union[str, Mapping[str, str]]] = None,    query: Optional[Union[str, Mapping[str, str]]] = None,    schema_types_override: Optional[Union[SchemaOverrideMapping, Mapping[str, SchemaOverrideMapping]]] = None,    target: Optional[Union[str, List[str]]] = None,    ignore_cols: List[str] = [],    selected_cols: List[str] = [],    data_splitter: Optional[DatasetSplitter] = None,    loss_weights_col: Optional[str] = None,    multihead_col: Optional[str] = None,    multihead_size: Optional[int] = None,    ignore_classes_col: Optional[str] = None,    image_cols: Optional[List[str]] = None,    batch_transforms: Optional[List[Dict[str, _JSONDict]]] = None,    dataset_transforms: Optional[List[Dict[str, _JSONDict]]] = None,    auto_convert_grayscale_images: bool = True,):

Information about the columns of a BaseSource.

This component provides the desired structure of data to be used by discriminative machine learning models.

note

If the datastructure includes image columns, batch transformations will be applied to them.

Arguments

  • table: The table in the Pod schema to be used for local data for single pod tasks. If executing a remote task involving multiple pods, this should be a mapping of Pod names to table names. Defaults to None.
  • query: The sql query that needs to be applied to the data. It should be a string if it is used for local data or single pod tasks or a mapping of Pod names to the queries if multiple pods are nvolved in the task. Defaults to None.
  • schema_types_override: A mapping that defines the new data types that will be returned after the sql query is executed. For a local training task it will be a mapping of column names to their types, for a remote task it will be a mapping of the Pod name to the new columns and types. If a column is defined as "categorical", the mapping should include a mapping to the categories. Required if a sql query is provided. E.g. {'Pod_id': {'categorical': [{'col1': {'value_1':0, 'value_2': 1}}], "continuous": ['col2']} for remote training or {'categorical':[{ "col1" : {'value_1':0, 'value_2': 1}}],'continuous': ['col2']} for local training. Defaults to None.
  • target: The training target column or list of columns.
  • ignore_cols: A list of columns to ignore when getting the data. Defaults to None.
  • selected_cols: A list of columns to select when getting the data. Defaults to None.
  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • loss_weights_col: A column name which provides a weight to be given to each sample in loss function. Defaults to None.
  • multihead_col: A categorical column whereby the number of unique values will determine number of heads in a Neural Network. Used for multitask training. Defaults to None.
  • multihead_size: The number of uniques values in the multihead_col. Used for multitask training. Required if multihead_col is provided. Defaults to None.
  • ignore_classes_col: A column name denoting which classes to ignore in a multilabel multiclass classification problem. Each value is expected to contain a list of numbers corresponding to the indices of the classes to be ignored as per the order provided in target. E.g. [0,2,3]. An empty list can be provided (e.g. []) to avoid ignoring any classes for some samples. Defaults to None.
  • image_cols: A list of columns that will be treated as images in the data.
  • batch_transforms: A dictionary of transformations to apply to batches. Defaults to None.
  • dataset_transforms: A dictionary of transformations to apply to the whole dataset. Defaults to None.
  • auto_convert_grayscale_images: Whether or not to automatically convert grayscale images to RGB. Defaults to True.

Raises

  • DataStructureError: If 'sql_query' is provided as well as either selected_cols or ignore_cols.
  • DataStructureError: If both ignore_cols and selected_cols are provided.
  • DataStructureError: If the multihead_col is provided without multihead_size.
  • ValueError: If a batch transformation name is not recognised.

Ancestors

Variables

  • static auto_convert_grayscale_images : bool
  • static batch_transforms : Optional[List[Dict[str, Dict[str, Any]]]]
  • static dataset_transforms : Optional[List[Dict[str, Dict[str, Any]]]]
  • static ignore_classes_col : Optional[str]
  • static ignore_cols : List[str]
  • static image_cols : Optional[List[str]]
  • static loss_weights_col : Optional[str]
  • static multihead_col : Optional[str]
  • static multihead_size : Optional[int]
  • static nested_fields : ClassVar[Dict[str, Mapping[str, Any]]]
  • static query : Union[str, Mapping[str, str], ForwardRef(None)]
  • static schema_types_override : Union[Mapping[Literal['categorical', 'continuous', 'image', 'text'], List[Union[str, Mapping[str, Mapping[str, int]]]]], Mapping[str, Mapping[Literal['categorical', 'continuous', 'image', 'text'], List[Union[str, Mapping[str, Mapping[str, int]]]]]], ForwardRef(None)]
  • static selected_cols : List[str]
  • static table : Union[str, Mapping[str, str], ForwardRef(None)]
  • static target : Union[List[str], str, ForwardRef(None)]

Static methods


create_datastructure

def create_datastructure(    table_config: DataStructureTableConfig,    select: DataStructureSelectConfig,    transform: DataStructureTransformConfig,    assign: DataStructureAssignConfig,    data_split: Optional[DataSplitConfig] = None,    *,    schema: BitfountSchema,)> DataStructure:

Creates a datastructure based on the yaml config and pod schema.

Arguments

  • table_config: The table in the Pod schema to be used for local data. If executing a remote task, this should a mapping of Pod names to table names.
  • select: The configuration for columns to be included/excluded from the DataStructure.
  • transform: The configuration for dataset and batch transformations to be applied to the data.
  • assign: The configuration for special columns in the DataStructure.
  • data_split: The configuration for splitting the data into training, test, validation.
  • schema: The Bitfount schema of the target pod

Returns A DataStructure object.

Methods


apply_dataset_transformations

def apply_dataset_transformations(self, datasource: BaseSource)> BaseSource:

Applies transformations to whole dataset.

Arguments

  • datasource: The BaseSource object to be transformed.

Returns datasource: The transformed datasource.

get_batch_transformations

def get_batch_transformations(    self,)> Optional[List[BatchTimeOperation]]:

Returns batch transformations to be performed as callables.

Returns A list of batch transformations to be passed to TransformationProcessor.

get_columns_ignored_for_training

def get_columns_ignored_for_training(self, table_schema: TableSchema)> List[str]:

Adds all the extra columns that will not be used in model training.

Arguments

  • table_schema: The schema of the table.

Returns ignore_cols_aux: A list of columns that will be ignored when training a model.

get_pod_identifiers

def get_pod_identifiers(self)> Optional[List[str]]:

Returns a list of pod identifiers specified in the table attribute.

These may actually be logical pods, or datasources.

If there are no pod identifiers specified, returns None.

get_table_name

def get_table_name(self, data_identifier: Optional[str] = None)> str:

Returns the relevant table name of the DataStructure.

Arguments

  • data_identifier: The identifier of the pod/logical pod/datasource to retrieve the table of.

Returns The table name of the DataStructure corresponding to the pod_identifier provided or just the local table name if running locally.

Raises

  • ValueError: If the data_identifier is not provided and there are different table names for different pods.
  • KeyError: If the data_identifier is not in the collection of tables specified for different pods.

get_table_schema

def get_table_schema(    self,    schema: BitfountSchema,    data_identifier: Optional[str] = None,    datasource: Optional[BaseSource] = None,)> TableSchema:

Returns the table schema based on the datastructure arguments.

This will return either the new schema defined by the schema_types_override if the datastructure has been initialised with a query, or the relevant table schema if the datastructure has been initialised with a table name.

Arguments

  • schema: The BitfountSchema either taken from the pod or provided by the user when defining a model.
  • data_identifier: The pod/logical pod/datasource identifier on which the model will be trained on. Defaults to None.
  • datasource: The datasource on which the model will be trained on. Defaults to None.

Raises

  • BitfountSchemaError: If the table is not found.

set_columns_after_transformations

def set_columns_after_transformations(    self, transforms: List[Dict[str, _JSONDict]],)> None:

Updates the selected/ignored columns based on the transformations applied.

It updates self.selected_cols by adding on the new names of columns after transformations are applied, and removing the original columns unless explicitly specified to keep.

Arguments

  • transforms: A list of transformations to be applied to the data.

set_training_column_split_by_semantic_type

def set_training_column_split_by_semantic_type(self, schema: TableSchema)> None:

Sets the column split by type from the schema.

This method splits the selected columns from the dataset based on their semantic type.

Arguments

  • schema: The TableSchema for the data.

set_training_input_size

def set_training_input_size(self, schema: TableSchema)> None:

Get the input size for model training.

Arguments

  • schema: The schema of the table.
  • table_name: The name of the table.