Skip to main content

views

Support for different "views" over existing datasets.

These allow constraining the usable data that is exposed to a modeller, or only presenting a transformed view to the modeller rather than the raw underlying data.

Classes

DataView

class DataView(    datasource: BaseSource,    source_dataset_name: str,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Base class for datasource views.

Arguments

  • datasource: The BaseSource the view is generated from.

Subclasses

  • DropColsDataview
  • SQLDataView
  • bitfount.data.datasources.views._DataViewFromFileIterableSource
  • bitfount.data.datasources.views._EmptyDataview

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns _is_task_running for the view and the parent datasource.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods


get_column

def get_column(    self, col_name: str, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

BaseSource.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

BaseSource.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> Optional[pandas.core.frame.DataFrame]:

Inherited from:

BaseSource.get_data :

Implement this method to load and return dataset.

get_dtypes

def get_dtypes(self, **kwargs: Any)> _Dtypes:

Inherited from:

BaseSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Inherited from:

BaseSource.get_values :

Get distinct values from list of columns.

load_data

def load_data(self, **kwargs: Any)> None:

Loads data from the underlying datasource.

DropColsDataview

class DropColsDataview(    datasource: BaseSource,    drop_cols: Union[str, Sequence[str]],    source_dataset_name: str,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents data with columns removed.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • datasource: The BaseSource the view is generated from.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns _is_task_running for the view and the parent datasource.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods


get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

DataView.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> pandas.core.frame.DataFrame:

Loads and returns data from underlying dataset.

Will handle drop columns specified in view.

Returns A DataFrame-type object which contains the data.

Raises

  • ValueError: if no data is returned from the original datasource.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Get distinct values from columns in dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

DataView.load_data :

Loads data from the underlying datasource.

DropColsFileSystemIterableDataview

class DropColsFileSystemIterableDataview(    datasource: BaseSource,    drop_cols: Union[str, Sequence[str]],    source_dataset_name: str,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents filesystem iterable data with columns removed.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • datasource: The BaseSource the view is generated from.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Raises

  • ValueError: if the underlying datasource is not of FileSystemIterableSource type.

Ancestors

Variables

  • cache_images : bool - Returns cache_images for the view.
  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • fast_load : bool - Returns fast_load for the view.
  • file_names - Get filenames for views generated from FileSystemIterableSource.
  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • image_columns : Set[str] - Returns image_columns for the view, excluding those in drop_cols.
  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns _is_task_running for the view and the parent datasource.
  • iterable : bool - Returns iterable for the view and the parent datasource.
  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

  • new_file_names_only_set : Optional[Set[str]] - Returns new_file_names_only_set for the view.
  • selected_file_names : List[str] - Returns selected_file_names for the view.
  • selected_file_names_override : List[str] - Returns selected_file_names_override for the view.

Methods


clear_file_names_cache

def clear_file_names_cache(self)> None:

Clear the file names cache.

get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

DropColsDataview.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> pandas.core.frame.DataFrame:

Inherited from:

DropColsDataview.get_data :

Loads and returns data from underlying dataset.

Will handle drop columns specified in view.

Returns A DataFrame-type object which contains the data.

Raises

  • ValueError: if no data is returned from the original datasource.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Inherited from:

DropColsDataview.get_values :

Get distinct values from columns in dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

DropColsDataview.load_data :

Loads data from the underlying datasource.

yield_data

def yield_data(    self, file_names: Optional[List[str]] = None, **kwargs: Any,)> Iterator[pandas.core.frame.DataFrame]:

Returns file_names for the view and the parent datasource.

SQLDataView

class SQLDataView(    datasource: BaseSource,    query: str,    pod_name: str,    source_dataset_name: str,    connector: PodDbConnector,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents data with SQL query applied.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • datasource: The BaseSource the view is generated from.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Raises

  • ValueError: if the underlying datasource is of IterableSource type.

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns _is_task_running for the view and the parent datasource.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods


get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

DataView.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> pandas.core.frame.DataFrame:

Loads and returns data from underlying dataset.

Will handle sql query specified in view.

Returns A DataFrame-type object which contains the data.

Raises

  • ValueError: if the table specified in the query is not found.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_tables

def get_tables(self)> List[str]:

Get the datasource tables from the pod database.

get_values

def get_values(    self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,)> Dict[str, Iterable[Any]]:

Get distinct values from columns in the dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

DataView.load_data :

Loads data from the underlying datasource.

SQLFileSystemIterableDataView

class SQLFileSystemIterableDataView(    datasource: BaseSource,    query: str,    pod_name: str,    source_dataset_name: str,    connector: PodDbConnector,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents filesystem iterable data with SQL query applied.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • datasource: The BaseSource the view is generated from.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Raises

  • ValueError: if the underlying datasource is not of FileSystemIterableSource type.

Ancestors

Variables

  • cache_images : bool - Returns cache_images for the view.
  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • fast_load : bool - Returns fast_load for the view.
  • file_names - Get filenames for views generated from FileSystemIterableSource.
  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • image_columns : Set[str] - Returns image_columns for the view.
  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns _is_task_running for the view and the parent datasource.
  • iterable : bool - Returns iterable for the view and the parent datasource.
  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

  • new_file_names_only_set : Optional[Set[str]] - Returns new_file_names_only_set for the view.
  • selected_file_names : List[str] - Returns selected_file_names for the view.
  • selected_file_names_override : List[str] - Returns selected_file_names_override for the view.

Methods


clear_file_names_cache

def clear_file_names_cache(self)> None:

Clear the file names cache.

get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

SQLDataView.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> pandas.core.frame.DataFrame:

Inherited from:

SQLDataView.get_data :

Loads and returns data from underlying dataset.

Will handle sql query specified in view.

Returns A DataFrame-type object which contains the data.

Raises

  • ValueError: if the table specified in the query is not found.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_tables

def get_tables(self)> List[str]:

Inherited from:

SQLDataView.get_tables :

Get the datasource tables from the pod database.

get_values

def get_values(    self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,)> Dict[str, Iterable[Any]]:

Inherited from:

SQLDataView.get_values :

Get distinct values from columns in the dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

SQLDataView.load_data :

Loads data from the underlying datasource.

yield_data

def yield_data(    self, file_names: Optional[List[str]] = None, **kwargs: Any,)> Iterator[pandas.core.frame.DataFrame]:

Returns file_names for the view and the parent datasource.