views

Support for different "views" over existing datasets.

These allow constraining the usable data that is exposed to a modeller, or only presenting a transformed view to the modeller rather than the raw underlying data.

Classes

DataView

class DataView(    datasource: BaseSource,    source_dataset_name: str,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Base class for datasource views.

Arguments

datasource: The BaseSource the view is generated from.

Ancestors

Subclasses

DropColsDataview
SQLDataView
bitfount.data.datasources.views._DataViewFromFileIterableSource
bitfount.data.datasources.views._EmptyDataview

Variables

data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

Raises: DataNotLoadedError: If the data has not been loaded yet.

hash : str - The hash associated with this BaseSource.

This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

Returns: The hexdigest of the DataFrame hash.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns _is_task_running for the view and the parent datasource.

iterable : bool - This returns False if the DataSource does not subclass IterableSource.

However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods

get_column

def get_column(    self, col_name: str, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

BaseSource.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:

Inherited from:

BaseSource.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any) ‑> Optional[pandas.core.frame.DataFrame]:

Inherited from:

BaseSource.get_data :

Implement this method to load and return dataset.

get_dtypes

def get_dtypes(self, **kwargs: Any) ‑> _Dtypes:

Inherited from:

BaseSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:

Inherited from:

BaseSource.get_values :

Get distinct values from list of columns.

load_data

def load_data(self, **kwargs: Any) ‑> None:

Loads data from the underlying datasource.

DropColsDataview

class DropColsDataview(    datasource: BaseSource,    drop_cols: Union[str, Sequence[str]],    source_dataset_name: str,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents data with columns removed.

Arguments

data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
datasource: The BaseSource the view is generated from.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

data: A Dataframe-type object which contains the data.
data_splitter: Approach used for splitting the data into training, test, validation.
seed: Random number seed. Used for setting random seed for all libraries.

Ancestors

Subclasses

DropColsFileSystemIterableDataview

Variables

data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

Raises: DataNotLoadedError: If the data has not been loaded yet.

hash : str - The hash associated with this BaseSource.

This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

Returns: The hexdigest of the DataFrame hash.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns _is_task_running for the view and the parent datasource.

iterable : bool - This returns False if the DataSource does not subclass IterableSource.

However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods

get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:

Inherited from:

DataView.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:

Loads and returns data from underlying dataset.

Will handle drop columns specified in view.

Returns A DataFrame-type object which contains the data.

Raises

ValueError: if no data is returned from the original datasource.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:

Get distinct values from columns in dataset.

Arguments

col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any) ‑> None:

Inherited from:

DataView.load_data :

Loads data from the underlying datasource.

DropColsFileSystemIterableDataview

class DropColsFileSystemIterableDataview(    datasource: BaseSource,    drop_cols: Union[str, Sequence[str]],    source_dataset_name: str,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents filesystem iterable data with columns removed.

Arguments

data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
datasource: The BaseSource the view is generated from.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

data: A Dataframe-type object which contains the data.
data_splitter: Approach used for splitting the data into training, test, validation.
seed: Random number seed. Used for setting random seed for all libraries.

Raises

ValueError: if the underlying datasource is not of FileSystemIterableSource type.

Ancestors

DropColsDataview
bitfount.data.datasources.views._DataViewFromFileIterableSource
DataView
BaseSource
abc.ABC

Variables

cache_images : bool - Returns cache_images for the view.

data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

Raises: DataNotLoadedError: If the data has not been loaded yet.

fast_load : bool - Returns fast_load for the view.

file_names - Get filenames for views generated from FileSystemIterableSource.

hash : str - The hash associated with this BaseSource.

This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

Returns: The hexdigest of the DataFrame hash.

image_columns : Set[str] - Returns image_columns for the view, excluding those in drop_cols.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns _is_task_running for the view and the parent datasource.

iterable : bool - Returns iterable for the view and the parent datasource.

multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

new_file_names_only_set : Optional[Set[str]] - Returns new_file_names_only_set for the view.

selected_file_names : List[str] - Returns selected_file_names for the view.

selected_file_names_override : List[str] - Returns selected_file_names_override for the view.

Methods

clear_file_names_cache

def clear_file_names_cache(self) ‑> None:

Clear the file names cache.

get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:

Inherited from:

DropColsDataview.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:

Inherited from:

DropColsDataview.get_data :

Loads and returns data from underlying dataset.

Will handle drop columns specified in view.

Returns A DataFrame-type object which contains the data.

Raises

ValueError: if no data is returned from the original datasource.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:

Inherited from:

DropColsDataview.get_values :

Get distinct values from columns in dataset.

Arguments

col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any) ‑> None:

Inherited from:

DropColsDataview.load_data :

Loads data from the underlying datasource.

yield_data

def yield_data(    self, file_names: Optional[List[str]] = None, **kwargs: Any,) ‑> Iterator[pandas.core.frame.DataFrame]:

Returns file_names for the view and the parent datasource.

SQLDataView

class SQLDataView(    datasource: BaseSource,    query: str,    pod_name: str,    source_dataset_name: str,    connector: PodDbConnector,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents data with SQL query applied.

Arguments

data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
datasource: The BaseSource the view is generated from.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

data: A Dataframe-type object which contains the data.
data_splitter: Approach used for splitting the data into training, test, validation.
seed: Random number seed. Used for setting random seed for all libraries.

Raises

ValueError: if the underlying datasource is of IterableSource type.

Ancestors

Subclasses

SQLFileSystemIterableDataView

Variables

data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

Raises: DataNotLoadedError: If the data has not been loaded yet.

hash : str - The hash associated with this BaseSource.

This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

Returns: The hexdigest of the DataFrame hash.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns _is_task_running for the view and the parent datasource.

iterable : bool - This returns False if the DataSource does not subclass IterableSource.

However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods

get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:

Inherited from:

DataView.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:

Loads and returns data from underlying dataset.

Will handle sql query specified in view.

Returns A DataFrame-type object which contains the data.

Raises

ValueError: if the table specified in the query is not found.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_tables

def get_tables(self) ‑> List[str]:

Get the datasource tables from the pod database.

get_values

def get_values(    self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,) ‑> Dict[str, Iterable[Any]]:

Get distinct values from columns in the dataset.

Arguments

col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any) ‑> None:

Inherited from:

DataView.load_data :

Loads data from the underlying datasource.

SQLFileSystemIterableDataView

class SQLFileSystemIterableDataView(    datasource: BaseSource,    query: str,    pod_name: str,    source_dataset_name: str,    connector: PodDbConnector,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

A data view that presents filesystem iterable data with SQL query applied.

Arguments

data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
datasource: The BaseSource the view is generated from.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

data: A Dataframe-type object which contains the data.
data_splitter: Approach used for splitting the data into training, test, validation.
seed: Random number seed. Used for setting random seed for all libraries.

Raises

ValueError: if the underlying datasource is not of FileSystemIterableSource type.

Ancestors

SQLDataView
bitfount.data.datasources.views._DataViewFromFileIterableSource
DataView
BaseSource
abc.ABC

Variables

cache_images : bool - Returns cache_images for the view.

data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

Raises: DataNotLoadedError: If the data has not been loaded yet.

fast_load : bool - Returns fast_load for the view.

file_names - Get filenames for views generated from FileSystemIterableSource.

hash : str - The hash associated with this BaseSource.

This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

Returns: The hexdigest of the DataFrame hash.

image_columns : Set[str] - Returns image_columns for the view.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns _is_task_running for the view and the parent datasource.

iterable : bool - Returns iterable for the view and the parent datasource.

multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

new_file_names_only_set : Optional[Set[str]] - Returns new_file_names_only_set for the view.

selected_file_names : List[str] - Returns selected_file_names for the view.

selected_file_names_override : List[str] - Returns selected_file_names_override for the view.

Methods

clear_file_names_cache

def clear_file_names_cache(self) ‑> None:

Clear the file names cache.

get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

DataView.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:

Inherited from:

SQLDataView.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:

Inherited from:

SQLDataView.get_data :

Loads and returns data from underlying dataset.

Will handle sql query specified in view.

Returns A DataFrame-type object which contains the data.

Raises

ValueError: if the table specified in the query is not found.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:

Inherited from:

DataView.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_tables

def get_tables(self) ‑> List[str]:

Inherited from:

SQLDataView.get_tables :

Get the datasource tables from the pod database.

get_values

def get_values(    self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,) ‑> Dict[str, Iterable[Any]]:

Inherited from:

SQLDataView.get_values :

Get distinct values from columns in the dataset.

Arguments

col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any) ‑> None:

Inherited from:

SQLDataView.load_data :

Loads data from the underlying datasource.

yield_data

def yield_data(    self, file_names: Optional[List[str]] = None, **kwargs: Any,) ‑> Iterator[pandas.core.frame.DataFrame]:

Returns file_names for the view and the parent datasource.

views

Classes​

DataView​

Ancestors​

Subclasses​

Variables​

Methods​

get_column​

get_column_names​

get_data​

get_dtypes​

get_values​

load_data​

DropColsDataview​

Ancestors​

Subclasses​

Variables​

Methods​

get_column​

get_column_names​

get_data​

get_dtypes​

get_values​

load_data​

DropColsFileSystemIterableDataview​

Ancestors​

Variables​

Methods​

clear_file_names_cache​

get_column​

get_column_names​

get_data​

get_dtypes​

get_values​

load_data​

yield_data​

SQLDataView​

Ancestors​

Subclasses​

Variables​

Methods​

get_column​

get_column_names​

get_data​

get_dtypes​

get_tables​

get_values​

load_data​

SQLFileSystemIterableDataView​

Ancestors​

Variables​

Methods​

clear_file_names_cache​

get_column​

get_column_names​

get_data​

get_dtypes​

get_tables​

get_values​

load_data​

yield_data​

Classes

DataView

Ancestors

Subclasses

Variables

Methods

get_column

get_column_names

get_data

get_dtypes

get_values

load_data

DropColsDataview

Ancestors

Subclasses

Variables

Methods

get_column

get_column_names

get_data

get_dtypes

get_values

load_data

DropColsFileSystemIterableDataview

Ancestors

Variables

Methods

clear_file_names_cache

get_column

get_column_names

get_data

get_dtypes

get_values

load_data

yield_data

SQLDataView

Ancestors

Subclasses

Variables

Methods

get_column

get_column_names

get_data

get_dtypes

get_tables

get_values

load_data

SQLFileSystemIterableDataView

Ancestors

Variables

Methods

clear_file_names_cache

get_column

get_column_names

get_data

get_dtypes

get_tables

get_values

load_data

yield_data