views
Support for different "views" over existing datasets.
These allow constraining the usable data that is exposed to a modeller, or only presenting a transformed view to the modeller rather than the raw underlying data.
Classes
DataView
class DataView( datasource: BaseSource, source_dataset_name: str, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Base class for datasource views.
Arguments
datasource
: TheBaseSource
the view is generated from.
Ancestors
Subclasses
- DropColsDataview
- SQLDataView
- bitfount.data.datasources.views._DataViewFromFileIterableSource
- bitfount.data.datasources.views._EmptyDataview
Variables
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns_is_task_running
for the view and the parent datasource.
-
iterable : bool
- This returns False if the DataSource does not subclassIterableSource
.However, this property must be re-implemented in
IterableSource
, therefore it is not necessarily True if the DataSource inherits fromIterableSource
.
-
multi_table : bool
- This returns False if the DataSource does not subclassMultiTableSource
.However, this property must be re-implemented in
MultiTableSource
, therefore it is not necessarily True if the DataSource inherits fromMultiTableSource
.
Methods
get_column
def get_column( self, col_name: str, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Inherited from:
Get a single column from dataset.
Used in the ColumnAverage
algorithm as well as to iterate over image columns
for the purposes of schema generation.
get_column_names
def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:
Inherited from:
Get the column names as an iterable.
get_data
def get_data(self, **kwargs: Any) ‑> Optional[pandas.core.frame.DataFrame]:
Inherited from:
Implement this method to load and return dataset.
get_dtypes
def get_dtypes(self, **kwargs: Any) ‑> _Dtypes:
Inherited from:
Implement this method to get the columns and column types from dataset.
get_values
def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:
Inherited from:
Get distinct values from list of columns.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Loads data from the underlying datasource.
DropColsDataview
class DropColsDataview( datasource: BaseSource, drop_cols: Union[str, Sequence[str]], source_dataset_name: str, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
A data view that presents data with columns removed.
Arguments
data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.datasource
: TheBaseSource
the view is generated from.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Ancestors
Subclasses
Variables
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns_is_task_running
for the view and the parent datasource.
-
iterable : bool
- This returns False if the DataSource does not subclassIterableSource
.However, this property must be re-implemented in
IterableSource
, therefore it is not necessarily True if the DataSource inherits fromIterableSource
.
-
multi_table : bool
- This returns False if the DataSource does not subclassMultiTableSource
.However, this property must be re-implemented in
MultiTableSource
, therefore it is not necessarily True if the DataSource inherits fromMultiTableSource
.
Methods
get_column
def get_column( self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Inherited from:
Get a single column from dataset.
Used in the ColumnAverage
algorithm as well as to iterate over image columns
for the purposes of schema generation.
get_column_names
def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:
Inherited from:
Get the column names as an iterable.
get_data
def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:
Loads and returns data from underlying dataset.
Will handle drop columns specified in view.
Returns A DataFrame-type object which contains the data.
Raises
ValueError
: if no data is returned from the original datasource.
get_dtypes
def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:
Inherited from:
Implement this method to get the columns and column types from dataset.
get_values
def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:
Get distinct values from columns in dataset.
Arguments
col_names
: The list of the columns whose distinct values should be returned.
Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Inherited from:
Loads data from the underlying datasource.
DropColsFileSystemIterableDataview
class DropColsFileSystemIterableDataview( datasource: BaseSource, drop_cols: Union[str, Sequence[str]], source_dataset_name: str, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
A data view that presents filesystem iterable data with columns removed.
Arguments
data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.datasource
: TheBaseSource
the view is generated from.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Raises
ValueError
: if the underlying datasource is not of FileSystemIterableSource type.