dataframe_source
Module containing DataFrameSource class.
DataFrameSource class handles loading data stored in memory in a pandas dataframe.
Classes
DataFrameSource
class DataFrameSource( data: pd.DataFrame, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Data source for loading dataframes.
Arguments
data
: The dataframe to be loaded.data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.
Attributes
data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Ancestors
Variables
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns True if a task is running.
-
iterable : bool
- This returns False if the DataSource does not subclassIterableSource
.However, this property must be re-implemented in
IterableSource
, therefore it is not necessarily True if the DataSource inherits fromIterableSource
.
-
multi_table : bool
- This returns False if the DataSource does not subclassMultiTableSource
.However, this property must be re-implemented in
MultiTableSource
, therefore it is not necessarily True if the DataSource inherits fromMultiTableSource
.
Methods
get_column
def get_column( self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Inherited from:
Get a single column from dataset.
Used in the ColumnAverage
algorithm as well as to iterate over image columns
for the purposes of schema generation.
get_column_names
def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:
Inherited from:
Get the column names as an iterable.
get_data
def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:
Loads and returns datafrom DataFrame dataset.
Returns A DataFrame-type object which contains the data.
get_dtypes
def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:
Inherited from:
Implement this method to get the columns and column types from dataset.
get_values
def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:
Get distinct values from columns in DataFrame dataset.
Arguments
col_names
: The list of the columns whose distinct values should be returned.
Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Inherited from:
Load the data for the datasource.
Raises
TypeError
: If data format is not supported.