Skip to main content

dataframe_source

Module containing DataFrameSource class.

DataFrameSource class handles loading data stored in memory in a pandas dataframe.

Classes

DataFrameSource

class DataFrameSource(    data: pd.DataFrame,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Data source for loading dataframes.

Arguments

  • data: The dataframe to be loaded.
  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods


get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

BaseSource.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

BaseSource.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> pandas.core.frame.DataFrame:

Loads and returns datafrom DataFrame dataset.

Returns A DataFrame-type object which contains the data.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

BaseSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Get distinct values from columns in DataFrame dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

BaseSource.load_data :

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.