Skip to main content

excel_source

Module containing ExcelSource class.

ExcelSource class handles loading of Excel data.

Classes

ExcelSource

class ExcelSource(    path: Union[os.PathLike, pydantic.networks.AnyUrl, str],    sheet_name: Union[str, Sequence[str], ForwardRef(None)] = None,    column_names: Optional[List[str]] = None,    dtype: Optional[Dict[str, Union[ForwardRef('ExtensionDtype'), str, numpy.dtype, Type[Union[str, complex, bool, object]]]]] = None,    read_excel_kwargs: Optional[Dict[str, Any]] = None,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Data source for loading excel files.

info

You must install a backend library to read excel files to use this data source. Currently supported engines are “xlrd”, “openpyxl”, “odf” and “pyxlsb”.

info

By default, the first row is used as the column names unless column_names or the header keyword argument is provided.

Arguments

  • **read_excel_kwargs: Additional arguments to be passed to pandas.read_excel.
  • column_names: The names of the columns if not using the first row of the sheet. Can only be used for single sheet excel files.
  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • dtype: The dtypes of the columns.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • path: The path or URL to the excel file.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.
  • sheet_name: The name(s) of the sheet(s) to load. If not provided, the all sheets will be loaded.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Raises

  • TypeError: If the path does not have the correct extension denoting an excel file.
  • ValueError: If multiple sheet names are provided and column names are also provided.
  • ValueError: If sheets are referenced which do not exist in the excel file.

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

  • multi_table : bool - Attribute to specify whether the datasource is multi table.
  • table_names : List[str] - Excel sheet names in datasource.

Methods


get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

MultiTableSource.get_column :

Implement this method to get single column from dataset.

get_column_names

def get_column_names(    self, table_name: Optional[str] = None, **kwargs: Any,)> Iterable[str]:

Get columns names in Excel dataset.

Arguments

  • table_name: The name of the table from which the column names should be loaded. Defaults to None.

Returns The list of column names from the requested table or the single table if not a multi-table instance.

Raises

  • ValueError: If the table name provided does not exist.
  • ValueError: If the data is multi-table but no table name provided.

get_data

def get_data(    self, table_name: Optional[str] = None, **kwargs: Any,)> Optional[pandas.core.frame.DataFrame]:

Loads and returns data from Excel dataset.

Arguments

  • table_name: Table name for multi table data sources. This comes from the DataStructure and is ignored if sql_query has been provided.

Returns A DataFrame-type object which contains the data.

Raises

  • ValueError: If the table name provided does not exist.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

MultiTableSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(    self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,)> Dict[str, Iterable[Any]]:

Get distinct values from columns in Excel dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.
  • table_name: The name of the table from which the column should be loaded. Defaults to None.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

Raises

  • ValueError: If the table name provided does not exist.
  • ValueError: If the data is multi-table but no table name provided.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

MultiTableSource.load_data :

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.