Skip to main content

base_source

Module containing BaseSource class.

BaseSource is the abstract data source class from which all concrete data sources must inherit.

Classes

BaseSource

class BaseSource(    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Abstract Base Source from which all other data sources must inherit.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Ancestors

Subclasses

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods


get_column

def get_column(    self, col_name: str, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> Optional[pandas.core.frame.DataFrame]:

Implement this method to load and return dataset.

get_dtypes

def get_dtypes(self, **kwargs: Any)> _Dtypes:

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Get distinct values from list of columns.

load_data

def load_data(self, **kwargs: Any)> None:

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.

FileSystemIterableSource

class FileSystemIterableSource(    path: Union[os.PathLike, str],    output_path: Optional[Union[os.PathLike, str]] = None,    iterable: bool = False,    fast_load: bool = False,    file_extension: Optional[_SingleOrMulti[str]] = None,    strict: bool = False,    cache_images: bool = True,    file_creation_cutoff_date: Optional[Union[Date, DateTD]] = None,    file_modification_cutoff_date: Optional[Union[Date, DateTD]] = None,    min_file_size: Optional[float] = None,    max_file_size: Optional[float] = None,    partition_size: int = 100,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Abstract base source that supports iterating over file-based data.

This is used for Iterable data sources that whose data is stored as files on disk.

Arguments

  • cache_images: Whether to cache images in the file system. Defaults to True. This is ignored if fast_load is True.
  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • fast_load: Whether the data will be loaded in fast mode. This is used to determine whether the data will be iterated over during set up for schema generation and splitting (where necessary). Only relevant if iterable is True, otherwise it is ignored. Defaults to False.
  • file_creation_cutoff_date: The oldest possible date to consider for file creation. If None, all files will be considered. Defaults to None.
  • file_extension: File extension(s) of the data files. If None, all files will be searched. Can either be a single file extension or a list of file extensions. Case-insensitive. Defaults to None.
  • file_modification_cutoff_date: The oldest possible date to consider for file modification. If None, all files will be considered. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • iterable: Whether the data source is iterable. This is used to determine whether the data source can be used in a streaming context during a task. Defaults to False.
  • max_file_size: The maximum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
  • min_file_size: The minimum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • output_path: The path where to save intermediary output files. Defaults to 'preprocessed/'.
  • partition_size: The size of each partition when iterating over the data.
  • path: Path to the directory which contains the data files. Subdirectories will be searched recursively.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.
  • strict: Whether File loading should be strictly done on files with the explicit file extension provided. If set to True will only load those files in the dataset. Otherwise, it will scan the given path for files of the same type as the provided file extension. Only relevant if file_extension is provided. Defaults to False.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Variables

  • static skipped_files : Set[str]
  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    If the datasource is iterable, this will raise an exception.

    Raises: IterableDataSourceError: If the datasource is set to iterable. DataNotLoadedError: If the data has not been loaded yet.

  • file_names - Returns a cached list of file names in the directory.
  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - Defines whether the data source is iterable.

    This is defined by the user when instantiating the class.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

  • path : pathlib.Path - Resolved absolute path to data.

    Provides a consistent version of the path provided by the user which should work throughout regardless of operating system and of directory structure.

  • selected_file_names : List[str] - Returns a list of selected file names.

    Selected file names are affected by the selected_file_names_override and new_file_names_only attributes.

  • stale : bool - Whether the data source is stale.

    This is defined by whether the data is loaded and the number of files matches the number of rows in the dataframe.

Static methods


get_num_workers

def get_num_workers(file_names: List[str])> int:

Inherited from:

MultiProcessingMixIn.get_num_workers :

Gets the number of workers to use for multiprocessing.

Ensures that the number of workers is at least 1 and at most equal to MAX_NUM_MULTIPROCESSING_WORKERS. If the number of files is less than MAX_NUM_MULTIPROCESSING_WORKERS, then we use the number of files as the number of workers. Unless the number of machine cores is also less than MAX_NUM_MULTIPROCESSING_WORKERS, in which case we use the lower of the two.

Arguments

  • file_names: The list of file names to load.

Returns The number of workers to use for multiprocessing.

Methods


clear_file_names_cache

def clear_file_names_cache(self)> None:

Clears the list of selected file names.

This allows the datasource to pick up any new files that have been added to the directory since the last time it was cached.

get_column

def get_column(    self, col_name: str, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Loads and returns single column from the dataset.

Arguments

  • col_name: The name of the column which should be loaded.
  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns The column request as a series.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

IterableSource.get_column_names :

Get the column names as an iterable.

get_data

def get_data(    self, file_names: Optional[List[str]] = None, use_cache: bool = True, **kwargs: Any,)> Optional[pandas.core.frame.DataFrame]:

Returns data if the datasource is not iterable, otherwise None.

We don't reload the data if it has already been loaded. We are assuming that files are only added, not removed, meaning we can rely on simply matching the number of files detected with the length of the dataframe to ascertain whether the data has already been loaded.

get_dtypes

def get_dtypes(self, **kwargs: Any)> _Dtypes:

Loads and returns the column names and types of the dataframe.

If fast_load is set to True and the data is stale, only the first file will be loaded to get the column names and types. Otherwise, all files will be loaded. We only perform this optimisation if the data is stale because if it's not, it's faster to just return the dtypes of the already loaded data.

Arguments

  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns A mapping from column names to column types.

get_file_creation_date

def get_file_creation_date(self, path: str)> datetime.date:

Get the creation date of a file with consideration for different OSs.

caution

It is not possible to get the creation date of a file on Linux. This method will return the last modification date instead. This will impact filtering of files by date.

Arguments

  • path: The path to the file.

Returns The creation date of the file.

get_file_last_modification_date

def get_file_last_modification_date(self, path: str)> datetime.date:

Get the last modification date of a file.

Arguments

  • path: The path to the file.

Returns The last modification date of the file.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Get distinct values from columns in the dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.
  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Load the data for the datasource.

Arguments

  • ****kwargs**: Keyword arguments to be passed to underlying functions.

Raises

  • TypeError: If data format is not supported.

partition

def partition(self, iterable: Sequence, partition_size: int = 1)> Iterable:

Takes an iterable and yields partitions of size partition_size.

The final partition may be less than size partition_size due to the variable length of the iterable.

use_file_multiprocessing

def use_file_multiprocessing(self, file_names: List[str])> bool:

Inherited from:

MultiProcessingMixIn.use_file_multiprocessing :

Check if file multiprocessing should be used.

Returns True if file multiprocessing has been enabled by the environment variable and the number of workers would be greater than 1, otherwise False. There is no need to use file multiprocessing if we are just going to use one worker - it would be slower than just loading the data in the main process.

Returns True if file multiprocessing should be used, otherwise False.

yield_data

def yield_data(    self, file_names: Optional[List[str]] = None, use_cache: bool = True, **kwargs: Any,)> Iterator[pandas.core.frame.DataFrame]:

Yields data in batches from files that match the given file names.

Arguments

  • file_names: An optional list of file names to use for yielding data. Otherwise, all files that have already been found will be used. file_names is always provided when this method is called from the Dataset as part of a task.

Raises

  • ValueError: If no file names provided and no files have been found.

FileSystemIterableSourceInferrable

class FileSystemIterableSourceInferrable(    path: Union[os.PathLike, str],    infer_class_labels_from_filepaths: bool = False,    data_cache: Optional[DataPersister] = None,    output_path: Optional[Union[os.PathLike, str]] = None,    iterable: bool = False,    fast_load: bool = False,    file_extension: Optional[_SingleOrMulti[str]] = None,    strict: bool = False,    cache_images: bool = True,    file_creation_cutoff_date: Optional[Union[Date, DateTD]] = None,    file_modification_cutoff_date: Optional[Union[Date, DateTD]] = None,    min_file_size: Optional[float] = None,    max_file_size: Optional[float] = None,    partition_size: int = 100,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Abstract base source that supports iterating over file-based data.

This is used for Iterable data sources that whose data is stored as files on disk.

Arguments

  • cache_images: Whether to cache images in the file system. Defaults to True. This is ignored if fast_load is True.
  • data_cache: A DataPersister instance to use if data caching is desired.
  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • fast_load: Whether the data will be loaded in fast mode. This is used to determine whether the data will be iterated over during set up for schema generation and splitting (where necessary). Only relevant if iterable is True, otherwise it is ignored. Defaults to False.
  • file_creation_cutoff_date: The oldest possible date to consider for file creation. If None, all files will be considered. Defaults to None.
  • file_extension: File extension(s) of the data files. If None, all files will be searched. Can either be a single file extension or a list of file extensions. Case-insensitive. Defaults to None.
  • file_modification_cutoff_date: The oldest possible date to consider for file modification. If None, all files will be considered. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • infer_class_labels_from_filepaths: Whether class labels should be added to the data based on the filepath of the files. Defaults to the first directory within self.path, but can go a level deeper if the datasplitter is provided with infer_data_split_labels set to true
  • iterable: Whether the data source is iterable. This is used to determine whether the data source can be used in a streaming context during a task. Defaults to False.
  • max_file_size: The maximum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
  • min_file_size: The minimum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • output_path: The path where to save intermediary output files. Defaults to 'preprocessed/'.
  • partition_size: The size of each partition when iterating over the data.
  • path: Path to the directory which contains the data files. Subdirectories will be searched recursively.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.
  • strict: Whether File loading should be strictly done on files with the explicit file extension provided. If set to True will only load those files in the dataset. Otherwise, it will scan the given path for files of the same type as the provided file extension. Only relevant if file_extension is provided. Defaults to False.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Subclasses

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    If the datasource is iterable, this will raise an exception.

    Raises: IterableDataSourceError: If the datasource is set to iterable. DataNotLoadedError: If the data has not been loaded yet.

  • file_names - Returns a cached list of file names in the directory.
  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - Defines whether the data source is iterable.

    This is defined by the user when instantiating the class.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

  • path : pathlib.Path - Resolved absolute path to data.

    Provides a consistent version of the path provided by the user which should work throughout regardless of operating system and of directory structure.

  • selected_file_names : List[str] - Returns a list of selected file names.

    Selected file names are affected by the selected_file_names_override and new_file_names_only attributes.

  • stale : bool - Whether the data source is stale.

    This is defined by whether the data is loaded and the number of files matches the number of rows in the dataframe.

Static methods


get_num_workers

def get_num_workers(file_names: List[str])> int:

Inherited from:

FileSystemIterableSource.get_num_workers :

Gets the number of workers to use for multiprocessing.

Ensures that the number of workers is at least 1 and at most equal to MAX_NUM_MULTIPROCESSING_WORKERS. If the number of files is less than MAX_NUM_MULTIPROCESSING_WORKERS, then we use the number of files as the number of workers. Unless the number of machine cores is also less than MAX_NUM_MULTIPROCESSING_WORKERS, in which case we use the lower of the two.

Arguments

  • file_names: The list of file names to load.

Returns The number of workers to use for multiprocessing.

Methods


clear_file_names_cache

def clear_file_names_cache(self)> None:

Inherited from:

FileSystemIterableSource.clear_file_names_cache :

Clears the list of selected file names.

This allows the datasource to pick up any new files that have been added to the directory since the last time it was cached.

get_column

def get_column(    self, col_name: str, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

FileSystemIterableSource.get_column :

Loads and returns single column from the dataset.

Arguments

  • col_name: The name of the column which should be loaded.
  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns The column request as a series.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

FileSystemIterableSource.get_column_names :

Get the column names as an iterable.

get_data

def get_data(    self, file_names: Optional[List[str]] = None, use_cache: bool = True, **kwargs: Any,)> Optional[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSource.get_data :

Returns data if the datasource is not iterable, otherwise None.

We don't reload the data if it has already been loaded. We are assuming that files are only added, not removed, meaning we can rely on simply matching the number of files detected with the length of the dataframe to ascertain whether the data has already been loaded.

get_dtypes

def get_dtypes(self, **kwargs: Any)> _Dtypes:

Inherited from:

FileSystemIterableSource.get_dtypes :

Loads and returns the column names and types of the dataframe.

If fast_load is set to True and the data is stale, only the first file will be loaded to get the column names and types. Otherwise, all files will be loaded. We only perform this optimisation if the data is stale because if it's not, it's faster to just return the dtypes of the already loaded data.

Arguments

  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns A mapping from column names to column types.

get_file_creation_date

def get_file_creation_date(self, path: str)> datetime.date:

Inherited from:

FileSystemIterableSource.get_file_creation_date :

Get the creation date of a file with consideration for different OSs.

caution

It is not possible to get the creation date of a file on Linux. This method will return the last modification date instead. This will impact filtering of files by date.

Arguments

  • path: The path to the file.

Returns The creation date of the file.

get_file_last_modification_date

def get_file_last_modification_date(self, path: str)> datetime.date:

Inherited from:

FileSystemIterableSource.get_file_last_modification_date :

Get the last modification date of a file.

Arguments

  • path: The path to the file.

Returns The last modification date of the file.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Inherited from:

FileSystemIterableSource.get_values :

Get distinct values from columns in the dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.
  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

FileSystemIterableSource.load_data :

Load the data for the datasource.

Arguments

  • ****kwargs**: Keyword arguments to be passed to underlying functions.

Raises

  • TypeError: If data format is not supported.

partition

def partition(self, iterable: Sequence, partition_size: int = 1)> Iterable:

Inherited from:

FileSystemIterableSource.partition :

Takes an iterable and yields partitions of size partition_size.

The final partition may be less than size partition_size due to the variable length of the iterable.

use_file_multiprocessing

def use_file_multiprocessing(self, file_names: List[str])> bool:

Inherited from:

FileSystemIterableSource.use_file_multiprocessing :

Check if file multiprocessing should be used.

Returns True if file multiprocessing has been enabled by the environment variable and the number of workers would be greater than 1, otherwise False. There is no need to use file multiprocessing if we are just going to use one worker - it would be slower than just loading the data in the main process.

Returns True if file multiprocessing should be used, otherwise False.

yield_data

def yield_data(    self, file_names: Optional[List[str]] = None, use_cache: bool = True, **kwargs: Any,)> Iterator[pandas.core.frame.DataFrame]:

Inherited from:

FileSystemIterableSource.yield_data :

Yields data in batches from files that match the given file names.

Arguments

  • file_names: An optional list of file names to use for yielding data. Otherwise, all files that have already been found will be used. file_names is always provided when this method is called from the Dataset as part of a task.

Raises

  • ValueError: If no file names provided and no files have been found.

IterableSource

class IterableSource(    partition_size: int = 100,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Abstract base source that supports iterating over the data.

This is used for streaming data in batches as opposed to loading the entire dataset into memory.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • partition_size: The size of each partition when iterating over the data.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    If the datasource is iterable, this will raise an exception.

    Raises: IterableDataSourceError: If the datasource is set to iterable. DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - Implement this method to define whether the data source is iterable.

    The datasource must inherit from IterableSource if this is True. However, the inverse is not necessarily True.

  • multi_table : bool - This returns False if the DataSource does not subclass MultiTableSource.

    However, this property must be re-implemented in MultiTableSource, therefore it is not necessarily True if the DataSource inherits from MultiTableSource.

Methods


get_column

def get_column(    self, col_name: str, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

BaseSource.get_column :

Get a single column from dataset.

Used in the ColumnAverage algorithm as well as to iterate over image columns for the purposes of schema generation.

get_column_names

def get_column_names(self, **kwargs: Any)> Iterable[str]:

Inherited from:

BaseSource.get_column_names :

Get the column names as an iterable.

get_data

def get_data(self, **kwargs: Any)> Optional[pandas.core.frame.DataFrame]:

This method must return None if the data source is iterable.

get_dtypes

def get_dtypes(self, **kwargs: Any)> _Dtypes:

Inherited from:

BaseSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(self, col_names: List[str], **kwargs: Any)> Dict[str, Iterable[Any]]:

Implement this method to get distinct values from list of columns.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

BaseSource.load_data :

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.

yield_data

def yield_data(self, **kwargs: Any)> Iterator[pandas.core.frame.DataFrame]:

Implement this method to yield dataframes.

MultiProcessingMixIn

class MultiProcessingMixIn():

MixIn class for multiprocessing of _get_data.

Variables

  • static skipped_files : Set[str]

Static methods


get_num_workers

def get_num_workers(file_names: List[str])> int:

Gets the number of workers to use for multiprocessing.

Ensures that the number of workers is at least 1 and at most equal to MAX_NUM_MULTIPROCESSING_WORKERS. If the number of files is less than MAX_NUM_MULTIPROCESSING_WORKERS, then we use the number of files as the number of workers. Unless the number of machine cores is also less than MAX_NUM_MULTIPROCESSING_WORKERS, in which case we use the lower of the two.

Arguments

  • file_names: The list of file names to load.

Returns The number of workers to use for multiprocessing.

Methods


use_file_multiprocessing

def use_file_multiprocessing(self, file_names: List[str])> bool:

Check if file multiprocessing should be used.

Returns True if file multiprocessing has been enabled by the environment variable and the number of workers would be greater than 1, otherwise False. There is no need to use file multiprocessing if we are just going to use one worker - it would be slower than just loading the data in the main process.

Returns True if file multiprocessing should be used, otherwise False.

MultiTableSource

class MultiTableSource(    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Abstract base source that supports multiple tables.

This class is used to define a data source that supports multiple tables. The datasources do not necessarily have multiple tables though.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - This returns False if the DataSource does not subclass IterableSource.

    However, this property must be re-implemented in IterableSource, therefore it is not necessarily True if the DataSource inherits from IterableSource.

  • multi_table : bool - Implement this method to define whether the data source is multi-table.

    The datasource must inherit from MultiTableSource if this is True. However, the inverse is not necessarily True.

  • table_names : List[str] - Implement this to return a list of table names.

    If there is only one table, it should return a list with one element.

Methods


get_column

def get_column(    self, col_name: str, table_name: Optional[str] = None, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Implement this method to get single column from dataset.

get_column_names

def get_column_names(    self, table_name: Optional[str] = None, **kwargs: Any,)> Iterable[str]:

Implement this method to get the column names as an iterable.

get_data

def get_data(    self, table_name: Optional[str] = None, **kwargs: Any,)> Optional[pandas.core.frame.DataFrame]:

Implement this method to loads and return dataset.

get_dtypes

def get_dtypes(self, table_name: Optional[str] = None, **kwargs: Any)> _Dtypes:

Inherited from:

BaseSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(    self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,)> Dict[str, Iterable[Any]]:

Implement this method to get distinct values from list of columns.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

BaseSource.load_data :

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.