base_source
Module containing BaseSource class.
BaseSource is the abstract data source class from which all concrete data sources must inherit.
Classes
BaseSource
class BaseSource( data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Abstract Base Source from which all other data sources must inherit.
Arguments
data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Subclasses
- IterableSource
- MultiTableSource
- CSVSource
- DataFrameSource
- bitfount.data.datasources.empty_source._EmptySource
- DataView
Variables
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns True if a task is running.
-
iterable : bool
- This returns False if the DataSource does not subclassIterableSource
.However, this property must be re-implemented in
IterableSource
, therefore it is not necessarily True if the DataSource inherits fromIterableSource
.
-
multi_table : bool
- This returns False if the DataSource does not subclassMultiTableSource
.However, this property must be re-implemented in
MultiTableSource
, therefore it is not necessarily True if the DataSource inherits fromMultiTableSource
.
Methods
get_column
def get_column( self, col_name: str, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Get a single column from dataset.
Used in the ColumnAverage
algorithm as well as to iterate over image columns
for the purposes of schema generation.
get_column_names
def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:
Get the column names as an iterable.
get_data
def get_data(self, **kwargs: Any) ‑> Optional[pandas.core.frame.DataFrame]:
Implement this method to load and return dataset.
get_dtypes
def get_dtypes(self, **kwargs: Any) ‑> _Dtypes:
Implement this method to get the columns and column types from dataset.
get_values
def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:
Get distinct values from list of columns.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Load the data for the datasource.
Raises
TypeError
: If data format is not supported.
FileSystemIterableSource
class FileSystemIterableSource( path: Union[os.PathLike, str], output_path: Optional[Union[os.PathLike, str]] = None, iterable: bool = False, fast_load: bool = False, file_extension: Optional[_SingleOrMulti[str]] = None, strict: bool = False, cache_images: bool = True, file_creation_cutoff_date: Optional[Union[Date, DateTD]] = None, file_modification_cutoff_date: Optional[Union[Date, DateTD]] = None, min_file_size: Optional[float] = None, max_file_size: Optional[float] = None, partition_size: int = 100, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Abstract base source that supports iterating over file-based data.
This is used for Iterable data sources that whose data is stored as files on disk.
Arguments
cache_images
: Whether to cache images in the file system. Defaults to True. This is ignored iffast_load
is True.data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.fast_load
: Whether the data will be loaded in fast mode. This is used to determine whether the data will be iterated over during set up for schema generation and splitting (where necessary). Only relevant ifiterable
is True, otherwise it is ignored. Defaults to False.file_creation_cutoff_date
: The oldest possible date to consider for file creation. If None, all files will be considered. Defaults to None.file_extension
: File extension(s) of the data files. If None, all files will be searched. Can either be a single file extension or a list of file extensions. Case-insensitive. Defaults to None.file_modification_cutoff_date
: The oldest possible date to consider for file modification. If None, all files will be considered. Defaults to None.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.iterable
: Whether the data source is iterable. This is used to determine whether the data source can be used in a streaming context during a task. Defaults to False.max_file_size
: The maximum file size in megabytes to consider. If None, all files will be considered. Defaults to None.min_file_size
: The minimum file size in megabytes to consider. If None, all files will be considered. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.output_path
: The path where to save intermediary output files. Defaults to 'preprocessed/'.partition_size
: The size of each partition when iterating over the data.path
: Path to the directory which contains the data files. Subdirectories will be searched recursively.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.strict
: Whether File loading should be strictly done on files with the explicit file extension provided. If set to True will only load those files in the dataset. Otherwise, it will scan the given path for files of the same type as the provided file extension. Only relevant iffile_extension
is provided. Defaults to False.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Ancestors
Subclasses
Variables
- static
data_cache : Optional[DataPersister]
- static
skipped_files : Set[str]
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.If the datasource is iterable, this will raise an exception.
Raises: IterableDataSourceError: If the datasource is set to iterable. DataNotLoadedError: If the data has not been loaded yet.
file_names
- Returns a cached list of file names in the directory.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns True if a task is running.
-
iterable : bool
- Defines whether the data source is iterable.This is defined by the user when instantiating the class.
-
multi_table : bool
- This returns False if the DataSource does not subclassMultiTableSource
.However, this property must be re-implemented in
MultiTableSource
, therefore it is not necessarily True if the DataSource inherits fromMultiTableSource
.
-
path : pathlib.Path
- Resolved absolute path to data.Provides a consistent version of the path provided by the user which should work throughout regardless of operating system and of directory structure.
-
selected_file_names : List[str]
- Returns a list of selected file names.Selected file names are affected by the
selected_file_names_override
andnew_file_names_only
attributes.
-
stale : bool
- Whether the data source is stale.This is defined by whether the data is loaded and the number of files matches the number of rows in the dataframe.
Static methods
get_num_workers
def get_num_workers(file_names: List[str]) ‑> int:
Inherited from:
MultiProcessingMixIn.get_num_workers :
Gets the number of workers to use for multiprocessing.
Ensures that the number of workers is at least 1 and at most equal to MAX_NUM_MULTIPROCESSING_WORKERS. If the number of files is less than MAX_NUM_MULTIPROCESSING_WORKERS, then we use the number of files as the number of workers. Unless the number of machine cores is also less than MAX_NUM_MULTIPROCESSING_WORKERS, in which case we use the lower of the two.
Arguments
file_names
: The list of file names to load.
Returns The number of workers to use for multiprocessing.
Methods
clear_file_names_cache
def clear_file_names_cache(self) ‑> None:
Clears the list of selected file names.
This allows the datasource to pick up any new files that have been added to the directory since the last time it was cached.
get_column
def get_column( self, col_name: str, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Loads and returns single column from the dataset.
Arguments
col_name
: The name of the column which should be loaded.**kwargs
: Additional keyword arguments to pass to theload_data
method if the data is stale.
Returns The column request as a series.
get_column_names
def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:
Inherited from:
IterableSource.get_column_names :
Get the column names as an iterable.
get_data
def get_data( self, file_names: Optional[List[str]] = None, use_cache: bool =