base
Base interface for data persistence implementations.
Classes
BulkResult
class BulkResult( file_name_column: str, cached: Optional[pd.DataFrame], misses: list[Path], skipped: list[str] = [],):Container for the results of a bulk_get result.
Variables
- static
cached : Optional[pd.DataFrame]
- static
file_name_column : str
- static
misses : list[Path]
- static
skipped : list[str]
data- Ordered DataFrame with cached data, excluding the file names.
hits- Ordered Series of file name hits, possibly including duplicates.
Methods
get_cached_by_filename
def get_cached_by_filename(self, file_name: str) ‑> Optional[pd.DataFrame]:Dataframe with cached data for a single file.
May contain multiple lines (e.g. for e2e files that contain several images).
CacheClearResult
class CacheClearResult(*args, **kwargs):Result structure for cache clearing operations.
Ancestors
- builtins.dict
Variables
- static
error : Optional[str]
- static
file_existed : bool
- static
file_path : Optional[str]
- static
success : bool
DataPersister
class DataPersister( file_name_column: str, lock: Optional[_Lock] = None, bulk_partition_size: Optional[int] = None,):Abstract interface for data persistence/caching implementations.
Subclasses
Static methods
prep_data_for_caching
def prep_data_for_caching( data: pd.DataFrame, image_cols: Optional[Collection[str]] = None,) ‑> pd.DataFrame:Prepares data ready for caching.
This involves removing/replacing things that aren't supposed to be cached or that it makes no sense to cache, such as image data or file paths that won't be relevant except for when the files are actually being used.
Does not mutate input dataframe.
Methods
bulk_get
def bulk_get(self, files: Sequence[Union[str, Path]]) ‑> BulkResult:Get the persisted data for several files.
Returns only misses if no data has been persisted, if it is out of date, or an error was otherwise encountered.
bulk_set
def bulk_set( self, data: pd.DataFrame, original_file_col: str = '_original_filename',) ‑> None:Bulk set a bunch of cache entries from a dataframe.
The dataframe must indicate the original file that each row is associated
with. This is the _original_filename column by default.
clear_cache_file
def clear_cache_file(self) ‑> CacheClearResult:Delete the cache storage completely.
Returns Dictionary with results of the cache clearing operation.
get
def get(self, file: Union[str, Path]) ‑> Optional[pd.DataFrame]:Get the persisted data for a given file.
Returns None if no data has been persisted, if it is out of date, or an error was otherwise encountered.
get_all_cached_file_paths
def get_all_cached_file_paths(self) ‑> list[str]:Get list of all cached file paths.
Returns List of canonical file paths (as strings) that have entries in the cache.
get_all_skipped_files
def get_all_skipped_files(self) ‑> list[str]:Get list of all skipped file paths.
Returns List of file paths that have been marked as skipped.
get_skip_reason_summary
def get_skip_reason_summary(self) ‑> pandas.core.frame.DataFrame:Get aggregate statistics of skip reasons.
Returns DataFrame with columns: reason_code, reason_description, file_count
is_file_skipped
def is_file_skipped(self, file: Union[str, Path]) ‑> bool:Check if a file has been previously skipped.
Arguments
file: The file path to check.
Returns True if the file has been marked as skipped, False otherwise.
mark_file_skipped
def mark_file_skipped(self, file: Union[str, Path], reason: FileSkipReason) ‑> None:Mark a file as skipped with the given reason.
Arguments
file: The file path that was skipped.reason: The reason why the file was skipped.
set
def set(self, file: Union[str, Path], data: pd.DataFrame) ‑> None:Set the persisted data for a given file.
If existing data is already set, it will be overwritten.
The data should only be the data that is related to that file.
unset
def unset(self, file: Union[str, Path]) ‑> None:Deletes the persisted data for a given file.