Skip to main content

base

Base interface for data persistence implementations.

Classes

BulkResult

class BulkResult(    file_name_column: str,    cached: Optional[pd.DataFrame],    misses: list[Path],    skipped: list[str] = [],):

Container for the results of a bulk_get result.

Variables

  • static cached : Optional[pd.DataFrame]
  • static file_name_column : str
  • static misses : list[Path]
  • static skipped : list[str]
  • data - Ordered DataFrame with cached data, excluding the file names.
  • hits - Ordered Series of file name hits, possibly including duplicates.

Methods


get_cached_by_filename

def get_cached_by_filename(self, file_name: str)> Optional[pd.DataFrame]:

Dataframe with cached data for a single file.

May contain multiple lines (e.g. for e2e files that contain several images).

CacheClearResult

class CacheClearResult(*args, **kwargs):

Result structure for cache clearing operations.

Ancestors

  • builtins.dict

Variables

  • static error : Optional[str]
  • static file_existed : bool
  • static file_path : Optional[str]
  • static success : bool

DataPersister

class DataPersister(    file_name_column: str,    lock: Optional[_Lock] = None,    bulk_partition_size: Optional[int] = None,):

Abstract interface for data persistence/caching implementations.

Ancestors

Static methods


prep_data_for_caching

def prep_data_for_caching(    data: pd.DataFrame, image_cols: Optional[Collection[str]] = None,)> pd.DataFrame:

Prepares data ready for caching.

This involves removing/replacing things that aren't supposed to be cached or that it makes no sense to cache, such as image data or file paths that won't be relevant except for when the files are actually being used.

Does not mutate input dataframe.

Methods


bulk_get

def bulk_get(self, files: Sequence[Union[str, Path]])> BulkResult:

Get the persisted data for several files.

Returns only misses if no data has been persisted, if it is out of date, or an error was otherwise encountered.

bulk_set

def bulk_set(    self, data: pd.DataFrame, original_file_col: str = '_original_filename',)> None:

Bulk set a bunch of cache entries from a dataframe.

The dataframe must indicate the original file that each row is associated with. This is the _original_filename column by default.

clear_cache_file

def clear_cache_file(self)> CacheClearResult:

Delete the cache storage completely.

Returns Dictionary with results of the cache clearing operation.

get

def get(self, file: Union[str, Path])> Optional[pd.DataFrame]:

Get the persisted data for a given file.

Returns None if no data has been persisted, if it is out of date, or an error was otherwise encountered.

get_all_cached_file_paths

def get_all_cached_file_paths(self)> list[str]:

Get list of all cached file paths.

Returns List of canonical file paths (as strings) that have entries in the cache.

get_all_skipped_files

def get_all_skipped_files(self)> list[str]:

Get list of all skipped file paths.

Returns List of file paths that have been marked as skipped.

get_skip_reason_summary

def get_skip_reason_summary(self)> pandas.core.frame.DataFrame:

Get aggregate statistics of skip reasons.

Returns DataFrame with columns: reason_code, reason_description, file_count

is_file_skipped

def is_file_skipped(self, file: Union[str, Path])> bool:

Check if a file has been previously skipped.

Arguments

  • file: The file path to check.

Returns True if the file has been marked as skipped, False otherwise.

mark_file_skipped

def mark_file_skipped(self, file: Union[str, Path], reason: FileSkipReason)> None:

Mark a file as skipped with the given reason.

Arguments

  • file: The file path that was skipped.
  • reason: The reason why the file was skipped.

set

def set(self, file: Union[str, Path], data: pd.DataFrame)> None:

Set the persisted data for a given file.

If existing data is already set, it will be overwritten.

The data should only be the data that is related to that file.

unset

def unset(self, file: Union[str, Path])> None:

Deletes the persisted data for a given file.