sqlite

A data persistance implementation backed by an SQLite database.

Classes

CacheInfoTableBase

class CacheInfoTableBase():

Cache information entry ORM.

Represents the table in the database that corresponds to cache validity information. In particular, stores the primary key of the cache, file, which is the canonical path of the file in question, and the time the cache was last updated for that file.

This is a mix-in designed to be used with the EntityName pattern: https://github.com/sqlalchemy/sqlalchemy/wiki/EntityName

Variables

static cache_updated_at : sqlalchemy.orm.base.Mapped[datetime.datetime]

static data

static file : sqlalchemy.orm.base.Mapped[str]

DataTableBase

class DataTableBase():

Cached data entry ORM.

The specific structure of this table will depend on the data being stored in it (hence why deferred reflection is used); the table is initialised at the first set() call and its schema determined at that point.

Some things are consistent though; the data must have: - an integer primary key column (data_cache_id) - a column of text called _source_canonical_path (which stores a canonical filepath) and has a foreign key constraint on the cache info table.

This is a mix-in designed to be used with the EntityName pattern: https://github.com/sqlalchemy/sqlalchemy/wiki/EntityName

Variables

static cache_info

static data_cache_id : sqlalchemy.orm.base.Mapped[int]

SQLiteDataPersister

class SQLiteDataPersister(sqlite_path: Path, *args: Any, **kwargs: Any):

A data caching implementation that uses an SQLite database.

This implementation maintains three related tables in the SQLite database:

1. cache_info table:

Tracks metadata for successfully cached files
Schema: file (TEXT PRIMARY KEY), cache_updated_at (DATETIME)
Purpose: Determines cache validity by comparing file modification times

2. cached_data table:

Stores the actual processed data from successful files
Schema: Dynamically determined from first cached file DataFrame + metadata columns
Always includes: data_cache_id (INT PRIMARY KEY), _source_canonical_path (TEXT)
Purpose: Fast retrieval of processed data without re-parsing files
Relationship: Foreign key to cache_info.file with CASCADE DELETE
Note: If images are present in the datasource, they will not be cached and the file will have to be processed again to obtain them.

3. skipped_files table:

Tracks files that were skipped during processing
Schema: file_path (TEXT PRIMARY KEY), reason_code (INT), skip_time (DATETIME)
Purpose: Avoid reprocessing files that will inevitably fail
Reason codes map to specific failure types in FileSkipReason enum

Database Lifecycle:

cache_info + cached_data tables: Created on first successful file processing
skipped_files table: Created immediately on SQLiteDataPersister initialization
All tables support concurrent access via SQLAlchemy sessions and optional locking

Performance Benefits:

Data cache: Eliminates re-parsing of files when only tabular data is needed
Skip tracking: Eliminates re-parsing of incompatible files

Skip Tracking Methods:

is_file_skipped(): Check if a file was previously skipped
mark_file_skipped(): Mark a file as skipped with specific reason
get_all_skipped_files(): Get detailed report of all skipped files

Arguments

sqlite_path: Path to the SQLite database file *args, **kwargs: Additional arguments passed to DataPersister

Ancestors

Variables

db_prepped : bool - Whether the database has been fully initialised.

Static methods

prep_data_for_caching

def prep_data_for_caching(    data: pd.DataFrame, image_cols: Optional[Collection[str]] = None,) ‑> pd.DataFrame:

Inherited from:

DataPersister.prep_data_for_caching :

Prepares data ready for caching.

This involves removing/replacing things that aren't supposed to be cached or that it makes no sense to cache, such as image data or file paths that won't be relevant except for when the files are actually being used.

Does not mutate input dataframe.

Methods

bulk_get

def bulk_get(self, files: Sequence[Union[str, Path]]) ‑> BulkResult:

Inherited from:

DataPersister.bulk_get :

Get the persisted data for several files.

Returns only misses if no data has been persisted, if it is out of date, or an error was otherwise encountered.

bulk_set

def bulk_set(    self, data: pd.DataFrame, original_file_col: str = '_original_filename',) ‑> None:

Inherited from:

DataPersister.bulk_set :

Bulk set a bunch of cache entries from a dataframe.

The dataframe must indicate the original file that each row is associated with. This is the _original_filename column by default.

clear_cache_file

def clear_cache_file(self) ‑> CacheClearResult:

Inherited from:

DataPersister.clear_cache_file :

Delete the cache storage completely.

Returns Dictionary with results of the cache clearing operation.

get

def get(self, file: Union[str, Path]) ‑> Optional[pd.DataFrame]:

Inherited from:

DataPersister.get :

Get the persisted data for a given file.

Returns None if no data has been persisted, if it is out of date, or an error was otherwise encountered.

get_all_cached_file_paths

def get_all_cached_file_paths(self) ‑> list[str]:

Inherited from:

DataPersister.get_all_cached_file_paths :

Get list of all cached file paths.

Returns List of canonical file paths (as strings) that have entries in the cache.

get_all_cached_files

def get_all_cached_files(self) ‑> list[str]:

Get all file paths currently stored in the cache.

Returns A list of canonical file paths (as strings) that have entries in the cache. Returns an empty list if the database hasn't been initialized yet.

get_all_skipped_files

def get_all_skipped_files(self) ‑> list[str]:

Inherited from:

DataPersister.get_all_skipped_files :

Get list of all skipped file paths.

Returns List of file paths that have been marked as skipped.

get_skip_reason_summary

def get_skip_reason_summary(self) ‑> pandas.core.frame.DataFrame:

Inherited from:

DataPersister.get_skip_reason_summary :

Get aggregate statistics of skip reasons.

Returns DataFrame with columns: reason_code, reason_description, file_count

is_file_skipped

def is_file_skipped(self, file: Union[str, Path]) ‑> bool:

Inherited from:

DataPersister.is_file_skipped :

Check if a file has been previously skipped.

Arguments

file: The file path to check.

Returns True if the file has been marked as skipped, False otherwise.

mark_file_skipped

def mark_file_skipped(self, file: Union[str, Path], reason: FileSkipReason) ‑> None:

Mark a file as skipped with the given reason.

set

def set(self, file: Union[str, Path], data: pd.DataFrame) ‑> None:

Inherited from:

DataPersister.set :

Set the persisted data for a given file.

If existing data is already set, it will be overwritten.

The data should only be the data that is related to that file.

unset

def unset(self, file: Union[str, Path]) ‑> None:

Deletes the persisted data for the given file.

SkippedFilesTableBase

class SkippedFilesTableBase():

Skipped files tracking table ORM.

Tracks files that have been skipped during processing to avoid reprocessing them.

Variables

static file_path : sqlalchemy.orm.base.Mapped[str]

static reason_code : sqlalchemy.orm.base.Mapped[int]

static skip_time : sqlalchemy.orm.base.Mapped[datetime.datetime]

Classes​

CacheInfoTableBase​

Variables​

DataTableBase​

Variables​

SQLiteDataPersister​

Ancestors​

Variables​

Static methods​

prep_data_for_caching​

Methods​

bulk_get​

bulk_set​

clear_cache_file​

get​

get_all_cached_file_paths​

get_all_cached_files​

get_all_skipped_files​

get_skip_reason_summary​

is_file_skipped​

mark_file_skipped​

set​

unset​

SkippedFilesTableBase​

Variables​

Classes

CacheInfoTableBase

Variables

DataTableBase

Variables

SQLiteDataPersister

Ancestors

Variables

Static methods

prep_data_for_caching

Methods

bulk_get

bulk_set

clear_cache_file

get

get_all_cached_file_paths

get_all_cached_files

get_all_skipped_files

get_skip_reason_summary

is_file_skipped

mark_file_skipped

set

unset

SkippedFilesTableBase

Variables