encoding_utils
Utilities for reading text files when encoding is unknown or external.
Useful for user-provided or cross-platform files (e.g. CSV/JSON/YAML saved on Windows) where the encoding may be UTF-8 with BOM, cp1252, or similar.
Module
Functions
read_text_with_encoding_fallback
def read_text_with_encoding_fallback( path: Union[os.PathLike[str], str], encodings: tuple[str, ...] = ('utf-8-sig', 'utf-8', 'cp1252', 'latin-1'),) ‑> tuple[str, str]:Read a file and decode using the first successful encoding.
Tries each encoding in order; the first that does not raise UnicodeDecodeError is used. Encodings are not disjoint (e.g. ASCII is valid in both UTF-8 and cp1252), so order expresses preference (UTF-8 first).
Arguments
path: Path to the file to read.encodings: Tuple of encoding names to try. Defaults to utf-8-sig, utf-8, cp1252, latin-1.
Returns (decoded_text, encoding_used).
Raises
UnicodeDecodeError: Only if the last encoding in the sequence fails (latin-1 accepts every byte, so this should not occur with the default encodings).
strip_bom_from_text
def strip_bom_from_text(text: str) ‑> str:Strip BOM and UTF-8 BOM mojibake from the start and end of text.
Arguments
text: String that may have leading/trailing BOM or BOM mojibake.
Returns Text with BOM and BOM mojibake removed from both ends, then stripped.