Skip to main content

encoding_utils

Utilities for reading text files when encoding is unknown or external.

Useful for user-provided or cross-platform files (e.g. CSV/JSON/YAML saved on Windows) where the encoding may be UTF-8 with BOM, cp1252, or similar.

Module

Functions

read_text_with_encoding_fallback

def read_text_with_encoding_fallback(    path: Union[os.PathLike[str], str],    encodings: tuple[str, ...] = ('utf-8-sig', 'utf-8', 'cp1252', 'latin-1'),)> tuple[str, str]:

Read a file and decode using the first successful encoding.

Tries each encoding in order; the first that does not raise UnicodeDecodeError is used. Encodings are not disjoint (e.g. ASCII is valid in both UTF-8 and cp1252), so order expresses preference (UTF-8 first).

Arguments

  • path: Path to the file to read.
  • encodings: Tuple of encoding names to try. Defaults to utf-8-sig, utf-8, cp1252, latin-1.

Returns (decoded_text, encoding_used).

Raises

  • UnicodeDecodeError: Only if the last encoding in the sequence fails (latin-1 accepts every byte, so this should not occur with the default encodings).

strip_bom_from_text

def strip_bom_from_text(text: str)> str:

Strip BOM and UTF-8 BOM mojibake from the start and end of text.

Arguments

  • text: String that may have leading/trailing BOM or BOM mojibake.

Returns Text with BOM and BOM mojibake removed from both ends, then stripped.