pandas_utils

Utility functions for interacting with pandas.

Module

Functions

append_dataframe_to_csv

def append_dataframe_to_csv(    csv_file: Union[str, os.PathLike], df: pd.DataFrame,) ‑> pathlib.Path:

Append or write a dataframe to a CSV file.

Handles appending a dataframe to an already existing CSV file that may contain differing columns.

Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.

Arguments

csv_file: The CSV file path to append/write to.
df: The dataframe to append.

Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.

calculate_age

def calculate_age(    dob: pd.Timestamp | datetime | date,    comparison_date: Optional[pd.Timestamp | datetime | date] = None,) ‑> int:

Given a date of birth, calculate age at a target date.

If no target date is supplied, use today.

Arguments

dob: Date of birth (should be pandas Timestamp or python datetime/date).
comparison_date: The date to calculate age at. Defaults to today.

Returns The age at the target date.

calculate_ages

def calculate_ages(    dobs: pd.Series[pd.Timestamp | datetime | date] | TimestampSeries,    comparison_date: Optional[pd.Timestamp | datetime | date] = None,) ‑> pd.Series[int]:

Given a series of date of births, calculate ages at a target date.

If no target date is supplied, use today.

Arguments

dobs: Series of date of births (should be pandas Timestamps or python datetimes/dates).
comparison_date: The date to calculate age at. Defaults to today.

Returns A series of the ages at the target date.

conditional_dataframe_yielder

def conditional_dataframe_yielder(    dfs: Iterable[pd.DataFrame],    condition: Callable[[pd.DataFrame], pd.DataFrame],    reset_index: bool = True,) ‑> collections.abc.Generator[pandas.core.frame.DataFrame, None, None]:

Create a generator that conditionally yields rows from a set of dataframes.

This replicates the standard .loc conditional indexing that can be used on a whole dataframe in a manner that can be applied to an iterable of dataframes such as is returned when chunking a CSV file.

Arguments

dfs: An iterable of dataframes to conditionally yield rows from.
condition: A callable that takes in a dataframe, applied a condition, and returns the edited/filtered dataframe.
reset_index: Whether the index of the yielded dataframes should be reset. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).

dataframe_iterable_join

def dataframe_iterable_join(    joiners: Iterable[pd.DataFrame],    joinee: pd.DataFrame,    reset_joiners_index: bool = False,) ‑> collections.abc.Generator[pandas.core.frame.DataFrame, None, None]:

Performs a dataframe join against a collection of dataframes.

This replicates the standard .join() method that can be used on a whole dataframe in a manner that can be applied to an iterable of dataframes such as is returned when chunking a CSV file.

This is equivalent to:

joiner.join(joinee)

Arguments

joiners: The collection of dataframes that should be joined against the joinee.
joinee: The single dataframe that the others should be joined against.
reset_joiners_index: Whether the index of the joiners dataframes should be reset as they are processed. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).

find_bitfount_id_column

def find_bitfount_id_column(df: pd.DataFrame) ‑> Optional[str]:

Find the actual column name for bitfount id in the DataFrame.

Arguments

df: DataFrame to search for bitfount id column

Returns The actual column name if found, None otherwise

find_column_name

def find_column_name(    dataframe_or_columns: pd.DataFrame | pd.Index | Collection[str],    potential_names: Collection[str],) ‑> Optional[str]:

Find the actual column name used, given a set of potential column names.

Arguments

dataframe_or_columns: The dataframe to match column names against or the column names as an Index or list of column names.
potential_names: The collection of potential column names.

Returns The found matching column name or None if no matching column name was found.

find_dob_column

def find_dob_column(df: pd.DataFrame) ‑> Optional[str]:

Find the actual column name for date of birth in the DataFrame.

Arguments

df: DataFrame to search for dob column

Returns The actual column name if found, None otherwise

find_family_name_column

def find_family_name_column(df: pd.DataFrame) ‑> Optional[str]:

Find the actual column name for family/last name in the DataFrame.

Arguments

df: DataFrame to search for family name column

Returns The actual column name if found, None otherwise

find_full_name_column

def find_full_name_column(df: pd.DataFrame) ‑> Optional[str]:

Find the actual column name for full name in the DataFrame.

Arguments

df: DataFrame to search for name column

Returns The actual column name if found, None otherwise

find_given_name_column

def find_given_name_column(df: pd.DataFrame) ‑> Optional[str]:

Find the actual column name for given/first name in the DataFrame.

Arguments

df: DataFrame to search for given name column

Returns The actual column name if found, None otherwise

rewrite_csv_with_new_columns

def rewrite_csv_with_new_columns(    csv_file: Union[str, os.PathLike], new_column_index: pd.Index,) ‑> pathlib.Path:

Rewrite an existing dataframe CSV with a new set of columns.

This is of use when new columns need to be added to the CSV file. The function will read, chunked, the original CSV, change the column index and write it out to a new file. At the end of writing the new file, it will replace the original CSV.

Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.

Arguments

csv_file: The CSV file to rewrite.
new_column_index: A pandas Index representing the new set of columns to use.

Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.

Module​

Functions​

append_dataframe_to_csv​

calculate_age​

calculate_ages​

conditional_dataframe_yielder​

dataframe_iterable_join​

find_bitfount_id_column​

find_column_name​

find_dob_column​

find_family_name_column​

find_full_name_column​

find_given_name_column​

rewrite_csv_with_new_columns​

Module

Functions

append_dataframe_to_csv

calculate_age

calculate_ages

conditional_dataframe_yielder

dataframe_iterable_join

find_bitfount_id_column

find_column_name

find_dob_column

find_family_name_column

find_full_name_column

find_given_name_column

rewrite_csv_with_new_columns