pandas_utils
Utility functions for interacting with pandas.
Module
Functions
append_dataframe_to_csv
def append_dataframe_to_csv( csv_file: Union[str, os.PathLike], df: pd.DataFrame,) ‑> pathlib.Path:
Append or write a dataframe to a CSV file.
Handles appending a dataframe to an already existing CSV file that may contain differing columns.
Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.
Arguments
csv_file
: The CSV file path to append/write to.df
: The dataframe to append.
Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.
calculate_age
def calculate_age( dob: pd.Timestamp | datetime | date, comparison_date: Optional[pd.Timestamp | datetime | date] = None,) ‑> int:
Given a date of birth, calculate age at a target date.
If no target date is supplied, use today.
Arguments
dob
: Date of birth (should be pandas Timestamp or python datetime/date).comparison_date
: The date to calculate age at. Defaults to today.
Returns The age at the target date.
calculate_ages
def calculate_ages( dobs: pd.Series[pd.Timestamp | datetime | date] | TimestampSeries, comparison_date: Optional[pd.Timestamp | datetime | date] = None,) ‑> pd.Series[int]:
Given a series of date of births, calculate ages at a target date.
If no target date is supplied, use today.
Arguments
dobs
: Series of date of births (should be pandas Timestamps or python datetimes/dates).comparison_date
: The date to calculate age at. Defaults to today.
Returns A series of the ages at the target date.
conditional_dataframe_yielder
def conditional_dataframe_yielder( dfs: Iterable[pd.DataFrame], condition: Callable[[pd.DataFrame], pd.DataFrame], reset_index: bool = True,) ‑> collections.abc.Generator[pandas.core.frame.DataFrame, None, None]:
Create a generator that conditionally yields rows from a set of dataframes.
This replicates the standard .loc
conditional indexing that can be used on
a whole dataframe in a manner that can be applied to an iterable of dataframes
such as is returned when chunking a CSV file.
Arguments
dfs
: An iterable of dataframes to conditionally yield rows from.condition
: A callable that takes in a dataframe, applied a condition, and returns the edited/filtered dataframe.reset_index
: Whether the index of the yielded dataframes should be reset. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).
dataframe_iterable_join
def dataframe_iterable_join( joiners: Iterable[pd.DataFrame], joinee: pd.DataFrame, reset_joiners_index: bool = False,) ‑> collections.abc.Generator[pandas.core.frame.DataFrame, None, None]:
Performs a dataframe join against a collection of dataframes.
This replicates the standard .join()
method that can be used on a whole
dataframe in a manner that can be applied to an iterable of dataframes such
as is returned when chunking a CSV file.
This is equivalent to:
joiner.join(joinee)
Arguments
joiners
: The collection of dataframes that should be joined against the joinee.joinee
: The single dataframe that the others should be joined against.reset_joiners_index
: Whether the index of the joiners dataframes should be reset as they are processed. If True, a standard integer index is used that is consistent between the yielded dataframes (e.g. if yielded dataframe 10 ends with index 42, yielded dataframe 11 will start with index 43).
find_bitfount_id_column
def find_bitfount_id_column(df: pd.DataFrame) ‑> Optional[str]:
Find the actual column name for bitfount id in the DataFrame.
Arguments
df
: DataFrame to search for bitfount id column
Returns The actual column name if found, None otherwise
find_column_name
def find_column_name( dataframe_or_columns: pd.DataFrame | pd.Index | Collection[str], potential_names: Collection[str],) ‑> Optional[str]:
Find the actual column name used, given a set of potential column names.
Arguments
dataframe_or_columns
: The dataframe to match column names against or the column names as an Index or list of column names.potential_names
: The collection of potential column names.
Returns The found matching column name or None if no matching column name was found.
find_dob_column
def find_dob_column(df: pd.DataFrame) ‑> Optional[str]:
Find the actual column name for date of birth in the DataFrame.
Arguments
df
: DataFrame to search for dob column
Returns The actual column name if found, None otherwise
find_family_name_column
def find_family_name_column(df: pd.DataFrame) ‑> Optional[str]:
Find the actual column name for family/last name in the DataFrame.
Arguments
df
: DataFrame to search for family name column
Returns The actual column name if found, None otherwise
find_full_name_column
def find_full_name_column(df: pd.DataFrame) ‑> Optional[str]:
Find the actual column name for full name in the DataFrame.
Arguments
df
: DataFrame to search for name column
Returns The actual column name if found, None otherwise
find_given_name_column
def find_given_name_column(df: pd.DataFrame) ‑> Optional[str]:
Find the actual column name for given/first name in the DataFrame.
Arguments
df
: DataFrame to search for given name column
Returns The actual column name if found, None otherwise
rewrite_csv_with_new_columns
def rewrite_csv_with_new_columns( csv_file: Union[str, os.PathLike], new_column_index: pd.Index,) ‑> pathlib.Path:
Rewrite an existing dataframe CSV with a new set of columns.
This is of use when new columns need to be added to the CSV file. The function will read, chunked, the original CSV, change the column index and write it out to a new file. At the end of writing the new file, it will replace the original CSV.
Additionally, handles safe writing to file, where a new file will be created if the desired one is inaccessible for some reason.
Arguments
csv_file
: The CSV file to rewrite.new_column_index
: A pandas Index representing the new set of columns to use.
Returns The actual path the CSV was written to, which may differ from the requested one if that file was inaccessible.