csv_source

Module containing CSVSource class.

CSVSource class handles loading of CSV data.

Classes

CSVSource

class CSVSource(    path: Union[os.PathLike[str], AnyUrl, str],    read_csv_kwargs: Optional[dict[str, Any]] = None,    modifiers: Optional[dict[str, DataPathModifiers]] = None,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,    iterable: bool = True,    partition_size: int = 16,    required_fields: Optional[dict[str, Any]] = None,    name: Optional[str] = None,):

Data source for loading csv files.

Arguments

data_splitter: Deprecated argument, will be removed in a future release. Defaults to None. Not used.
ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
name: The name for the datasource. Optional, defaults to None.
partition_size: The size of each partition when iterating over the data in a batched fashion.
path: The path or URL to the csv file.
read_csv_kwargs: Additional arguments to be passed as a dictionary to pandas.read_csv. Defaults to None.
seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

seed: Random number seed. Used for setting random seed for all libraries.

Ancestors

Variables

accessibility_details : Optional[AccessibilityDetails] - Check if the CSV file is accessible.

Returns: None if accessible, or dict with error details if not.

is_accessible : bool - Check if datasource is currently accessible.

Returns True if accessibility_details is None (no errors). This is a convenience property that wraps accessibility_details.

is_file_iterable : bool - Returns True if the datasource iterates over files.

Subclasses that iterate over files (e.g., FileSystemIterableSource) should override this to return True.

is_initialised : bool - Checks if BaseSource was initialised.

is_task_running : bool - Returns True if a task is running.

supports_project_db : bool - Whether the datasource supports the project database.

Each datasource needs to implement its own methods to define how what its project database table should look like. If the datasource does not implement the methods to get the table creation query and columns, it does not support the projectdatabase.

Methods

add_hook

def add_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

BaseSource.add_hook :

Add a hook to the datasource.

apply_ignore_cols

def apply_ignore_cols(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

BaseSource.apply_ignore_cols :

Apply ignored columns to dataframe, dropping columns as needed.

Returns A copy of the dataframe with ignored columns removed, or the original dataframe if this datasource does not specify any ignore columns.

apply_ignore_cols_iter

def apply_ignore_cols_iter(    self, dfs: Iterator[pd.DataFrame],) ‑> collections.abc.Iterator[pandas.core.frame.DataFrame]:

Inherited from:

BaseSource.apply_ignore_cols_iter :

Apply ignored columns to dataframes from iterator.

apply_modifiers

def apply_modifiers(self, df: pd.DataFrame) ‑> pandas.core.frame.DataFrame:

Inherited from:

BaseSource.apply_modifiers :

Apply column modifiers to the dataframe.

If no modifiers are specified, returns the dataframe unchanged.

get_data

def get_data(    self,    data_keys: SingleOrMulti[str] | SingleOrMulti[int],    *,    use_cache: bool = True,    **kwargs: Any,) ‑> Optional[pandas.core.frame.DataFrame]:

Inherited from:

BaseSource.get_data :

Get data corresponding to the provided data key(s).

Can be used to return data for a single data key or for multiple at once. If used for multiple, the order of the output dataframe must match the order of the keys provided.

Arguments

data_keys: Key(s) for which to get the data of. These may be things such as file names, UUIDs, etc. Can also be a list of integers if the datasource has an integer index.
use_cache: Whether the cache should be used to retrieve data for these keys. Note that cached data may have some elements, particularly image-related fields such as image data or file paths, replaced with placeholder values when stored in the cache. If data_cache is set on the instance, data will be set in the cache, regardless of this argument.
**kwargs: Additional keyword arguments.

Returns A dataframe containing the data, ordered to match the order of keys in data_keys, or None if no data for those keys was available.

get_datasource_metrics

def get_datasource_metrics(    self, use_skip_codes: bool = False, data: Optional[pd.DataFrame] = None,) ‑> DatasourceSummaryStats:

Inherited from:

BaseSource.get_datasource_metrics :

Get metadata about this datasource.

This can be used to store information about the datasource that may be useful for debugging or tracking purposes. The metadata will be stored in the project database.

Arguments

use_skip_codes: Whether to use the skip reason codes as the keys in the skip_reasons dictionary, rather than the existing reason descriptions.
data: The data to use for getting the metrics.

Returns A dictionary containing metadata about this datasource.

get_project_db_sqlite_columns

def get_project_db_sqlite_columns(self) ‑> list[str]:

Inherited from:

BaseSource.get_project_db_sqlite_columns :

Implement this method to get the required columns.

This is used by the "run on new data only" feature. This is used to add data to the task table in the project database.

get_project_db_sqlite_create_table_query

def get_project_db_sqlite_create_table_query(self) ‑> str:

Inherited from:

BaseSource.get_project_db_sqlite_create_table_query :

Implement this method to return the required columns and types.

This is used by the "run on new data only" feature. This should be in the format that can be used after a "CREATE TABLE" statement and is used to create the task table in the project database.

get_schema

def get_schema(self) ‑> dict[str, typing.Any]:

Inherited from:

BaseSource.get_schema :

Get the pre-defined schema for this datasource.

This method should be overridden by datasources that have pre-defined schemas (i.e., those with has_predefined_schema = True).

Returns The schema as a dictionary.

Raises

NotImplementedError: If the datasource doesn't have a pre-defined schema.

merge_and_validate_filters

def merge_and_validate_filters(    self, datasource_level_filters: FilterConfig, task_level_filters: list[TaskFilter],) ‑> MergedFilterConfig:

Inherited from:

BaseSource.merge_and_validate_filters :

Merge and validate the filters from the datasource and the task.

Returns a MergedFilterConfig with resolved filter values from both datasource and task-level filters, using intersection logic (most restrictive wins).

partition

def partition(    self, iterable: Iterable[_I], partition_size: int = 1,) ‑> collections.abc.Iterable[collections.abc.Sequence[~_I]]:

Inherited from:

BaseSource.partition :

Takes an iterable and yields partitions of size partition_size.

The final partition may be less than size partition_size due to the variable length of the iterable.

remove_hook

def remove_hook(self, hook: DataSourceHook) ‑> None:

Inherited from:

BaseSource.remove_hook :

Remove a hook from the datasource.

yield_data

def yield_data(    self,    data_keys: Optional[SingleOrMulti[str] | SingleOrMulti[int]] = None,    *,    use_cache: bool = True,    partition_size: Optional[int] = None,    **kwargs: Any,) ‑> collections.abc.Iterator[pandas.core.frame.DataFrame]:

Inherited from:

BaseSource.yield_data :

Yields data in batches from this source.

If data_keys is specified, only yield from that subset of the data. Otherwise, iterate through the whole datasource.

Arguments

data_keys: An optional list of data keys to use for yielding data. Otherwise, all data in the datasource will be considered. data_keys is always provided when this method is called from the Dataset as part of a task. Can also be a list of integers if the datasource has an integer index.
use_cache: Whether the cache should be used to retrieve data for these data points. Note that cached data may have some elements, particularly image-related fields such as image data or file paths, replaced with placeholder values when stored in the cache. If data_cache is set on the instance, data will be set in the cache, regardless of this argument.
partition_size: The number of data elements to load/yield in each iteration. If not provided, defaults to the partition size configured in the datasource.
**kwargs: Additional keyword arguments.

Classes​

CSVSource​

Ancestors​

Variables​

Methods​

add_hook​

apply_ignore_cols​

apply_ignore_cols_iter​

apply_modifiers​

get_data​

get_datasource_metrics​

get_project_db_sqlite_columns​

get_project_db_sqlite_create_table_query​

get_schema​

merge_and_validate_filters​

partition​

remove_hook​

yield_data​

Classes

CSVSource

Ancestors

Variables

Methods

add_hook

apply_ignore_cols

apply_ignore_cols_iter

apply_modifiers

get_data

get_datasource_metrics

get_project_db_sqlite_columns

get_project_db_sqlite_create_table_query

get_schema

merge_and_validate_filters

partition

remove_hook

yield_data