csv_source
Module containing CSVSource class.
CSVSource class handles loading of CSV data.
Classes
CSVSource
class CSVSource( path: Union[os.PathLike, AnyUrl, str], read_csv_kwargs: Optional[Dict[str, Any]] = None, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Data source for loading csv files.
Arguments
data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.path
: The path or URL to the csv file.read_csv_kwargs
: Additional arguments to be passed as a dictionary topandas.read_csv
. Defaults to None.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Ancestors
Variables
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns True if a task is running.
-
iterable : bool
- This returns False if the DataSource does not subclassIterableSource
.However, this property must be re-implemented in
IterableSource
, therefore it is not necessarily True if the DataSource inherits fromIterableSource
.
-
multi_table : bool
- This returns False if the DataSource does not subclassMultiTableSource
.However, this property must be re-implemented in
MultiTableSource
, therefore it is not necessarily True if the DataSource inherits fromMultiTableSource
.
Methods
get_column
def get_column( self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Inherited from:
Get a single column from dataset.
Used in the ColumnAverage
algorithm as well as to iterate over image columns
for the purposes of schema generation.
get_column_names
def get_column_names(self, **kwargs: Any) ‑> Iterable[str]:
Inherited from:
Get the column names as an iterable.
get_data
def get_data(self, **kwargs: Any) ‑> pandas.core.frame.DataFrame:
Loads and returns data from CSV dataset.
Returns A DataFrame-type object which contains the data.
Raises
DataSourceError
: If the CSV file cannot be opened.
get_dtypes
def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:
Inherited from:
Implement this method to get the columns and column types from dataset.
get_values
def get_values(self, col_names: List[str], **kwargs: Any) ‑> Dict[str, Iterable[Any]]:
Get distinct values from columns in CSV dataset.
Arguments
col_names
: The list of the columns whose distinct values should be returned.
Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Inherited from:
Load the data for the datasource.
Raises
TypeError
: If data format is not supported.