database_source
Module containing DatabaseSource class.
DatabaseSource class handles loading data stored in a SQL database.
Module
Functions
auto_validate
def auto_validate(f: Callable) ‑> Callable:
Decorator which validates the database connection.
Classes
DatabaseSource
class DatabaseSource( db_conn: Union[DatabaseConnection, str], partition_size: int = 100, max_row_buffer: int = 500, db_conn_kwargs: Optional[Dict[str, Any]] = None, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Data source for loading data from databases.
This datasource subclasses both MultiTableSource and IterableSource. This means that it can be used to load data from a single table or multiple tables. It also means that it can be used to load data all at once or iteratively. However, these two functionalities are mutually inclusive. If you want to load data from multiple tables, you must use the iterable functionality. If you want to load data from a single table, you must use the non-iterable functionality.
Arguments
data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.partition_size
: The size of each partition when iterating over the data.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Ancestors
Variables
-
con : sqlalchemy.engine.base.Engine
- Sqlalchemy engine.Connection options are set to stream results using a server side cursor where possible (depends on the database backend's support for this feature) with a maximum client side row buffer of
self.max_row_buffer
rows.
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns True if a task is running.
-
iterable : bool
- Defines whether the data source is iterable.For the database source, it can only be iterable if
multi_table
is True.
-
multi_table : bool
- Attribute to specify whether the datasource is multi table.This returns True if the database connection has multiple tables and no query is provided when creating the datasource.
-
query : Optional[str]
- A Database query as a string.The query is resolved in the following order:
- The query specified in the database connection.
- The table name specified in the database connection if just 1 table.
- The query specified by the datastructure (if multi-table).
- The table name specified by the datastructure (if multi-table).
- None.
table_names : List[str]
- Database table names.
Methods
get_column
def get_column( self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Inherited from:
Implement this method to get single column from dataset.
get_column_names
def get_column_names( self, table_name: Optional[str] = None, **kwargs: Any,) ‑> Iterable[str]:
Get the column names as an iterable.
Arguments
table_name
: The name of the table_name which should be loaded. Only required for multitable database.
Returns A list of the column names for the target table.
Raises
ValueError
: If the data is multi-table but no table name provided.
get_data
def get_data( self, table_name: Optional[str] = None, sql_query: Optional[str] = None, **kwargs: Any,) ‑> Optional[pandas.core.frame.DataFrame]:
Loads and returns data from Database dataset.
Arguments
sql_query
: A SQL query string required for multi table data sources. This comes from the DataStructure and takes precedence over the table_name.table_name
: Table name for multi table data sources. This comes from the DataStructure and is ignored if sql_query has been provided.
Returns A DataFrame-type object which contains the data or None if the data is multi-table.
get_dtypes
def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:
Inherited from:
Implement this method to get the columns and column types from dataset.
get_values
def get_values( self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,) ‑> Dict[str, Iterable[Any]]:
Get distinct values from columns in Database dataset.
Arguments
col_names
: The list of the columns whose distinct values should be returned.table_name
: The name of the table to which the column exists. Required for multi-table databases.
Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Inherited from:
Load the data for the datasource.
Raises
TypeError
: If data format is not supported.
validate
def validate(self) ‑> None:
Validate the database connection.
This method does not revalidate the connection if it has already been validated.
Raises
ArgumentError
: If the database connection is not a valid sqlalchemy database url.
yield_data
def yield_data( self, query: Optional[str] = None, **kwargs: Any,) ‑> Iterator[pandas.core.frame.DataFrame]:
Yields data from the database in partitions from the provided query.
If query is not provided, the query from the datastructure is used.
Arguments
query
: An optional query to use for yielding data. Otherwise the query from the datastructure is used. Aquery
is always provided when this method is called from the Dataset as part of a task.
Raises
ValueError
: If no query is provided and the datastructure has no query either.