Skip to main content

database_source

Module containing DatabaseSource class.

DatabaseSource class handles loading data stored in a SQL database.

Module

Functions

auto_validate

def auto_validate(f: Callable)> Callable:

Decorator which validates the database connection.

Classes

DatabaseSource

class DatabaseSource(    db_conn: Union[DatabaseConnection, str],    partition_size: int = 100,    max_row_buffer: int = 500,    db_conn_kwargs: Optional[Dict[str, Any]] = None,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[Dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,):

Data source for loading data from databases.

This datasource subclasses both MultiTableSource and IterableSource. This means that it can be used to load data from a single table or multiple tables. It also means that it can be used to load data all at once or iteratively. However, these two functionalities are mutually inclusive. If you want to load data from multiple tables, you must use the iterable functionality. If you want to load data from a single table, you must use the non-iterable functionality.

Arguments

  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • partition_size: The size of each partition when iterating over the data.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Variables

  • con : sqlalchemy.engine.base.Engine - Sqlalchemy engine.

    Connection options are set to stream results using a server side cursor where possible (depends on the database backend's support for this feature) with a maximum client side row buffer of self.max_row_buffer rows.

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    Raises: DataNotLoadedError: If the data has not been loaded yet.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - Defines whether the data source is iterable.

    For the database source, it can only be iterable if multi_table is True.

  • multi_table : bool - Attribute to specify whether the datasource is multi table.

    This returns True if the database connection has multiple tables and no query is provided when creating the datasource.

  • query : Optional[str] - A Database query as a string.

    The query is resolved in the following order:

    1. The query specified in the database connection.
    2. The table name specified in the database connection if just 1 table.
    3. The query specified by the datastructure (if multi-table).
    4. The table name specified by the datastructure (if multi-table).
    5. None.
  • table_names : List[str] - Database table names.

Methods


get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

MultiTableSource.get_column :

Implement this method to get single column from dataset.

get_column_names

def get_column_names(    self, table_name: Optional[str] = None, **kwargs: Any,)> Iterable[str]:

Get the column names as an iterable.

Arguments

  • table_name: The name of the table_name which should be loaded. Only required for multitable database.

Returns A list of the column names for the target table.

Raises

  • ValueError: If the data is multi-table but no table name provided.

get_data

def get_data(    self,    table_name: Optional[str] = None,    sql_query: Optional[str] = None,    **kwargs: Any,)> Optional[pandas.core.frame.DataFrame]:

Loads and returns data from Database dataset.

Arguments

  • sql_query: A SQL query string required for multi table data sources. This comes from the DataStructure and takes precedence over the table_name.
  • table_name: Table name for multi table data sources. This comes from the DataStructure and is ignored if sql_query has been provided.

Returns A DataFrame-type object which contains the data or None if the data is multi-table.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

MultiTableSource.get_dtypes :

Implement this method to get the columns and column types from dataset.

get_values

def get_values(    self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,)> Dict[str, Iterable[Any]]:

Get distinct values from columns in Database dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.
  • table_name: The name of the table to which the column exists. Required for multi-table databases.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

MultiTableSource.load_data :

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.

validate

def validate(self)> None:

Validate the database connection.

This method does not revalidate the connection if it has already been validated.

Raises

  • ArgumentError: If the database connection is not a valid sqlalchemy database url.

yield_data

def yield_data(    self, query: Optional[str] = None, **kwargs: Any,)> Iterator[pandas.core.frame.DataFrame]:

Yields data from the database in partitions from the provided query.

If query is not provided, the query from the datastructure is used.

Arguments

  • query: An optional query to use for yielding data. Otherwise the query from the datastructure is used. A query is always provided when this method is called from the Dataset as part of a task.

Raises

  • ValueError: If no query is provided and the datastructure has no query either.