Skip to main content

utils

Utility functions concerning data.

Module

Functions

check_datastructure_schema_compatibility

def check_datastructure_schema_compatibility(    datastructure: DataStructure,    schema: BitfountSchema,    data_identifier: Optional[str] = None,)> Tuple[DataStructureSchemaCompatibility, List[str]]:

Compare a datastructure from a task and a data schema for compatibility.

Currently, this checks that requested columns exist in the target schema.

Query-based datastructures are not supported.

Arguments

  • datastructure: The datastructure for the task.
  • schema: The overall schema for the pod in question.
  • data_identifier: If the datastructure specifies multiple pods then the data identifier is needed to identify which part of the datastructure refers to the pod in question.

Returns A tuple of the compatibility level (DataStructureSchemaCompatibility value), and a list of strings which are all compatibility warnings/issues found.

Classes

DataStructureSchemaCompatibility

class DataStructureSchemaCompatibility(    value, names=None, *, module=None, qualname=None, type=None, start=1,):

The level of compatibility between a datastructure and a pod/table schema.

Denotes 4 different levels of compatibility: - COMPATIBLE: Compatible to our knowledge. - WARNING: Might be compatible but there might still be runtime incompatibility issues. - INCOMPATIBLE: Clearly incompatible. - ERROR: An error occurred whilst trying to check compatibility.

Ancestors

Variables

  • static COMPATIBLE
  • static ERROR
  • static INCOMPATIBLE
  • static WARNING

DatabaseConnection

class DatabaseConnection(    con: Union[str, sqlalchemy.engine.base.Engine],    db_schema: Optional[str] = None,    query: Optional[str] = None,    table_names: Optional[List[str]] = None,):

Encapsulates database connection information for a BaseSource.

If a query is provided or if table_name only has one table, the database will be queried for the data, after which the database connection will be closed and the resulting DataFrame will be used and stored in the BaseSource.

danger

If you are creating a multi-table Pod, ensure that the connection you provide only has access to the schemas and tables you wish to share and that this access has suitably restricted permissions i.e. SELECT only.

table_names limits the Pod schema to only those tables you specify but it does not prevent a Modeller from accessing other tables in the schema or indeed other tables in other schemas by guessing their names.

If only a single table is provided or a query is provided to combine multiple tables into one table, the Modeller will have no access to the database.

Arguments

  • con: A database URI provided as a string or a SQLAlchemy Engine. This should include the database name, user, password, host, port, etc.
  • db_schema: The database schema to use. If not provided, the default schema will be used.
  • query: The SQL query to be executed as a string.
  • table_names: Name(s) of SQL table(s) in database.

Attributes

  • multi_table: Whether or not the database connection is for multiple tables.

Raises

  • DatabaseMissingTableError: If schema (or the default schema if not provided) does not contain any tables or any of the specified tables can't be found in the schema.
  • DatabaseSchemaNotFoundError: If schema is provided but can't be found in the database.
  • DatabaseModificationError: If query is provided and contains an 'INTO' clause.
  • ValueError: If both query and table_names are provided.

Variables

  • static con : Union[str, sqlalchemy.engine.base.Engine]
  • static db_schema : Optional[str]
  • static query : Optional[str]
  • static table_names : Optional[List[str]]
  • validated : bool - Whether or not the database connection has been validated.

Methods


validate

def validate(self)> None:

Validates the database connection.

This method is called by the corresponding validate method on the DatabaseSource which wraps the DatabaseConnection. The reason this does not happen on instantiation is that the Pod is responsible for validating the connection so that if there is an error, it is raised in the scope of the Pod's error handling hooks.

note

This method does not revalidate the connection if it has already been validated.