utils
Utility functions concerning data.
Module
Functions
check_datastructure_schema_compatibility
def check_datastructure_schema_compatibility( datastructure: DataStructure, schema: BitfountSchema, data_identifier: Optional[str] = None,) ‑> Tuple[DataStructureSchemaCompatibility, List[str]]:
Compare a datastructure from a task and a data schema for compatibility.
Currently, this checks that requested columns exist in the target schema.
Query-based datastructures are not supported.
Arguments
datastructure
: The datastructure for the task.schema
: The overall schema for the pod in question.data_identifier
: If the datastructure specifies multiple pods then the data identifier is needed to identify which part of the datastructure refers to the pod in question.
Returns A tuple of the compatibility level (DataStructureSchemaCompatibility value), and a list of strings which are all compatibility warnings/issues found.
Classes
DataStructureSchemaCompatibility
class DataStructureSchemaCompatibility( value, names=None, *, module=None, qualname=None, type=None, start=1,):
The level of compatibility between a datastructure and a pod/table schema.
Denotes 4 different levels of compatibility: - COMPATIBLE: Compatible to our knowledge. - WARNING: Might be compatible but there might still be runtime incompatibility issues. - INCOMPATIBLE: Clearly incompatible. - ERROR: An error occurred whilst trying to check compatibility.
DatabaseConnection
class DatabaseConnection( con: Union[str, sqlalchemy.engine.base.Engine], db_schema: Optional[str] = None, query: Optional[str] = None, table_names: Optional[List[str]] = None,):
Encapsulates database connection information for a BaseSource
.
If a query
is provided or if table_name
only has one table, the database will be
queried for the data, after which the database connection will be closed and the
resulting DataFrame will be used and stored in the BaseSource
.
If you are creating a multi-table Pod, ensure that the connection you provide only
has access to the schemas and tables you wish to share and that this access has
suitably restricted permissions i.e. SELECT
only.
table_names
limits the Pod schema to only those tables you specify but it does
not prevent a Modeller from accessing other tables in the schema or indeed other
tables in other schemas by guessing their names.
If only a single table is provided or a query is provided to combine multiple tables into one table, the Modeller will have no access to the database.
Arguments
con
: A database URI provided as a string or a SQLAlchemy Engine. This should include the database name, user, password, host, port, etc.db_schema
: The database schema to use. If not provided, the default schema will be used.query
: The SQL query to be executed as a string.table_names
: Name(s) of SQL table(s) in database.
Attributes
multi_table
: Whether or not the database connection is for multiple tables.
Raises
DatabaseMissingTableError
: Ifschema
(or the default schema if not provided) does not contain any tables or any of the specified tables can't be found in the schema.DatabaseSchemaNotFoundError
: Ifschema
is provided but can't be found in the database.DatabaseModificationError
: Ifquery
is provided and contains an 'INTO' clause.ValueError
: If bothquery
andtable_names
are provided.
Variables
- static
con : Union[str, sqlalchemy.engine.base.Engine]
- static
db_schema : Optional[str]
- static
query : Optional[str]
- static
table_names : Optional[List[str]]
validated : bool
- Whether or not the database connection has been validated.
Methods
validate
def validate(self) ‑> None:
Validates the database connection.
This method is called by the corresponding validate
method on the
DatabaseSource
which wraps the DatabaseConnection
. The reason this does not
happen on instantiation is that the Pod
is responsible for validating the
connection so that if there is an error, it is raised in the scope of the Pod's
error handling hooks.
This method does not revalidate the connection if it has already been validated.