Skip to main content

dicom_ophthalmology_source

Module containing DICOMOphthalmologySource class.

DICOMOphthalmologySource class handles loading of ophthalmological DICOM data.

Classes

CZMError

class CZMError(*args, **kwargs):

Errors related to processing of Zeiss DICOMs.

Ancestors

DICOMOphthalmologyCSVColumns

class DICOMOphthalmologyCSVColumns(    oct_column: Optional[str] = 'oct', slo_column: Optional[str] = None,):

Arguments for ophthalmology columns in the csv.

Arguments

  • oct_column: The column pointing to OCT files in the filename. Refers to Optical Coherence Tomography (OCT), typically these are a series of 2D images used to show a cross-section of the tissue layers in the retina (specifically the macula), combined to form a 3D image. Defaults to 'oct'.
  • slo_column: The string pointing to SLO files in the filename. Refers to Scanning Laser Ophthalmoscope (SLO), typically referred to as an 'en-face' image of the retina (specifically the macula). Defaults to None.

Variables

  • static oct_column : Optional[str]
  • static slo_column : Optional[str]

DICOMOphthalmologySource

class DICOMOphthalmologySource(    path: Union[os.PathLike, str],    dicom_ophthalmology_csv_columns: Optional[Union[DICOMOphthalmologyCSVColumns, _DICOMOphthalmologyCSVColumnsTD]] = None,    file_extension: Optional[str] = '.dcm',    images_only: bool = True,    data_cache: Optional[DataPersister] = None,    infer_class_labels_from_filepaths: bool = False,    output_path: Optional[Union[os.PathLike, str]] = None,    iterable: bool = True,    fast_load: bool = True,    strict: bool = False,    cache_images: bool = False,    file_creation_min_date: Optional[Union[Date, DateTD]] = None,    file_modification_min_date: Optional[Union[Date, DateTD]] = None,    file_creation_max_date: Optional[Union[Date, DateTD]] = None,    file_modification_max_date: Optional[Union[Date, DateTD]] = None,    min_file_size: Optional[float] = None,    max_file_size: Optional[float] = None,    partition_size: int = 16,    data_splitter: Optional[DatasetSplitter] = None,    seed: Optional[int] = None,    modifiers: Optional[dict[str, DataPathModifiers]] = None,    ignore_cols: Optional[Union[str, Sequence[str]]] = None,    ophthalmology_args: Optional[Union[OphthalmologyDataSourceArgs, _OphthalmologyDataSourceArgsTD]] = None,):

Data source for loading DICOM ophthalmology files.

Arguments

  • ****kwargs**: Keyword arguments passed to the parent base classes.
  • cache_images: Whether to cache images in the file system. Defaults to False. This is ignored if fast_load is True.
  • data_cache: A DataPersister instance to use for data caching.
  • data_splitter: Approach used for splitting the data into training, test, validation. Defaults to None.
  • dicom_ophthalmology_csv_columns: Which columns in the CSV, if using a CSV, relate to the OCT/SLO files.
  • fast_load: Whether the data will be loaded in fast mode. This is used to determine whether the data will be iterated over during set up for schema generation and splitting (where necessary). Only relevant if iterable is True, otherwise it is ignored. Defaults to True.
  • file_creation_max_date: The newest possible date to consider for file creation. If None, this filter will not be applied. Defaults to None.
  • file_creation_min_date: The oldest possible date to consider for file creation. If None, this filter will not be applied. Defaults to None.
  • file_extension: The file extension of the DICOM files. Defaults to '.dcm'.
  • file_modification_max_date: The newest possible date to consider for file modification. If None, this filter will not be applied. Defaults to None.
  • file_modification_min_date: The oldest possible date to consider for file modification. If None, this filter will not be applied. Defaults to None.
  • ignore_cols: Column/list of columns to be ignored from the data. Defaults to None.
  • images_only: If True, only dicom files containing image data will be loaded. If the file does not contain any image data, or it does but there was an error loading or saving the image(s), the whole file will be skipped. Defaults to True.
  • infer_class_labels_from_filepaths: Whether class labels should be added to the data based on the filepath of the files. Defaults to the first directory within self.path, but can go a level deeper if the datasplitter is provided with infer_data_split_labels set to true
  • iterable: Whether the data source is iterable. This is used to determine whether the data source can be used in a streaming context during a task. Defaults to True.
  • max_file_size: The maximum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
  • min_file_size: The minimum file size in megabytes to consider. If None, all files will be considered. Defaults to None.
  • modifiers: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.
  • ophthalmology_args: Arguments for ophthalmology modality data.
  • output_path: The path where to save intermediary output files. Defaults to 'preprocessed/'.
  • partition_size: The size of each partition when iterating over the data.
  • path: The path to the directory containing the DICOM files or to a CSV containing paths to the DICOM files.
  • seed: Random number seed. Used for setting random seed for all libraries. Defaults to None.
  • strict: Whether File loading should be strictly done on files with the explicit file extension provided. If set to True will only load those files in the dataset. Otherwise, it will scan the given path for files of the same type as the provided file extension. Only relevant if file_extension is provided. Defaults to False.

Attributes

  • data: A Dataframe-type object which contains the data.
  • data_splitter: Approach used for splitting the data into training, test, validation.
  • seed: Random number seed. Used for setting random seed for all libraries.

Raises

  • ValueError: If iterable is False or fast_load is False or cache_images is True.

Ancestors

Variables

  • data : pandas.core.frame.DataFrame - A property containing the underlying dataframe if the data has been loaded.

    If the datasource is iterable, this will raise an exception.

    Raises: IterableDataSourceError: If the datasource is set to iterable. DataNotLoadedError: If the data has not been loaded yet.

  • drop_row_on_missing_slo : bool - Returns whether to drop the OCT row if the corresponding SLO file is missing.

    This is only relevant if match_slo is True.

  • file_names : list - Returns a list of file names in the directory.

    These file paths will be absolute paths, regardless of their origin.

  • hash : str - The hash associated with this BaseSource.

    This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.

    Returns: The hexdigest of the DataFrame hash.

  • is_initialised : bool - Checks if BaseSource was initialised.
  • is_task_running : bool - Returns True if a task is running.
  • iterable : bool - Defines whether the data source is iterable.

    This is defined by the user when instantiating the class.

  • match_slo : bool - Returns whether the SLO files should be matched to the OCT files.
  • modality : Literal['OCT', 'SLO', None] - Returns the modality of the data source.
  • path : pathlib.Path - Resolved absolute path to data.

    Provides a consistent version of the path provided by the user which should work throughout regardless of operating system and of directory structure.

  • selected_file_names : list - Returns a list of selected file names.

    Selected file names are affected by the selected_file_names_override and new_file_names_only attributes.

  • slo_file_names : list - Returns a list of SLO file names in the directory.

    These file paths will be absolute paths, regardless of their origin.

Static methods


get_num_workers

def get_num_workers(file_names: list[str])> int:

Inherited from:

DICOMSource.get_num_workers :

Gets the number of workers to use for multiprocessing.

Ensures that the number of workers is at least 1 and at most equal to MAX_NUM_MULTIPROCESSING_WORKERS. If the number of files is less than MAX_NUM_MULTIPROCESSING_WORKERS, then we use the number of files as the number of workers. Unless the number of machine cores is also less than MAX_NUM_MULTIPROCESSING_WORKERS, in which case we use the lower of the two.

Arguments

  • file_names: The list of file names to load.

Returns The number of workers to use for multiprocessing.

Methods


clear_file_names_cache

def clear_file_names_cache(self)> None:

Inherited from:

DICOMSource.clear_file_names_cache :

Clears the list of selected file names.

This allows the datasource to pick up any new files that have been added to the directory since the last time it was cached.

file_names_iter

def file_names_iter(    self, as_strs: bool = False,)> Union[collections.abc.Iterator[pathlib.Path], collections.abc.Iterator[str]]:

Inherited from:

DICOMSource.file_names_iter :

Iterate over files in a directory, yielding those that match the criteria.

Arguments

  • as_strs: By default the files yielded will be yielded as Path objects. If this is True, yield them as strings instead.

get_column

def get_column(    self: BaseSource, col_name: str, *args: Any, **kwargs: Any,)> Union[numpy.ndarray, pandas.core.series.Series]:

Inherited from:

FileSystemIterableSourceInferrable.get_column :

Loads and returns single column from the dataset.

Arguments

  • col_name: The name of the column which should be loaded.
  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns The column request as a series.

get_column_names

def get_column_names(self, **kwargs: Any)> collections.abc.Iterable:

Inherited from:

DICOMSource.get_column_names :

Get column names for fast-load datasource.

get_data

def get_data(self, **kwargs: Any)> Optional[pandas.core.frame.DataFrame]:

Inherited from:

DICOMSource.get_data :

This method must return None if the data source is iterable.

get_dtypes

def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any)> _Dtypes:

Inherited from:

FileSystemIterableSourceInferrable.get_dtypes :

Get dtypes for iterable datasource.

get_project_db_sqlite_columns

def get_project_db_sqlite_columns(self)> list:

Inherited from:

DICOMSource.get_project_db_sqlite_columns :

Returns the required columns to identify a data point.

get_project_db_sqlite_create_table_query

def get_project_db_sqlite_create_table_query(self)> str:

Inherited from:

DICOMSource.get_project_db_sqlite_create_table_query :

Returns the required columns and types to identify a data point.

The file name is used as the primary key and the last modified date is used to determine if the file has been updated since the last time it was processed. If there is a conflict on the file name, the row is replaced with the new data to ensure that the last modified date is always up to date.

get_values

def get_values(self, col_names: list[str], **kwargs: Any)> dict:

Inherited from:

DICOMSource.get_values :

Get distinct values from columns in the dataset.

Arguments

  • col_names: The list of the columns whose distinct values should be returned.
  • **kwargs: Additional keyword arguments to pass to the load_data method if the data is stale.

Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.

load_data

def load_data(self, **kwargs: Any)> None:

Inherited from:

DICOMSource.load_data :

Load the data for the datasource.

Raises

  • TypeError: If data format is not supported.

partition

def partition(    self, iterable: Sequence, partition_size: int = 1,)> collections.abc.Iterable:

Inherited from:

DICOMSource.partition :

Takes an iterable and yields partitions of size partition_size.

The final partition may be less than size partition_size due to the variable length of the iterable.

process_sequence_field

def process_sequence_field(self, elem: _DICOMSequenceField)> Optional[dict]:

Process 'Shared Functional Groups Sequence' sequence field.

This sequence field contains the slice thickness and pixel spacing data. Other fields are ignored.

Arguments

  • elem: The DICOM data element.

Returns A dictionary containing the processed sequence data or None.

use_file_multiprocessing

def use_file_multiprocessing(self, file_names: list[str])> bool:

Inherited from:

DICOMSource.use_file_multiprocessing :

Check if file multiprocessing should be used.

Returns True if file multiprocessing has been enabled by the environment variable and the number of workers would be greater than 1, otherwise False. There is no need to use file multiprocessing if we are just going to use one worker - it would be slower than just loading the data in the main process.

Returns True if file multiprocessing should be used, otherwise False.

yield_data

def yield_data(    self,    file_names: Optional[list[str]] = None,    use_cache: bool = True,    partition_size: Optional[int] = None,    **kwargs: Any,)> collections.abc.Iterator:

Inherited from:

DICOMSource.yield_data :

Yields data in batches from files that match the given file names.

Arguments

  • file_names: An optional list of file names to use for yielding data. Otherwise, all files that have already been found will be used. file_names is always provided when this method is called from the Dataset as part of a task.
  • use_cache: Whether the cache should be used to retrieve data for these files. Note that cached data may have some elements, particularly image-related fields such as image data or file paths, replaced with placeholder values when stored in the cache. If datacache is set on the instance, data will be _set in the cache, regardless of this argument.
  • partition_size: The number of file names to load in each iteration.
  • ****kwargs**: Additional keyword arguments.

Raises

  • ValueError: If no file names provided and no files have been found.

MissingDICOMFieldError

class MissingDICOMFieldError(*args, **kwargs):

Error for DICOMSources.

We raise this if an expected DICOM field is missing.

Ancestors