nccidxclean.clean subpackage#

Submodules#

nccidxclean.clean.binary_and_cat module#

Parse and clean binary and categorical columns.

nccidxclean.clean.binary_and_cat.binarise_lung_csv(patients_df)[source]#

Converts lung disease to binary + unknown as noted some categories had very few values.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.binary_and_cat.parse_binary_and_cat(patients_df)[source]#

Parses the binary and categorical columns. The original binary and categorical functions have been merged and the following added/changed:

  • code to allow for unknown values to be kept, e.g. binary field with values of [“0”, “1”] now has possible values of [“0”, “1”, “2”] where ‘2’ is unknown.

  • code to handle multiple Lung Diseases added (e.g. 2,4 = COPD and asthma) and save any additional lung disease info in a new field.

  • code to turn multiple cvs entries (e.g. ‘1,2’) to 4 (i.e. ‘multiple’), rather than extracting the first one.

  • handles ‘Yes’ in diabetes field

  • handles discrepancies for Sandwell & Birmingham in pmh_hypertension, pmh_cvs_disease, and pmh_lung_disease.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.dicts_and_maps module#

Contains common lists and dictionaries for use in the package.

nccidxclean.clean.enrich_with_dcm module#

nccidxclean.clean.enrich_with_dcm.add_dicom_update(patients, images)[source]#

Fills in missing values for Sex and Age from imaging dicom headers. The updated data is saved in the columns ‘sex_update’ and ‘age_update’.

Parameters:
  • patients (DataFrame) – The patient clinical DataFrame that needs filling in.

  • images (Collection[DataFrame]) – List/tuple of image metadata dataframes, e.g., [xrays, cts, mris].

Type:

pd.DataFrame

Type:

Collection[pd.DataFrame]

Returns:

Patient data with updated sex and age information completed using the image metadata.

Return type:

pd.DataFrame

nccidxclean.clean.enrich_with_dcm.extract_dcm_metadata(base_dir, img_subdirs)[source]#

Extracts metadata from dicom files

Parameters:
  • base_dir (Union[str, os.PathLike]) – base directory - should have same structure as original S3 bucket

  • img_subdirs (Dict[str, Union[str, os.PathLike]]) – dictionary of modalities and their subdirectory

Returns:

dictionary of the modality and a dataframe of the extracted metadata

Return type:

Dict[str, pd.DataFrame]

nccidxclean.clean.enrich_with_dcm.save_dcm_updates(patients)[source]#

Encrypts and saves the updates to the clinical data from the dicoms involving the age and sex features.

Parameters:

patients (pd.DataFrame) – cleaned + dicom enriched dataframe

Returns:

None

Return type:

NoReturn

nccidxclean.clean.ethnicity_and_sex module#

Remaps sex and ethnicity.

nccidxclean.clean.ethnicity_and_sex.remap_ethnicity(patients_df)[source]#

Remap ethnicities to standardised groupings. :param patients_df: dataframe of clinical data :type patients_df: pd.DataFrame :return: dataframe of clinical data with ethnicities remapped :rtype: pd.DataFrame

Parameters:

patients_df (DataFrame) –

Return type:

DataFrame

nccidxclean.clean.ethnicity_and_sex.remap_sex(patients_df)[source]#

NCCID code: Remaps sex to F/M/Unknown. Converts any missing values to Unknown.

This function: - Missing values are left missing to be consistent with the other fields. Developer can choose whether to treat unknown and np.nan as equivalent. - Converts the Sandwell sex codes to match the schema. For female patients values 0,1 and 2 are reported for Sandwell and West Birmingham (2 = Female, 1 = Male). To match the other centres (0 = Female, 1 = Male) values were corrected based on the information from the json files.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data with sex remapped

Return type:

pd.DataFrame

nccidxclean.clean.fix_headers_and_order module#

Fixes known mistakes in column headers, places into logical order of submission spreadsheet, and selects columns for output.

nccidxclean.clean.fix_headers_and_order.fix_headers(patients_df)[source]#

Fixes known mistakes in column headers. Is always run last as it acts on the cleaned columns.

Orders columns in same order as submission script.

Originally, the date of the swab for the ‘negative’ patients was saved in the ‘swabdate’ field, however, this does not make it immediately clear that this is only for negative patients. Here ‘swabdate’ is renamed to ‘negative_swab_date’ for that reason.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.fix_headers_and_order.order_columns(patients_df)[source]#

Put columns in an order consistent with the submission spreadsheet with original columns followed by cleaned columns. This assists in spotting systematic errors.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.fix_headers_and_order.select_output_cols(patients_df, cols)[source]#

Selects columns to be output.

Parameters:
  • patients_df (pd.DataFrame) – dataframe of all clinical data, both original and cleaned

  • cols (str) – columns to return, options are: ‘spo2_imputed_cleaned_only’: returns the cleaned columns only, with spo2 imputed from pao2. ‘spo2_imputed_with_original’: returns the original and cleaned columns, with spo2 and pao2. ‘o2_split_cleaned_only’: returns the cleaned columns only, with spo2 and pao2. ‘o2_split_with_original’: returns the original and cleaned columns, with spo2 and pao2. ‘all_cleaned_only’: returns all possible cleaned columns. ‘all_with_original’: returns all possible original and cleaned columns.

Returns:

dataframe containing only columns requested

Return type:

pd.DataFrame

nccidxclean.clean.geh_col_shift module#

Corrects known error for George Eliot Hospital where data is shifted one-column-to-the-left.

nccidxclean.clean.geh_col_shift.column_shift(patients_df)[source]#

Originally, the majority of columns for George Eliot Hospital (GEH) had their data entered in the column one-to-the-left when placed in the order of the submission spreadsheet. Most have now been corrected but a small number remain, which are fixed by this function.

All of these errors have a date entered in the ‘Current NSAID used’, ‘Troponin T’ and ‘Final COVID Status’ columns allowing them to be identified and corrected.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.inferences module#

Performs logical inferences on the data to reduce missing data.

nccidxclean.clean.inferences.inferences(patients_df, inference_pipeline=('update_final_covid_status', 'death_inferences', 'last_known_alive_inferences', 'itu_and_intubation_inferences', 'ckd_inferences', 'calculate_pf_ratio'))[source]#

Applies functions which perform inferences. Pipeline of inference functions may be modified as desired. All functions are applied by default.

Parameters:
  • patients_df (pd.DataFrame) – dataframe of clinical data

  • inference_pipeline (Collection) – inference functions to apply

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.numeric module#

Parses and cleans numeric columns.

nccidxclean.clean.numeric.clean_numeric(patients_df)[source]#

Cleans the numerical columns. May only be performed after the NHSx _coerce_numeric_columns function.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.numeric.clip_numeric(patients_df)[source]#

Removed values outside of expected limits. Is called after other numeric functions.

New:
  • Additional clipping of numerical fields

  • Removal of non-integer numbers from integer fields

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.numeric.rescale_fio2(patients_df, ltrs_to_percent=True)[source]#
Remaps FiO2 entries to the % scale. Changes:
  1. 0.5 is more likely 0.5L (23%) than 50%;

  2. Minimum oxygen is 21% (room air), not 0% -> 0 should be 21% and minimum value should be 21%;

  3. Handle the ‘Any supplemental oxygen: FiO2’ data.

Parameters:
  • patients_df (pd.DataFrame) – dataframe of clinical data

  • ltrs_to_percent (bool, default True) – convert values suspected to be in L to % scale

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.parse_dates module#

Parses and cleans date columns.

nccidxclean.clean.parse_dates.parse_date_columns(patients_df)[source]#
Additions to original cleaning pipeline:
  1. Convertion of dates stored as numbers (in the excel date format).

  2. Adjustment of Leicester swab dates that were in UK rather than US format.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.remap_hospitals module#

Remaps hospital names from abbreviations.

nccidxclean.clean.remap_hospitals.check_new_centres(patients_df)[source]#

Provides warning if a new center is included which was not used in the development data. Data from these centers should be checked

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.remap_hospitals.remap_hospitals(patients_df)[source]#

Remap hospital names from abbreviations.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.sense_check module#

Performs logical sense checks on the data after cleaning to warn of potential errors and remove data where necessary.

nccidxclean.clean.sense_check.sense_check(patients_df)[source]#

Sense checks data to identify potential errors.

Parameters:

patients_df (pd.DataFrame) – dataframe of clinical data

Returns:

dataframe of clinical data

Return type:

pd.DataFrame

nccidxclean.clean.sense_check.sense_check_dicom_dates(patients_df, imaging_df)[source]#
  1. Checks that no patients have a scan dated after they supposedly died in the clinical data.

  2. Updates date_last_known_alive in the clinical data, if patient had a scan.

Parameters:
  • patients_df (pd.Dataframe) – dataframe of clinical data

  • imaging_df (pd.Dataframe) – dataframe containing metadata from the DICOM imaging files

Returns:

the clinical data and image metadata dataframes

Return type:

pd.Dataframe, pd.Dataframe

nccidxclean.clean.utils module#

Utility functions to:
  1. extract data from old jsons to compare to current data

  2. produce outputs for checking warnings/errors from the log file

  3. merge checked data with the cleaned dataframe

nccidxclean.clean.utils.create_error_report(patients_df, log_path)[source]#

Produces an Excel file for checking warnings/errors saved in the current working directory.

Parameters:
  • patients_df (pd.DataFrame) – clinical data output by pipeline

  • log_path (str) – path to the log file

Returns:

clinical data

Return type:

pd.DataFrame

nccidxclean.clean.utils.create_xclean_id(patient)[source]#

Creates a unique ID for each patient using their original data. This may be used as a password to encrypt the data, only allowing access to data users with the original data.

Parameters:

patient (pd.Series) – row of dataframe representing a patient

Returns:

unique ID for the patient generated from their original data

Return type:

str

nccidxclean.clean.utils.extract_old_data(x, base_path, patient_subdir)[source]#

Finds columns / data fields deleted from the most recent jsons compared to earlier jsons.

Parameters:
  • x (pd.Series) – input patient series from clinical data dataframe.

  • base_path (Path) – The base path to the JSON directory

  • patient_subdir (Path) – The subdirectory for the patient

Returns:

output patient series including new column containing deleted patient data

Return type:

pd.Series

nccidxclean.clean.utils.merge_checks_with_cleaned_df(patients_df, numeric_errs_path, date_errs_path)[source]#

Merges the error checked/amended data into the cleaned df.

Parameters:
  • patients_df_path (pd.DataFrame) – cleaned nccid clinical data

  • numeric_errs_path (str) – path to updated csv of numerical warnings/errors

  • date_errs_path (str) – path to updated csv of date warnings/errors

  • patients_df (DataFrame) –

Returns:

updated nccid clinical data

Return type:

pd.DataFrame

nccidxclean.clean.utils.read_in_clinical_data(base_dir, clin_subdir)[source]#

Reads in the clinical data from the json files.

Parameters:
  • base_dir (Union[str, os.PathLike]) – base directory - should have same structure as original S3 bucket

  • clin_subdir (Union[str, os.PathLike]) – subdirectory containing the clinical data json files

Returns:

dataframe of clinical data with a row for each patient

Return type:

pd.DataFrame

nccidxclean.clean.utils.update_with_prev_changes(patients_df)[source]#

Update the cleaned clinical data with previous saved manual changes and those made from previous DICOM enrichment (stored in encrypted form in the data_updates folder of the package).

Parameters:

patients_df (pd.DataFrame) – cleaned dataframe

Returns:

cleaned dataframe with data updated from stored changes

Return type:

pd.DataFrame