nccidxclean.clean subpackage#
Submodules#
nccidxclean.clean.binary_and_cat module#
Parse and clean binary and categorical columns.
- nccidxclean.clean.binary_and_cat.binarise_lung_csv(patients_df)[source]#
Converts lung disease to binary + unknown as noted some categories had very few values.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.binary_and_cat.parse_binary_and_cat(patients_df)[source]#
Parses the binary and categorical columns. The original binary and categorical functions have been merged and the following added/changed:
code to allow for unknown values to be kept, e.g. binary field with values of [“0”, “1”] now has possible values of [“0”, “1”, “2”] where ‘2’ is unknown.
code to handle multiple Lung Diseases added (e.g. 2,4 = COPD and asthma) and save any additional lung disease info in a new field.
code to turn multiple cvs entries (e.g. ‘1,2’) to 4 (i.e. ‘multiple’), rather than extracting the first one.
handles ‘Yes’ in diabetes field
handles discrepancies for Sandwell & Birmingham in pmh_hypertension, pmh_cvs_disease, and pmh_lung_disease.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
nccidxclean.clean.dicts_and_maps module#
Contains common lists and dictionaries for use in the package.
nccidxclean.clean.enrich_with_dcm module#
- nccidxclean.clean.enrich_with_dcm.add_dicom_update(patients, images)[source]#
Fills in missing values for Sex and Age from imaging dicom headers. The updated data is saved in the columns ‘sex_update’ and ‘age_update’.
- Parameters:
patients (DataFrame) – The patient clinical DataFrame that needs filling in.
images (Collection[DataFrame]) – List/tuple of image metadata dataframes, e.g., [xrays, cts, mris].
- Type:
pd.DataFrame
- Type:
Collection[pd.DataFrame]
- Returns:
Patient data with updated sex and age information completed using the image metadata.
- Return type:
pd.DataFrame
- nccidxclean.clean.enrich_with_dcm.extract_dcm_metadata(base_dir, img_subdirs)[source]#
Extracts metadata from dicom files
- Parameters:
base_dir (Union[str, os.PathLike]) – base directory - should have same structure as original S3 bucket
img_subdirs (Dict[str, Union[str, os.PathLike]]) – dictionary of modalities and their subdirectory
- Returns:
dictionary of the modality and a dataframe of the extracted metadata
- Return type:
Dict[str, pd.DataFrame]
nccidxclean.clean.ethnicity_and_sex module#
Remaps sex and ethnicity.
- nccidxclean.clean.ethnicity_and_sex.remap_ethnicity(patients_df)[source]#
Remap ethnicities to standardised groupings. :param patients_df: dataframe of clinical data :type patients_df: pd.DataFrame :return: dataframe of clinical data with ethnicities remapped :rtype: pd.DataFrame
- Parameters:
patients_df (DataFrame) –
- Return type:
DataFrame
- nccidxclean.clean.ethnicity_and_sex.remap_sex(patients_df)[source]#
NCCID code: Remaps sex to F/M/Unknown. Converts any missing values to Unknown.
This function: - Missing values are left missing to be consistent with the other fields. Developer can choose whether to treat unknown and np.nan as equivalent. - Converts the Sandwell sex codes to match the schema. For female patients values 0,1 and 2 are reported for Sandwell and West Birmingham (2 = Female, 1 = Male). To match the other centres (0 = Female, 1 = Male) values were corrected based on the information from the json files.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data with sex remapped
- Return type:
pd.DataFrame
nccidxclean.clean.fix_headers_and_order module#
Fixes known mistakes in column headers, places into logical order of submission spreadsheet, and selects columns for output.
- nccidxclean.clean.fix_headers_and_order.fix_headers(patients_df)[source]#
Fixes known mistakes in column headers. Is always run last as it acts on the cleaned columns.
Orders columns in same order as submission script.
Originally, the date of the swab for the ‘negative’ patients was saved in the ‘swabdate’ field, however, this does not make it immediately clear that this is only for negative patients. Here ‘swabdate’ is renamed to ‘negative_swab_date’ for that reason.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.fix_headers_and_order.order_columns(patients_df)[source]#
Put columns in an order consistent with the submission spreadsheet with original columns followed by cleaned columns. This assists in spotting systematic errors.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.fix_headers_and_order.select_output_cols(patients_df, cols)[source]#
Selects columns to be output.
- Parameters:
patients_df (pd.DataFrame) – dataframe of all clinical data, both original and cleaned
cols (str) – columns to return, options are: ‘spo2_imputed_cleaned_only’: returns the cleaned columns only, with spo2 imputed from pao2. ‘spo2_imputed_with_original’: returns the original and cleaned columns, with spo2 and pao2. ‘o2_split_cleaned_only’: returns the cleaned columns only, with spo2 and pao2. ‘o2_split_with_original’: returns the original and cleaned columns, with spo2 and pao2. ‘all_cleaned_only’: returns all possible cleaned columns. ‘all_with_original’: returns all possible original and cleaned columns.
- Returns:
dataframe containing only columns requested
- Return type:
pd.DataFrame
nccidxclean.clean.geh_col_shift module#
Corrects known error for George Eliot Hospital where data is shifted one-column-to-the-left.
- nccidxclean.clean.geh_col_shift.column_shift(patients_df)[source]#
Originally, the majority of columns for George Eliot Hospital (GEH) had their data entered in the column one-to-the-left when placed in the order of the submission spreadsheet. Most have now been corrected but a small number remain, which are fixed by this function.
All of these errors have a date entered in the ‘Current NSAID used’, ‘Troponin T’ and ‘Final COVID Status’ columns allowing them to be identified and corrected.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
nccidxclean.clean.inferences module#
Performs logical inferences on the data to reduce missing data.
- nccidxclean.clean.inferences.inferences(patients_df, inference_pipeline=('update_final_covid_status', 'death_inferences', 'last_known_alive_inferences', 'itu_and_intubation_inferences', 'ckd_inferences', 'calculate_pf_ratio'))[source]#
Applies functions which perform inferences. Pipeline of inference functions may be modified as desired. All functions are applied by default.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
inference_pipeline (Collection) – inference functions to apply
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
nccidxclean.clean.numeric module#
Parses and cleans numeric columns.
- nccidxclean.clean.numeric.clean_numeric(patients_df)[source]#
Cleans the numerical columns. May only be performed after the NHSx _coerce_numeric_columns function.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.numeric.clip_numeric(patients_df)[source]#
Removed values outside of expected limits. Is called after other numeric functions.
- New:
Additional clipping of numerical fields
Removal of non-integer numbers from integer fields
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.numeric.rescale_fio2(patients_df, ltrs_to_percent=True)[source]#
- Remaps FiO2 entries to the % scale. Changes:
0.5 is more likely 0.5L (23%) than 50%;
Minimum oxygen is 21% (room air), not 0% -> 0 should be 21% and minimum value should be 21%;
Handle the ‘Any supplemental oxygen: FiO2’ data.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
ltrs_to_percent (bool, default True) – convert values suspected to be in L to % scale
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
nccidxclean.clean.parse_dates module#
Parses and cleans date columns.
- nccidxclean.clean.parse_dates.parse_date_columns(patients_df)[source]#
- Additions to original cleaning pipeline:
Convertion of dates stored as numbers (in the excel date format).
Adjustment of Leicester swab dates that were in UK rather than US format.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
nccidxclean.clean.remap_hospitals module#
Remaps hospital names from abbreviations.
- nccidxclean.clean.remap_hospitals.check_new_centres(patients_df)[source]#
Provides warning if a new center is included which was not used in the development data. Data from these centers should be checked
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
nccidxclean.clean.sense_check module#
Performs logical sense checks on the data after cleaning to warn of potential errors and remove data where necessary.
- nccidxclean.clean.sense_check.sense_check(patients_df)[source]#
Sense checks data to identify potential errors.
- Parameters:
patients_df (pd.DataFrame) – dataframe of clinical data
- Returns:
dataframe of clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.sense_check.sense_check_dicom_dates(patients_df, imaging_df)[source]#
Checks that no patients have a scan dated after they supposedly died in the clinical data.
Updates date_last_known_alive in the clinical data, if patient had a scan.
- Parameters:
patients_df (pd.Dataframe) – dataframe of clinical data
imaging_df (pd.Dataframe) – dataframe containing metadata from the DICOM imaging files
- Returns:
the clinical data and image metadata dataframes
- Return type:
pd.Dataframe, pd.Dataframe
nccidxclean.clean.utils module#
- Utility functions to:
extract data from old jsons to compare to current data
produce outputs for checking warnings/errors from the log file
merge checked data with the cleaned dataframe
- nccidxclean.clean.utils.create_error_report(patients_df, log_path)[source]#
Produces an Excel file for checking warnings/errors saved in the current working directory.
- Parameters:
patients_df (pd.DataFrame) – clinical data output by pipeline
log_path (str) – path to the log file
- Returns:
clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.utils.create_xclean_id(patient)[source]#
Creates a unique ID for each patient using their original data. This may be used as a password to encrypt the data, only allowing access to data users with the original data.
- Parameters:
patient (pd.Series) – row of dataframe representing a patient
- Returns:
unique ID for the patient generated from their original data
- Return type:
str
- nccidxclean.clean.utils.extract_old_data(x, base_path, patient_subdir)[source]#
Finds columns / data fields deleted from the most recent jsons compared to earlier jsons.
- Parameters:
x (pd.Series) – input patient series from clinical data dataframe.
base_path (Path) – The base path to the JSON directory
patient_subdir (Path) – The subdirectory for the patient
- Returns:
output patient series including new column containing deleted patient data
- Return type:
pd.Series
- nccidxclean.clean.utils.merge_checks_with_cleaned_df(patients_df, numeric_errs_path, date_errs_path)[source]#
Merges the error checked/amended data into the cleaned df.
- Parameters:
patients_df_path (pd.DataFrame) – cleaned nccid clinical data
numeric_errs_path (str) – path to updated csv of numerical warnings/errors
date_errs_path (str) – path to updated csv of date warnings/errors
patients_df (DataFrame) –
- Returns:
updated nccid clinical data
- Return type:
pd.DataFrame
- nccidxclean.clean.utils.read_in_clinical_data(base_dir, clin_subdir)[source]#
Reads in the clinical data from the json files.
- Parameters:
base_dir (Union[str, os.PathLike]) – base directory - should have same structure as original S3 bucket
clin_subdir (Union[str, os.PathLike]) – subdirectory containing the clinical data json files
- Returns:
dataframe of clinical data with a row for each patient
- Return type:
pd.DataFrame
- nccidxclean.clean.utils.update_with_prev_changes(patients_df)[source]#
Update the cleaned clinical data with previous saved manual changes and those made from previous DICOM enrichment (stored in encrypted form in the data_updates folder of the package).
- Parameters:
patients_df (pd.DataFrame) – cleaned dataframe
- Returns:
cleaned dataframe with data updated from stored changes
- Return type:
pd.DataFrame