Binary and Categorical Columns#

Original NCCID Cleaning Pipeline#

Binary Columns#

NCCID Function : _parse_binary_columns

  • Mapped unknowns to missing, i.e. np.nan.

  • Converted 0 (int), “0.0” and “0” (strings) to False and 1 (int), “1.0” and “1” (strings) to True.

  • Mapped and merged ‘PMH h1pertension’ column into pmh_hypertension

  • Merged “PMH diabetes mellitus type II” into “pmh_diabetes_mellitus_type_2”.

  • Filled blanks in “pmh_diabetes_mellitus_type_2” with “PMH diabetes mellitus TYPE II” and ignored “PMH diabetes mellitus TYPE I”.

Categorical Columns#

NCCID Function : _parse_cat_columns

  • Ensured values in list of possible values

  • Extracted the integer value from “Pack year history”

  • Stripped digits from strings and exclude values outside of schema

  • “Unknown” categories mapped to missing (np.nan) if they existed


NCCIDxClean#

New Function : parse_binary_and_cat

The binary and categorical functions have been merged and the following added/changed:

  • Unknown values are retained using an additional code to distinguish from missing values. For example, previously a binary field would have unknowns mapped to np.nan, leaving values of [“0”, “1”]. Now this column would have possible values of [“0”, “1”, “2”] where ‘2’ is ‘Unknown’.

  • ‘1es’ entries in the PMH diabetes mellitus type II field is now handled.

  • PMH CVS disease and PMH Lung disease are both now converted into binary features (+ unknown), due to the low number of cases in most categories.

  • For both PMH CVS disease and PMH Lung disease, the original code meaning and any additional disease information is retained in a list in a new field for each of these: pmh_cvs_disease_info and pmh_lung_disease_info. For example: - If a ‘1’ and ‘2’ (a history of myocardial infarction and angina respectively) was included in the PMH CVS disease entry, pmh_cvs_disease_info will be [MI, Angina]. - Where multiple codes have been entered for PMH Lung Disease (e.g. 2,4 = COPD and asthma), both corresponding values are now retained in the pmh_lung_disease_info as [COPD, Asthma].

New Function : binarise_lung_csv

  • PMH CVS disease and PMH Lung disease are both now converted into binary fields, due to the low number of cases in most categories.

  • You can prevent this from occuring in the default pipeline by setting the collapse_pmh parameter/flag to False.

Discrepancies for Sandwell & West Birmingham#

Discrepancies for Sandwell & West Birmingham in pmh_cvs_disease and pmh_lung_disease are now handled:

  • PMH CVS Disease: All COVID-positive patients had a ‘1’ for PMH CVS disease, indicating a history of myocardial infarction. As a result, all of these values were set to missing (np.nan).

  • PMH Lung Disease: Only values of ‘4’ (COPD) and ‘6’ (Unknown) were included. This issue was handled by making this field binary (+ unknown).