Binary and Categorical Columns#
Original NCCID Cleaning Pipeline#
Binary Columns#
NCCID Function
: _parse_binary_columns
Mapped unknowns to missing, i.e.
np.nan.Converted 0 (int), “0.0” and “0” (strings) to False and 1 (int), “1.0” and “1” (strings) to True.
Mapped and merged ‘PMH h1pertension’ column into pmh_hypertension
Merged “PMH diabetes mellitus type II” into “pmh_diabetes_mellitus_type_2”.
Filled blanks in “pmh_diabetes_mellitus_type_2” with “PMH diabetes mellitus TYPE II” and ignored “PMH diabetes mellitus TYPE I”.
Categorical Columns#
NCCID Function
: _parse_cat_columns
Ensured values in list of possible values
Extracted the integer value from “Pack year history”
Stripped digits from strings and exclude values outside of schema
“Unknown” categories mapped to missing (
np.nan) if they existed
NCCIDxClean#
New Function
: parse_binary_and_cat
The binary and categorical functions have been merged and the following added/changed:
Unknown values are retained using an additional code to distinguish from missing values. For example, previously a binary field would have unknowns mapped to
np.nan, leaving values of [“0”, “1”]. Now this column would have possible values of [“0”, “1”, “2”] where ‘2’ is ‘Unknown’.‘1es’ entries in the PMH diabetes mellitus type II field is now handled.
PMH CVS disease and PMH Lung disease are both now converted into binary features (+ unknown), due to the low number of cases in most categories.
For both PMH CVS disease and PMH Lung disease, the original code meaning and any additional disease information is retained in a list in a new field for each of these: pmh_cvs_disease_info and pmh_lung_disease_info. For example: - If a ‘1’ and ‘2’ (a history of myocardial infarction and angina respectively) was included in the PMH CVS disease entry, pmh_cvs_disease_info will be [MI, Angina]. - Where multiple codes have been entered for PMH Lung Disease (e.g. 2,4 = COPD and asthma), both corresponding values are now retained in the pmh_lung_disease_info as [COPD, Asthma].
New Function
: binarise_lung_csv
PMH CVS disease and PMH Lung disease are both now converted into binary fields, due to the low number of cases in most categories.
You can prevent this from occuring in the default pipeline by setting the collapse_pmh parameter/flag to False.
Discrepancies for Sandwell & West Birmingham#
Discrepancies for Sandwell & West Birmingham in pmh_cvs_disease and pmh_lung_disease are now handled:
PMH CVS Disease: All COVID-positive patients had a ‘1’ for PMH CVS disease, indicating a history of myocardial infarction. As a result, all of these values were set to missing (
np.nan).PMH Lung Disease: Only values of ‘4’ (COPD) and ‘6’ (Unknown) were included. This issue was handled by making this field binary (+ unknown).