Installation and Usage#
Installation#
The package and dependencies, including the original NHSx pipeline, may be installed using pip:
pip install git+https://gitlab.developers.cam.ac.uk/maths/cia/covid-19-projects/nccidxclean
Alternatively, the git repository may be cloned and installed locally:
git clone https://gitlab.developers.cam.ac.uk/maths/cia/covid-19-projects/nccidxclean
cd nccidxclean
pip install .
It is advised that the package be installed in a virtual environment (conda or venv).
Usage#
The package may be run from the command line or within python.
To run the package on the command line:
nccidxclean <base_path> <clinical_subdir> --xray_subdir --ct_subdir --eda
The output, clean data is stored in ./data/ generated in the working directory.
For additional information on command line usage, please see the
docs.
An example notebook is provided which demonstrates how the cleaning process may be performed step-by-step, from reading in the clinical data to enriching missing values using the DICOM metadata.
To run the default module pipeline on a pandas dataframe in python, use:
import nccidxclean as xclean
xclean_df = xclean.xclean_nccid(df)
The pipeline may be modified to remove steps, use original NHSx modules,
and select which data features will be returned. There are parameters and arguments which allow for
the user to specify: whether to convert FiO2 values in litres to percentages (fio2_ltrs_to_percent),
and whether to collapse ‘PMH Lung Disease’ and ‘PMH CVS Disease’ into binary + unknown features (collapse_pmh).
Please see the API documentation for more information.
Data Warnings and Errors!
Three .csv files are generated which contain patient data for manual review. They are stored
in the for_review folder in the working directory.
This data should be reviewed and amended (e.g. in a spreadsheet application) and the changes
then merged with the cleaned data using the nccidxclean.clean.utils.merge_checks_with_cleaned_df function. This is then stored in an encrypted format allowing
these updates to be applied during future deployments.