Running the Pipeline#
Important
To use these tools you need to provide a BASE_PATH in the notebook that points to the location of the data that has
been pulled from the NCCID S3 bucket, where your local directory structure should match the original S3 structure. You
can set the local path to your NCCID data below by changing the DEFAULT_PATH variable or alternatively set as an
environment variable, NCCID_DATA_DIR in e.g., .bashrc.
If wanting to run NCCIDxClean from python, an example of the full pipeline is given in
this notebook,
which details all cleaning steps. This is a modified version of the
data ingestion example in the NHSx NCCID
cleaning package, which generates tabular patient clinical data and imaging metadata files (.csv) using the submodule
etl.py.
In this Jupyter notebook:
DICOM metadata is read-in converted to a pandas dataframe and saved as a .csv file for each modality (
xrays.csv,cts.csv, andmris.csv).The raw clinical data is read-in and converted to a pandas dataframe. The most recent JSON ‘data’ file (for COVID-positive) or ‘status’ file (for COVID-negative) is parsed for each patient in the directory tree.
The default
nccidxcleanpipeline is run on the clinical data.Potential errors in the cleaned data are saved in
./for_review/and should be checked. The amended data is then merged back into the cleaned data.DICOM metadata is used to enrich blanks in the ‘age’ and ‘ethnicity’ fields using the
etl.pysubmodule in the NHSx NCCID Cleaning tool.Dates in the DICOM files are sense checked against those in the cleaned clinical data.
The final tabular clinical data file is saved in another .csv file (
patients.csv).The raw data is run through the original NHSx cleaning pipeline and this data is then enriched using the DICOM metadata, with the output saved as
nhsx_patients.csv
Important
The new pipeline merges the DICOM updates into the ‘age’ and ‘sex’ parameters rather than using ‘age_update’ and ‘sex_update’, as we found this led to confusion when utilising the data. The original age and sex remain available in the ‘age_b4dm’ and ‘sex_b4dcm’ features, although these are not included in the output by default.