Running the Pipeline#

Important

To use these tools you need to provide a BASE_PATH in the notebook that points to the location of the data that has been pulled from the NCCID S3 bucket, where your local directory structure should match the original S3 structure. You can set the local path to your NCCID data below by changing the DEFAULT_PATH variable or alternatively set as an environment variable, NCCID_DATA_DIR in e.g., .bashrc.

If wanting to run NCCIDxClean from python, an example of the full pipeline is given in this notebook, which details all cleaning steps. This is a modified version of the data ingestion example in the NHSx NCCID cleaning package, which generates tabular patient clinical data and imaging metadata files (.csv) using the submodule etl.py.

In this Jupyter notebook:

  1. DICOM metadata is read-in converted to a pandas dataframe and saved as a .csv file for each modality (xrays.csv, cts.csv, and mris.csv).

  2. The raw clinical data is read-in and converted to a pandas dataframe. The most recent JSON ‘data’ file (for COVID-positive) or ‘status’ file (for COVID-negative) is parsed for each patient in the directory tree.

  3. The default nccidxclean pipeline is run on the clinical data.

  4. Potential errors in the cleaned data are saved in ./for_review/ and should be checked. The amended data is then merged back into the cleaned data.

  5. DICOM metadata is used to enrich blanks in the ‘age’ and ‘ethnicity’ fields using the etl.py submodule in the NHSx NCCID Cleaning tool.

  6. Dates in the DICOM files are sense checked against those in the cleaned clinical data.

  7. The final tabular clinical data file is saved in another .csv file (patients.csv).

  8. The raw data is run through the original NHSx cleaning pipeline and this data is then enriched using the DICOM metadata, with the output saved as nhsx_patients.csv

Important

The new pipeline merges the DICOM updates into the ‘age’ and ‘sex’ parameters rather than using ‘age_update’ and ‘sex_update’, as we found this led to confusion when utilising the data. The original age and sex remain available in the ‘age_b4dm’ and ‘sex_b4dcm’ features, although these are not included in the output by default.