nccidxclean.figures subpackage#

Submodules#

nccidxclean.figures.categorical#

This module contains the CategoricalField class and the functions to produce the date figures used in our write-up, comparing the nccid-cleaning and nccidxclean pipelines (Figures 2,6).

N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.

class nccidxclean.figures.categorical.CategoricalField[source]#

Bases: object

A class to hold the data for a single categorical field, and to produce plots for the figures in the write-up.

Methods

cvs_map_values

map_values

methods_hist_by_hosp

remap_column_names

rename_field

results_hist_by_dataset

__init__(nhs_data, xtd_data, original_field, cleaned_field=None)[source]#
cvs_map_values(nhs_field_map, cam_field_map)[source]#
map_values(field_map)[source]#
methods_hist_by_hosp(axis, field_hue_order, field_palette, dataset, multi='stack')[source]#
remap_column_names(new_label, first_map=False)[source]#
rename_field(label)[source]#
results_hist_by_dataset(axis, field_hue_order, field_palette, multi='stack', shrink_bars=0.5)[source]#
nccidxclean.figures.categorical.make_figure_2(n_pos, x_pos)[source]#

Generate Figure 2 - histograms of sex (with and without dicom enrichment), sex, pmh hypertension, pmh cvs disease for each of the three datasets (raw, nhsx pipeline, nccidxclean pipeline).

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

Returns:

figure and axis containing figure 2

Return type:

plt.Figure, plt.Axes

nccidxclean.figures.categorical.make_figure_6(n_pos, x_pos, hosp_code_dict)[source]#

Generate Figure 6 - histograms of age and sex at each hospital in the dataset cleaned by nccid-cleaning.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

  • hosp_code_dict (Dict) – dictionary containing hospital codes and names

Returns:

figure and axis containing figure 6

Return type:

plt.Figure, plt.Axes

nccidxclean.figures.dates#

This module produces the date figures used in our write-up to compare the nccid-cleaning and nccidxclean pipelines (Figures 3,7).

N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.

nccidxclean.figures.dates.extract_day_and_month(df)[source]#

Extracts the day and month from each date field in the dataframe.

Parameters:

df (pd.DataFrame) – dataframe containing dates

Returns:

dataframe containing extracted day and month from each cleaned date field

Return type:

pd.DataFrame

nccidxclean.figures.dates.get_gov_data()[source]#

Downloads the latest covid positivity data from the gov.uk api.

nccidxclean.figures.dates.get_min_gap(x)[source]#

Calculates the minimum gap between imaging and a PCR for each patient.

Parameters:

x (pd.Series) – row of dataframe corresponding to a patient

Returns:

row of dataframe with minimum gap between imaging and a PCR

Return type:

pd.Series

nccidxclean.figures.dates.make_figure_3(n_pos, x_pos, n_dates, x_dates, hosp_code_dict, code_sorter)[source]#
Produces figure 3 from the paper demonstrating the mean minimum time between PCR and imaging at each hospital

and the distribution of days of the months in the data cleaned by the nccid-cleaning and nccidxclean pipelines.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

  • n_dates (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning with day and month extracted for each date and after merging with dates from dicom metadata

  • x_dates (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean with day and month extracted for each date and after merging with dates from dicom metadata

  • hosp_code_dict (Dict) – dictionary containing hospital codes and names

  • code_sorter (Dict) – dictionary containing hospital codes and sorting order

Returns:

figure and axis containing figure 3

Return type:

plt.Figure, plt.Axes

nccidxclean.figures.dates.make_figure_7(n_pos, hosp_code_dict)[source]#

Generates figure 7 - the distribution of dates corresponding to positive RT-PCR results at 3 hospitals where the distribution was suspicious of a change in the date format, alongside the distribution of dates for the rest of the NCCID hospitals and the proportion of positive PCR tests in England.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • hosp_code_dict (Dict) – dictionary containing hospital codes and names

Returns:

figure and axis containing figure 7

Return type:

plt.Figure, plt.Axes

nccidxclean.figures.dates.merge_pcr_and_study_dates(df, img_df)[source]#

Merges the pcr and imaging dates and calculates the minimum gap between.

Parameters:
  • df (pd.DataFrame) – dataframe containing pcr dates

  • img_df (pd.DataFrame) – dataframe containing imaging dates from dicom headers

Returns:

dataframe containing merged pcr and imaging dates

Return type:

pd.DataFrame

nccidxclean.figures.dates.prepare_dataframe_dates(n_pos, x_pos, img_df)[source]#

Prepares the dataframes containing the dates for plotting.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing data cleaned by the nccid-cleaning pipeline.

  • x_pos (pd.DataFrame) – dataframe containing data cleaned by the nccidxclean pipeline.

  • img_df (pd.DataFrame) – dataframe containing imaging dates from dicom headers

Returns:

tuple of dataframes containing the dates for now ready for plotting

Return type:

Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]

nccidxclean.figures.dates.remap_melted_date_names(df)[source]#

Remaps the melted date names ready for plotting.

Parameters:

df (pd.DataFrame) – melted dataframe

Returns:

melted dataframe with remapped names

Return type:

pd.DataFrame

nccidxclean.figures.numeric#

This module contains functions used to create the figures for the numerical fields used in our write-up (Figures 4,5,8).

N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.

nccidxclean.figures.numeric.clean_raw_data(df_raw)[source]#

Cleans the raw data ready for plotting.

Parameters:

df_raw (pd.DataFrame) – dataframe containing the raw data to be cleaned

Returns:

cleaned dataframe

Return type:

pd.DataFrame

nccidxclean.figures.numeric.create_numeric_results_fig(df, field, axis)[source]#

Creates a histogram for a numeric field.

Parameters:
  • df (pd.DataFrame) – dataframe containing the data

  • field (str) – field to plot

  • axis (plt.Axes) – axis to plot on

Returns:

axis containing the histogram

Return type:

plt.Axes

nccidxclean.figures.numeric.is_float(string)[source]#

Checks if a string can be converted to a float.

Parameters:

string (str) – input variable

Returns:

boolean indicating whether the string can be converted to a float

Return type:

bool

nccidxclean.figures.numeric.make_figure_4(n_pos, x_pos)[source]#

Generates figure 4 from the paper - histograms of FiO2, Creatinine, Troponin and D-Dimer for the raw data, nccid-cleaning pipeline, and nccidxclean pipeline.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

Returns:

figure and axis containing figure 4

Return type:

plt.Figure, plt.Axes

nccidxclean.figures.numeric.make_figure_5(n_pos, x_pos, hosp_code_dict)[source]#

Produces figure 5 from the paper demonstrating the distribution of pao2 / spo2 in the data cleaned by the nccid-cleaning and nccidxclean pipelines.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

  • hosp_code_dict (Dict) – dictionary containing hospital codes and names

Returns:

figure and axis containing figure 5

Return type:

plt.Figure, plt.Axes

nccidxclean.figures.numeric.make_figure_8(n_pos, hosp_code_dict, sorter)[source]#

Produces figure 8, a box plot of the pao2 values for each hospital generated by the nccid-cleaning pipeline.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

  • hosp_code_dict (Dict) – dictionary containing hospital codes and names

  • sorter (Dict) – dictionary to sort the hospitals

Returns:

figure and axis containing figure 4

Return type:

plt.Figure, plt.Axes

nccidxclean.figures.overall_numbers#

This file contains the code to generate Figure 1 of the paper providing the overall number of missing values after using both pipelines.

N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.

nccidxclean.figures.overall_numbers.get_total_missing(n_pos, x_pos)[source]#

Returns the total number of missing values and dates in each dataset.

Parameters:
  • n_pos (pd.DataFrame) – dataframe cleaned by the nccid-cleaning pipeline for positive pcr patients

  • x_pos (pd.DataFrame) – dataframe cleaned by the nccidxclean pipeline for positive pcr patients

Returns:

dataframes showing the total number of missing values and dates in each dataset

Return type:

pd.DataFrame, pd.DataFrame

nccidxclean.figures.overall_numbers.make_figure_1(n_pos, x_pos)[source]#

Generate Figure 1 of the paper providing the overall number of missing values after using both pipelines.

Parameters:
  • n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

Returns:

figure 1

Return type:

Tuple[plt.figure, plt.axes]

nccidxclean.figures.prepare_dataframes#

The functions in this module prepare the dataframes from the nccid-cleaning and nccidxclean pipelines for comparison. Without it, the columns are not equivalent. It is required before producing any of the figures.

nccidxclean.figures.prepare_dataframes.allocate_hospital_codes(df)[source]#

Allocates a code to each hospital in the dataframe.

Parameters:

df (pd.DataFrame) – dataframe containing hospital names and submitting centres

Returns:

dataframe containing hospital names, submitting centres and allocated code

Return type:

pd.DataFrame

nccidxclean.figures.prepare_dataframes.prepare_dataframes_for_comparison(nhsx_df, xtd_df)[source]#

Prepares the dataframes from the nccid-cleaning and nccidxclean pipelines for comparison.

Parameters:
  • nhsx_df (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning

  • xtd_df (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean

Returns:

nhsx_df, xtd_df

Return type:

pd.DataFrame, pd.DataFrame

nccidxclean.figures.sns_settings#

Used to set the seaborn settings, including palettes, for all figures