nccidxclean.figures subpackage#
Submodules#
nccidxclean.figures.categorical#
This module contains the CategoricalField class and the functions to produce the date figures used in our write-up, comparing the nccid-cleaning and nccidxclean pipelines (Figures 2,6).
N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.
- class nccidxclean.figures.categorical.CategoricalField[source]#
Bases:
objectA class to hold the data for a single categorical field, and to produce plots for the figures in the write-up.
Methods
cvs_map_values
map_values
methods_hist_by_hosp
remap_column_names
rename_field
results_hist_by_dataset
- nccidxclean.figures.categorical.make_figure_2(n_pos, x_pos)[source]#
Generate Figure 2 - histograms of sex (with and without dicom enrichment), sex, pmh hypertension, pmh cvs disease for each of the three datasets (raw, nhsx pipeline, nccidxclean pipeline).
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
- Returns:
figure and axis containing figure 2
- Return type:
plt.Figure, plt.Axes
- nccidxclean.figures.categorical.make_figure_6(n_pos, x_pos, hosp_code_dict)[source]#
Generate Figure 6 - histograms of age and sex at each hospital in the dataset cleaned by nccid-cleaning.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
hosp_code_dict (Dict) – dictionary containing hospital codes and names
- Returns:
figure and axis containing figure 6
- Return type:
plt.Figure, plt.Axes
nccidxclean.figures.dates#
This module produces the date figures used in our write-up to compare the nccid-cleaning and nccidxclean pipelines (Figures 3,7).
N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.
- nccidxclean.figures.dates.extract_day_and_month(df)[source]#
Extracts the day and month from each date field in the dataframe.
- Parameters:
df (pd.DataFrame) – dataframe containing dates
- Returns:
dataframe containing extracted day and month from each cleaned date field
- Return type:
pd.DataFrame
- nccidxclean.figures.dates.get_gov_data()[source]#
Downloads the latest covid positivity data from the gov.uk api.
- nccidxclean.figures.dates.get_min_gap(x)[source]#
Calculates the minimum gap between imaging and a PCR for each patient.
- Parameters:
x (pd.Series) – row of dataframe corresponding to a patient
- Returns:
row of dataframe with minimum gap between imaging and a PCR
- Return type:
pd.Series
- nccidxclean.figures.dates.make_figure_3(n_pos, x_pos, n_dates, x_dates, hosp_code_dict, code_sorter)[source]#
- Produces figure 3 from the paper demonstrating the mean minimum time between PCR and imaging at each hospital
and the distribution of days of the months in the data cleaned by the nccid-cleaning and nccidxclean pipelines.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
n_dates (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning with day and month extracted for each date and after merging with dates from dicom metadata
x_dates (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean with day and month extracted for each date and after merging with dates from dicom metadata
hosp_code_dict (Dict) – dictionary containing hospital codes and names
code_sorter (Dict) – dictionary containing hospital codes and sorting order
- Returns:
figure and axis containing figure 3
- Return type:
plt.Figure, plt.Axes
- nccidxclean.figures.dates.make_figure_7(n_pos, hosp_code_dict)[source]#
Generates figure 7 - the distribution of dates corresponding to positive RT-PCR results at 3 hospitals where the distribution was suspicious of a change in the date format, alongside the distribution of dates for the rest of the NCCID hospitals and the proportion of positive PCR tests in England.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
hosp_code_dict (Dict) – dictionary containing hospital codes and names
- Returns:
figure and axis containing figure 7
- Return type:
plt.Figure, plt.Axes
- nccidxclean.figures.dates.merge_pcr_and_study_dates(df, img_df)[source]#
Merges the pcr and imaging dates and calculates the minimum gap between.
- Parameters:
df (pd.DataFrame) – dataframe containing pcr dates
img_df (pd.DataFrame) – dataframe containing imaging dates from dicom headers
- Returns:
dataframe containing merged pcr and imaging dates
- Return type:
pd.DataFrame
- nccidxclean.figures.dates.prepare_dataframe_dates(n_pos, x_pos, img_df)[source]#
Prepares the dataframes containing the dates for plotting.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing data cleaned by the nccid-cleaning pipeline.
x_pos (pd.DataFrame) – dataframe containing data cleaned by the nccidxclean pipeline.
img_df (pd.DataFrame) – dataframe containing imaging dates from dicom headers
- Returns:
tuple of dataframes containing the dates for now ready for plotting
- Return type:
Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]
nccidxclean.figures.numeric#
This module contains functions used to create the figures for the numerical fields used in our write-up (Figures 4,5,8).
N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.
- nccidxclean.figures.numeric.clean_raw_data(df_raw)[source]#
Cleans the raw data ready for plotting.
- Parameters:
df_raw (pd.DataFrame) – dataframe containing the raw data to be cleaned
- Returns:
cleaned dataframe
- Return type:
pd.DataFrame
- nccidxclean.figures.numeric.create_numeric_results_fig(df, field, axis)[source]#
Creates a histogram for a numeric field.
- Parameters:
df (pd.DataFrame) – dataframe containing the data
field (str) – field to plot
axis (plt.Axes) – axis to plot on
- Returns:
axis containing the histogram
- Return type:
plt.Axes
- nccidxclean.figures.numeric.is_float(string)[source]#
Checks if a string can be converted to a float.
- Parameters:
string (str) – input variable
- Returns:
boolean indicating whether the string can be converted to a float
- Return type:
bool
- nccidxclean.figures.numeric.make_figure_4(n_pos, x_pos)[source]#
Generates figure 4 from the paper - histograms of FiO2, Creatinine, Troponin and D-Dimer for the raw data, nccid-cleaning pipeline, and nccidxclean pipeline.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
- Returns:
figure and axis containing figure 4
- Return type:
plt.Figure, plt.Axes
- nccidxclean.figures.numeric.make_figure_5(n_pos, x_pos, hosp_code_dict)[source]#
Produces figure 5 from the paper demonstrating the distribution of pao2 / spo2 in the data cleaned by the nccid-cleaning and nccidxclean pipelines.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
hosp_code_dict (Dict) – dictionary containing hospital codes and names
- Returns:
figure and axis containing figure 5
- Return type:
plt.Figure, plt.Axes
- nccidxclean.figures.numeric.make_figure_8(n_pos, hosp_code_dict, sorter)[source]#
Produces figure 8, a box plot of the pao2 values for each hospital generated by the nccid-cleaning pipeline.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
hosp_code_dict (Dict) – dictionary containing hospital codes and names
sorter (Dict) – dictionary to sort the hospitals
- Returns:
figure and axis containing figure 4
- Return type:
plt.Figure, plt.Axes
nccidxclean.figures.overall_numbers#
This file contains the code to generate Figure 1 of the paper providing the overall number of missing values after using both pipelines.
N.B. The prepare_dataframes_for_comparison and allocate_hospital_codes functions must be run prior to producing any of the figures.
- nccidxclean.figures.overall_numbers.get_total_missing(n_pos, x_pos)[source]#
Returns the total number of missing values and dates in each dataset.
- Parameters:
n_pos (pd.DataFrame) – dataframe cleaned by the nccid-cleaning pipeline for positive pcr patients
x_pos (pd.DataFrame) – dataframe cleaned by the nccidxclean pipeline for positive pcr patients
- Returns:
dataframes showing the total number of missing values and dates in each dataset
- Return type:
pd.DataFrame, pd.DataFrame
- nccidxclean.figures.overall_numbers.make_figure_1(n_pos, x_pos)[source]#
Generate Figure 1 of the paper providing the overall number of missing values after using both pipelines.
- Parameters:
n_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
x_pos (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
- Returns:
figure 1
- Return type:
Tuple[plt.figure, plt.axes]
nccidxclean.figures.prepare_dataframes#
The functions in this module prepare the dataframes from the nccid-cleaning and nccidxclean pipelines for comparison. Without it, the columns are not equivalent. It is required before producing any of the figures.
- nccidxclean.figures.prepare_dataframes.allocate_hospital_codes(df)[source]#
Allocates a code to each hospital in the dataframe.
- Parameters:
df (pd.DataFrame) – dataframe containing hospital names and submitting centres
- Returns:
dataframe containing hospital names, submitting centres and allocated code
- Return type:
pd.DataFrame
- nccidxclean.figures.prepare_dataframes.prepare_dataframes_for_comparison(nhsx_df, xtd_df)[source]#
Prepares the dataframes from the nccid-cleaning and nccidxclean pipelines for comparison.
- Parameters:
nhsx_df (pd.DataFrame) – dataframe containing clinical data cleaned by nccid-cleaning
xtd_df (pd.DataFrame) – dataframe containing clinical data cleaned by nccidxclean
- Returns:
nhsx_df, xtd_df
- Return type:
pd.DataFrame, pd.DataFrame
nccidxclean.figures.sns_settings#
Used to set the seaborn settings, including palettes, for all figures