Tidy Data

Handout 06
Date: 2022-11-06
Topic: Tidy Data

Literature
Handout
Ismay & Kim (2022) Chapter

Recap

Ismay & Kim Introduction Chapter 4: dplyr functions

Collecting Data Example

Go to https://data.worldbank.org.
Search ‘tuberculosis’ and choose Incidence of Tuberculosis (per 100,000 people).
Choose Databank in the menu on the right hand side at the bottom of the webpage.
Left hand side; Country menu, choose Countries; Years: choose all years in the range 2000 - 2020.
Download options: choose Advanced options, choose CSV and make some smart choices; download the data to your laptop.
Create a new R script and write code to tidy the downloaded data.

library(tidyverse)
tbc_download <- read_csv("/Users/Rnew/500Courses/IOM/datafiles/P_Data_Extract_From_World_Development_Indicators/ca4634cd-d4ef-465c-816c-d70954bda653_Data.csv")
tail(tbc_download)

  #data cleaning
#make a work copy of the dataset
tbc <- tbc_download
#rename the columns to comply with conventions (no spaces, not beginning with a number)
#paste0() functions is handy here
names(tbc) <- c("INDICATOR", "CODE", "COUNTRY", "COUNTRYCODE",
                paste0("YR", 2000:2021))
#remove NA's in COUNTRY variable
tbc <- tbc %>% 
  filter(!is.na(COUNTRY))

Long and Wide format

The data in the example tbc dataset are in wide format.
In this format the tbc incidence for every year is a variable; it’s more tidy to consider year as a variable so that an observation consists of values for the variables COUNTRY, COUNTRYCODE, YEAR, TBC_INCIDENCE.

#transform into long format with dplyr::pivot_longer()
tbc_long <- tbc %>% 
  pivot_longer(cols = YR2000:YR2020, names_to = "YEAR", values_to = "TBC_INC") %>% 
  select(c("COUNTRY", "COUNTRYCODE", "YEAR", "TBC_INC"))

#transform the TBC_INC inot numeric format
tbc_long <- tbc_long %>% 
  mutate(TBC_INC = as.numeric(TBC_INC))

#to transform the data back to a wide format use dplyr::pivot_wider()
tbc_wide <- tbc_long %>% 
  pivot_wider(names_from = "YEAR",
              values_from = "TBC_INC")

#summarizing the data with five number summary for each year

tbc_summary <- tbc_long %>% 
  group_by(YEAR) %>% 
  summarize(COUNT = n(),
            MIN = min(TBC_INC, na.rm = TRUE),
            Q1 = quantile(TBC_INC, 0.25, na.rm=  TRUE),
            MEDIAN = median(TBC_INC, na.rm = TRUE),
            Q3 = quantile(TBC_INC, 0.75, na.rm = TRUE),
            MAX = max(TBC_INC, na.rm = TRUE))
#an average value is not appropriate in this case (why not?)

Contingency Tables

Contingency Tables are important when analysing the association between two categorical variables.
An example from healthcare.
People can use a self test to test whether they are infected with a certain virus. The result of this test is POS (effected according to the test) or NEG (not infected according to the test). The test is not always correct. To test how good the self test is, a sample of people are asked to take the self-test and after that they are tested with a test in a test centre which is 100% reliable.

library(tidyverse)
test_data <- read_rds("datafiles/test_data.rds")
head(test_data)

Create a two-way table to assess the goodness of the test.

test_results_table <- table(test_data$TEST, test_data$TRUTH)
test_results_table

     
      INFECTED NOT_INFECTED
  POS       98           57
  NEG        0         9845

The two-way table can be used to assess the goodness of the test.
In data analysis, a table like this in which test results are compared with what is actually true, is called a Confusion Matrix.

The structure of the table is shown in Table 1.
The code in the script uses the base R table() function. It is possible to use the transposed table, i.e. the TRUTH values in the rows and the TEST results in the columns.

Table 1
Structure of a Confusion Matrix

		TRUTH
		POSITIVE	NEGATIVE
TEST	POSITIVE	TP	FP
	NEGATIVE	FN	TN
TOTAL

Adding totals can be done with the base R addmargins() function.

addmargins(test_results_table)

     
      INFECTED NOT_INFECTED   Sum
  POS       98           57   155
  NEG        0         9845  9845
  Sum       98         9902 10000

Relative frequencies can be calculated as well; for that, first the question has to be answered which of the relative frequencies are meant. There are three options:

the cell values as proportions of the total number of tests; function prop.table()
the cell values as the proportion of the row totals; function prop.table(x, margins = 1)
the cell values as proportions of the column totals; function prop.table(x, margins = 2)

prop.table(test_results_table) %>% addmargins()

     
      INFECTED NOT_INFECTED    Sum
  POS   0.0098       0.0057 0.0155
  NEG   0.0000       0.9845 0.9845
  Sum   0.0098       0.9902 1.0000

The numbers in this table show the proportions of the overall total for each cell. It shows that the overall accuracy, that is the total proportion/percentage, of correct test results is high: 0.0098 + 0.9845 = 0.9943.

The next table shows the proportions of the column totals.

prop.table(test_results_table, margin = 2) %>% addmargins(1) %>% round(4)

     
      INFECTED NOT_INFECTED
  POS   1.0000       0.0058
  NEG   0.0000       0.9942
  Sum   1.0000       1.0000

It shows that for all infected people, the test result is POS; for the not-infected people the test result is NEG in 99.4% of all cases.
Does this mean that this is a reliable test?
To answer that question, calculate the cell values as proportions of the row sums.

prop.table(test_results_table, margin = 1) %>% addmargins(2) %>% round(4)

     
      INFECTED NOT_INFECTED    Sum
  POS   0.6323       0.3677 1.0000
  NEG   0.0000       1.0000 1.0000

This table says that from all people that have a POS test result, 63.2% is infected and 36.8% is not. From the people who have a NEG test result, none are infected.

In terms of probabilities:

P(POS | INFECTED) = 1.0000; True Positive Rate (TPR) or Sensitivity
P(NEG | INFECTED) = 0.0000; False Negative Rate

P(NEG | NOT_INFECTED) = 0.9942; True Negative Rate (TNR) or Specificity
P(POS | NOT_INFECTED) = 0.0058; False Positive Rate

P(INFECTED | POS) = 0.6323; Positive Predicted Value (PPV) or Precision
P(NOT_INFECTED | POS) = 0.3677; False Discovery Rate (FDR)

P(NOT_INFECTED | NEG) = 1.0000; Negative Predicted Value (NPV)
P(INFECTED | NEG) = 0.0000; False Omission Rate (FOR)