Handout 06
Date: 2022-11-06
Topic: Tidy Data
Literature
Handout
Ismay & Kim (2022) Chapter
Ismay & Kim Introduction Chapter 4: dplyr functions
Go to https://data.worldbank.org.
Search ‘tuberculosis’ and choose Incidence of Tuberculosis (per 100,000
people).
Choose Databank in the menu on the right hand side at the bottom of the
webpage.
Left hand side; Country menu, choose Countries; Years: choose all years
in the range 2000 - 2020.
Download options: choose Advanced options, choose CSV and make some
smart choices; download the data to your laptop.
Create a new R script and write code to tidy the downloaded data.
library(tidyverse)
tbc_download <- read_csv("/Users/Rnew/500Courses/IOM/datafiles/P_Data_Extract_From_World_Development_Indicators/ca4634cd-d4ef-465c-816c-d70954bda653_Data.csv")
tail(tbc_download)
#data cleaning
#make a work copy of the dataset
tbc <- tbc_download
#rename the columns to comply with conventions (no spaces, not beginning with a number)
#paste0() functions is handy here
names(tbc) <- c("INDICATOR", "CODE", "COUNTRY", "COUNTRYCODE",
paste0("YR", 2000:2021))
#remove NA's in COUNTRY variable
tbc <- tbc %>%
filter(!is.na(COUNTRY))
The data in the example tbc dataset are in wide format.
In this format the tbc incidence for every year is a variable; it’s more
tidy to consider year as a variable so that an observation consists of
values for the variables COUNTRY, COUNTRYCODE, YEAR, TBC_INCIDENCE.
#transform into long format with dplyr::pivot_longer()
tbc_long <- tbc %>%
pivot_longer(cols = YR2000:YR2020, names_to = "YEAR", values_to = "TBC_INC") %>%
select(c("COUNTRY", "COUNTRYCODE", "YEAR", "TBC_INC"))
#transform the TBC_INC inot numeric format
tbc_long <- tbc_long %>%
mutate(TBC_INC = as.numeric(TBC_INC))
#to transform the data back to a wide format use dplyr::pivot_wider()
tbc_wide <- tbc_long %>%
pivot_wider(names_from = "YEAR",
values_from = "TBC_INC")
#summarizing the data with five number summary for each year
tbc_summary <- tbc_long %>%
group_by(YEAR) %>%
summarize(COUNT = n(),
MIN = min(TBC_INC, na.rm = TRUE),
Q1 = quantile(TBC_INC, 0.25, na.rm= TRUE),
MEDIAN = median(TBC_INC, na.rm = TRUE),
Q3 = quantile(TBC_INC, 0.75, na.rm = TRUE),
MAX = max(TBC_INC, na.rm = TRUE))
#an average value is not appropriate in this case (why not?)
Contingency Tables are important when analysing the association
between two categorical variables.
An example from healthcare.
People can use a self test to test whether they are infected with a
certain virus. The result of this test is POS (effected according to the
test) or NEG (not infected according to the test). The test is not
always correct. To test how good the self test is, a sample of people
are asked to take the self-test and after that they are tested with a
test in a test centre which is 100% reliable.
library(tidyverse)
test_data <- read_rds("datafiles/test_data.rds")
head(test_data)
Create a two-way table to assess the goodness of the test.
test_results_table <- table(test_data$TEST, test_data$TRUTH)
test_results_table
INFECTED NOT_INFECTED
POS 98 57
NEG 0 9845
The two-way table can be used to assess the goodness of the
test.
In data analysis, a table like this in which test results are compared
with what is actually true, is called a Confusion Matrix.
The structure of the table is shown in Table 1.
The code in the script uses the base R table() function. It is possible
to use the transposed table, i.e. the TRUTH values in the rows and the
TEST results in the columns.
Table 1
Structure of a Confusion Matrix
| TRUTH | TOTAL | |||
|---|---|---|---|---|
| POSITIVE | NEGATIVE | |||
| TEST | POSITIVE | TP | FP | |
| NEGATIVE | FN | TN | ||
| TOTAL |
Adding totals can be done with the base R addmargins() function.
addmargins(test_results_table)
INFECTED NOT_INFECTED Sum
POS 98 57 155
NEG 0 9845 9845
Sum 98 9902 10000
Relative frequencies can be calculated as well; for that, first the question has to be answered which of the relative frequencies are meant. There are three options:
prop.table(test_results_table) %>% addmargins()
INFECTED NOT_INFECTED Sum
POS 0.0098 0.0057 0.0155
NEG 0.0000 0.9845 0.9845
Sum 0.0098 0.9902 1.0000
The numbers in this table show the proportions of the overall total for each cell. It shows that the overall accuracy, that is the total proportion/percentage, of correct test results is high: 0.0098 + 0.9845 = 0.9943.
The next table shows the proportions of the column totals.
prop.table(test_results_table, margin = 2) %>% addmargins(1) %>% round(4)
INFECTED NOT_INFECTED
POS 1.0000 0.0058
NEG 0.0000 0.9942
Sum 1.0000 1.0000
It shows that for all infected people, the test result is POS; for
the not-infected people the test result is NEG in 99.4% of all
cases.
Does this mean that this is a reliable test?
To answer that question, calculate the cell values as proportions of the
row sums.
prop.table(test_results_table, margin = 1) %>% addmargins(2) %>% round(4)
INFECTED NOT_INFECTED Sum
POS 0.6323 0.3677 1.0000
NEG 0.0000 1.0000 1.0000
This table says that from all people that have a POS test result, 63.2% is infected and 36.8% is not. From the people who have a NEG test result, none are infected.
In terms of probabilities:
P(POS | INFECTED) = 1.0000; True Positive Rate (TPR) or
Sensitivity
P(NEG | INFECTED) = 0.0000; False Negative Rate
P(NEG | NOT_INFECTED) = 0.9942; True Negative Rate (TNR) or
Specificity
P(POS | NOT_INFECTED) = 0.0058; False Positive Rate
P(INFECTED | POS) = 0.6323; Positive Predicted Value (PPV) or
Precision
P(NOT_INFECTED | POS) = 0.3677; False Discovery Rate (FDR)
P(NOT_INFECTED | NEG) = 1.0000; Negative Predicted Value (NPV)
P(INFECTED | NEG) = 0.0000; False Omission Rate (FOR)