Handout 04

Recap

In this course we work with structured datasets. There are a lot of different dataset formats.
Popular formats, among others: csv, xlsx (MS Excel format), json.
R can import a great variety of data set formats; it’s a matter of finding an appropriate function.
In this course readr::read_csv() is used for csv files, openxlsx::read.xlsx() for MS Excel files.

Graphing Data

Graphing Data is at the heart of data analysis in all stages, i.e. in the exploratory stage as well as in the explanatory stage.
This course uses the ggplot2 package for graphing data; the workhorse is the ggplot() function. See handout03.

Working through the remaining part of handout03 and the scripts:

ggplot_barplot.R
ggplot_histogram.R
ggplot_boxplot.R
ggplot_barplot2.R
ggplot_scatterplot.R
ggplot_timeseries.R

R Objects

R distinguishes different kind of objects. The most important ones are:

Vectors
Data Frames/ Tibbles
Functions
Lists; a component that consist of different components
Matrices; a data frame with the same data type in every column

Data Types

R distinguishes different data types, such as:

Numeric
Integer
Character
Date; standard format: yyyy-mm-dd
Logical (TRUE / FALSE)

All elements of a vector have the same data type. Every column in a data frame is a vector and every element in a column has the same data type. Notice that in MS Excel elements in one column can be of different data types, but in R - and in other statistical programs - this is not possible. Every column is a variable and a variable has one data type.

Factor

A factor is a vector object used to specify a discrete classification (categorization, grouping). Factors can be ordered or unordered.

Subsetting Data Frame

There are different ways to subset a data frame. For instance:

df$; select one column
df[3,5]; select one cell, the cell in row 3 and column 5
df[1:100, 2:4]; select the first 100 rows and column 2, 3 and 4
df[5,]; select row 5
df[,10]; select column 10
df[c(1,3,5), c(2,3)]; select the cells in the 1st,3rd and 5th row in the 2nd and 3rd column
df9[,c(“VAR1”, “VAR2”)]; select a data frame with two columns, the variables VAR1 and VAR2

EXERCISE
Open a new R script file.
Import the Amman Weather data set (https://iom.zwannen.nl/datafiles/amman_weather.csv).
Write code that:

selects the TEMP, TEMPMIN and TEMPMAX variable
selects the first 500 rows in the data set
selects the last 3650 rows and the variables DATETIME, TEMP, HUMIDITY, WIDSPEED and PRECIP.

Functions dplyr::select() and dplyer::filter()

It’s also possible to use logical expressions to select columns and filter rows. Another and preferable option is to use functions from the dplyr package to select columns and filter rows.

Example Code (1)

library(tidyverse)
library(lubridate)
amman_weather <- read_csv("https://iom.zwannen.nl/datafiles/amman_weather.csv")

#select DATETIME, TEMP, TEMPMAX and TEMPMIN variables
amman_temp <- amman_weather %>% 
  select(DATETIME, TEMP, TEMPMAX, TEMPMIN)

#filter the days in which MAXTEMP was more than 40 degrees Celsius
amman_hot <-  amman_weather %>% 
  filter(TEMPMAX > 40)

#filter the days with no precipitation
amman_dry <- amman_weather %>% 
  filter(PRECIP == 0)

#filter the days with precipitation
amman_wet <- amman_weather %>% 
  filter(PRECIP != 0)

#filter the day(s) with the highest TEMPMAX
amman_hotst <- amman_weather %>% 
  filter(TEMPMAX == max(TEMPMAX))

Example Code (2)

tbc <- read_csv("https://iom.zwannen.nl/datafiles/tbc_data.csv")

#change variable names
names(tbc) <- c("YEAR", "YEAR_CODE", "COUNTRY_NAME", "COUNTRY_CODE",
               "TBC_INC", "TBC_DEATH_RATE")

#filter data from Jordan for the years 2010-2020
tbc_jordan <- tbc %>% 
  filter(COUNTRY_CODE == "JOR" & YEAR >= 2010 & YEAR <=2020)

#filter data from Jordan, Syrie, Israel, Egypt from 2000 on
tbc_me <- tbc %>% 
  filter(YEAR >= 2000 & (COUNTRY_CODE == "JOR" | COUNTRY_CODE == "SYR" |
           COUNTRY_CODE == "ISR" | COUNTRY_CODE == "EGY"))

#alternative
tbc_me_alt <- tbc %>% 
  filter(YEAR >= 2000 & COUNTRY_CODE %in% c("JOR", "SYR", "ISR", "EGY"))