Recap

In this course we work with structured datasets. There are a lot of different dataset formats.
Popular formats, among others: csv, xlsx (MS Excel format), json.
R can import a great variety of data set formats; it’s a matter of finding an appropriate function.
In this course readr::read_csv() is used for csv files, openxlsx::read.xlsx() for MS Excel files.

Graphing Data

Graphing Data is at the heart of data analysis in all stages, i.e. in the exploratory stage as well as in the explanatory stage.
This course uses the ggplot2 package for graphing data; the workhorse is the ggplot() function. See handout03.

Working through the remaining part of handout03 and the scripts:

  • ggplot_barplot.R
  • ggplot_histogram.R
  • ggplot_boxplot.R
  • ggplot_barplot2.R
  • ggplot_scatterplot.R
  • ggplot_timeseries.R

R Objects

R distinguishes different kind of objects. The most important ones are:

  1. Vectors
  2. Data Frames/ Tibbles
  3. Functions
  4. Lists; a component that consist of different components
  5. Matrices; a data frame with the same data type in every column

Data Types

R distinguishes different data types, such as:

  1. Numeric
  2. Integer
  3. Character
  4. Date; standard format: yyyy-mm-dd
  5. Logical (TRUE / FALSE)

All elements of a vector have the same data type. Every column in a data frame is a vector and every element in a column has the same data type. Notice that in MS Excel elements in one column can be of different data types, but in R - and in other statistical programs - this is not possible. Every column is a variable and a variable has one data type.

Factor

A factor is a vector object used to specify a discrete classification (categorization, grouping). Factors can be ordered or unordered.

Subsetting Data Frame

There are different ways to subset a data frame. For instance:

  1. df$; select one column
  2. df[3,5]; select one cell, the cell in row 3 and column 5
  3. df[1:100, 2:4]; select the first 100 rows and column 2, 3 and 4
  4. df[5,]; select row 5
  5. df[,10]; select column 10
  6. df[c(1,3,5), c(2,3)]; select the cells in the 1st,3rd and 5th row in the 2nd and 3rd column
  7. df9[,c(“VAR1”, “VAR2”)]; select a data frame with two columns, the variables VAR1 and VAR2

EXERCISE
Open a new R script file.
Import the Amman Weather data set (https://iom.zwannen.nl/datafiles/amman_weather.csv).
Write code that:

  1. selects the TEMP, TEMPMIN and TEMPMAX variable
  2. selects the first 500 rows in the data set
  3. selects the last 3650 rows and the variables DATETIME, TEMP, HUMIDITY, WIDSPEED and PRECIP.

Functions dplyr::select() and dplyer::filter()

It’s also possible to use logical expressions to select columns and filter rows. Another and preferable option is to use functions from the dplyr package to select columns and filter rows.

Example Code (1)

library(tidyverse)
library(lubridate)
amman_weather <- read_csv("https://iom.zwannen.nl/datafiles/amman_weather.csv")

#select DATETIME, TEMP, TEMPMAX and TEMPMIN variables
amman_temp <- amman_weather %>% 
  select(DATETIME, TEMP, TEMPMAX, TEMPMIN)

#filter the days in which MAXTEMP was more than 40 degrees Celsius
amman_hot <-  amman_weather %>% 
  filter(TEMPMAX > 40)

#filter the days with no precipitation
amman_dry <- amman_weather %>% 
  filter(PRECIP == 0)

#filter the days with precipitation
amman_wet <- amman_weather %>% 
  filter(PRECIP != 0)

#filter the day(s) with the highest TEMPMAX
amman_hotst <- amman_weather %>% 
  filter(TEMPMAX == max(TEMPMAX))

Example Code (2)

tbc <- read_csv("https://iom.zwannen.nl/datafiles/tbc_data.csv")

#change variable names
names(tbc) <- c("YEAR", "YEAR_CODE", "COUNTRY_NAME", "COUNTRY_CODE",
               "TBC_INC", "TBC_DEATH_RATE")

#filter data from Jordan for the years 2010-2020
tbc_jordan <- tbc %>% 
  filter(COUNTRY_CODE == "JOR" & YEAR >= 2010 & YEAR <=2020)

#filter data from Jordan, Syrie, Israel, Egypt from 2000 on
tbc_me <- tbc %>% 
  filter(YEAR >= 2000 & (COUNTRY_CODE == "JOR" | COUNTRY_CODE == "SYR" |
           COUNTRY_CODE == "ISR" | COUNTRY_CODE == "EGY"))

#alternative
tbc_me_alt <- tbc %>% 
  filter(YEAR >= 2000 & COUNTRY_CODE %in% c("JOR", "SYR", "ISR", "EGY"))