Data Wrangling

Handout 05
Date: 2022-10-30
Topic: Data Wrangling

Literature
Handout
Ismay & Kim (2022) Chapter 3

Recap

Ismay & Kim Introduction Chapter 3: five named graphs

The dplyr package comes with great number of functions to wrangle (transform) data as preparation for further analysis.

select() (3.8.1); to select variables
the %>% (pipe) operator (3.1)
filter() (3.2); to filter rows based on one or more conditions
summarize() (3.3); to generate summary statistics (see below)
group_by() (3.4); to generate summary statistics per category
mutate() (3.5); to add new variables
arrange() (3.6); to sort the data set
join functions (3.7 this paragraph only covers inner_join()); see script join_functions.R
- inner_join()
- left_join()
- right_join()
- full_join()
- semi_join()
- anti_join()
transmute(); adds new variables and drops existing ones

The most common summary statistics are mentioned.

Categorical Variable

Numerical Variable

Number of Observations
Measure Center
- Median
- Average or Mean
- Trimmed Mean
Measure Spread
- Range
- IQR (Inter Quartile Range)
- Variance/ Standard Deviation; measure for spread around the Mean
- Coefficient of Variation (= SD/Mean)
Measure Skewness
- Skewness Coefficient

Two Categorical Variables

Number of Observations
Two-Way Table or Contingency Table
- Absolute Frequencies
- Relative Frequencies as proportions of total number of observations
- Relative Frequencies as proportions of row totals
- Relative Frequencies of column totals
Measure Correlation (Association) between the two variables
- Cramer’s V

Two Numerical Variables

Number of Observations
Two-Way Table, only in case of limited number of values for the two categories
Measure Correlation (Association) between the two variables
- Pearson’s Correlation Coefficient (or in short: Correlation Coefficient)
- Spearman’s Rank Correlation Coefficient
- Kendall’s Rank Correlation Coefficient

One Categorical and one Numerical Variable

Number of Observations
Number of Observations per Category
Summary Statistics for the Numerical Variable per Category
Correlation between the two variables
- no one-size-fits-all measure
- perform ANOVA to analyse differences between group means

EXERCISE
Work through script: ppd_london.R
ppd: price paid data; the script analyses property sold data from London.
Meta data can be found here.