Handout 05
Date: 2022-10-30
Topic: Data Wrangling

Literature
Handout
Ismay & Kim (2022) Chapter 3

Recap

Ismay & Kim Introduction Chapter 3: five named graphs

Wrangling functions from dplyr

The dplyr package comes with great number of functions to wrangle (transform) data as preparation for further analysis.

  • select() (3.8.1); to select variables
  • the %>% (pipe) operator (3.1)
  • filter() (3.2); to filter rows based on one or more conditions
  • summarize() (3.3); to generate summary statistics (see below)
  • group_by() (3.4); to generate summary statistics per category
  • mutate() (3.5); to add new variables
  • arrange() (3.6); to sort the data set
  • join functions (3.7 this paragraph only covers inner_join()); see script join_functions.R
    • inner_join()
    • left_join()
    • right_join()
    • full_join()
    • semi_join()
    • anti_join()
  • transmute(); adds new variables and drops existing ones

Summary Statistics

The most common summary statistics are mentioned.

Categorical Variable

  • Number of Observations
  • Frequency Table with absolute or relative frequencies per category
  • Modal Class; the class with the highest frequency

Numerical Variable

  • Number of Observations
  • Measure Center
    • Median
    • Average or Mean
    • Trimmed Mean
  • Measure Spread
    • Range
    • IQR (Inter Quartile Range)
    • Variance/ Standard Deviation; measure for spread around the Mean
    • Coefficient of Variation (= SD/Mean)
  • Measure Skewness
    • Skewness Coefficient

Two Categorical Variables

  • Number of Observations
  • Two-Way Table or Contingency Table
    • Absolute Frequencies
    • Relative Frequencies as proportions of total number of observations
    • Relative Frequencies as proportions of row totals
    • Relative Frequencies of column totals
  • Measure Correlation (Association) between the two variables
    • Cramer’s V

Two Numerical Variables

  • Number of Observations
  • Two-Way Table, only in case of limited number of values for the two categories
  • Measure Correlation (Association) between the two variables
    • Pearson’s Correlation Coefficient (or in short: Correlation Coefficient)
    • Spearman’s Rank Correlation Coefficient
    • Kendall’s Rank Correlation Coefficient

One Categorical and one Numerical Variable

  • Number of Observations
  • Number of Observations per Category
  • Summary Statistics for the Numerical Variable per Category
  • Correlation between the two variables
    • no one-size-fits-all measure
    • perform ANOVA to analyse differences between group means

EXERCISE
Work through script: ppd_london.R
ppd: price paid data; the script analyses property sold data from London.
Meta data can be found here.