Topic: regression analysis
EXERCISE
Download the file with Dutch reimbursed healthcare costs per
municipality in 2018: vektis2018_extended.xlsx. This file has two
worksheet, one with meta data and one with the data. Importing the data
in R: openxlsx::read.xlsx(“datafiles/vektis2018_extended.xlsx”, sheet =
“data”).
Draw a random sample of 100 observations from this data set, using
the sample() function (the dplyr package is part of the tidyverse
package).
- Create a correlation matrix with the numerical variables in the data
set. Which variable has the highest correlation with
COSTS_PER_INSURED_YEAR?
- Comment on what can be seen in the correlation matrix.
- MODEL1: Generate a linear regression model with
COSTS_PER_INSURED_YEAR as response variable (Y-variable) and AGE_AVERAGE
as explanatory variable (X-variable).
- MODEL2: Generate a linear regression model with
COSTS_PER_INSURED_YEAR as response variable (Y-variable) and AGE_MEDIAN
as explanatory variable (X-variable).
- Compare MODEL1 with MODEL2; which model is the most usefull to
explain the variation in the COSTS_PER_INSURED_YEAR for the different
municipalities.
- MODEL3: Generate a multiple linear regression model with
COSTS_PER_INSURED_YEAR as response variable (Y-variable) and a couple of
explanatory variables; use the correlation matrix for making a selection
of features (explanatory variables) in the model.