Linear and Bayesian Regression Models with tidymodels package

As a data scientist, you need to distinguish between regression predictive models and classification predictive models. Clear understanding of these models helps to choose the best one for a specific use case. In a nutshell, regression predictive models andclassification predictive models` fall under supervised machine learning. The main difference between them is that the output variable—in regression is numerical (or continuous) while that for classification is categorical (or discrete).

A big part of machine learning is classification — we want to know what class or group an observation belongs to. Therefore, the ability to precisely classify observations to their groups is valuable for various business applications like predicting the future based on historical data.

For example, when provided with a dataset about environmental variables, and you are asked to predict productivity, that is a regression task because productivity measured in term of chlorophyll concentration will be a continuous output.

In this post we will focus on regression. We will learn the steps of modelling using tidymodels (Kuhn and Wickham 2020 b). We first explore the data and check if it fit for modelling, we then split the dataset into a training and testing sets. Next, we will create a recipe object and define our model. Lastly, we will train a specified model and evaluate its performance.

I will use the same dataset for three different model’s algorithms. Example of the common regression algorithms include random forest, linear regression, support vector regression (SVR) and bayes. Some algorithms, such as logistic regression, have the name regression in their functions but they are not regression algorithms.

To follow use code in this article, you will need tidymodels (Kuhn and Wickham 2020 b) and tidyverse packages (Wickham 2017) installed in your machine. You can install them from CRAN. The chunk below highlight lines of code to install the packages if they are yet in your PC.

model.packages = c("tidymodels", "tidyverse")

install.packages(model.packages)

Once installed, you can load the packages in your session. We often we use several packages to accomplish a task. Even in this post, though seems only two packages are loaded, these are ecosystems which comes with dozens of packages bundled with them.

require(tidyverse)
require(tidymodels)

The CRTR data

We use the data collected with the Institute of Marine Sciences of the University of Dar es Salaam to illustrate the concept. The data was collected through the Coral Reef Targeted Research and Capacity Building for Management (CRTR) project between 2008 and 2009. The dataset contains;

Chemical variables —nitrate, phosphate, salinity, pH, dissolved oxygen and ammonia
Physical variables — temperature
Biological variables— chlorophyll-a

Because the variables are organized in sheets of Excel spreadsheet, i used a read_excel function from readxl package (Wickham and Bryan 2018) to read the file from the sheet. And because there are several sheet, the processed was iterated with a for loop. Data from each sheet was allocated in the list file. The chunk below highlight the code used to read files in sheets.

variables = c("salinity", "temperature", "do", "ph", "chl", "ammonia", "phosphate", "nitrate")

crtr.list = list()

for (i in 1:length(variables)){
  
  crtr.list[[i]] = readxl::read_excel("crtr.xlsx", sheet = i) %>% 
    mutate(variable = variables[i]) 

}

The data was untidy and unsuitable for visualization and modelling in R. Therefore, the first thing I had to deal with the data was to tidy the variables in the dataset to a right format that tidymodels and tidyverse recognizes. First the dataset was unlisted with bind_rows function and the data was pivoted to long format and then back to wide format with only the variable of interested selected.

## organize in long form
crtr.long = crtr.list %>% 
  bind_rows() %>%
  pivot_longer(cols =2:5, names_to = "sites", values_to = "data" ) 

## organize in the wide form
crtr.wide = crtr.long %>%
  pivot_wider(names_from = variable, values_from = data) %>%
  mutate(month = lubridate::month(Date, label = TRUE, 
                                  abb = TRUE) %>% as.character()) %>%
  mutate_if(is.character, as.factor) %>%
  mutate_if(is.numeric, round, digits = 2)  %>%
  select(date = Date, month,sites, chl, everything())

Let’s us glimpse the dataset

crtr.wide %>% 
  glimpse()

Rows: 52
Columns: 11
$ date        <dttm> 2008-03-01, 2008-03-01, 2008-03-01, 2008-03-01, 2008-0...
$ month       <fct> Mar, Mar, Mar, Mar, Apr, Apr, Apr, Apr, May, May, May, ...
$ sites       <fct> Pongwe, Mnemba, Chumbe, Bawe, Pongwe, Mnemba, Chumbe, B...
$ chl         <dbl> 0.09, 0.26, 0.56, 0.43, 0.47, 1.01, 0.63, 1.39, 0.37, 0...
$ salinity    <dbl> 35.0, 34.0, 32.0, 32.0, 35.0, 35.0, 34.0, 34.0, 36.0, 3...
$ temperature <dbl> 28.8, 28.4, 28.0, 28.0, 28.2, 27.7, 28.1, 26.7, 27.0, 2...
$ do          <dbl> 6.11, 5.95, 6.16, NA, 6.35, 6.14, 7.01, 6.31, 6.15, 6.1...
$ ph          <dbl> 7.86, 7.88, 7.73, NA, 7.87, 7.88, 7.86, 7.91, 7.68, 7.8...
$ ammonia     <dbl> 0.55, 0.80, 0.74, 0.94, 0.56, 0.72, 0.53, 0.97, 0.56, 0...
$ phosphate   <dbl> 0.28, 0.28, 1.31, 1.90, 0.28, 0.32, 1.16, 0.84, 0.28, 0...
$ nitrate     <dbl> 0.04, 0.07, 3.26, 3.34, 0.03, 0.47, 1.45, 0.84, 0.06, 0...

As a first step in modeling, it’s always a good idea to explore the variables in the dataset. Figure 1 is a pairplot that compare each pair of variables as scatterplots in the lower diagonal, densities on the diagonal and correlations written in the upper diagonal (Schloerke et al. 2018). Figure 2 show the correlation between chlorophyll-a (outcome) with other six predictor variables using a linear and quadratic equations is unfit for these dataset.

crtr.wide %>%
  select(-salinity)%>%
  mutate(season = lubridate::month(date) %>% as.integer(),
         season = replace(season,season %in% c(1:4,11:12), "NE"),
         season = replace(season,season %in% c(5:10), "SE"))%>%
  GGally::ggscatmat(columns = 4:10,color="season", alpha=1, corMethod = "spearman")+
  ggsci::scale_color_jco()+
  ggpubr::theme_pubclean()+
  theme(strip.background = element_blank(), 
        legend.position = "right",
        legend.key = element_blank())

Figure 1: Pair plot of numerical variables

require(wesanderson)

wesa = wes_palettes %>% names()

crtr.wide %>%
  select(-salinity)%>%
  filter(nitrate < 1 & phosphate < 1.2 & chl < 1) %>% 
  pivot_longer(cols = 5:10, names_to = "predictor", values_to = "data") %>%
  # filter(sites == "Bawe")%>%
  ggplot(aes(x = data, y = chl))+
  scale_y_continuous(trans = scales::sqrt_trans(), labels = function(x) round(x,2))+
  # scale_x_continuous(trans = scales::sqrt_trans(), labels = function(x) round(x,2))+
  geom_jitter()+
  geom_smooth(se = FALSE, method = "lm", formula = "y ~ poly(x,2)", aes(color = "Quadratic"))+
  geom_smooth(se = FALSE, method = "lm", formula = "y ~ x", aes(color = "Linear"))+
   ggsci::scale_color_jco()+
  facet_wrap(~predictor, scales = "free_x")+
  ggpubr::theme_pubclean()+
  theme(strip.background.x = element_blank(), legend.key = element_blank(), 
        legend.position = "right", panel.background = element_rect(colour = "black"))

Figure 2: Correalation of

Resample

In machine learning, one risk is the machine learns too well our sample data and is then less accurate during a real-world testing. This phenomenon is called overtraining or overfitting. We overcome this problem by splitting the dataset into a training and testing sets. The training set is used to train the model while the test set is reserved to later estimate how well the model might work on new or wild data.

The splitting is based on ratios and the widely used ratios include 80/20, 75/25, or 7/30, with the training data receiving a bigger proportion of the dataset and the test set get the remaining small portion.

For our sample that has only 52 observations, it make sense to use 70/30 split ratio. we use the fraction set.seed(4595) from base R to fix the random number generator from rsample package (Kuhn, Chow, and Wickham 2020). This prevent generating new data in each execution.

the function initial_split from the rsample package is designed to split the dataset into a training and testing sets. We purse the data to be split and the proportion that serve as a cutpoint of the two sets.

set.seed(4595)

crtr.split = crtr.clean %>%
  rsample::initial_split(prop = 0.7)

crtr.split

<Training/Validation/Total>
<29/12/41>

Given the 41 total observations, we reserve 12 observations as a test set and kept 70% of the dataset (29 observation) as train set. From the crtr.split object, we pull both the train set with the training function and the test set with a testing function.

## pull train set
crtr.train = crtr.split %>% 
  training()

## pull test set
crtr.test = crtr.split %>% 
  testing()

We can have a glimpse of the train dataset using a glimpse function from dplyr package (Wickham et al. 2018).

crtr.train %>% glimpse()

Rows: 29
Columns: 7
$ chl         <dbl> 0.26, 0.38, 0.43, 0.30, 0.64, 0.35, 0.22, 0.36, 0.14, 0...
$ temperature <dbl> 28.4, 26.5, 26.5, 26.4, 25.7, 25.4, 26.2, 26.4, 25.5, 2...
$ do          <dbl> 5.95, 5.94, 5.90, 6.10, 6.21, 6.01, 6.44, 6.26, 5.64, 6...
$ ph          <dbl> 7.88, 7.77, 7.82, 7.85, 7.84, 7.79, 7.76, 7.85, 7.83, 7...
$ ammonia     <dbl> 0.80, 0.60, 0.64, 0.90, 0.72, 0.67, 0.65, 0.46, 0.63, 0...
$ phosphate   <dbl> 0.28, 0.34, 0.42, 0.68, 0.32, 0.52, 0.86, 0.26, 0.42, 0...
$ nitrate     <dbl> 0.07, 0.06, 0.38, 0.68, 0.08, 0.42, 0.78, 0.08, 0.37, 0...

The printed output shows that we have seven variables and all are numeric

recipe

The recipes package (Kuhn and Wickham 2020 a) define a recipe object that we will use for modeling and to conduct preprocessing of variables. The four main functions are recipe(), prep(), juice() and bake(). recipe() defines the operations on the data and the associated roles. Once the preprocessing steps are defined, any parameters are estimated using prep().

Recipes can be created manually by sequentially adding roles to variables in a data set. First, we will create a recipe object from the train set data and then specify the processing steps and transform the data with step_*. once the recipe is ready we prep it. For example, to create a simple recipe containing only an outcome and predictors and have the predictors normalized and missing values in predictors imputed:

crtr.recipe = crtr.train %>%
  recipe(chl ~ .) %>%
  step_normalize(all_numeric()) %>%
  step_corr(all_numeric())%>%
  step_knnimpute(all_numeric()) %>%
  prep()

crtr.recipe

Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor          6

Training data contained 29 data points and no missing data.

Operations:

Centering and scaling for temperature, do, ph, ammonia, ... [trained]
Correlation filter removed no terms [trained]
K-nearest neighbor imputation for do, ph, ammonia, phosphate, ... [trained]

Once the data are ready for transformation, the juices() extract transformed training set while the bake() function create a new testing set.

crtr.training = crtr.recipe %>%
  juice()

crtr.testing = crtr.recipe %>%
  bake(crtr.test)

Build Models

Random Forest

We begin by fitting a random forest model.

Make random forest model

We specify the model using the parsnip package (Kuhn and Vaughan 2020 a). This package provides a tidy, unified interface to models for a range of models without getting bogged down in the syntactical minutiae of the underlying packages. The parsnip package help us to specify ;

the type of model e.g random forest,
the mode of the model whether is regression or classification
the computational engine is the name of the R package.

Based on the information above, we can use parsnip package to build our model as;

rf.model = rand_forest() %>%
  set_engine(engine = "ranger") %>%
  set_mode(mode = "regression")

rf.model

Random Forest Model Specification (regression)

Computational engine: ranger

Train random forest model

Once we have specified the model type, engine and mode, the model can be trained with the fit function. We simply parse into the fit the specified model and the transformed training set extracted from the prepped recipe.

rf.fit = rf.model %>%
  fit(chl ~ ., data = crtr.training)

predict with random forest

To get our predicted results, we use the predict() function to find the estimated chlorophyll-a. First, let’s generate the estimated chlorophyll-a values by simply parse the random forest model rf.model we specified and the transformed testing set we created from a prepped recipe. We also stitch the predicted values to the transformed train set with bind_cols function;

rf.predict = rf.fit %>%
  predict(crtr.testing) %>%
  bind_cols(crtr.testing) 

rf.predict

# A tibble: 12 x 8
     .pred temperature      do      ph ammonia phosphate nitrate     chl
     <dbl>       <dbl>   <dbl>   <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
 1  0.0278       0.750  0.884  -0.947    0.258   -0.560   -0.625 -1.50  
 2 -0.0202       0.370  1.44   -0.870    0.291   -0.560   -0.677  0.504 
 3  0.0453      -0.392  0.977  -2.33     0.291   -0.560   -0.521 -0.0237
 4 -0.173       -0.265  0.861  -1.41     0.588    0.0415   1.66  -0.235 
 5 -0.0443      -0.201  1.07   -1.02     1.35     2.01     2.49   0.346 
 6 -0.274       -1.15   0.0720  0.284    0.291   -0.961   -0.573  0.399 
 7 -0.140       -0.265 -0.624   0.515    0.158   -1.12    -0.625 -0.499 
 8  0.0790       0.814 -0.392  -0.0239   2.80    -0.480   -0.625 -0.763 
 9 -0.193        0.814 -1.39   -0.485   -1.10    -0.720   -0.573  0.0819
10 -0.231        1.00  -1.41   -0.101    0.621    0.482    0.206  1.35  
11 -0.417        0.370 -0.856   0.438   -1.10    -0.640   -0.573  1.61  
12  1.30         1.38  -0.949   2.44     0.456    0.362   -0.677 -0.657

When making predictions, the tidymodels convention is to always produce a tibble of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format:

Evaluate the rf model

So far, we have built a model and preprocessed data with a recipe. We predicted new data as a way to bundle a parsnip model and recipe together. The next step is to assess or evaluate the accuracy of the model. We use a metrics function from yardstick package (Kuhn and Vaughan 2020 b)to assess the accuracy of the model by comparing the predicted versus the original outcome variable. Note that we use the predicted dataset we just computed using predict function.

rf.predict %>%
  metrics(truth = chl, estimate = .pred)

# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.10 
2 rsq     standard       0.221
3 mae     standard       0.857

Linear regression approach

Make linear model

The good of tidymodels is that when we change the model, we do not need to start over again from the beginning but rather change the engine. For instance, we replace the ranger engine with lm to specify the linear model using the parsnip package (Kuhn and Vaughan 2020 a) .

lm.model = linear_reg() %>%
  set_engine(engine = "lm") %>%
  set_mode(mode = "regression")

Train Linear model

Once we have specified the model type, engine and mode, the model can be trained with the fit function;

lm.fit = lm.model %>%
  fit(chl ~ ., data = crtr.training)

Predict with linear model

Once the model is fitted, This fitted object lm_fit has the lm model output built-in, which you can access with lm_fit$fit, but there are some benefits to using the fitted parsnip model object when it comes to predicting. To predict the values we use predict function and parse the model and standardized testing values we computed from the recipe (R Core Team 2018). Note that here we use the transformed test set and not the original from the split object. In this case we use the model to predict the value and stitch the testing values using the bind_cols function;

lm.predict = lm.fit %>%
  predict(crtr.testing) %>%
  bind_cols(crtr.testing) 

lm.predict

# A tibble: 12 x 8
     .pred temperature      do      ph ammonia phosphate nitrate     chl
     <dbl>       <dbl>   <dbl>   <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
 1 -0.236        0.750  0.884  -0.947    0.258   -0.560   -0.625 -1.50  
 2 -0.0649       0.370  1.44   -0.870    0.291   -0.560   -0.677  0.504 
 3 -0.728       -0.392  0.977  -2.33     0.291   -0.560   -0.521 -0.0237
 4 -0.764       -0.265  0.861  -1.41     0.588    0.0415   1.66  -0.235 
 5  0.0740      -0.201  1.07   -1.02     1.35     2.01     2.49   0.346 
 6  0.0151      -1.15   0.0720  0.284    0.291   -0.961   -0.573  0.399 
 7 -0.135       -0.265 -0.624   0.515    0.158   -1.12    -0.625 -0.499 
 8  0.323        0.814 -0.392  -0.0239   2.80    -0.480   -0.625 -0.763 
 9 -0.772        0.814 -1.39   -0.485   -1.10    -0.720   -0.573  0.0819
10 -0.117        1.00  -1.41   -0.101    0.621    0.482    0.206  1.35  
11 -0.263        0.370 -0.856   0.438   -1.10    -0.640   -0.573  1.61  
12  1.13         1.38  -0.949   2.44     0.456    0.362   -0.677 -0.657

Evaluate linear model

Once we have our lm.predict dataset that contains the predicted and test values, we can now use the metrics fiction to evaluate the accuracy of the model.

lm.predict%>%
  metrics(truth = chl, estimate = .pred)

# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      1.07  
2 rsq     standard      0.0337
3 mae     standard      0.929

Estimate stats

Sometimes you may wish to plot predicted values with different predictors. To do that you need to create a new tidied data from the the model with tidy function and parse interval = TRUE argument as shown in the code below. This create a tibble shown below and the same data is plotted in figure 3.

lm.fit.stats = lm.fit %>% 
  tidy(interval = TRUE)

lm.fit.stats

# A tibble: 7 x 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  1.25e-15     0.172  7.25e-15  1.00  
2 temperature -4.82e- 2     0.258 -1.87e- 1  0.854 
3 do           1.83e- 1     0.214  8.58e- 1  0.400 
4 ph           3.88e- 1     0.307  1.26e+ 0  0.220 
5 ammonia      1.59e- 1     0.209  7.64e- 1  0.453 
6 phosphate    3.98e- 1     0.225  1.77e+ 0  0.0913
7 nitrate     -3.00e- 1     0.274 -1.09e+ 0  0.286

lm.fit.stats %>%
  slice(-1) %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_point(size = 4)+
  geom_errorbar(aes(ymin = estimate-std.error, ymax = estimate+std.error), width = .2)+
  scale_y_continuous(breaks = seq(-1.5,1.5,0.5))+
  ggpubr::theme_pubclean()+
  theme(axis.text = element_text(size = 10))+
  labs(x = "", y = "Estimated chl")

Figure 3: Estimated value of chlorophyll concentration at different predictors

Bayesian approach

Make Bayes Model

For Bayesian, we also change the engine and specified are called prior and prior_intercept. It turns out that linear_reg() has a stan engine. Since these prior distribution arguments are specific to the Stan software, they are passed as arguments to parsnip::set_engine().

prior.dist = rstanarm::student_t(df = 1)

set.seed(401)

## make the parsnip model
bayes.model = linear_reg() %>%
  set_engine(engine = "stan", 
             prior_intercept = prior.dist, 
             prior = prior.dist) %>%
  set_mode(mode = "regression")

This kind of Bayesian analysis (like many models) involves randomly generated numbers in its fitting procedure. We can use set.seed() to ensure that the same (pseudo-)random numbers are generated each time we run this code. The number 123 isn’t special or related to our data; it is just a “seed” used to choose random numbers.

Train Bayes model

Once we have defined the Bayesian model, we train it using a transformed testing set;

## train the bayes model
bayes.fit = bayes.model%>%
  fit(chl ~ ., data = crtr.testing)

bayes.fit

parsnip model object

Fit time:  1.7s 
stan_glm
 family:       gaussian [identity]
 formula:      chl ~ .
 observations: 12
 predictors:   7
------
            Median MAD_SD
(Intercept)  0.4    0.4  
temperature -0.5    0.6  
do          -0.4    0.4  
ph          -0.2    0.3  
ammonia     -0.3    0.3  
phosphate    0.6    0.8  
nitrate     -0.2    0.6  

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.0    0.3   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Predict Bayes fit

bayes.predict = bayes.fit %>%
  predict(crtr.testing) %>%
  bind_cols(crtr.testing)

bayes.predict

# A tibble: 12 x 8
     .pred temperature      do      ph ammonia phosphate nitrate     chl
     <dbl>       <dbl>   <dbl>   <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
 1 -0.500        0.750  0.884  -0.947    0.258   -0.560   -0.625 -1.50  
 2 -0.570        0.370  1.44   -0.870    0.291   -0.560   -0.677  0.504 
 3  0.211       -0.392  0.977  -2.33     0.291   -0.560   -0.521 -0.0237
 4 -0.0976      -0.265  0.861  -1.41     0.588    0.0415   1.66  -0.235 
 5  0.475       -0.201  1.07   -1.02     1.35     2.01     2.49   0.346 
 6  0.280       -1.15   0.0720  0.284    0.291   -0.961   -0.573  0.399 
 7  0.0727      -0.265 -0.624   0.515    0.158   -1.12    -0.625 -0.499 
 8 -0.902        0.814 -0.392  -0.0239   2.80    -0.480   -0.625 -0.763 
 9  0.681        0.814 -1.39   -0.485   -1.10    -0.720   -0.573  0.0819
10  0.554        1.00  -1.41   -0.101    0.621    0.482    0.206  1.35  
11  0.552        0.370 -0.856   0.438   -1.10    -0.640   -0.573  1.61  
12 -0.119        1.38  -0.949   2.44     0.456    0.362   -0.677 -0.657

Evaluate Bayes model

bayes.predict %>%
  metrics(truth = chl, estimate = .pred)

# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.646
2 rsq     standard       0.415
3 mae     standard       0.533

Bayes.fit.stats

To update the parameter table, the tidy() method is once again used:

bayes.stats = bayes.fit %>% 
  tidy(intervals = TRUE)

bayes.stats

# A tibble: 7 x 5
  term        estimate std.error  lower upper
  <chr>          <dbl>     <dbl>  <dbl> <dbl>
1 (Intercept)    0.373     0.425 -0.394 1.07 
2 temperature   -0.472     0.564 -1.44  0.559
3 do            -0.428     0.416 -1.14  0.340
4 ph            -0.170     0.341 -0.772 0.461
5 ammonia       -0.318     0.312 -0.864 0.244
6 phosphate      0.600     0.757 -0.812 1.89 
7 nitrate       -0.206     0.625 -1.22  0.942

bayes.stats %>% 
  slice(-1) %>%
  ggplot(aes(x = term, y = estimate)) +
  geom_point(size = 4)+
  geom_errorbar(aes(ymin = lower, ymax = upper), width = .2)+
  scale_y_continuous(breaks = seq(-1.5,1.5,0.5))+
  ggpubr::theme_pubclean()+
  theme(axis.text = element_text(size = 10))+
  labs(x = "", y = "Estimated chl")

References

Kuhn, Max, Fanny Chow, and Hadley Wickham. 2020. Rsample: General Resampling Infrastructure. https://CRAN.R-project.org/package=rsample.

Kuhn, Max, and Davis Vaughan. 2020a. Parsnip: A Common Api to Modeling and Analysis Functions. https://CRAN.R-project.org/package=parsnip.

———. 2020b. Yardstick: Tidy Characterizations of Model Performance. https://CRAN.R-project.org/package=yardstick.

Kuhn, Max, and Hadley Wickham. 2020a. Recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.

———. 2020b. Tidymodels: Easily Install and Load the ’Tidymodels’ Packages. https://CRAN.R-project.org/package=tidymodels.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2018. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, and Jennifer Bryan. 2018. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.