As a data scientist, you need to distinguish between regression predictive models
and classification predictive models
. Clear understanding of these models helps to choose the best one for a specific use case. In a nutshell, regression predictive models and
classification predictive models` fall under supervised machine learning. The main difference between them is that the output variable—in regression is numerical (or continuous) while that for classification is categorical (or discrete).
A big part of machine learning is classification
— we want to know what class or group an observation belongs to. Therefore, the ability to precisely classify observations to their groups is valuable for various business applications like predicting the future based on historical data.
For example, when provided with a dataset about environmental variables, and you are asked to predict productivity, that is a regression task because productivity measured in term of chlorophyll concentration will be a continuous output.
In this post we will focus on regression. We will learn the steps of modelling using tidymodels (Kuhn and Wickham 2020b). We first explore the data and check if it fit for modelling, we then split the dataset into a training and testing sets. Next, we will create a recipe object and define our model. Lastly, we will train a specified model and evaluate its performance.
I will use the same dataset for three different model’s algorithms. Example of the common regression algorithms include random forest
, linear regression
, support vector regression (SVR)
and bayes
. Some algorithms, such as logistic regression
, have the name regression
in their functions but they are not regression algorithms.
To follow use code in this article, you will need tidymodels (Kuhn and Wickham 2020b) and tidyverse packages (Wickham 2017) installed in your machine. You can install them from CRAN. The chunk below highlight lines of code to install the packages if they are yet in your PC.
model.packages = c("tidymodels", "tidyverse")
install.packages(model.packages)
Once installed, you can load the packages in your session. We often we use several packages to accomplish a task. Even in this post, though seems only two packages are loaded, these are ecosystems which comes with dozens of packages bundled with them.
require(tidyverse)
require(tidymodels)
The CRTR data
We use the data collected with the Institute of Marine Sciences of the University of Dar es Salaam to illustrate the concept. The data was collected through the Coral Reef Targeted Research and Capacity Building for Management (CRTR) project between 2008 and 2009. The dataset contains;
- Chemical variables —nitrate, phosphate, salinity, pH, dissolved oxygen and ammonia
- Physical variables — temperature
- Biological variables— chlorophyll-a
Because the variables are organized in sheets of Excel spreadsheet, i used a read_excel
function from readxl package (Wickham and Bryan 2018) to read the file from the sheet. And because there are several sheet, the processed was iterated with a for
loop. Data from each sheet was allocated in the list file. The chunk below highlight the code used to read files in sheets.
variables = c("salinity", "temperature", "do", "ph", "chl", "ammonia", "phosphate", "nitrate")
crtr.list = list()
for (i in 1:length(variables)){
crtr.list[[i]] = readxl::read_excel("crtr.xlsx", sheet = i) %>%
mutate(variable = variables[i])
}
The data was untidy and unsuitable for visualization and modelling in R. Therefore, the first thing I had to deal with the data was to tidy the variables in the dataset to a right format that tidymodels and tidyverse recognizes. First the dataset was unlisted with bind_rows
function and the data was pivoted to long format and then back to wide format with only the variable of interested selected.
## organize in long form
crtr.long = crtr.list %>%
bind_rows() %>%
pivot_longer(cols =2:5, names_to = "sites", values_to = "data" )
## organize in the wide form
crtr.wide = crtr.long %>%
pivot_wider(names_from = variable, values_from = data) %>%
mutate(month = lubridate::month(Date, label = TRUE,
abb = TRUE) %>% as.character()) %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.numeric, round, digits = 2) %>%
select(date = Date, month,sites, chl, everything())
Let’s us glimpse the dataset
crtr.wide %>%
glimpse()
Rows: 52
Columns: 11
$ date <dttm> 2008-03-01, 2008-03-01, 2008-03-01, 2008-03-01, 2008-0...
$ month <fct> Mar, Mar, Mar, Mar, Apr, Apr, Apr, Apr, May, May, May, ...
$ sites <fct> Pongwe, Mnemba, Chumbe, Bawe, Pongwe, Mnemba, Chumbe, B...
$ chl <dbl> 0.09, 0.26, 0.56, 0.43, 0.47, 1.01, 0.63, 1.39, 0.37, 0...
$ salinity <dbl> 35.0, 34.0, 32.0, 32.0, 35.0, 35.0, 34.0, 34.0, 36.0, 3...
$ temperature <dbl> 28.8, 28.4, 28.0, 28.0, 28.2, 27.7, 28.1, 26.7, 27.0, 2...
$ do <dbl> 6.11, 5.95, 6.16, NA, 6.35, 6.14, 7.01, 6.31, 6.15, 6.1...
$ ph <dbl> 7.86, 7.88, 7.73, NA, 7.87, 7.88, 7.86, 7.91, 7.68, 7.8...
$ ammonia <dbl> 0.55, 0.80, 0.74, 0.94, 0.56, 0.72, 0.53, 0.97, 0.56, 0...
$ phosphate <dbl> 0.28, 0.28, 1.31, 1.90, 0.28, 0.32, 1.16, 0.84, 0.28, 0...
$ nitrate <dbl> 0.04, 0.07, 3.26, 3.34, 0.03, 0.47, 1.45, 0.84, 0.06, 0...
As a first step in modeling, it’s always a good idea to explore the variables in the dataset. Figure 1 is a pairplot that compare each pair of variables as scatterplots in the lower diagonal, densities on the diagonal and correlations written in the upper diagonal (Schloerke et al. 2018). Figure 2 show the correlation between chlorophyll-a (outcome) with other six predictor variables using a linear and quadratic equations is unfit for these dataset.
crtr.wide %>%
select(-salinity)%>%
mutate(season = lubridate::month(date) %>% as.integer(),
season = replace(season,season %in% c(1:4,11:12), "NE"),
season = replace(season,season %in% c(5:10), "SE"))%>%
GGally::ggscatmat(columns = 4:10,color="season", alpha=1, corMethod = "spearman")+
ggsci::scale_color_jco()+
ggpubr::theme_pubclean()+
theme(strip.background = element_blank(),
legend.position = "right",
legend.key = element_blank())
require(wesanderson)
wesa = wes_palettes %>% names()
crtr.wide %>%
select(-salinity)%>%
filter(nitrate < 1 & phosphate < 1.2 & chl < 1) %>%
pivot_longer(cols = 5:10, names_to = "predictor", values_to = "data") %>%
# filter(sites == "Bawe")%>%
ggplot(aes(x = data, y = chl))+
scale_y_continuous(trans = scales::sqrt_trans(), labels = function(x) round(x,2))+
# scale_x_continuous(trans = scales::sqrt_trans(), labels = function(x) round(x,2))+
geom_jitter()+
geom_smooth(se = FALSE, method = "lm", formula = "y ~ poly(x,2)", aes(color = "Quadratic"))+
geom_smooth(se = FALSE, method = "lm", formula = "y ~ x", aes(color = "Linear"))+
ggsci::scale_color_jco()+
facet_wrap(~predictor, scales = "free_x")+
ggpubr::theme_pubclean()+
theme(strip.background.x = element_blank(), legend.key = element_blank(),
legend.position = "right", panel.background = element_rect(colour = "black"))
Resample
In machine learning, one risk is the machine learns too well our sample data and is then less accurate during a real-world testing. This phenomenon is called overtraining or overfitting. We overcome this problem by splitting the dataset into a training and testing sets. The training set is used to train the model while the test set is reserved to later estimate how well the model might work on new or wild data.
The splitting is based on ratios and the widely used ratios include 80/20, 75/25, or 7/30, with the training data receiving a bigger proportion of the dataset and the test set get the remaining small portion.
For our sample that has only 52 observations, it make sense to use 70/30 split ratio. we use the fraction set.seed(4595)
from base R to fix the random number generator from rsample package (Kuhn, Chow, and Wickham 2020). This prevent generating new data in each execution.
the function initial_split
from the rsample package is designed to split the dataset into a training and testing sets. We purse the data to be split and the proportion that serve as a cutpoint of the two sets.
set.seed(4595)
crtr.split = crtr.clean %>%
rsample::initial_split(prop = 0.7)
crtr.split
<Training/Validation/Total>
<29/12/41>
Given the 41 total observations, we reserve 12 observations as a test set and kept 70% of the dataset (29 observation) as train set. From the crtr.split
object, we pull both the train set with the training
function and the test set with a testing
function.
## pull train set
crtr.train = crtr.split %>%
training()
## pull test set
crtr.test = crtr.split %>%
testing()
We can have a glimpse of the train dataset using a glimpse
function from dplyr package (Wickham et al. 2018).
crtr.train %>% glimpse()
Rows: 29
Columns: 7
$ chl <dbl> 0.26, 0.38, 0.43, 0.30, 0.64, 0.35, 0.22, 0.36, 0.14, 0...
$ temperature <dbl> 28.4, 26.5, 26.5, 26.4, 25.7, 25.4, 26.2, 26.4, 25.5, 2...
$ do <dbl> 5.95, 5.94, 5.90, 6.10, 6.21, 6.01, 6.44, 6.26, 5.64, 6...
$ ph <dbl> 7.88, 7.77, 7.82, 7.85, 7.84, 7.79, 7.76, 7.85, 7.83, 7...
$ ammonia <dbl> 0.80, 0.60, 0.64, 0.90, 0.72, 0.67, 0.65, 0.46, 0.63, 0...
$ phosphate <dbl> 0.28, 0.34, 0.42, 0.68, 0.32, 0.52, 0.86, 0.26, 0.42, 0...
$ nitrate <dbl> 0.07, 0.06, 0.38, 0.68, 0.08, 0.42, 0.78, 0.08, 0.37, 0...
The printed output shows that we have seven variables and all are numeric
recipe
The recipes package (Kuhn and Wickham 2020a) define a recipe object that we will use for modeling and to conduct preprocessing of variables. The four main functions are recipe()
, prep()
, juice()
and bake()
. recipe()
defines the operations on the data and the associated roles. Once the preprocessing steps are defined, any parameters are estimated using prep()
.
Recipes can be created manually by sequentially adding roles to variables in a data set. First, we will create a recipe object from the train set data and then specify the processing steps and transform the data with step_*
. once the recipe is ready we prep it. For example, to create a simple recipe containing only an outcome and predictors and have the predictors normalized and missing values in predictors imputed:
crtr.recipe = crtr.train %>%
recipe(chl ~ .) %>%
step_normalize(all_numeric()) %>%
step_corr(all_numeric())%>%
step_knnimpute(all_numeric()) %>%
prep()
crtr.recipe
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Training data contained 29 data points and no missing data.
Operations:
Centering and scaling for temperature, do, ph, ammonia, ... [trained]
Correlation filter removed no terms [trained]
K-nearest neighbor imputation for do, ph, ammonia, phosphate, ... [trained]
Once the data are ready for transformation, the juices()
extract transformed training set while the bake()
function create a new testing set.
crtr.training = crtr.recipe %>%
juice()
crtr.testing = crtr.recipe %>%
bake(crtr.test)
Build Models
Random Forest
We begin by fitting a random forest model.
Make random forest model
We specify the model using the parsnip package (Kuhn and Vaughan 2020a). This package provides a tidy, unified interface to models for a range of models without getting bogged down in the syntactical minutiae of the underlying packages. The parsnip package help us to specify ;
- the
type
of model e.g random forest, - the
mode
of the model whether isregression
orclassification
- the computational
engine
is the name of the R package.
Based on the information above, we can use parsnip package to build our model as;
rf.model = rand_forest() %>%
set_engine(engine = "ranger") %>%
set_mode(mode = "regression")
rf.model
Random Forest Model Specification (regression)
Computational engine: ranger
Train random forest model
Once we have specified the model type, engine and mode, the model can be trained with the fit
function. We simply parse into the fit the specified model and the transformed training set extracted from the prepped recipe.
rf.fit = rf.model %>%
fit(chl ~ ., data = crtr.training)
predict with random forest
To get our predicted results, we use the predict()
function to find the estimated chlorophyll-a. First, let’s generate the estimated chlorophyll-a values by simply parse the random forest model rf.model
we specified and the transformed testing set we created from a prepped recipe. We also stitch the predicted values to the transformed train set with bind_cols
function;
rf.predict = rf.fit %>%
predict(crtr.testing) %>%
bind_cols(crtr.testing)
rf.predict
# A tibble: 12 x 8
.pred temperature do ph ammonia phosphate nitrate chl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0278 0.750 0.884 -0.947 0.258 -0.560 -0.625 -1.50
2 -0.0202 0.370 1.44 -0.870 0.291 -0.560 -0.677 0.504
3 0.0453 -0.392 0.977 -2.33 0.291 -0.560 -0.521 -0.0237
4 -0.173 -0.265 0.861 -1.41 0.588 0.0415 1.66 -0.235
5 -0.0443 -0.201 1.07 -1.02 1.35 2.01 2.49 0.346
6 -0.274 -1.15 0.0720 0.284 0.291 -0.961 -0.573 0.399
7 -0.140 -0.265 -0.624 0.515 0.158 -1.12 -0.625 -0.499
8 0.0790 0.814 -0.392 -0.0239 2.80 -0.480 -0.625 -0.763
9 -0.193 0.814 -1.39 -0.485 -1.10 -0.720 -0.573 0.0819
10 -0.231 1.00 -1.41 -0.101 0.621 0.482 0.206 1.35
11 -0.417 0.370 -0.856 0.438 -1.10 -0.640 -0.573 1.61
12 1.30 1.38 -0.949 2.44 0.456 0.362 -0.677 -0.657
When making predictions, the tidymodels convention is to always produce a tibble of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format:
Evaluate the rf model
So far, we have built a model and preprocessed data with a recipe. We predicted new data as a way to bundle a parsnip model and recipe together. The next step is to assess or evaluate the accuracy of the model. We use a metrics
function from yardstick package (Kuhn and Vaughan 2020b)to assess the accuracy of the model by comparing the predicted versus the original outcome variable. Note that we use the predicted dataset we just computed using predict
function.
rf.predict %>%
metrics(truth = chl, estimate = .pred)
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.10
2 rsq standard 0.221
3 mae standard 0.857
Linear regression approach
Make linear model
The good of tidymodels is that when we change the model, we do not need to start over again from the beginning but rather change the engine. For instance, we replace the ranger
engine with lm
to specify the linear model using the parsnip package (Kuhn and Vaughan 2020a) .
lm.model = linear_reg() %>%
set_engine(engine = "lm") %>%
set_mode(mode = "regression")
Train Linear model
Once we have specified the model type, engine and mode, the model can be trained with the fit
function;
lm.fit = lm.model %>%
fit(chl ~ ., data = crtr.training)
Predict with linear model
Once the model is fitted, This fitted object lm_fit has the lm model output built-in, which you can access with lm_fit$fit, but there are some benefits to using the fitted parsnip model object when it comes to predicting. To predict the values we use predict
function and parse the model and standardized testing values we computed from the recipe (R Core Team 2018). Note that here we use the transformed test set and not the original from the split object. In this case we use the model to predict the value and stitch the testing values using the bind_cols
function;
lm.predict = lm.fit %>%
predict(crtr.testing) %>%
bind_cols(crtr.testing)
lm.predict
# A tibble: 12 x 8
.pred temperature do ph ammonia phosphate nitrate chl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.236 0.750 0.884 -0.947 0.258 -0.560 -0.625 -1.50
2 -0.0649 0.370 1.44 -0.870 0.291 -0.560 -0.677 0.504
3 -0.728 -0.392 0.977 -2.33 0.291 -0.560 -0.521 -0.0237
4 -0.764 -0.265 0.861 -1.41 0.588 0.0415 1.66 -0.235
5 0.0740 -0.201 1.07 -1.02 1.35 2.01 2.49 0.346
6 0.0151 -1.15 0.0720 0.284 0.291 -0.961 -0.573 0.399
7 -0.135 -0.265 -0.624 0.515 0.158 -1.12 -0.625 -0.499
8 0.323 0.814 -0.392 -0.0239 2.80 -0.480 -0.625 -0.763
9 -0.772 0.814 -1.39 -0.485 -1.10 -0.720 -0.573 0.0819
10 -0.117 1.00 -1.41 -0.101 0.621 0.482 0.206 1.35
11 -0.263 0.370 -0.856 0.438 -1.10 -0.640 -0.573 1.61
12 1.13 1.38 -0.949 2.44 0.456 0.362 -0.677 -0.657
Evaluate linear model
Once we have our lm.predict
dataset that contains the predicted and test values, we can now use the metrics
fiction to evaluate the accuracy of the model.
lm.predict%>%
metrics(truth = chl, estimate = .pred)
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.07
2 rsq standard 0.0337
3 mae standard 0.929
Estimate stats
Sometimes you may wish to plot predicted values with different predictors. To do that you need to create a new tidied data from the the model with tidy
function and parse interval = TRUE
argument as shown in the code below. This create a tibble shown below and the same data is plotted in figure 3.
lm.fit.stats = lm.fit %>%
tidy(interval = TRUE)
lm.fit.stats
# A tibble: 7 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.25e-15 0.172 7.25e-15 1.00
2 temperature -4.82e- 2 0.258 -1.87e- 1 0.854
3 do 1.83e- 1 0.214 8.58e- 1 0.400
4 ph 3.88e- 1 0.307 1.26e+ 0 0.220
5 ammonia 1.59e- 1 0.209 7.64e- 1 0.453
6 phosphate 3.98e- 1 0.225 1.77e+ 0 0.0913
7 nitrate -3.00e- 1 0.274 -1.09e+ 0 0.286
lm.fit.stats %>%
slice(-1) %>%
ggplot(aes(x = term, y = estimate)) +
geom_point(size = 4)+
geom_errorbar(aes(ymin = estimate-std.error, ymax = estimate+std.error), width = .2)+
scale_y_continuous(breaks = seq(-1.5,1.5,0.5))+
ggpubr::theme_pubclean()+
theme(axis.text = element_text(size = 10))+
labs(x = "", y = "Estimated chl")
Bayesian approach
Make Bayes Model
For Bayesian, we also change the engine and specified are called prior
and prior_intercept
. It turns out that linear_reg()
has a stan
engine. Since these prior distribution arguments are specific to the Stan software, they are passed as arguments to parsnip::set_engine()
.
prior.dist = rstanarm::student_t(df = 1)
set.seed(401)
## make the parsnip model
bayes.model = linear_reg() %>%
set_engine(engine = "stan",
prior_intercept = prior.dist,
prior = prior.dist) %>%
set_mode(mode = "regression")
This kind of Bayesian analysis (like many models) involves randomly generated numbers in its fitting procedure. We can use set.seed()
to ensure that the same (pseudo-)random numbers are generated each time we run this code. The number 123 isn’t special or related to our data; it is just a “seed” used to choose random numbers.
Train Bayes model
Once we have defined the Bayesian model, we train it using a transformed testing set;
## train the bayes model
bayes.fit = bayes.model%>%
fit(chl ~ ., data = crtr.testing)
bayes.fit
parsnip model object
Fit time: 1.7s
stan_glm
family: gaussian [identity]
formula: chl ~ .
observations: 12
predictors: 7
------
Median MAD_SD
(Intercept) 0.4 0.4
temperature -0.5 0.6
do -0.4 0.4
ph -0.2 0.3
ammonia -0.3 0.3
phosphate 0.6 0.8
nitrate -0.2 0.6
Auxiliary parameter(s):
Median MAD_SD
sigma 1.0 0.3
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Predict Bayes fit
bayes.predict = bayes.fit %>%
predict(crtr.testing) %>%
bind_cols(crtr.testing)
bayes.predict
# A tibble: 12 x 8
.pred temperature do ph ammonia phosphate nitrate chl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.500 0.750 0.884 -0.947 0.258 -0.560 -0.625 -1.50
2 -0.570 0.370 1.44 -0.870 0.291 -0.560 -0.677 0.504
3 0.211 -0.392 0.977 -2.33 0.291 -0.560 -0.521 -0.0237
4 -0.0976 -0.265 0.861 -1.41 0.588 0.0415 1.66 -0.235
5 0.475 -0.201 1.07 -1.02 1.35 2.01 2.49 0.346
6 0.280 -1.15 0.0720 0.284 0.291 -0.961 -0.573 0.399
7 0.0727 -0.265 -0.624 0.515 0.158 -1.12 -0.625 -0.499
8 -0.902 0.814 -0.392 -0.0239 2.80 -0.480 -0.625 -0.763
9 0.681 0.814 -1.39 -0.485 -1.10 -0.720 -0.573 0.0819
10 0.554 1.00 -1.41 -0.101 0.621 0.482 0.206 1.35
11 0.552 0.370 -0.856 0.438 -1.10 -0.640 -0.573 1.61
12 -0.119 1.38 -0.949 2.44 0.456 0.362 -0.677 -0.657
Evaluate Bayes model
bayes.predict %>%
metrics(truth = chl, estimate = .pred)
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.646
2 rsq standard 0.415
3 mae standard 0.533
Bayes.fit.stats
To update the parameter table, the tidy()
method is once again used:
bayes.stats = bayes.fit %>%
tidy(intervals = TRUE)
bayes.stats
# A tibble: 7 x 5
term estimate std.error lower upper
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.373 0.425 -0.394 1.07
2 temperature -0.472 0.564 -1.44 0.559
3 do -0.428 0.416 -1.14 0.340
4 ph -0.170 0.341 -0.772 0.461
5 ammonia -0.318 0.312 -0.864 0.244
6 phosphate 0.600 0.757 -0.812 1.89
7 nitrate -0.206 0.625 -1.22 0.942
bayes.stats %>%
slice(-1) %>%
ggplot(aes(x = term, y = estimate)) +
geom_point(size = 4)+
geom_errorbar(aes(ymin = lower, ymax = upper), width = .2)+
scale_y_continuous(breaks = seq(-1.5,1.5,0.5))+
ggpubr::theme_pubclean()+
theme(axis.text = element_text(size = 10))+
labs(x = "", y = "Estimated chl")
References
Kuhn, Max, Fanny Chow, and Hadley Wickham. 2020. Rsample: General Resampling Infrastructure. https://CRAN.R-project.org/package=rsample.
Kuhn, Max, and Davis Vaughan. 2020a. Parsnip: A Common Api to Modeling and Analysis Functions. https://CRAN.R-project.org/package=parsnip.
———. 2020b. Yardstick: Tidy Characterizations of Model Performance. https://CRAN.R-project.org/package=yardstick.
Kuhn, Max, and Hadley Wickham. 2020a. Recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.
———. 2020b. Tidymodels: Easily Install and Load the ’Tidymodels’ Packages. https://CRAN.R-project.org/package=tidymodels.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2018. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, and Jennifer Bryan. 2018. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.