One of the most important components of a data management strategy is the proper handling of missing values. Often you will find datasets that have used either zero or NA to represent missing values in cell of a data frame or table. The question that may sober our mind is that, which of the two way of storing missing value in the data is right? You will notice that everyone who deals with data has to deal with this important distinction. And far too often people get it wrong.
Given the prevalence of PC’s in the government and business world, most people have Excel on their desktop and many small datasets are generated in Excel and disseminated as Excel files. Spreadsheet software like Microsoft Excel (Kraus 2014) and OpenOffice Calc (Oualline and Oualline 2018) is ubiquitous.
One of the problems in using spreadsheet software to work with data is that the default data validation settings on these tools are too forgiving. By default, data entered into a cell can be of any type — string, float, decimal, date, etc. Though these tools offers features to properly encode the “missing value”, data entry in spreadsheet tend to ignore them.
Unlike spreadsheet, R data stored in data frame are type specific and you can not mix them in one variable [R Core Team (2018). R store missing values by the symbol NA
(not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN
(not a number). To those accustomed to working with missing values in spreadsheet software they find hard to handle data in R and handles missing values in R to them may require a shift in thinking.
On this post, I will present a case in which you have received a data with zero as missing values and how you can convert to NA
—a standard form for missing value in R. Let’s first load the package we will use for this post.
require(tidyverse)
The tidyverse is an ecosystem of packages for data import, manipulate, model, visualize and share. For the purpose of learning, we first create a random sea surface temperature for three sites recorded for thirty days in March 2020. Table 1 shows a sample of temperature records in the three sites for four consecutive days generated in the chunk below;
sst = tibble(data = rnorm(n = 90, mean = 28.5, sd = 1.5),
sites = rep(c("Pemba", "Zanzibar", "Mafia"), times = 30),
day = rep(seq(lubridate::dmy(010320),lubridate::dmy(300320), by = "day"), each = 3)) %>%
select(day, everything())
Day | Sites | Temperature |
---|---|---|
2020-03-01 | Pemba | 28.98 |
2020-03-01 | Zanzibar | 27.34 |
2020-03-01 | Mafia | 28.97 |
2020-03-02 | Pemba | 29.99 |
2020-03-02 | Zanzibar | 26.94 |
2020-03-02 | Mafia | 29.87 |
2020-03-03 | Pemba | 27.46 |
2020-03-03 | Zanzibar | 26.42 |
2020-03-03 | Mafia | 29.80 |
2020-03-04 | Pemba | 29.80 |
2020-03-04 | Zanzibar | 31.00 |
2020-03-04 | Mafia | 29.41 |
Once we have the dataset, we can introduce the zero value in the temperature variable at 5, 9, 13, 18, 22, 27, 31, 36, 40, 45, 49, 54, 58, 63, 67, 72, 76, 81, 85, 90 index of the data frame.
sst[seq(5, 90, length.out = 20) %>% as.integer(),2] = 0
To have a clear view of the position of zero, I first converted the long form table of the sea surface temperature to wider form using the pivot_wider
function from tidyr package (Wickham and Henry 2018). Table 2 shows the temperature in the wider format with four columns. The first column represent the data of temperature records and the second to the fourth columns represent the temperature records of the three channels. Take note of the position of zero at each channel
sst.wide = sst %>%
pivot_wider(names_from = sites,
values_from = data,
id_cols = day)
Day | Pemba | Zanzibar | Mafia |
---|---|---|---|
2020-03-01 | 28.98 | 27.34 | 28.97 |
2020-03-02 | 29.99 | 0.00 | 29.87 |
2020-03-03 | 27.46 | 26.42 | 0.00 |
2020-03-04 | 29.80 | 31.00 | 29.41 |
2020-03-05 | 0.00 | 27.12 | 27.39 |
2020-03-06 | 29.43 | 28.09 | 0.00 |
2020-03-07 | 27.26 | 27.87 | 29.66 |
2020-03-08 | 0.00 | 27.44 | 28.26 |
2020-03-09 | 28.67 | 29.34 | 0.00 |
2020-03-10 | 27.92 | 28.59 | 26.61 |
2020-03-11 | 0.00 | 29.41 | 30.28 |
2020-03-12 | 26.21 | 28.38 | 0.00 |
2020-03-13 | 26.91 | 27.45 | 30.32 |
2020-03-14 | 0.00 | 28.54 | 28.17 |
2020-03-15 | 29.65 | 27.28 | 0.00 |
2020-03-16 | 26.34 | 26.06 | 27.20 |
2020-03-17 | 0.00 | 30.29 | 29.44 |
2020-03-18 | 29.97 | 27.36 | 0.00 |
2020-03-19 | 27.60 | 29.10 | 25.51 |
2020-03-20 | 0.00 | 29.45 | 30.10 |
2020-03-21 | 29.42 | 30.44 | 0.00 |
2020-03-22 | 29.55 | 32.12 | 28.96 |
2020-03-23 | 0.00 | 29.57 | 29.29 |
2020-03-24 | 30.24 | 26.54 | 0.00 |
2020-03-25 | 27.21 | 27.87 | 29.92 |
2020-03-26 | 0.00 | 28.83 | 30.43 |
2020-03-27 | 26.14 | 30.63 | 0.00 |
2020-03-28 | 28.24 | 28.15 | 27.10 |
2020-03-29 | 0.00 | 27.81 | 27.93 |
2020-03-30 | 30.47 | 27.48 | 0.00 |
In short, table 2 show that there are several days with zero value. But we know that it is impossible to have zero sea surface temperature. This indicate a common problem that you will often face when you deal with data from different sources. The data found in the real world is that data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of missing values. To make matters even more complicated, different data sources may indicate missing data in different ways.
Thus, for us to work with this data, we need to convert the 0 entries in the three channels to NA
. To do that, we use the mutate_at()
function from dplyr (Wickham et al. 2018) to specify which variables we want to apply our mutating function to, and we use the if_else()
function to specify what to replace the value with if the condition is true
or false
. The chunk below summarize how to convert the zero cell to NA
and the result is presented in table 3
sst.clean = sst.wide %>%
mutate_at(vars(Pemba, Zanzibar, Mafia),
function(.var){
if_else(.var ==0, ## if the value is equal to zero
true = as.numeric(NA), ## replace it with NA
false = .var ## else leave the value as it is
)
}
)
Day | Pemba | Zanzibar | Mafia |
---|---|---|---|
2020-03-01 | 28.98 | 27.34 | 28.97 |
2020-03-02 | 29.99 | NA | 29.87 |
2020-03-03 | 27.46 | 26.42 | NA |
2020-03-04 | 29.80 | 31.00 | 29.41 |
2020-03-05 | NA | 27.12 | 27.39 |
2020-03-06 | 29.43 | 28.09 | NA |
2020-03-07 | 27.26 | 27.87 | 29.66 |
2020-03-08 | NA | 27.44 | 28.26 |
2020-03-09 | 28.67 | 29.34 | NA |
2020-03-10 | 27.92 | 28.59 | 26.61 |
2020-03-11 | NA | 29.41 | 30.28 |
2020-03-12 | 26.21 | 28.38 | NA |
2020-03-13 | 26.91 | 27.45 | 30.32 |
2020-03-14 | NA | 28.54 | 28.17 |
2020-03-15 | 29.65 | 27.28 | NA |
2020-03-16 | 26.34 | 26.06 | 27.20 |
2020-03-17 | NA | 30.29 | 29.44 |
2020-03-18 | 29.97 | 27.36 | NA |
2020-03-19 | 27.60 | 29.10 | 25.51 |
2020-03-20 | NA | 29.45 | 30.10 |
2020-03-21 | 29.42 | 30.44 | NA |
2020-03-22 | 29.55 | 32.12 | 28.96 |
2020-03-23 | NA | 29.57 | 29.29 |
2020-03-24 | 30.24 | 26.54 | NA |
2020-03-25 | 27.21 | 27.87 | 29.92 |
2020-03-26 | NA | 28.83 | 30.43 |
2020-03-27 | 26.14 | 30.63 | NA |
2020-03-28 | 28.24 | 28.15 | 27.10 |
2020-03-29 | NA | 27.81 | 27.93 |
2020-03-30 | 30.47 | 27.48 | NA |
Summary
Here we have seen how to convert inproper filled cell of data frame with missing values to NA
in for all columns in data frame using tidyverse functions and we have seen a nested function that specifically designed to handle these missing values in a uniform way. Missing data is a fact of life in real-world datasets, and we’ll face often when we works with data from other sources.
Remarks
If creating functions is a hurdle task to you, dplyr has a na_if()
function that convert a value that meet a specified condition to NA
for all the variables. For instance we could also change the zero values in the dataset into NA as shwon in the chunk below;
sst.wide %>%
na_if(0)
# A tibble: 30 x 4
day Pemba Zanzibar Mafia
<date> <dbl> <dbl> <dbl>
1 2020-03-01 29.0 27.3 29.0
2 2020-03-02 30.0 NA 29.9
3 2020-03-03 27.5 26.4 NA
4 2020-03-04 29.8 31.0 29.4
5 2020-03-05 NA 27.1 27.4
6 2020-03-06 29.4 28.1 NA
7 2020-03-07 27.3 27.9 29.7
8 2020-03-08 NA 27.4 28.3
9 2020-03-09 28.7 29.3 NA
10 2020-03-10 27.9 28.6 26.6
# ... with 20 more rows
Reference
Kraus, Daniel. 2014. “Consolidated Data Analysis and Presentation Using an Open-Source Add-in for the Microsoft Excel Spreadsheet Software.” Medical Writing 23 (1): 25–28.
Oualline, Steve, and Grace Oualline. 2018. “Working with Spreadsheets in Libreoffice Calc.” In Practical Free Alternatives to Commercial Software, 89–117. Springer.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Lionel Henry. 2018. Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.