Introduction
A word cloud is a visual representation of word, with more frequent word appear large and bold and less frequency small. In other words the more often a word occurs within a body of text, the larger it appears in the word cloud. This allows you to see immediately which words are most prominent and maybe most important. This state–of–the–art data analysis is widely refered to as text data mining and other call it text analytics. Text mining is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. In this post I illustrate steps you need to create a wordcloud with the text from the Bible!
In this post I analyse text from the last Book in Old Testament called Malachi. Briefly, the word Malachi is hebrew, which means my messenger in English. After many years of exile, the Israel return to their land and expected to enjoy the unlimited blessing of God. This did not prove to be so, and as a result people began to doubt whether God really cared for them. Malachi, the prophet replied that the fault was on their side, not God. They had, by their sins, created barriers that hindered the flow and enjoyment of God’s love.
Packages
We need to load the packages of some function that we are going to use in this post, the package can be loaded using either library()
or require()
function. I prefer the require()
function and is the one I used to load the package as highlighted in the chunk below;
require(tm)
require(RColorBrewer)
require(tidyverse)
require(magrittr)
require(ggwordcloud)
Text Data
Coene (2019) developed a sacred package that bundle the whole King James Version bible. This bible contains all 66 books and each verse is structured in a row. We can obtain this bible with the line of code below
scripture = sacred::king_james_version
The bible comes as a data frame with four column shown in table 1
Number | Book | Chapter | Verse | Text |
---|---|---|---|---|
43 | JOH | 10 | 2 | But he that entereth in by the door is the shepherd of the sheep. |
44 | ACT | 28 | 21 | And they said unto him, We neither received letters out of Judaea concerning thee, neither any of the brethren that came shewed or spake any harm of thee. |
55 | TI2 | 24 | 6 | For I am now ready to be offered, and the time of my departure is at hand. |
38 | ZAC | 13 | 3 | And it shall come to pass, that when any shall yet prophesy, then his father and his mother that begat him shall say unto him, Thou shalt not live; for thou speakest lies in the name of the LORD: and his father and his mother that begat him shall thrust him through when he prophesieth. |
43 | JOH | 4 | 25 | The woman saith unto him, I know that Messias cometh, which is called Christ: when he is come, he will tell us all things. |
24 | JER | 51 | 58 | Thus saith the LORD of hosts; The broad walls of Babylon shall be utterly broken, and her high gates shall be burned with fire; and the people shall labour in vain, and the folk in the fire, and they shall be weary. |
19 | PSA | 38 | 11 | My lovers and my friends stand aloof from my sore; and my kinsmen stand afar off. |
24 | JER | 37 | 5 | Then Pharaoh’s army was come forth out of Egypt: and when the Chaldeans that besieged Jerusalem heard tidings of them, they departed from Jerusalem. |
47 | CO2 | 23 | 16 | Nevertheless when it shall turn to the Lord, the vail shall be taken away. |
58 | HEB | 1 | 3 | Who being the brightness of his glory, and the express image of his person, and upholding all things by the word of his power, when he had by himself purged our sins, sat down on the right hand of the Majesty on high. |
Because our interest is to analyse the text in the KJV, I only pick the text from the Malachi, you can pick any book by using a filter()
function from dplyr package (Wickham et al. 2018). I have noticed that the proceeding function works well with tibble format document rather than the vector. So please avoid selecting the column using the $
operator. You may end up getting error and fail to recognize the cause of this failure.
malachi = scripture %>%
filter(book == "mal")%>%
select(text)
Because the text mining package we are going to use like to work in the document that are structured as corpus
, we ought to convert the text document into the corpus
format. The chunk below show the code for transforming the text document into corpus;
malachi.corpus = malachi %>%
tm::VectorSource() %>%
tm::VCorpus()
Once the corpus document is create, it was further cleaned and neaty by replacing special characters to space with tm_map()
function , for example, you can replace any special characters from the text like “/”, “@” and “|” with space:
toSpace = content_transformer(function (x , pattern )
gsub(pattern, " ", x))
malachi.corpus = malachi.corpus %>%
tm_map(toSpace, "/") %>%
tm_map(toSpace, " ") %>%
tm_map(toSpace, "\\|")
We can then further clean and make the document neaty by removing stopwords and change all words to small letters. The chunk below highlight lines of code used to clean and remove stopwords in the Malachi Book of the Bible.
malachi.corpus = malachi.corpus %>%
tm_map(FUN = content_transformer(tolower)) %>% # Convert the text to lower case
tm_map(FUN = removeNumbers) %>% # Remove numbers
tm_map(removeWords, stopwords("english")) %>% # Remove english common stopwords
tm_map(removeWords, c("ye", "O", "unto", "yet", "thee", "wherein", "neither", "shall",
"saith", "host", "will", "offer", "say")) %>% # Remove words
tm_map(removePunctuation) %>% # Remove punctuations
tm_map(stripWhitespace) #
# malachi.corpus %>% inspect()
Once the document is clean, it the right time to compute the frequency of each word. We can achive this with the TermDocumentMatrix()
function from the tmpackage. Then the document is changed from corpus format to matrix and to data.frame. Because the rownames—the word is embbed as rownames, we transform them into a column with the rownames_to_column()
function from tibble package (Müller and Wickham 2018) and change the columns names with meaningful ones.
malachi.corpus.tb= malachi.corpus %>%
tm::TermDocumentMatrix(control = list(removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE)) %>%
as.matrix() %>% as.data.frame() %>%
tibble::rownames_to_column() %>%
dplyr::rename(word = 1, freq = 2) %>%
dplyr::arrange(desc(freq))
Figure 1 is word cloud of the entire Book of Malachi created using the chunk below. Note that that using the frequency value only makes the word horizontal.
ggplot(data = malachi.corpus.tb,
aes(label = word, size = freq, col = as.character(freq))) +
geom_text_wordcloud(rm_outside = TRUE, max_steps = 1,
grid_size = 1, eccentricity = .9)+
scale_size_area(max_size = 20)+
scale_color_brewer(palette = "Paired", direction = -1)+
theme_void()
However, if you want the word to rotate, you should parse an argument of the angle
in the aes()
and supply a column that contain the angle. For this case we compute the angle that will rotate 90 degrees for a random subset of 40 % of the words’ The chunk codes for computing the angle is shown below;
malachi.corpus.tb = malachi.corpus.tb %>%
mutate(angle = 90 * sample(c(0, 1), n(), replace = TRUE, prob = c(60, 40)))
Figure 2 is the wordcloud of Malachi book with some rotated at 90 degree. Figure 2 was generated with the lines of code in the chunk below;
ggplot(data = malachi.corpus.tb,
aes(label = word, size = freq, angle = angle, col = as.character(freq))) +
geom_text_wordcloud(rm_outside = TRUE, max_steps = 1,
grid_size = 1, eccentricity = .9)+
scale_size_area(max_size = 20)+
scale_color_brewer(palette = "Paired", direction = -1)+
theme_void()
References
Coene, John. 2019. Sacred: Set of Sacred Texts Bible Data in Tidy Format. http://sacred.john-coene.com.
Müller, Kirill, and Hadley Wickham. 2018. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.