R : word frequency in dataframe


Alright so in the short tutorial we’ll calculate word frequency and visualize it.

It’s relatively simple task.
BUT when it comes for stopwords and language different from English, there might be some difficulties.

I’ve a dataframe which has field text is russian language.

Step 0 : Install required libraries

packages.install("tidyverse")
packages.install("tidytext")
packages.install("tm")
library(tidyverse)
library(tidytext)
library(tm)

Step 1 : Create stopwords dataframe

#create stopwords DF
rus_stopwords = data.frame(word = stopwords("ru"))

Step 2 : Tokenize

new_df <- video %>% unnest_tokens(word, text) %>% anti_join(rus_stopwords)


# - anti_join  - functoin to remove stopwords
#video - is name of dataframe
#word - is name of new field
#text - is just a filed with our text

Step 3 : Count words

frequency_dataframe = new_df %>% count(word) %>% arrange(desc(n))

Step 4 (Optional) : Take only first 20 items from a dataframe

short_dataframe = head(frequency_dataframe, 20)

Step 5 Visualize with ggplot

ggplot(short_dataframe, aes(x = word, y = n, fill = word)) + geom_col() 

So in my case it looked looked like this:

Screenshot 2020-05-05 at 11.50.18.png



Reference: Source link

Sr. SDET M Mehedi Zaman

Currently working as Sr. SDET at Robi Axiata Limited, a subsidiary of Axiata Group.

Leave a Reply

Your email address will not be published. Required fields are marked *