Taking text data to the next level - Unsupervised and supervised approaches in NLP @R-Ladies Bergen

I had the great pleasure to talk about NLP at R-Ladies Bergen yesterday. Thanks to everyone for making this event so much fun! The talk covers both unsupervised and supervised approaches and introduces you to quanteda, an R package that allows you to perform NLP tasks.

All material can be accessed here (including slides, raw and deployed code as well as the recording). The talk itself is heavily based on this blogpost.

Here are some further insights into the talk:

# Plot a word cloud
  # Load the DFM object
  # Define the minimum number the words have to occur
  min_count = 3,
  # Define the maximum number the words can occur
  max_words = 500,
  # Define a color
  color = wes_palette("Darjeeling1")
# This code is heavily inspired by Julia Silge's blog post
# (https://juliasilge.com/blog/sherlock-holmes-stm/)
# This code is heavily inspired by this blog post:
# (https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/)

data %>%
# Generate the country name for each country using the 
# `countrycode()` command
  dplyr::mutate(countryname = countrycode(ccode, "iso3c", "country.name")) %>%
# Filter and only select specific countries that we want to compare
  dplyr::filter(countryname %in% c(
    "United Kingdom",
)) %>%
# Now comes the plotting part :-)
  ggplot() +
# We do a bar plot that has the years on the x-axis and the level of the 
# net-sentiment on the y-axis
# We also color it so that all the net-sentiments greater 0 get a 
# different color
    x = year,
    y = net_perc,
    fill = (net_perc > 0)
  )) +
# Here we define the colors as well as the labels and title of the legend
    name = "Sentiment",
    labels = c("Negative", "Positive"),
    values = c("#C93312", "#446455")
  ) +
# Now we add the axes labels
  xlab("Time") +
  ylab("Net sentiment") +
# And do a facet_wrap by country to get a more meaningful visualization
  facet_wrap(~ countryname)
# Inspired here: https://bit.ly/37MCEHg

# Get the 30 top features from the DFM
freq_feature <- topfeatures(mydfm, 30)

# Create a data.frame for ggplot
data <- data.frame(list(
  term = names(freq_feature),
  frequency = unname(freq_feature)

# Plot the plot
data %>%
  # Call ggplot
  ggplot() +
  # Add geom_segment (this will give us the lines of the lollipops)
    x = reorder(term, frequency),
    xend = reorder(term, frequency),
    y = 0,
    yend = frequency
  ), color = "grey") +
# Call a point plot with the terms on the x-axis 
# and the frequency on the y-axis
  geom_point(aes(x = reorder(term, frequency), y = frequency)) +
# Flip the plot
  coord_flip() +
# Add labels for the axes
  xlab("") +
  ylab("Absolute frequency of the features")
data %>%
# Generate the continent for each country using the `countrycode()` command
  dplyr::mutate(continent = countrycode(ccode, "iso3c", "continent", 
                            custom_match = c("YUG" = "Europe"))) %>%
# We group by continent and year to generate the average sentiment by 
# continent and and year  
  group_by(continent, year) %>%
  dplyr::mutate(avg = mean(net_perc)) %>%
# We now plot it
  ggplot() +
# Using a line chart with year on the x-axis, the average sentiment 
# by continent on the y-axis and colored by continent
  geom_line(aes(x = year, y = avg, col = continent)) +
# Define the colors
  scale_color_manual(name = "", values = wes_palette("Darjeeling1")) +
# Label the axes
  xlab("Time") +
  ylab("Average net sentiment") 

These figures above show the output of more basic supervised and unsupervised models in NLP that you can use and that we covered during the talk. And as you work more and more with textual data, you will see that there is so much more in the field of NLP including document similarity, text generation or even chat bots that you can create using your knowledge and starting with the same simple steps that I presented in the talk 👩🏼‍💻

If you want more resources, you can access them here:

PhD Candidate