Machine-learning assisted vaccine sentiment and misinformation detection dashboard — part 1

song yunju
6 min readMar 16, 2022

Project contributors: Mustafa Arif, Anup Deb, Saad Jameel, Bill (Yuan Hong) Sun, Melodie (Yunju) Song

Social media such as Twitter, Facebook, Reddit, is the pulse of our generation where anyone can share their opinion on what matters to them. From social media platforms, digital epidemiologists extract posts in the format of unstructured free-texts to explore why are people opposed to mask-wearing during the COVID-19 pandemic (He et al, 2021), detect outbreaks (Yousefinaghani et al, 2019), estimate influenza prevalence (Broniatowski et al, 2013), and even measure quality of life (Zivanovic, et al, 2020).

As a Health Systems Impact Fellow funded jointly by the Canadian Institutes of Health Research (CIHR) and Public Health Ontario (aka The Ontario Agency of Health Protection and Promotion), I worked with students at the Department of Mechanical Engineering and the Department of Computer Sciences at the University of Toronto, and students at the Department of Occupational and Population Health at Ryerson University to design a semi-automated vaccine sentiment detection dashboard that can tell us if a tweet is pro-vaccine or anti-vaccine, in real time.

Background

The motivation behind developing a dashboard that tracks daily anti- and pro-vaccine is simple — we want to have a longitudinal understanding of anti-vaccine prevalence without having the need to disseminate surveys and collecting surveys. Dr. Matthew Salganik, the author of “Bit by Bit: Social Research in the Digital Age” and the co-organizer of the Summer Institutes of Computational Social Sciences (SICSS), a book and course that are both available for free for any social sciences researchers, argues that researchers who deploy traditional social surveys face obstacles of non-respondence, incomplete response, or Hawthorne effects in written or phone surveys, and given the abundance and availability of ready-made big data (e.g., social media posts that are collected and stored not for research but can be repurposed to answer social science questions), surveys take time to produce and collect, and finding the right population to distribute the survey can be time-consuming and require follow-up.

Since the late 2000’s, Canadians have become increasingly vaccine hesitant due to concerns about vaccine efficacy and vaccine safety, at any point in time, 2–3 people out of 10 will refuse or delay vaccination. The reasons for vaccine hesitancy is complex and hard to conduct target interventions in Canada due to vaccine hesitant individuals in Canada not less educated or more educated than the average Canadian, they are more inclined to mistrust medical professionals and government institutions broadly, and tend to visit a primary care physician less, where interventions or opportunities to communicate vaccine information occur. There is, however, certainty that the population who are vaccine hesitant are exposed to anti-vaccine misinformation.

According to the “Canadian Perspectives Survey Series 3” taken in June 20202 by Statistics Canada, roughly 57.5% of the respondents expressed that they are very likely to receive the vaccine, but some are “somewhat unlikely” (5.1%) or “very unlikely” (9.0%) to receive a COVID-19 vaccine when it becomes available.

Figure 1. Canadians’ willingness to get a COVID-19 vaccine when it becomes available, source: Statistics Canada, Canadian Perspectives Survey Series 3 (June 2020).

After COVID-19 vaccines were approved and made available, Statistics Canada estimate that unvaccinated populations vary by age group, the highest 30–39 year-old cohort rate (15%). This seems to show that the surveyed Canadians’ intention to vaccinate match the actual vaccination rate.

Figure 2. COVID-19 vaccine uptake by age group, adult population of the 10 provinces, source: Statistics Canada, Canadian Community Health Survey (June — August 2021).

What about the actual vaccination rate and what people express on social media? Can we estimate population-level immunization coverage using public social media posts? The short answer is yes, if tweets that express receipt of vaccination is correctly identified once for each unique twitter user. One of the earlier large-scale studies built a tweet classifier to track flu vaccinations by geographic location and gender from 2013–2014, and compared the patterns of the volume of tweets with the flu vaccination coverage rates collected by Centers for Disease Control and Prevention (CDC) (Huang et al, 2017, AAAI). The team annotated 10,000 tweets manually into “yes: has the person already received a vaccine, or do they intend to receive the vaccine in the future” and “no” (Fleiss-kappa interrater reliability: 0.793). Using these annotated tweets as part of the training dataset, they trained a model that detected patterns of vaccination intention, vaccination mentions, and actual vaccination rates (see Figure 3.

Figure 3. A research team collected flu-vaccine-tweets to show that the vaccine prevalence estimates on Twitter coincide with CDC’s actual vaccination data (image cropped from Huang et al, 2013).

Over the past decade, it has become increasingly difficult to estimate vaccine coverage using social media data because of noise. Noise in data science refers to information collected that does not contribute to our understanding of the ground truth in the advent of bots, vaccine-related campaigns, anti-vaccine activity, and the many options of vaccines to prevent the same disease.

Take COVID-19 vaccination related tweets for example, if we were to estimate vaccination intent (true positive), an anti-vaccine campaign or disinformation campaign deploy bots to post “intention to refuse vaccination” tweets, we would see too much anti-vaccine tweets in our sample that does not reflect the actual number of people who oppose vaccines. Mass pro-vaccination social media organizational campaigns such as “book a vaccination appointment for you and your loved ones” can skew the sample the other direction, creating too many intention to vaccinate tweets in the sample. Third, the availability of many different kinds of vaccines to prevent one single disease can cause classification errors, if a person tweeted “I will not get an AZ, I am waiting for a newer safer vaccine!”, though this tweet should ideally be classified as “yes: intention to vaccinate”, it is uncertain whether the automated classification algorithm can be trained with such a sensitivity (i.e. the proportion of true intention to vaccinate identified over all intention to vaccinate tweets identified) without sacrificing model specificity (i.e., the proportion of true intention to not vaccinate tweets over all intention to not vaccinate tweets).

Given the challenges of estimating coverage using social media data, the second best approach to know what areas to target when improving vaccine confidence is to conduct sentiment analysis. Sentiment analysis, or opinion mining, is to use NLP to determine whether a topic/item/product induces positive, negative, or neutral feelings in people. Further, data scientists retrieve the geographic location, time, and URLs of these tweets to see what types of links are pro-vaccine and anti-vaccine users sharing with others, where are the users located, and what are words most tied to anti- or pro-vaccine sentiment — all this can help public health practitioners to design communication strategies to promote vaccine confidence.

Given these considerations, the 5 objectives of our dashboard is as follows:

  • Identify trends in vaccine sentiment on Twitter with over 80% accuracy and 70% recall on a day-to-day basis.
  • Link vaccine hesitancy trends with media and key news events in order to identify the factors behind changes in sentiment.
  • Identify sources of vaccine hesitancy hotspots within Canada by city and province.
  • Present all the collected statistics and data on an interactive and convenient dashboard in the form of graphs, visualizations and infographics that multiple users can access simultaneously.
  • Automate daily data pulls from Twitter onto the dashboard to update it daily.

In the second part of the series, we will share how we collected the data, annotated the tweets, processed the tweets. We will also discuss the kinds of models, hyperparameter tuning, and features we chose. A dashboard architecture is also shared.

  • To understand more about vaccine hesitancy, please refer to the federally funded Canadian Immunization Research Network (CIRN)’s Social Sciences and Humanities Network (SSHN) consists of leading scientists across the provinces such as Dr. Eve Dube, Dr. Natasha Crowcroft, Dr. Noni McDonald, Dr. Kumanan Wilson, and Dr. Jordan Tustin, who have helped define vaccine hesitancy, identified the contextual reasons Canadians’ refuse vaccination online and offline, and evaluated methods to increase vaccine accessibility and confidence among diverse populations.

--

--