Misinformation versus Facts: Understanding the Influence of News regarding COVID-19 Vaccines on Vaccine Uptake

Background. There is a lot of fact-based information and misinformation in the online discourses and discussions about the COVID-19 vaccines.Method. Using a sample of nearly four million geotagged English tweets and the data from the CDC COVID Data Tracker, we conducted the Fama-MacBeth regression with the Newey-West adjustment to understand the influence of both misinformation and fact-based news on Twitter on the COVID-19 vaccine uptake in the US from April 19 when US adults were vaccine eligible to June 30, 2021, after controlling state-level factors such as demographics, education, and the pandemic severity. We identified the tweets related to either misinformation or fact-based news by analyzing the URLs.Results. One percent increase in fact-related Twitter users is associated with an approximately 0.87 decrease (B=−0.87, SE=0.25, and p<.001) in the number of daily new vaccinated people per hundred. No significant relationship was found between the percentage of fake-news-related users and the vaccination rate.Conclusion. The negative association between the percentage of fact-related users and the vaccination rate might be due to a combination of a larger user-level influence and the negative impact of online social endorsement on vaccination intent.


Introduction
Many people read news on social media today, yet the veracity of the news is not guaranteed. Waszak et al. studied the top shared health web links on Polish social media platform and found that 40% of the most frequently shared links contain fake news [1]. Fake news regarding the COVID-19 pandemic is particularly concerning. By identifying and analyzing 1,225 pieces of COVID-19 fake news, Naeem et al. concluded that fake news is pervasive on social media, putting public health at risk [2]. Among these health related fake news, vaccine-related news [3] has the most fallacious content [1]. A recent study showed that misinformation induced a decline in intent of 6.2% in the UK and 6.4% in the USA among those who previously intended to take the vaccine [4]. To support the COVID-19 vaccination, Rzymski et al. [5] suggested to track and tackle emerging and circulating fake news. Montagni et al. [6] argued to increase people's ability to detect fake news. Additionally, collaboration with the media and other organizations should be used, given that citizens do not support the involvement of government authorities in the direct control of news [7]. By studying the anti-vaccination sentiment on Facebook, Hoffman et al. concluded that it would be valuable for health professionals to deliver targeted information to different sub-groups of individuals through social networks [8]. In this study, we intended to examine the scale and scope of the influence of misinformation and fact-based news about COVID-19 vaccines on social media platforms on the vaccine uptake. To summarize, this work (1) quantitatively analyzed the effect of fake news and fact-based news on the vaccine uptake in the U.S. using the Fama-MacBeth regression with the Newey-West adjustment and (2) compared the user characteristics of the fact-related and fake-news-related users. Seemingly counter-intuitive, the percentage of fact-related users is significantly negatively associated with the vaccination rate while no significant correlation is found between the percentage of fake-news-related users and the vaccination rate. The fact-related users have relatively more social capitals than the fake-newsrelated users. Most of the frequent keywords in the user descriptions of the fake-news-related users are political.

Twitter Data
We used the Twitter API * to collect the related tweets that were publicly available. More specifically, the Twitter streaming API was used. The search keywords and hashtags are COVID-19 vaccinerelated or vaccine-related, including "vaccine", "vaccinated", "immunization", "covidvaccine", and "#vaccine". † Slang and misspellings of the related keywords were also included which are composed of "vacinne", "vacine", "antivax" and "anti vax". The tweets that were only related to other vaccine topics like MMR, autism, HPV, tuberculosis, tetanus, hepatitis B, flu shot or flu vaccine were removed using a keyword search. Moreover, since this study focused on the tweets posted by the U.S. Twitter users, we used the geo-location disclosed in the users' profiles to filter out the tweets of non-US users. Similar to Lyu et al. [9], the locations with noise were excluded. Nearly four million geotagged tweets as well as the retweets posted from April 19, 2021 to June 30, 2021 were collected.

CDC COVID-19 Data
The daily state-level number of people with at least one dose, confirmed cases, and deaths per hundred were extracted from the CDC COVID Data Tracker [10].

Census Data
Multiple factors including demographics, socioeconomic status, political affiliation and population density have been found to be related with people's intent to accept a COVID-19 vaccine [11,12].
These were considered as control variables in our study. From the latest American Community

2020 National Popular Vote Data
The results of the 2020 national popular vote [14] were used to estimate the political affiliation of individual states. Since the sums of the shares of Biden and the shares of Trump are almost equal to 100%, we only selected the shares of Biden. To keep the consistency among the variables, the state-level shares were chosen.

Tweets Classification
On the one hand, automated fake news detection methods have been proposed by multiple studies.
To characterize fake news, Zhou et al. represented the spread network in different levels and confirmed that the network of misinformation is more-spreaded, farther in distance, and denser [15].
Horne and Adali found that fake news has longer titles, uses simpler sentences, and is more similar to satire compared to real news [16]. Sentiment of the content was also proven to be an important feature used to detect fake news [17]. Jin et al. [18] proposed a recurrent neural network with an attention mechanism to fuse multimodal features for effective rumor detection. On the other hand, researchers also relied on fact-checking groups to detect misinformation [19]. Compared to the automated fake news detection methods, this kind of approach has a higher true positive rate and a lower false positive rate. A relatively low recall rate which could be a disadvantage. However, a previous study has shown that it still enabled researchers to reveal important patterns and insights [19]. To have a better and more precise understanding of the influence and importance, we detected misinformation using the second type of approach.
In particular, following the method of Bovet and Makse [19], we attempted to classify the tweets into (1) fake-news-related, (2) fact-related, and (3) others, by examining the URLs (if any) of the tweets. More specifically, if the URL's domain name was judged on the basis of the opinion of communications scholar to be related to the websites containing fake news, conspiracy theories, unreliable contents, or extremely biased news, the tweets that were associated with (i.e., contained/retweeted/quoted) this URL were classified as fake-news-related. It is noteworthy that not only the websites containing fake news, but also the ones containing conspiracy theories, unreliable contents, or extremely biased news, were included in this group. The websites containing extremely biased news are sources "that come from a particular point of view and may rely on propaganda, decontextualized information, and opinions distorted as facts by www.opensources.co" [19]. For simplicity, we refer to this group of websites containing fake news, conspiracy theories, unreliable contents, or extremely biased news as fake-news-related.
If the URL's domain name was judged to be related to the websites that were traditional, factbased, news outlets, the tweets that were associated with this URL were classified as fact-related. If the tweets were not associated with any URLs or the URLs's domain names were not identified as fake-news-related or fact-related, the tweets were classified as others. Most URLs were shortened. The domain names that were assigned as fake, conspiracy, bias, and unreliable were included in our study.
The curated list of fact-related websites, composed of 77 unique domain names, was reported by Bovet and Makse [19]. They identified the most important traditional news outlets by manually inspecting the list of top 250 URLs' domain names.
Using this approach, we assumed that the Twitter users did not post a tweet containing a link to fake news outlets or fact-based news outlets just to indicate whether or not they thought the content was fact or fake. Example tweets are as follows: • This article in [fake-news-related URL] is apparently fake/ a fact.
• This article in [fact-related URL] is apparently fake/ a fact.
Instead, we assumed that the Twitter users shared a similar opinion with the content they posted.
To verify the assumption and the robustness of our approach, we randomly sampled 100 unique tweets from both identified fake-news-related (fake news, conspiracy theories, unreliable contents, or extremely biased news) and fact-related tweets, respectively, and inspected whether or not they met our assumptions. After manually reading the sampled tweets, we found all of them met our assumptions. In fact, the majority of the content are just a short sentence that summarizes the content to which the URL links.

Preprocessing
The daily state-level number of people with at least one dose, confirmed cases, deaths per hundred were transformed using a two-step procedure. First, we calculated the lag-1 differences of these three variables. Next, we smoothed the data using a simple moving average. According to the CDC vaccination data [10], there is a seasonal pattern inside the the number of daily new vaccinated people (i.e., the lag-1 difference). The number normally reaches the highest on Thursdays or Fridays, and ‡ https://www.cjr.org/, Accessed January 31, 2022 approaches the lowest on weekends. Therefore, we applied a 7-day moving average to the vaccination data. To maintain the consistency, the lag-1 differences of confirmed cases and deaths were processed in the same way.
As for the Twitter data, since Twitter users could post tweets repeatedly, the series of (1) the percentage of unique Twitter users who posted fake-news-related tweets, and (2) the percentage of unique Twitter users who posted fact-related tweets were only processed with a 7-day moving average.

Fama-MacBeth Regression
In our study, we attempted to analyze five time series data, but most of them are non-stationary.
For example, the time series of the vaccination data show a declining trend during our study period.
Noticeably, the vaccination data, at this stage, has already been transformed into the lag-1 difference. To avoid the spurious regression problem, which might lead to a incorrectly estimated linear relationship between non-stationary time series variables [20], we conducted the Fama-MacBeth regression [21] with the Newey-West adjustment [22], which has also been applied in several previous studies to address the time effect in areas such as finance [23], public health and epidemiology [24].
The optimal number of lags was selected automatically using a nonparametric method [25]. Apart from the time series data, we added control variables from the aforementioned data sources including the Census data and the 2020 National Popular Vote data. Table 1 summarizes the dependent, independent, and control variables.

Results
Using the aforementioned URL-based tweet classification method, we detected 26,998 fake-newsrelated (fake news, conspiracy theories, unreliable contents, or extremely biased news) and 456,061 fact-related tweets. There were 10,925 unique Twitter users who were associated with fake news, while 159,283 were associated with fact-based news. Interestingly, 6,839 were associated with both fake news and fact-based news, which accounted for 62.6% and 4.3% of the fake-news-related users and fact-related users, respectively. This suggested that people who were associated with fake news were more likely to be associated with fact-based news, but not the other way around.
The state-level percentages of fake-news-related and fact-related Twitter users were presented in  However, what is inconsistent between our findings and theirs is the effect of misinformation. They found the exposure to misinformation induces a decline in vaccination intent, while, as shown in Table 2, we did not find significant relationship between the percentage of fake-news-related users and the vaccination rate. To better understand the discrepancy in the effects of misinformation (fake news, conspiracy § https://www.reuters.com/article/us-health-coronavirus-usa/all-american-adults-to-be-eligible-for-covid-19vaccine-by-april-19-biden-idUKKBN2BT1IF?edition-redirect=uk, Accessed June 9, 2021  and listed memberships. There is significant evidence (p < .05) to conclude that the social capitals of these two groups of users are different. Specifically, the users who have posted fact-related tweets but have not posted fake-news-related tweets have more followers, friends, statuses, listed memberships, and give more favorites (i.e., likes). Moreover, by performing the proportion z test over the percentages of verified users between these two groups, we found there are significantly more verified users among the fact-related Twitter users (p < .05). We further plotted the word clouds of the user descriptions of these two groups in Figure 2. The size of the word is proportional to its frequency. Apart from "love" and "life" which appear in both groups, a clear difference can be observed between the other keywords of the user descriptions. Political keywords such as "maga", "conservative" and "Trump" are in the user descriptions of the users who posted fake-news-related tweets. However, there are fewer political keywords in the user descriptions of the users who posted fact-related tweets, although there are "blm" and "blacklivesmatter".

Discussions
We identified a significant negative correlation between the percentage of the U.S. Twitter users who were associated with fact-based news and the U.S. COVID-19 vaccination rates during the period when all U.S. adults were eligible for the COVID-19 vaccines. We found no significant effects of misinformation (fake news, conspiracy theories, unreliable contents, or extremely biased news) on the vaccination rates. The negative relationship between the fact-based news and the vaccination rate is consistent with the questionnaire-based randomized control trials conducted by Loomba et al. [4]. However, we found discrepancy in the effects of misinformation. As acknowledged by Loomba et al. [4], their study "does not replicate a real-world social media platform environment where information exposure is a complex combination of what is shown to a person by the platform's algorithms and what is shared by their friends or followers [26]". By comparing the user characteristics of the fact-related and fake-news-related users, we found significant evidence that the fact-related users tend to have greater online influence as they have more followers, friends, statuses, listed memberships, give more favorites (i.e., likes), and are more likely to be verified users. We further qualitatively compared the words extracted from the user descriptions of these two groups of users and found clear differences. The fake-news-related users tend to have similar user profiles as more political keywords such as "maga", "conservative" and "Trump" were observed in the user descriptions. As a result, in our study, the number of detected fact-related users is almost 15 times of the number of detected fake-news-related users. These findings indicate a combination of a smaller online influence and a tendency for selective exposure to homogeneous opinions [27,28] that may create echo chambers [29,30].
At first glance, it might be counterintuitive that more fact-related news is associated with lower vaccination rate. However, this pattern was consistently found in both survey-based studies [4,31] and our social media-based study. The reason could be that more fact-related news about the vaccines might raise not only more discussions but also more concerns. This non-positive perception of the vaccines might induce a decline in the vaccination intent among the people who were hesitant. Chadwick et al. [31] conducted a survey-based study to explore the implications of online social endorsement for the COVID-19 vaccination program. They found the effects of online social endorsement are complex in terms of the people who consume them. The people who give less priority to active monitoring of news are more likely to be associated with discouragement of vaccination compared to the people who actively seek news. It it notable that the users we captured using our methods are the ones who posted tweets. Based on our results, the people who have posted factrelated tweets but have not posted fake-news-related tweets have a relatively larger audience. The people among the audience who do not post fact-related tweets can be considered as less active than the people who posted. Therefore, according to the findings of Chadwick et al. [31], these people might become more vaccine hesitant after consuming a growing amount the news, which could be the reason of a negative association we found in our study. Future research can further explore this pattern by investigating the effects of online social endorsement on vaccination intent [31] using social media data in real-world environment.
With respect to the control variables, the patterns are in line with the ongoing vaccination trends [10]. In June 2021, the estimated percents of people 18 years and older in White alone, not Hispanic or Latino, Black or African American alone, and Asian alone were 66.8, 56.7, and 85.0, respectively. Our results also show a negative association between the percentage of Black or African American alone, a positive association between the percentage of Asian alone with the vaccination rate. No statistically significant relationship was found between the percentage of persons aged 65 and over, which is within our expectation, since this demographic group was among the first batches who were eligible for the COVID-19 vaccines in the U.S. By the time of our study period, over 78% of the people aged 65 years and over have already received at least one dose [10].
Echoed with Bertoncello et al. [32], the states with more people holding a Bachelor's degree or higher tend to have higher vaccination rates.
The findings of our study should be interpreted with caution as there are still limitations in terms of the representativeness of online behaviours and the potential biases in the type of people using Twitter. However, multiple previous studies have shown that these kind of online activities are representative of real-world patterns in many areas such as diet [33,34] and public health [11,35]. More importantly, as also shown in our study, social media-based studies to some extent overcome the challenges encountered using survey-based methods [4]. Ideally, a future research direction is to explore the combination of both survey-based and social media-based methods to improve robustness while addressing the drawbacks of both methods.
Moreover, this work employed a method to identify fake-news-related and fact-related tweets only using the URLs fact checked by human experts, which could potentially cause a sample bias since not all fake-news-related or fact-related tweets contain URLs. However, one of the advantages of this approach over other text-based machine learning or deep learning methods [18,36,37] is its high precision rate. Shahi and Nandini [38] presented a multilingual cross-domain dataset of 5,182 fact-checked news articles for COVID-19. This dataset was annotated manually. They used a BERT-based classification model [39] for fake/fact detection. The overall precision was only 0.78.
Although this result was achieved without fine-tuning, it suggests that there are gaps in the precision rate between expert-labeled and machine-detected results. In the future, we intend to combine these methods to detect fake news more reliably. In addition, other advanced time series models can be explored to perform the regression analysis for spatial and temporal patterns.

Conclusion
In this study, we identified the tweets related to either misinformation (fake news, conspiracy theories, unreliable contents, or extremely biased news) or fact-based news posted from April 19, 2021 to June 30, 2021 on Twitter. After performing the Fama-MacBeth regression with Newey-West adjustment, we found the percentage of fact-related users is significantly associated with the vaccination rate. We did not find a significant relationship between the percentage of fake-news-related users and the vaccination rate. We further compared the user characteristics of the fact-related and fake-news-related users and found fact-related user have significantly more social capitals. The fake-news-related users are similar to each other in terms of social capitals as well as their user descriptions. Our findings are mostly consistent with the findings of previous survey-based studies.
More importantly, we conducted our study by passively observing the social media data in an attempt to address the issue that previous survey-based studies did not replicate a real-world social media platform environment, enabling us to have a better understanding of the mechanism of the relationship between vaccine-related news and vaccination rates.