This article was written by Sadia Afroz in cooperation with Vibhor Sehgal, an AI researcher at Avast
A recent report by NISC found that ~50% of cybersecurity professionals regard misinformation as a significant threat to their organizations. “Fake news”, misinformation, and disinformation is fabricated news with no verifiable facts, sources, or quotes. A 2020 online poll conducted across 142 countries found that 57% of internet users believe fake news was the biggest threat. Believe it or not, disinformation has been labeled as the new breed of cybersecurity. Since fake news is becoming more prominent, we need to adopt a system that responds to its threat accordingly.
Online misinformation ecosystem
To understand the online misinformation ecosystem, we build two networks of misinformational and informational domains: a domain-level hyperlink network and a social-media level link sharing network. The domain-level network represents the hyperlinking relationship between domains. The social-media network represents the link-sharing behavior of social network users. We collated and curated several public databases of previously identified misinformational and information domains. The domain-level hyperlink network is constructed by scraping all hyperlink tags (<a href=”…” </a>) from these domains.
In total, we found four data sets consisting of 1,707 domains. There is, however, overlap between these datasets, which once removed yields 1,389 distinct misinformational domains. There are several limitations to immediately using these domains in our analyses. only provide the headline of the offending news article, from which we had to perform a reverse heuristic Google search to identify the source domain. This reverse search does not always identify the offending domain; entertainment stories, for example, often lead to domains like imdb.com and people.com. To contend with these limitations, we applied a ranking of the 1,389 domains to down-rank mislabelled domains like imdb.com. Despite this ranking system, a dozen clearly non-misinformational domains remained in our data set, like theonion.com and huffingtonpost.com. These domains were manually removed, yielding a total of 1,059 domains.
We paired these 1,059 misinformational domains with 1,059 informational domains corresponding to the top-ranked Alexa domains (which we manually verified are trustworthy domains). After scraping all informational and misinformational domains, we constructed an unweighted, directed graph of hyperlinks in which the graph nodes are the domains and a directed edge connects one domain that is hyperlinked to another. Each domain in the network created from the misinformational and informational domains are assigned a label of “misinfo”, “info” or “none”. The “none” categorization is used to classify domains not in the misinformational or informational data set. We see a huge difference, with 17.90% of hyperlinks of misinfo domains linking to misinfo domains, as compared to only 0.62% for info domains. Similarly, albeit a smaller effect, 4.37% of hyperlinks on misinfo domains are to info domains, as compared to 13.45% for info domains.
Figure 1: Hyperlink connectivity between misinformation (red) and informational domains. The nodes colored red correspond to misinformation domains, and the remaining nodes correspond to different categories of informational domains.
Link Sharing on Social Media
We investigate the ability to identify misinformational domains by tracking the hyperlinks shared by certain social-media users. Because of the relative ease of access, we focus on Twitter’s publicly available user data. In particular, we enlist two Twitter APIs: (1) The Search Tweets API14 allows filtering tweets based on a query term against a tweet’s keywords, hashtags, or shared URLs. We filter tweets by matching shared URLs against our misinfo/info URL dataset, surfacing which users are sharing a particular domain.; and (2) The Get Tweet Timelines API15 allows querying all tweets surfaced from the Search Tweets API. In our case, we extract the domain URLs shared by the Twitter users surfaced in the previous step. Although we don’t consider them here, the data returned by both APIs contains geo-location, replied-to, time, and other attributes that could be leveraged in the future.
Figure 2: A magnified view of eight almost fully connected .news domains, all owned by Webseed.
We trained a logistic regression (LR) using 75% of the data and then evaluated it on the rest of the 25% data set. The LR hyperparameters are tuned to maximize the F1-score. On testing, the LR classifier, with a sup- port of 113/176 misinfo/info domains, achieved an F1-score of 0.75/0.85, with precision 0.78/0.84 and re- call 0.73/0.87.
As with most aspects of cybersecurity, technological solutions to addressing misinformation will themselves have to be multi-faceted. With some 500 hours of video uploaded to YouTube every minute, and over a billion posts to Facebook each day, the massive scale of social media makes tackling misinformation an enormous challenge. We propose that in conjunction with complementary approaches to tackling misinformation, addressing misinformation at the domain level holds promise to disrupt large-scale misinformation campaigns.
About the authors
Sadia Afroz, PhD, is a staff scientist at Avast and a research scientist at the International Computer Science Institute (ICSI). Before joining ICSI, She was a postdoc at UC Berkeley and a PhD student at Drexel University. Her work focuses on anti-censorship, anonymity and adversarial learning. Her work on adversarial authorship attribution received the 2013 Privacy Enhancing Technology (PET) award, the best student paper award at the 2012 Privacy Enhancing Technology Symposium (PETS), and the 2014 ACM SIGSAC dissertation award (runner-up). More about her research can be found: https://www.eecs.berkeley.edu/~sa499/