CyberSec&AI Connected 2021 | BLOG

Fighting Misinformation with AI

Sadia Afroz

“5G Causes COVID-19”. If you believed this statement was true, even for a moment, you have fallen into the trap of fake news. Fake news, especially during COVID-19, has become extremely dangerous. Myths like “hydroxychloroquine can fight the virus” have caused assaults, arsons, and even deaths. Overall, COVID-19 related fake news has killed over 800 people and resulted in over 5,800 hospitalizations. Misinformation is now being disseminated on a massive scale, and it has severe repercussions.

This article was written by Sadia Afroz in cooperation with Vibhor Sehgal, an AI researcher at Avast

A recent report by NISC found that ~50% of cybersecurity professionals regard misinformation as a significant threat to their organizations. “Fake news”, misinformation, and disinformation is fabricated news with no verifiable facts, sources, or quotes. A 2020 online poll conducted across 142 countries found that 57% of internet users believe fake news was the biggest threat. Believe it or not, disinformation has been labeled as the new breed of cybersecurity. Since fake news is becoming more prominent, we need to adopt a system that responds to its threat accordingly. 

Online misinformation ecosystem

To understand the online misinformation ecosystem, we build two networks of misinformational and informational domains: a domain-level hyperlink network and a social-media level link sharing network. The domain-level network represents the hyperlinking relationship between domains. The social-media network represents the link-sharing behavior of social network users. We collated and curated several public databases of previously identified misinformational and information domains. The domain-level hyperlink network is constructed by scraping all hyperlink tags (<a href=”…” </a>) from these domains.

In total, we found four data sets consisting of 1,707 domains. There is, however, overlap between these datasets, which once removed yields 1,389 distinct misinformational domains. There are several limitations to immediately using these domains in our analyses. only provide the headline of the offending news article, from which we had to perform a reverse heuristic Google search to identify the source domain. This reverse search does not always identify the offending domain; entertainment stories, for example, often lead to domains like imdb.com and people.com. To contend with these limitations, we applied a ranking of the 1,389 domains to down-rank mislabelled domains like imdb.com. Despite this ranking system, a dozen clearly non-misinformational domains remained in our data set, like theonion.com and huffingtonpost.com. These domains were manually removed, yielding a total of 1,059 domains.

Domain Scraping

We paired these 1,059 misinformational domains with 1,059 informational domains corresponding to the top-ranked Alexa domains (which we manually verified are trustworthy domains). After scraping all informational and misinformational domains, we constructed an unweighted, directed graph of hyperlinks in which the graph nodes are the domains and a directed edge connects one domain that is hyperlinked to another. Each domain in the network created from the misinformational and informational domains are assigned a label of “misinfo”, “info” or “none”. The “none” categorization is used to classify domains not in the misinformational or informational data set. We see a huge difference, with 17.90% of hyperlinks of misinfo domains linking to misinfo domains, as compared to only 0.62% for info domains. Similarly, albeit a smaller effect, 4.37% of hyperlinks on misinfo domains are to info domains, as compared to 13.45% for info domains.

J0a8jUIE9c84tRbRkonVIVUduzyRJpNKG776PBmqk74krXPT75HOE vSw4jr2cz80 M7OTyONbqh4SOMKFu0vlptzo8PcE 69fCRq

Figure 1: Hyperlink connectivity between misinformation (red) and informational domains. The nodes colored red correspond to misinformation domains, and the remaining nodes correspond to different categories of informational domains.

Link Sharing on Social Media

We investigate the ability to identify misinformational domains by tracking the hyperlinks shared by certain social-media users. Because of the relative ease of access, we focus on Twitter’s publicly available user data. In particular, we enlist two Twitter APIs: (1) The Search Tweets API14 allows filtering tweets based on a query term against a tweet’s keywords, hashtags, or shared URLs. We filter tweets by matching shared URLs against our misinfo/info URL dataset, surfacing which users are sharing a particular domain.; and (2) The Get Tweet Timelines API15 allows querying all tweets surfaced from the Search Tweets API. In our case, we extract the domain URLs shared by the Twitter users surfaced in the previous step. Although we don’t consider them here, the data returned by both APIs contains geo-location, replied-to, time, and other attributes that could be leveraged in the future.

FYpoh8p4NIRaRNBSFZbP45BcmIZzz8kWDs7gO Zk4B3kprmoZ16Aeg XAbEIsut64d9K qaSEmGPaDGsvr8DuO5ZCUyQO6bw1HfscGrEwbEgJz6w4wwS YGHRbxldrHAX9Kxuv 3

Figure 2: A magnified view of eight almost fully connected .news domains, all owned by Webseed.

We trained a logistic regression (LR) using 75% of the data and then evaluated it on the rest of the 25% data set. The LR hyperparameters are tuned to maximize the F1-score. On testing, the LR classifier, with a sup- port of 113/176 misinfo/info domains, achieved an F1-score of 0.75/0.85, with precision 0.78/0.84 and re- call 0.73/0.87.

Conclusion

As with most aspects of cybersecurity, technological solutions to addressing misinformation will themselves have to be multi-faceted. With some 500 hours of video uploaded to YouTube every minute, and over a billion posts to Facebook each day, the massive scale of social media makes tackling misinformation an enormous challenge. We propose that in conjunction with complementary approaches to tackling misinformation, addressing misinformation at the domain level holds promise to disrupt large-scale misinformation campaigns.

About the authors

Sadia Afroz, PhD, is a staff scientist at Avast and a research scientist at the International Computer Science Institute (ICSI). Before joining ICSI, She was a postdoc at UC Berkeley and a PhD student at Drexel University. Her work focuses on anti-censorship, anonymity and adversarial learning. Her work on adversarial authorship attribution received the 2013 Privacy Enhancing Technology (PET) award, the best student paper award at the 2012 Privacy Enhancing Technology Symposium (PETS), and the 2014 ACM SIGSAC dissertation award (runner-up). More about her research can be found: https://www.eecs.berkeley.edu/~sa499/

Share this article

Share on linkedin
Share on facebook
Share on twitter

And follow #CyberSecAI

Sadia Afroz

Staff scientist at Avast & Research scientist at the International Computer Science Institute (ICSI)

is featured in this article

Join us in November 2021 and register now for online CyberSec&AI Connected 2021

Latest news

Download the CyberSec&AI Connected Overview

Thank you!

Thank you for your interest. We will stay in touch regarding any related news about the CyberSec&AI  conference

We value your privacy

By clicking “ACCEPT” you allow cookies that help us analyze the performance and usage of this website. See our Cookies Policy for more details.