Within the context of vera.ai, we first identified certain challenges of training AFC systems.
To develop and train effective AFC systems, collected evidence must meet specific criteria and not be leaked from existing fact-checking articles. The problem of leaked evidence arises when information from previously fact-checked articles is used during training, making the model less effective at handling new misinformation. Additionally, external information retrieved by the model must be credible to avoid feeding it with unreliable and false data. These issues are crucial for achieving realistic and practical fact-checking results.
To address these challenges, we developed the “CREDible, Unreliable, or LEaked” (CREDULE) dataset by modifying, merging, and extending previous datasets such as MultiFC, PolitiFact, PUBHEALTH, NELA-GT, Fake News Corpus, and Getting Real About Fake News. It contains 91,632 samples from 2016 to 2022, equally distributed across three classes:
The dataset includes short texts (titles) and long texts (full articles), along with metadata such as date, domain, URL, topic, and credibility scores.
Having created the CREDULE dataset, we then developed the EVidence VERification Network (EVVER-Net), a neural network designed to detect leaked and unreliable evidence during the evidence retrieval process. EVVER-Net can be integrated into an AFC pipeline to ensure that only credible information is used during training. The model leverages large pre-trained transformer-based text encoders and integrates information from credibility and bias scores.
EVVER-Net demonstrated impressive performances, achieving up to 89.0% accuracy without credibility and 94.4% with credibility scores.
In order to examine the evidence of widely used AFC datasets, we then applied the EVVER-Net model to widely used AFC datasets, including LIAR-Plus, MOCHEG, FACTIFY, NewsCLIPpings, and VERITE. EVVER-Net identified up to 98.5% of items in LIAR-Plus, 95.0% in the "Refute" class of FACTIFY, and 83.6% in MOCHEG as leaked evidence. For instance, in the FACTIFY dataset, the claim "Microsoft bought Sony for $121 billion" was accompanied by evidence sourced from a previously fact-checked article, limiting the model's ability to detect and verify new claims that have not been fact-checked.
NB 1: More (technical) details about what has been briefly illustrated above can be found in the paper “Credible, Unreliable or Leaked?: Evidence verification for enhanced automated fact-checking” by Zacharias Chrysidis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos and Panagiotis C. Petrantonakis, which has been presented at ICMR’s 3rd ACM International Workshop on Multimedia AI against Disinformation.
NB 2: This is a condensed and further edited version of an article that first appeared on the MeVer team's website. You may also want to refer to the original article for more details.
Authors: Stefanos-Iordanis Papadopoulos, Olga Papadopoulou, Symeon Papadopoulos (all MeVer team at CERTH-ITI)
Editor: Jochen Spangenberg (DW)