Datasets

Here, you can find a list of datasets (including respective links and references) coming out of the project work.

You may also want to check out our presence on Zenodo where we also list datasets or the Data Management Plan.

“IDMT Audio Provenance Analysis Dataset” Milica Gerhardt, Luca Cuccovillo & Patrick Aichroth

April 11, 2024

This dataset contains two distinct collections tailored for evaluating audio provenance analysis solutions within specified scenarios: Singular Composition and Multi-Source Composition. For a comprehensive understanding of these scenarios and the process behind generating the test files, please consult the referenced publication.

This dataset is accompanying the respective publication. In case you use it please cite: M. Gerhardt, L. Cuccovillo and P. Aichroth, "Audio Provenance Analysis in Heterogeneous Media Sets," 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 2024, pp. 4387-4396, doi: 10.1109/CVPRW63382.2024.00442.

Show dataset

"M3DSYNTH: A DATASET OF MEDICAL 3D IMAGES WITH AI-GENERATED LOCAL MANIPULATIONS" Giada Zingarini, Davide Cozzolino, Riccrdo Corvi, Giovanni Poggi & Luisda Verdoliva

April 01, 2024

M3Dsynth, a large dataset of manipulated Computed Tomography (CT) lung images. We create manipulated images by injecting or removing lung cancer nodules in real CT scans, using three different methods based on Generative Adversarial Networks (GAN) or Diffusion Models (DM), for a total of 8,577 manipulated samples. Experiments show that these images easily fool automated diagnostic tools. We also tested several state-of-the-art forensic detectors and demonstrated that, once trained on the proposed dataset, they are able to accurately detect and localize manipulated synthetic content, even when training and test sets are not aligned, showing good generalization ability.

Show dataset

"EUvsDisinfo: a Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles (Dataset)" João Leite, Olesya Razuvayevskaya, Kalina Bontcheva & Carolina Scarton

January 15, 2024

This is the dataset and metadata accompanying the paper submission titled "EUvsDisinfo: a Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles".

Show dataset

"VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias" Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos & Panagiotis Petrantonakis

January 08, 2024

VERITE (VERification of Image-TExt pairs) is an annotated evaluation benchmark for multimodal (image-caption) misinformation detection that accounts for unimodal biases.

Show dataset

"Synthbuster: Towards Detection of Diffusion Model Generated Images" Quentin Bammey

November 02, 2023

Dataset of 9.000 AI-generated images, described in the paper “Synthbuster: Towards Detection of Diffusion Model Generated Images” (Quentin Bammey, 2023, Open Journal of Signal Processing)

Show dataset

Show related article

vera.ai talk and asset presentation at AI-on-Demand webinar

On 3 April 2025 vera.ai joined the first online webinar of the AI-on-Demand Webinar Series. Title: AI Innovation From Research to Market. We were represented by Olga Papadopoulou (CERTH), supported with input from Martin Hyben (KInIT) and Luca Cuccovillo (IDMT). Here's more, including links to available assets.

Meet us @Datasets CodeApril 14, 2025

Code made available - you can get your hands on it

A lot in vera.ai centers around data and code. In other words: the dealing with data, its analysis, experimentation, and more. In October 2024 we have started to bring together and make available all code that a) was created in and b) is used by vera.ai members in one central destination on the project website. This includes direct links to respective repositories - in most cases GitHub.

Code DatasetsOctober 15, 2024

ODSS: An Open Dataset of Synthetic Speech. Call to use and cooperate

A team of vera:ai researchers from the Fraunhofer Institute for Digital Media Technology (IDMT), Germany and the Centre for Research and Technology Hellas (CERTH-ITI), Greece have developed a new synthetic speech detection dataset. Find out more why they did it and what this contains.

DatasetsMarch 12, 2024

Overcoming Unimodal Bias in Multimodal Misinformation Detection

This post explains the basics behind a paper entitled “VERITE: a robust benchmark for multimodal misinformation detection accounting for unimodal bias”, published in the International Journal of Multimedia Information Retrieval (IJMIR). It has been authored by researchers of the mever group at project partner CERTH-ITI.

DatasetsJanuary 19, 2024

The persistence and resilience of misinformation in the face of fact checking

A project by vera.ai, entitled “The Persistence and Resilience of Misinformation in the Face of Fact Checking,” was facilitated by Fabio Giglietto, Massimo Terenzi ( UNIURB) and Richard Rogers (UvA) at the Digital Media Initiative Winter School at the University of Amsterdam. We are pleased to share with you the results.

Datasets Meet us @January 17, 2024

Project partner KInIT at the EMNLP conference in Singapore represented with three papers

The EMNLP (Empirical Methods in Natural Language Processing) conference is one of the top NLP conferences. KInIT researchers Róbert Móro and Ján Čegiň presented three full papers at this prestigious event. Here’s more about what they did, and the related works.

Meet us @DatasetsJanuary 16, 2024

Unlocking Insights: paper "Cracking Open the European Newsfeed" is finally out in JQD:DM

A new paper that includes recent findings of disinformation analysis has been published in the Journal of Quantitative Description: Social Media. It is entitled “Cracking Open the European Newsfeed”, co-authored by vera.ai team members based at University of Urbino “Carlo Bo”, Italy. The paper contributes to the ongoing effort to describe and quantify the quality of information shared on large social media platforms.

Further Material DatasetsDecember 21, 2023

Synthbuster: Towards Detection of Diffusion Model Generated Images

Dataset of 9.000 AI-generated images described in the paper "Synthbuster: Towards Detection of Diffusion Model Generated Images" (Quentin Bammey, 2023, Open Journal of Signal Processing).

DatasetsNovember 02, 2023

OpenAI Models for Topic Modelling in Social Media Analysis

Here's a video recording of a talk by Fabio Giglietto of the University of Urbino, given in June 2023 at the Queensland University of Technology (QUT). Topic: using OpenAI models to identify the most salient topics circulated via Facebook links in the run-up to the Italian general elections.

Meet us @Demos and Trainings DatasetsJuly 07, 2023

Mapping the ‘memory loss’ of disinformation in fact-checks

In this piece we present findings from a study on current fact-checking archiving practices and Facebook post removals using the “War in Ukraine” dataset of the European Digital Media Observatory (EDMO). Insights presented here come from a project that was carried out as part of the Digital Methods Initiative Winter School and Data Sprint 2023.

Demos and Trainings Datasets Meet us @January 26, 2023

vera.ai is co-funded by the European Commission under grant agreement ID 101070093, and the UK and Swiss authorities. This website reflects the views of the vera.ai consortium and respective contributors. The EU cannot be held responsible for any use which may be made of the information contained herein.

Datasets

Related Articles

vera.ai talk and asset presentation at AI-on-Demand webinar

Code made available - you can get your hands on it

ODSS: An Open Dataset of Synthetic Speech. Call to use and cooperate

Overcoming Unimodal Bias in Multimodal Misinformation Detection

The persistence and resilience of misinformation in the face of fact checking

Project partner KInIT at the EMNLP conference in Singapore represented with three papers

Unlocking Insights: paper "Cracking Open the European Newsfeed" is finally out in JQD:DM

Synthbuster: Towards Detection of Diffusion Model Generated Images

OpenAI Models for Topic Modelling in Social Media Analysis

Mapping the ‘memory loss’ of disinformation in fact-checks