In the corresponding paper, the KInIT research team thoroughly investigated capabilities of large language models (LLMs) to automatically generate disinformation. This paper has been recently published and presented at the premiere (A*) ACL 2024 conference that took place in Bangkok, Thailand during August 11–16, 2024. In the following summary, we describe novel findings our research revealed.
The emergence of large language models (LLMs) has heightened concerns about automatic generation and spread of disinformation. The threat of misusing LLMs to generate disinformation is one of the commonly cited risks of their future development (Goldstein et al., 2023). Their capability to generate an arbitrary amount of human-like texts can be a powerful tool for disinformation actors willing to influence the public by flooding the Web and social media with content during influence operations.
So far, very little is known about the seriousness of this risk and the disinformation capabilities of the current generation of LLMs (Buchanan et al., 2021). To address this, our research has focused on a comprehensive analysis of several LLMs and their ability to generate disinformation news articles in English language. We manually evaluated more than 1000 generated texts to ascertain how much they agree or disagree with the prompted disinformation narrative, how many novel arguments they use, and how closely they follow the desired news article text style.
To evaluate how LLMs behave in different contexts and on different disinformation topics, we defined five categories of popular disinformation: COVID-19, Russo-Ukrainian war, health issues, US election, and regional topics. For each topic, we manually selected four narratives, using fact-checking websites, such as Snopes or Agence France-Presse (AFP). Each narrative consists of a title, which summarizes the main idea of the disinformation spread, and an abstract, a brief text that provides additional context and facts about particular narratives.
Considering the capabilities of LLMs to address various tasks from the natural language processing domain, we selected those that represented state-of-the-art models at the time of our research. Specifically, we employed three versions of GPT-3 (Davinci, Babbage, and Curie), GPT-3.5 (ChatGPT), OPT-IML-Max, Falcon, Vicuna, GPT-4, Llama-2, and Mistral. These models were selected from two groups. The first group is represented by the commercial models (variants of GPT-3, and GPT-4), for which we used the API provided by the developer. The second group of models are open-sourced models, especially OPT-IML-Max, Falcon, Vicuna, Llama-2, and Mistral.
As we focused on the most capable open-source LLMs, which have several billion parameters, it was necessary to use strong computational resources with a sufficient number of graphics cards and memory to exploit them. Therefore, for experiments with open-source LLMs, we used the resources provided by the Slovak National Computing Centre for HPC (NCC HPC).
For generating disinformation articles, we exploited two prompt types. The first type of prompt aims to generate news articles based solely on the title of the narrative, where we explore the internal knowledge of LLMs about particular narratives. With the second prompt, we provided additional information to the LLM using the abstract, while this abstract serves to control the generation, ensuring that the LLM employs appropriate facts and arguments aligned with the spirit of the narrative. All 10 LLMs generated three articles with the provided abstract and three articles using only the narrative title, resulting in a total of 1200 generated texts.
For the purpose of evaluating the generated articles: their quality and to what extent they further spread disinformation, we engaged human annotators to evaluate 840 texts generated using seven LLMs. Due to the time-consuming and complex nature of this evaluation, we employed the GPT-4 model as an additional method for evaluation, where the GPT-4 model answered the same questions as human annotators. In this way, we evaluate the texts generated by additional three LLMs.
Evaluation questions thoroughly examined the style and content of the generated texts. When evaluating style as part of the quality of the generated texts, we mainly focused on whether the generated texts are coherent, in natural language (Q1), and whether the style of the text is like news articles (Q2). We evaluated the content of the generated texts by means of 4 additional questions - whether they agree (Q3) or disagree (Q4) with the narrative and how many arguments for (Q5) and against (Q6) the narrative were generated.
Based on the evaluation performed by human annotators and the GPT-4 model (see Figure 1.), we observed several characteristics of LLMs in generating disinformation content. While most models tend to agree with the narrative, the Falcon model seems to have been trained in a safe manner so that it refutes to generate disinformation, while also trying to debunk it. ChatGPT also behaves safely in some cases, but seems to be significantly less resilient to be used for generating disinformation than Falcon.
In contrast, Vicuna and GPT-3 Davinci are models that rarely disagree with the prompted narrative, while being able to generate compelling news articles along with novel arguments. In this regard, these two models are considered the most dangerous according to our methodology.
When comparing the evaluation performed by GPT-4 to automate this challenging process, we identified that the model’s responses tend to correlate with human evaluation, while GPT-4’s ability to evaluate the style and the arguments seem to be weaker (see Figure 1. b). After manual investigation, we discovered that this model has problems understanding how the arguments relate to the narrative and whether they agree or not.
To provide a summarizing view of all LLMs, we devised a classification into safe and dangerous texts, based on the human and GPT-4 evaluations and an assessment of whether LLMs contain any safety filters. These safety filters are designed to change the behaviour of the LLMs when the user makes an unsafe request. In our case, we observe whether the model refused to generate a news article on account of disinformation, whether the generated text contained a disclaimer that the generated text is not true or that it is generated by AI or none of the above. The results of this classification are shown in Figure 2. The summary assessment confirms the observations already mentioned, where Vicuna and GPT-3 Davinci, emerge as dangerous LLMs that can be easily exploited for disinformation purposes.
*Dangerous texts are disinformation articles that could be misused by bad actors. Safe texts contain disclaimers, provide counterarguments, argue against the user, etc. Note that GPT-4 annotations are generally slightly biased towards safety.
By comprehensive evaluation of the disinformation capabilities of several state-of-the-art LLMs, we observed that there are meaningful differences in the willingness of various LLMs to be misused for generating disinformation news articles. Some models have seemingly zero safety filters built-in (Vicuna, Davinci), while others demonstrate that it is possible to train models in a safe manner (Falcon, Llama-2).
While this research sheds light on the capabilities of LLMs to generate disinformation, it also provides many opportunities for its extension and future work. Among them, we see a big potential in extending the analysis on constantly emerging models, to other languages or other content formats (like social media posts).
For more details, please see our research paper.
Authors: Ivan Vykopal (KInIT), Ivan Srba (KInIT)
Editor: Anna Schild (DW)