Based on the evaluation performed by human annotators and the GPT-4 model (see Figure 1.), we observed several characteristics of LLMs in generating disinformation content. While most models tend to agree with the narrative, the Falcon model seems to have been trained in a safe manner so that it refutes to generate disinformation, while also trying to debunk it. ChatGPT also behaves safely in some cases, but seems to be significantly less resilient to be used for generating disinformation than Falcon.
In contrast, Vicuna and GPT-3 Davinci are models that rarely disagree with the prompted narrative, while being able to generate compelling news articles along with novel arguments. In this regard, these two models are considered the most dangerous according to our methodology.
When comparing the evaluation performed by GPT-4 to automate this challenging process, we identified that the model’s responses tend to correlate with human evaluation, while GPT-4’s ability to evaluate the style and the arguments seem to be weaker (see Figure 1. b). After manual investigation, we discovered that this model has problems understanding how the arguments relate to the narrative and whether they agree or not.
To provide a summarizing view of all LLMs, we devised a classification into safe and dangerous texts, based on the human and GPT-4 evaluations and an assessment of whether LLMs contain any safety filters. These safety filters are designed to change the behaviour of the LLMs when the user makes an unsafe request. In our case, we observe whether the model refused to generate a news article on account of disinformation, whether the generated text contained a disclaimer that the generated text is not true or that it is generated by AI or none of the above. The results of this classification are shown in Figure 2. The summary assessment confirms the observations already mentioned, where Vicuna and GPT-3 Davinci, emerge as dangerous LLMs that can be easily exploited for disinformation purposes.
*Dangerous texts are disinformation articles that could be misused by bad actors. Safe texts contain disclaimers, provide counterarguments, argue against the user, etc. Note that GPT-4 annotations are generally slightly biased towards safety.
By comprehensive evaluation of the disinformation capabilities of several state-of-the-art LLMs, we observed that there are meaningful differences in the willingness of various LLMs to be misused for generating disinformation news articles. Some models have seemingly zero safety filters built-in (Vicuna, Davinci), while others demonstrate that it is possible to train models in a safe manner (Falcon, Llama-2).
While this research sheds light on the capabilities of LLMs to generate disinformation, it also provides many opportunities for its extension and future work. Among them, we see a big potential in extending the analysis on constantly emerging models, to other languages or other content formats (like social media posts).
For more details, please see our research paper.
Authors: Ivan Vykopal (KInIT), Ivan Srba (KInIT)
Editor: Anna Schild (DW)