Gaining Global Insights with Multilingual Entity Linking

In our interconnected world, information travels across the globe at unprecedented speed and disinformation is getting spread even faster and reused across geographies. To help in the battle against disinformation, Ontotext is tackling the challenge of identifying narratives or disinformation campaigns. This covers a range of tasks starting from analysis of textual content in multiple languages to detecting and connecting separate pieces of related manipulative stories.

Similar challenges are faced by international organizations that monitor relevant content in multiple languages, whether they operate in an inherently multilingual field or deal with trade and business across borders and between continents. The difficulty lies in harmonizing content across these linguistic landscapes, especially considering that not everyone has immediate access to a multilingual expert. Particularly not one who can process tens or even thousands of content pieces in mere minutes.

Ontotext

Linking entities in text to knowledge bases

A necessary step in analyzing multilingual content is linking mentions of entities in text (general concepts or named entities) in different languages to a common knowledge base. The ideal knowledge base would comprise broad information, be dynamically updated on a regular basis, and be multilingual. According to recent publications for entity linking, Wikipedia and Wikidata are among the most popular ones. Since they meet our requirements, we’ve chosen them as our primary target knowledge base.

Wikidata is the biggest public knowledge graph, covering over 100 million entities. Wikidata entities are connected to Wikipedia articles, where such exist (Wikipedia has about 7 million articles). We have experimented with different models that enable linking to Wikidata. We have evaluated their performance on several datasets to select the best suited for multilingual content.

How does multilingual entity linking work?

A multilingual entity linking (MEL) model is a natural language processing (NLP) system designed to detect, disambiguate, and link named entities mentioned in text to a common knowledge base across different languages. Entities in this context can be specific objects such as people, organizations, locations, dates, as well as general concepts, such as “global warming”.

Ontotext

Connecting words across different languages, however, is no easy feat. Take the example of the “World Health Organization (WHO)”. In German it is “Weltgesundheitsorganisation”, in Italian – “Organizzazione Mondiale della Sanità”, in Polish – “Światowa Organizacja Zdrowia”. MEL overcomes this hurdle by connecting all transcriptions to the respective Wikidata entity – World Health Organisation.

Downstream applications can also fetch and correlate additional information about the mentioned entities available in Wikidata, for example, headquarters address, relations to people and other organizations, and more. Wikidata entries can be used to link not only named entity mentions in different languages, but also general concepts such as ‘hospital‘ or ‘prize‘, thus providing common ground for insights over content in different languages.

Respective word cloudOntotext

The image above shows the word cloud of the text in this post. Without technologies like MEL, these words can create beautiful visualizations but cannot be utilized to provide value for the business.

The required MEL model for this task should be capable of performing end-to-end entity linking. It should take as input unstructured text and return annotations in the form of entity reference, extracted from the input text, and the identifier of the corresponding concept in the target knowledge base. Unfortunately, our current research showed that there are only a few such models or systems. For that reason, we experimented with a two-step approach, where state-of-the-art multilingual entity disambiguation algorithms and systems are used as a second component in a MEL solution. The first step was performed either by a multilingual named entity recognition model or a multilingual entity boundary detection (identifying where a named entity is mentioned in the text) algorithm.

We experimented with different MEL systems and selected three for evaluation:

IXA+MGENRE – we combined a transformer-based multilingual masked language model – IXA, for the entity boundary detection step, with multilingual GENRE (mGENRE) – the state-of-the-art method for entity disambiguation
MultiNERD+MGENRE – an alternative combination of the MultiNERD model for the entity boundary detection, again with mGENRE for disambiguation
BELA – an end-to-end MEL model, based on bi-encoder architecture

Comparison of different entity linking systems

To select the most suitable entity linking approach, we conducted a thorough comparison based on criteria unrelated to the quality of annotations generated by each system. Our analysis revealed that the choice of system depends on the specific use case. For commercial applications, BELA emerged as the exclusive option, while for non-commercial purposes, any of the three systems could be viable, contingent on the hardware availability.

Ontotext

In evaluating the systems annotation quality, we employed two distinct approaches, utilizing English benchmark datasets and the multilingual dataset MultiNERD. Given that MultiNERD NER is trained on the same dataset, our focus is on IXA+MGENRE and BELA. Considering the incomplete annotations in MultiNERD, we prioritized recall as the pivotal metric. The results unequivocally demonstrated BELA’s superior performance across all languages when contrasted with IXA+MGENRE.

Ontotext

Shifting our attention to the English benchmark datasets, we considered all three multilingual entity linking systems, alongside CEEL – Ontotext’s recently developed entity linking system for English. This allowed us to understand how the performance of MEL systems compared to the one of English-only systems. The comparison results showed that BELA outperformed IXA+MGENRE for all datasets. Conversely, the other three systems exhibited comparable results for the tested datasets.

Ontotext

The AIDA benchmark is the most directly relevant to the task of interlinking narratives across news articles – it uses the CoNLL dataset, which includes 1393 Thomson Reuters news articles with almost 35 thousand entity mentions. In contrast, the KORE 50 benchmark uses a much smaller set of 50 documents, intentionally selected to include complex disambiguation cases, while TWEEKI Gold uses a corpus of Tweets.

BELA and CEEL share similar architecture and it is not surprising to see that they demonstrate similar performance. Both models generate Wikidata identifiers at a relatively high speed, which makes them applicable to a wide range of applications. Still, there are several important differences between the two models:

BELA is multilingual, while CEEL works only for English texts
BELA is limited only to the 7 million entities covered in Wikipedia
CEEL covers about 40 million people, organizations, and locations, but does not link entities of other types
BELA needs a GPU processor, while a CPU is sufficient for CEEL

Concepts in debunking articles

So what results does the system produce when applied to real world data? We ran it over a collection of over 100 thousand disinformation claims and journalistic articles debunking them. The data is strongly multilingual with articles in over 50 languages with 30 of them having at least 1000 articles. Over this data the system discovered over 5 million mentions of 280 thousand unique Wikidata concepts within the texts.

How can we use these annotations to search and analyze our large collection? We pickеd “NATO” as an example of a large international organization that is discussed in many languages and has very different names and acronyms in these languages.

Ontotext

The chart above shows the number of debunking articles mentioning NATO over the last two years broken down by language. Unsurprisingly, English has a strong lead, since about 50% of the content is in English, but there is a long tail of the concept being discussed in many European languages. What other concepts are mentioned in connection with NATO might be even more interesting than just the language of the content.

The diagram above shows the results of applying a filter so we get the most commonly discussed person concepts within articles mentioning NATO in the last two years. As expected, these are the important political and military figures most strongly related to the conflict in Ukraine.

This is a relatively simple analysis but it can be built upon to gain more sophisticated insight such as discovering specific documents on a very specific subject, identifying trends in co-occurring concepts, or being alerted when a known topic re-emerges or starts being discussed in a new language.

Other use cases

Within the realm of Business Intelligence, MEL can help organizations identify and swiftly analyze market trends, assess performance, and make informed decisions using multilingual data with clear connections between concepts in different languages. This capability becomes particularly important when dealing with large-scale datasets.

In the media monitoring field, MEL can be employed to enrich already extracted content to enhance discoverability by linking concepts in a variety of languages, regardless of whether the person behind the screen speaks them.

In academic research, MEL allows for deep research aligning and co-referencing diverse content in many languages, and thanks to entity linking the data is precise, saving a lot of time for the researcher.

To wrap it up

As we continue to traverse our interconnected world, systems for MEL would be increasingly helpful. They enable establishing meaningful connections between terms in different languages via linking to a common knowledge graph, such as Wikidata. These connections can be stored in a knowledge graph, which is an indispensable tool unlocking a world of accuracy and insights.

Ontotext’s experience in applying MEL models for the needs of fact-checking professionals proves that such an approach enables the seamless integration and navigation of linguistically diverse data. It connects entities between languages, thus enabling organizations to identify and decipher global narratives. As a result of these experiments, Ontotext gathered experience and developed the necessary proprietary AI models to address fact-checking needs. This will also allow us to choose the most efficient models depending on the needs for other MEL solutions and based on the type of text, the languages, and the domains that have to be addressed.

Authors: Ivelina Bozhinova, Andrey Tagarev, Eneya Georgieva (all from ONTOTEXT)

Editor: Jochen Spangenberg (DW)

This text first appeared on the ONTOTEXT website and is re-published here with kind permission of the authors.

vera.ai is co-funded by the European Commission under grant agreement ID 101070093, and the UK and Swiss authorities. This website reflects the views of the vera.ai consortium and respective contributors. The EU cannot be held responsible for any use which may be made of the information contained herein.