Training with BABE: The Bias Annotations By Experts Dataset for Automatic Media Bias Detection

Through research, reduction, and validation, we developed a questionnaire to assess readers' perceptions of media bias.

Do You Think It’s Biased? How to Measure the Perception of Media Bias

19. September 2024

28. April 2024

**Our Bias Annotations By Experts (BABE)!**

BABE stands as one of the largest datasets in the domain of media bias research, comprising 3,700 sentences that are balanced across news topics and news outlets.
It is fully annotated by trained media bias experts, featuring labels on both the word and sentence levels for bias.
This dataset offers superior annotation quality and higher inter-annotator agreement than existing works.

Furthermore, we introduce a method to automatically detect bias-inducing sentences in news articles using BABE. We compare five state-of-the-art (SOTA) neural models and subsequently pre-train the top two performers on a larger news corpus using a distant supervision approach.

Our final, best-performing model is RoBERTa, a BERT-based model pre-trained with the distant supervision approach. After fine-tuning and evaluating this model on BABE, it surpasses previous methods, with a macro F1 score of 0.804.

If the term ‘fine-tuning’ baffles you, don’t worry 🙂 We will explain the training process in simple words in the section ‘Training with BABE — Developing Media Bias Detection Models.’

In the following sections, we will look into the creation of BABE, including the selection of our news sources and content, our expert training and annotation process, and most importantly, how we evaluate and demonstrate that BABE offers better annotation quality than existing works. We will also explain how we train the SOTA models using BABE and set new records, eclipsing previous efforts.

I hope you’re not wearing your favorite shirt, because we’re about to spill the tea about our BABE!

MBIC: the Precursor of BABE (click to view original paper)

Before diving into the details of BABE, it’s crucial to acknowledge its precursor, MBIC.

MBIC stands as another significant dataset in media bias annotation, notable for including annotator characteristics. Unlike BABE, which contains expert annotations exclusively, MBIC gathers its annotators via crowdsourcing. MBIC offers a balanced content selection with annotations at both the word and sentence levels. It is also among the largest datasets in the field, containing 1,700 annotated sentences.

Building on the collected news sentences from MBIC, BABE enhances the dataset in two aspects:

1. Expert annotations: recruiting trained experts for all annotations, ensuring higher quality and consistency.
2. Larger corpus size: expanding the corpus size considerably with additional 2000 news sentences, broadening the dataset’s scope and utility.

Data Collection: Handpicking News Sentences for Annotation

The process for selecting news sentences for annotation in BABE mirrors that used in MBIC.

Our research focuses on the US media landscape, which has become increasingly polarized over recent years. We extract sentences from news articles covering 12 predefined controversial topics and then manually inspect each sentence. The articles are published across 14 US news platforms from January 2017 to June 2020.

Here’s how we handpicked each news sentence:

1. Defining keywords and retrieving news articles
  We define keywords describing every topic in one word or a short phrase, specify the news outlets, their time frame, and retrieve all available links for the relevant articles.
2. Manual inspection of retrieved text
  Then, we extract sentences by manually inspecting the provided list of articles. The sentence selection was based on our media bias annotation guidelines comprising diverse examples of biased and neutral text instances.

Finding Our Media Bias Experts

Given that identifying lexical and sentential bias is a nuanced linguistic task and that cognitive and language abilities likely impact annotators’ perceptions, we sought annotators with the following qualifications:

- Satisfactory English proficiency
  Those who are master’s students enrolled in programs taught in English and whose grades rank in the top 20%.
- Relevant academic background
  Those who come from backgrounds in Data Science, Computer Science, Psychology, or Intercultural Communication.
- Sufficient practical experience
  Those who possess at least six months of experience in the media bias domain.

Additionally, we require the recruited annotators to undergo comprehensive training to:

1. reliably identify biased wording,
2. differentiate between bias and merely polarizing language, and
3. adopt a politically neutral stance during annotation.

Data Annotation

Now that we have carefully selected the best tea leaves and the finest tea baristas—namely, the news sentences to be annotated and the annotators—it’s time to make the tea!

Our goal is to collect three types of labels for each sentence:

1. Word bias labels
  Initially, all the annotators are asked to mark words or phrases inducing bias.
2. A sentence bias label
  Subsequently, we ask them to indicate whether the whole sentence is biased or non-biased.
3. A sentence opinion label
  Lastly, the annotators label the sentence as opinionated, factual, or mixed.

To find the ideal tradeoff between the number of sentences and annotators per sentence, and to facilitate subsequent evaluation of annotation quality, we organize BABE into two subgroups (SG):

- SG1: 1,700 sentences (matching MBIC) annotated by eight experts each.
- SG2: 3,700 sentences annotated by five experts each.

For SG1, we ask eight annotators to annotate the 1,700 sentences that are the same as in MBIC. We thereby obtain an expert-labeled ground truth comparable to MBIC’s crowdsourcing results. For SG2, we ask five of the previous eight annotators to label the additional 2,000 sentences to expand the corpus size.

Annotation Evaluation

After gathering the raw labels from our expert annotators, we need to process them, examine how the annotators’ judgements agree with one another, and then determine whether a sentence (or word) should be eventually labeled as biased.

We determine the final labeling of the sentences by a majority vote:

On the sentence level, a sentence is eventually marked as ‘biased’ if at least half of the annotators label it so. Otherwise, the sentence was marked as ‘non-biased.’ When the annotators do not agree on a label (no majority vote) in some sentences, we assign the label ‘no agreement.’

We compute Krippendorff’s agreement scores for the two types of sentence labels (bias labels and opinion labels) in SG1 and SG2.

Tip: If the name Krippendorff sounds new to you, just know that higher scores indicate more consistent annotations and greater agreement among annotators, which is a good thing.

We then compare the scores to acquire knowledge about data quality and demonstrate that BABE offers better annotation quality than existing works.

So which offers higher quality, crowdsourcing or expert annotations?

To compare the annotation quality of MBIC’s crowdsourcing approach with that of our expert-based approach, we use SG1’s annotations.

The final agreement scores for bias labels and opinion labels indicate:

1. A higher consistency in the expert approach compared to the crowdsourcing one.
2. Expert annotators are more conservative in their annotations than crowdsourcers.*

*The second finding comes from the observation that expert annotators assign less biased labels to both words and sentences than crowdsourcers.

The resulting labels in BABE are of higher quality and capture media bias better than labels gathered via crowdsourcing, suggesting an enhancement in authenticity and the importance of having media bias experts perform the annotations. BABE also has a higher inter-annotator agreement score than existing work, suggesting an enhancement in reliability.

**Training with BABE – Developing Media Bias Detection Models**

Biased or not biased; that is the question.

Automatic media bias detection is essentially a classification task to perform for neural models. Additionally, research has shown that pre-training on larger distant datasets followed by fine-tuning on supervised data yields improved performance for sentiment classification.

In our work, we also introduce an additional pre-training task employing the idea of the distant supervision approach to enhance models’ capabilities in recognizing media bias content.

We first fine-tune and evaluate five neural models — BERT, DistilBERT, RoBERTa, ELECTRA, and XLNet — on BABE. Then, we identify the best performing models — BERT and RoBERTa — of the first run and include the distant supervision pre-training task.

Distant Supervision Approach

To learn bias-specific word embeddings—which can be thought of as the computational method for representing words internally within the system, not for human interpretation—a news corpus is compiled for pre-training. The news corpus consists of news headlines from outlets both with and without a partisan leaning.

Our purpose is to achieve the automatic mapping of bias and neutral labels to sequences, thus alleviating the burden of collecting expensive manual annotations labeled by humans.

Therefore, distant or weak labels are predicted from noisy sources in this approach, where we assume the distribution of biased words is denser in some news sources with a partisan leaning than in others.

Since the news headline corpus serves to learn more effective language representations, it is not suitable for evaluation purposes due to its noisy nature. We ensure that no overlap exists between the distant corpus and BABE to guarantee model integrity with respect to training and testing.

The data collection here resembles the collection of the ground-truth data as described in the section of Data Collection above. Similarly, we extract news about controversial topics with pre-defined keywords, as we assume slanted reporting to be more likely among those topics than in the case of less controversial topics.

If you feel lost due to the technical details above, here’s a simplified way to understand this:
The journey of training a language model from scratch to become specialized at addressing a specific task is just like that of us humans—we start as newbies in a field, learn the general yet foundational knowledge, and then continue to deepen our knowledge until the day we can be confident in our specialties in this field.
For language models, the first phase of learning a general representation of a subject (say, a language like English) is called pre-training. During this phase, language models are exposed to a large amount of unlabeled text data to capture the underlying language representations.
The second phase aims at refining and deepening the models’ knowledge for a specific task, which is called fine-tuning. Here, models are trained on a smaller, more specific set of data that is related to the particular task they need to perform.
Therefore, we can picture the overall training process of the models, as they acquire language abilities and domain-specific knowledge in media bias in this way —
Stage 1. Acquiring general language ability (pre-training):
We select and directly implement with language models like BERT that are pre-trained with large unsupervised corpus. Thus, they are equipped with the general language ability at this stage. For instance, BERT can speak English after this stage.

Stage 2. Acquiring general domain knowledge (pre-training):
This is where we implement the distant supervision approach. By pre-training these language models from Stage 1 again with a more domain-specific (even though somewhat noisy) news corpus, the models acquire more domain-specific knowledge. Namely, BERT now is equipped with some general knowledge about news bias or biased words.

Stage 3. Acquiring more accurate domain knowledge (fine-tuning):
Fine-tuning and evaluating the language models from Stage 2 on BABE allows the final models to obtain more accurate understanding of media bias at the word and sentence levels.

Baseline Method

Every newly developed model needs a comparable baseline to highlight its improvements. Our baseline method comes from one of our previous works by Spinde (2021).

It is a traditional feature-engineering model using syntactic and lexical features related to bias words such as dictionaries of opinion words, hedges, and assertive and factive verbs. As feature-based models operate on the word level, we provide comparability by implementing the classification rule that the presence of a predicted biased word leads to the overall sentence being labeled as biased.

Results

So how do the models trained on BABE perform? Here’s a brief summary of the exciting findings we get:

- The distant supervision pre-training task leads to an improvement over BERT and RoBERTa, and RoBERTa is the best performing model among the five SOTA models we select.
- Models trained and evaluated on a larger corpus (i.e., SG2) generally perform better, which we believe indicates that extending the dataset in the future will be valuable.
- Overall, media bias can be better captured when word embedding algorithms are pre-trained on the news headline corpus with distant supervision based on varying news outlets.

Further Application: the active BABE

Fortunately, the journey of BABE doesn’t end here. On one hand, we continue to improve the dataset and the models to this day. We have also built a main site for BABE on Hugging Face, where we keep posting its latest updates and other relevant works.

On the other hand, BABE has been actively involved in many other research and projects to this day. We provide below the shortcuts to some of these studies that incorporate BABE.

Research within Media Bias Group

Title	Author(s)	Year of Publication
The Media Bias Taxonomy: A Systematic Literature Review on the Forms and Automated Detection of Media Bias	Timo Spinde, Smi Hinterreiter, Fabian Haak, Terry Ruas, Helge Giese, Norman Meuschke, Bela Gipp	2024
What do Twitter comments tell about news article bias? Assessing the impact of news article bias on its perception on Twitter	Timo Spinde, Elisabeth Richter, Martin Wessel, Juhi Kulshrestha, Karsten Donnay	2023
Introducing MBIB - The First Media Bias Identification Benchmark Task and Dataset Collection	Martin Wessel, Tomás Horych, Terry Ruas, Akiko Aizawa, Bela Gipp, Timo Spinde	2023
How do we raise media bias awareness effectively? Effects of visualizations to communicate bias	Timo Spinde, Christin Jeggle, Magdalena Haupt, Wolfgang Gaissmaier, Helge Giese	2022
A domain-adaptive pre-training approach for language bias detection in news	Jan-David Krieger, Timo Spinde, Terry Ruas, Juhi Kulshrestha, Bela Gipp	2022
Exploiting Transformer-Based Multitask Learning for the Detection of Media Bias in News Articles	Timo Spinde, Jan-David Krieger, Terry Ruas, Jelena Mitrović, Franz Götz-Hahn, Akiko Aizawa & Bela Gipp	2022
An Interdisciplinary Approach for the Automated Detection and Visualization of Media Bias in News Articles	Timo Spinde	2021

Research outside Media Bias Group

Title	Author(s)	Year of Publication
A systematic review on media bias detection: What is media bias, how it is expressed, and how to detect it	Francisco-Javier Rodrigo-Ginés, Jorge Carrillo-de-Albornoz, Laura Plaza	2024
Nbias: A natural language processing framework for BIAS identification in text	Shaina Raza, Muskan Garg, Deepak John Reji, Syed Raza Bashir, Chen Ding	2024
Connecting the Dots in News Analysis: A Cross-Disciplinary Survey of Media Bias and Framing	Gisela Vallejo, Timothy Baldwin, Lea Frermann	2023
AI Usage Cards: Responsibly Reporting AI-Generated Content	Jan Philip Wahle, Terry Ruas, Saif M. Mohammad, Norman Meuschke, Bela Gipp	2023
Overview of DIPROMATS 2023 automatic detection and characterization of propaganda techniques in messages from diplomats and authorities of world powers	Jorge Carrillo de Albornoz, Iván Gonzalo Verdugo, Pablo Moral Martín, Guillermo Marco Remón, Julio Gonzalo Arroyo	2023
Sentence-level Media Bias Analysis Informed by Discourse Structures	Yuanyuan Lei, Ruihong Huang, Lu Wang, Nick Beauchamp	2022
Towards Better Detection of Biased Language with Scarce, Noisy, and Biased Annotations	Zhuoyan Li, PictureZhuoran Lu, PictureMing Yin	2022
Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy	Jiwoo Hong, Yejin Cho, Jaemin Jung, Jiyoung Han, James Thorne
It's All Relative! A Method to Counter Human Bias in Crowdsourced Stance Detection of News Articles	Ehsan-Ul Haq, PictureYang K. Lu, PicturePan Hui	2022
Specialized document embeddings for aspect-based similarity of research papers	Malte Ostendorff, PictureTill Blume, PictureTerry Ruas, PictureBela Gipp, PictureGeorg Rehm	2022
Analyzing Multi-Task Learning for Abstractive Text Summarization	Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp	2022

Training with BABE: The Bias Annotations By Experts Dataset for Automatic Media Bias Detection

Do You Think It’s Biased? How to Measure the Perception of Media Bias

Do You Think It’s Biased? How to Measure the Perception of Media Bias

Our Bias Annotations By Experts (BABE)!

Annotation Evaluation

Training with BABE – Developing Media Bias Detection Models

Leave a Reply Cancel reply

**Our Bias Annotations By Experts (BABE)!**

**Training with BABE – Developing Media Bias Detection Models**