Do You Think It’s Biased? How to Measure the Perception of Media Bias
19. September 2024BABE—the dataset of Bias Annotations By Experts—is one of the largest media bias datasets with gold-standard expert annotations at the word and sentence levels.
11 min read || Click to view original paper
Takeaways from this post:
- We release BABE, one of the largest media bias datasets fully annotated by trained media bias experts, providing word- and sentence-level bias labels.
- BABE demonstrates higher-quality (both in consistency and authenticity) than previous works through our quantitive analyses.
- A distant supervision approach using weak labels is incorporated in developing media bias detection models, which leads to improved performance.
- Lists of relevant research works using BABE are provided at the end of this post.
If you’re into Computer Science or know a bit about it, you’ve likely come across the saying, “Garbage in, garbage out (GIGO),” or, in a more posh British tone, “Rubbish in, rubbish out (RIRO).” These phrases highlight a simple truth: flawed, biased, or poor-quality input inevitably leads to equally poor output.
Indeed, in the world of machine learning, this notion isn’t just fundamental; it’s pivotal. The quality of the dataset used plays a crucial role in ensuring the trained model’s authenticity (how credible and trustworthy it is) and reliability (how consistent it is). Within the domain of media bias research, the attention paid to dataset development should be even more prominent.
Because, well, as professional and ethical media bias researchers, we definitely don’t want to create a fabulous model that’s detrimentally biased.
But developing an authentic media bias dataset and collecting consistently reliable annotations is no walk in the park. The challenge stems from addressing the complex nature and diverse embodiments of media bias.
While media bias can be defined as slanted news coverage or internal bias in wording, in practice, it often takes various forms. These include the framing effect, bias by omission, and bias by word choice on the side of news producers, as well as perceptual bias on the side of consumers.
Moreover, these bias factors are often interwoven with one another, i.e., bias by word choice from the news producer plus perceptual bias by the consumers. This often leads to inconsistent interpretations of media bias for annotators lacking sufficient training and experience. It also casts doubt on the credibility of media bias annotations made by laypeople.
So, would it be better to recruit media bias experts to do the annotations? –Yes.
But won’t that be time-consuming and costly? –Yes, but someone’s got to do it.
Among the many forms of bias inducement in news articles, this time our focus is on identifying bias by word choice.
Namely, when there are multiple words referring to the same concept, does the writer choose to use the neutral ones or the biased ones?
We developed a dataset of 3,700 news sentence annotations made by media bias experts at the word and sentence levels. This dataset addresses the lack of a gold-standard dataset in this domain and can facilitate the automatic identification of bias in news articles.
Ladies, gentlemen, and everyone in-between or outside, we are thrilled to present:
Our Bias Annotations By Experts (BABE)!
- BABE stands as one of the largest datasets in the domain of media bias research, comprising 3,700 sentences that are balanced across news topics and news outlets.
- It is fully annotated by trained media bias experts, featuring labels on both the word and sentence levels for bias.
- This dataset offers superior annotation quality and higher inter-annotator agreement than existing works.
Furthermore, we introduce a method to automatically detect bias-inducing sentences in news articles using BABE. We compare five state-of-the-art (SOTA) neural models and subsequently pre-train the top two performers on a larger news corpus using a distant supervision approach.
Our final, best-performing model is RoBERTa, a BERT-based model pre-trained with the distant supervision approach. After fine-tuning and evaluating this model on BABE, it surpasses previous methods, with a macro F1 score of 0.804.
If the term ‘fine-tuning’ baffles you, don’t worry 🙂 We will explain the training process in simple words in the section ‘Training with BABE — Developing Media Bias Detection Models.’
In the following sections, we will look into the creation of BABE, including the selection of our news sources and content, our expert training and annotation process, and most importantly, how we evaluate and demonstrate that BABE offers better annotation quality than existing works. We will also explain how we train the SOTA models using BABE and set new records, eclipsing previous efforts.
I hope you’re not wearing your favorite shirt, because we’re about to spill the tea about our BABE!
MBIC: the Precursor of BABE (click to view original paper)
Before diving into the details of BABE, it’s crucial to acknowledge its precursor, MBIC.
MBIC stands as another significant dataset in media bias annotation, notable for including annotator characteristics. Unlike BABE, which contains expert annotations exclusively, MBIC gathers its annotators via crowdsourcing. MBIC offers a balanced content selection with annotations at both the word and sentence levels. It is also among the largest datasets in the field, containing 1,700 annotated sentences.
Building on the collected news sentences from MBIC, BABE enhances the dataset in two aspects:
- Expert annotations: recruiting trained experts for all annotations, ensuring higher quality and consistency.
- Larger corpus size: expanding the corpus size considerably with additional 2000 news sentences, broadening the dataset’s scope and utility.
Data Collection: Handpicking News Sentences for Annotation
The process for selecting news sentences for annotation in BABE mirrors that used in MBIC.
Our research focuses on the US media landscape, which has become increasingly polarized over recent years. We extract sentences from news articles covering 12 predefined controversial topics and then manually inspect each sentence. The articles are published across 14 US news platforms from January 2017 to June 2020.
Here’s how we handpicked each news sentence:
- Defining keywords and retrieving news articles
We define keywords describing every topic in one word or a short phrase, specify the news outlets, their time frame, and retrieve all available links for the relevant articles. - Manual inspection of retrieved text
Then, we extract sentences by manually inspecting the provided list of articles. The sentence selection was based on our media bias annotation guidelines comprising diverse examples of biased and neutral text instances.
- Defining keywords and retrieving news articles
Finding Our Media Bias Experts
Given that identifying lexical and sentential bias is a nuanced linguistic task and that cognitive and language abilities likely impact annotators’ perceptions, we sought annotators with the following qualifications:
- Satisfactory English proficiency
Those who are master’s students enrolled in programs taught in English and whose grades rank in the top 20%. - Relevant academic background
Those who come from backgrounds in Data Science, Computer Science, Psychology, or Intercultural Communication. - Sufficient practical experience
Those who possess at least six months of experience in the media bias domain.
- Satisfactory English proficiency
Additionally, we require the recruited annotators to undergo comprehensive training to:
- reliably identify biased wording,
- differentiate between bias and merely polarizing language, and
- adopt a politically neutral stance during annotation.
Data Annotation
Now that we have carefully selected the best tea leaves and the finest tea baristas—namely, the news sentences to be annotated and the annotators—it’s time to make the tea!
Our goal is to collect three types of labels for each sentence:
- Word bias labels
Initially, all the annotators are asked to mark words or phrases inducing bias. - A sentence bias label
Subsequently, we ask them to indicate whether the whole sentence is biased or non-biased. - A sentence opinion label
Lastly, the annotators label the sentence as opinionated, factual, or mixed.
- Word bias labels
To find the ideal tradeoff between the number of sentences and annotators per sentence, and to facilitate subsequent evaluation of annotation quality, we organize BABE into two subgroups (SG):
- SG1: 1,700 sentences (matching MBIC) annotated by eight experts each.
- SG2: 3,700 sentences annotated by five experts each.
For SG1, we ask eight annotators to annotate the 1,700 sentences that are the same as in MBIC. We thereby obtain an expert-labeled ground truth comparable to MBIC’s crowdsourcing results. For SG2, we ask five of the previous eight annotators to label the additional 2,000 sentences to expand the corpus size.
Annotation Evaluation
After gathering the raw labels from our expert annotators, we need to process them, examine how the annotators’ judgements agree with one another, and then determine whether a sentence (or word) should be eventually labeled as biased.
We determine the final labeling of the sentences by a majority vote:
On the sentence level, a sentence is eventually marked as ‘biased’ if at least half of the annotators label it so. Otherwise, the sentence was marked as ‘non-biased.’ When the annotators do not agree on a label (no majority vote) in some sentences, we assign the label ‘no agreement.’
We compute Krippendorff’s agreement scores for the two types of sentence labels (bias labels and opinion labels) in SG1 and SG2.
Tip: If the name Krippendorff sounds new to you, just know that higher scores indicate more consistent annotations and greater agreement among annotators, which is a good thing.
We then compare the scores to acquire knowledge about data quality and demonstrate that BABE offers better annotation quality than existing works.
So which offers higher quality, crowdsourcing or expert annotations?
To compare the annotation quality of MBIC’s crowdsourcing approach with that of our expert-based approach, we use SG1’s annotations.
The final agreement scores for bias labels and opinion labels indicate:
- A higher consistency in the expert approach compared to the crowdsourcing one.
- Expert annotators are more conservative in their annotations than crowdsourcers.*
*The second finding comes from the observation that expert annotators assign less biased labels to both words and sentences than crowdsourcers.
The resulting labels in BABE are of higher quality and capture media bias better than labels gathered via crowdsourcing, suggesting an enhancement in authenticity and the importance of having media bias experts perform the annotations. BABE also has a higher inter-annotator agreement score than existing work, suggesting an enhancement in reliability.
Training with BABE – Developing Media Bias Detection Models
Biased or not biased; that is the question.
Automatic media bias detection is essentially a classification task to perform for neural models. Additionally, research has shown that pre-training on larger distant datasets followed by fine-tuning on supervised data yields improved performance for sentiment classification.
In our work, we also introduce an additional pre-training task employing the idea of the distant supervision approach to enhance models’ capabilities in recognizing media bias content.
We first fine-tune and evaluate five neural models — BERT, DistilBERT, RoBERTa, ELECTRA, and XLNet — on BABE. Then, we identify the best performing models — BERT and RoBERTa — of the first run and include the distant supervision pre-training task.
Distant Supervision Approach
To learn bias-specific word embeddings—which can be thought of as the computational method for representing words internally within the system, not for human interpretation—a news corpus is compiled for pre-training. The news corpus consists of news headlines from outlets both with and without a partisan leaning.
Our purpose is to achieve the automatic mapping of bias and neutral labels to sequences, thus alleviating the burden of collecting expensive manual annotations labeled by humans.
Therefore, distant or weak labels are predicted from noisy sources in this approach, where we assume the distribution of biased words is denser in some news sources with a partisan leaning than in others.
Since the news headline corpus serves to learn more effective language representations, it is not suitable for evaluation purposes due to its noisy nature. We ensure that no overlap exists between the distant corpus and BABE to guarantee model integrity with respect to training and testing.
The data collection here resembles the collection of the ground-truth data as described in the section of Data Collection above. Similarly, we extract news about controversial topics with pre-defined keywords, as we assume slanted reporting to be more likely among those topics than in the case of less controversial topics.
If you feel lost due to the technical details above, here’s a simplified way to understand this:
The journey of training a language model from scratch to become specialized at addressing a specific task is just like that of us humans—we start as newbies in a field, learn the general yet foundational knowledge, and then continue to deepen our knowledge until the day we can be confident in our specialties in this field.
For language models, the first phase of learning a general representation of a subject (say, a language like English) is called pre-training. During this phase, language models are exposed to a large amount of unlabeled text data to capture the underlying language representations.
The second phase aims at refining and deepening the models’ knowledge for a specific task, which is called fine-tuning. Here, models are trained on a smaller, more specific set of data that is related to the particular task they need to perform.
Therefore, we can picture the overall training process of the models, as they acquire language abilities and domain-specific knowledge in media bias in this way —
Stage 1. Acquiring general language ability (pre-training):
We select and directly implement with language models like BERT that are pre-trained with large unsupervised corpus. Thus, they are equipped with the general language ability at this stage. For instance, BERT can speak English after this stage.
Stage 2. Acquiring general domain knowledge (pre-training):
This is where we implement the distant supervision approach. By pre-training these language models from Stage 1 again with a more domain-specific (even though somewhat noisy) news corpus, the models acquire more domain-specific knowledge. Namely, BERT now is equipped with some general knowledge about news bias or biased words.
Stage 3. Acquiring more accurate domain knowledge (fine-tuning):
Fine-tuning and evaluating the language models from Stage 2 on BABE allows the final models to obtain more accurate understanding of media bias at the word and sentence levels.
Baseline Method
Every newly developed model needs a comparable baseline to highlight its improvements. Our baseline method comes from one of our previous works by Spinde (2021).
It is a traditional feature-engineering model using syntactic and lexical features related to bias words such as dictionaries of opinion words, hedges, and assertive and factive verbs. As feature-based models operate on the word level, we provide comparability by implementing the classification rule that the presence of a predicted biased word leads to the overall sentence being labeled as biased.
Results
So how do the models trained on BABE perform? Here’s a brief summary of the exciting findings we get:
- The distant supervision pre-training task leads to an improvement over BERT and RoBERTa, and RoBERTa is the best performing model among the five SOTA models we select.
- Models trained and evaluated on a larger corpus (i.e., SG2) generally perform better, which we believe indicates that extending the dataset in the future will be valuable.
- Overall, media bias can be better captured when word embedding algorithms are pre-trained on the news headline corpus with distant supervision based on varying news outlets.
Further Application: the active BABE
Fortunately, the journey of BABE doesn’t end here. On one hand, we continue to improve the dataset and the models to this day. We have also built a main site for BABE on Hugging Face, where we keep posting its latest updates and other relevant works.
On the other hand, BABE has been actively involved in many other research and projects to this day. We provide below the shortcuts to some of these studies that incorporate BABE.
Research within Media Bias Group
Research outside Media Bias Group