Skip to content Skip to footer

Uncovering Bias in AI Models

Interpreting Models' Internal Mechanisms

As AI models are trained on vast internet-scale data, they can internalize the biases present in that content and reproduce them in downstream applications. Our research aims to use mechanistic interpretability in Vision-Language Models to detect, localize, and control these biased internal representations, helping develop efficient tools for safer and fairer AI.

As larger and larger models are pre-trained on internet-scale data, their capabilities grow. However, as the internet naturally contains many biased views and information (e.g., Wikipedia per this study), the models incorporate them into their weights. This can negatively affect any downstream application where some input (e.g., question containing sterotypical information, concerning minorities, describing political affiliation, etc.) might evoke biased output.

To mitigate this, our research proposes utilizing mechanistic interpretability methods to detect and control these learned biases in the models themselves (specifically in Vision Language Models). This way, we can pinpoint specific learned representations which are intrinsically biased or be able to find steering vectors to control the models’ behavior. We hope this research direction will yield powerful yet efficient tools to uncover the “Pandora box” of the omnipresent AI models.