As larger and larger models are pre-trained on internet-scale data, their capabilities grow. However, as the internet naturally contains many biased views and information (e.g., Wikipedia per this study), the models incorporate them into their weights. This can negatively affect any downstream application where some input (e.g., question containing sterotypical information, concerning minorities, describing political affiliation, etc.) might evoke biased output.
To mitigate this, our research proposes utilizing mechanistic interpretability methods to detect and control these learned biases in the models themselves (specifically in Vision Language Models). This way, we can pinpoint specific learned representations which are intrinsically biased or be able to find steering vectors to control the models’ behavior. We hope this research direction will yield powerful yet efficient tools to uncover the “Pandora box” of the omnipresent AI models.

