Understanding AI: Mechanistic Interpretability and Its Challenges

Robotic art might soon transition from a curiosity to a significant influence in the art market. Artificial Intelligence (AI) has led to breakthroughs in numerous fields, from drug research to robotics. It is also changing how we interact with computers and the internet. However, a major issue persists: we still don’t fully understand how large language models work or why they perform so well. We have a rough idea, but the details within AI systems are too complex to decipher. This is problematic, especially if we use AI in sensitive areas like medicine without understanding potential critical weaknesses.

A team at Google DeepMind working on mechanistic interpretability has been developing new methods to peer inside AI. In July, they released Gemma Scope, a tool to help researchers understand what happens when generative systems produce an output. The hope is that by better understanding the inner workings of an AI model, we can control its outcomes more effectively, leading to fundamentally better AI systems in the future.

Neel Nanda, leading the mechanistic interpretability team at Google DeepMind, wants to be able to “read the thoughts” of an AI model to see if it acts deceptively. Mechanistic interpretability, also known as “Mech Interp,” aims to understand how neural networks actually function. Currently, we feed a model a lot of data and end up with a set of model weights that determine how it makes decisions. We understand that the AI looks for patterns in the data and draws conclusions, but these patterns can be incredibly complex and hard for humans to interpret.

This is like a teacher checking answers to a complex math problem in a test. The student – the AI – has the right answer, but the process looks like a jumble of lines. And sometimes, the AI student might find an irrelevant pattern it believes to be valid. For instance, some AI systems mistakenly conclude that 9.11 is greater than 9.8. Methods developed in the field of mechanistic interpretability are starting to shed light on these issues.

A main goal of mechanistic interpretability is to reverse-engineer the algorithms within these systems. For example, when we ask a model to “write a poem,” it generates rhymed lines. Understanding the algorithm it used to do so is the aim. To identify features in Google’s AI model Gemma, DeepMind applied a tool called “Sparse Autoencoder” to its many layers. A Sparse Autoencoder is like a microscope that magnifies these layers, allowing a closer look at their details.

The challenge with autoencoders is deciding how granular they should be. Zooming in too much might make what you see impossible to interpret, while zooming out too far might miss interesting details. DeepMind’s solution was to run Sparse Autoencoders of different sizes, varying the number of features they should find. The project was open-source, encouraging other researchers to explore the findings and gain new insights into the model’s internal logic.

Josh Batson, a researcher at Anthropic, finds this exciting for interpretability research. Making such systems available means a lot of research can be based on these Sparse Autoencoders, lowering the entry barrier for those wanting to learn from these methods. Neuronpedia, a platform for mechanistic interpretability, partnered with DeepMind to create a demonstration of Gemma Scope, allowing users to experiment with different prompts and see the neural activations they trigger.

Sparse Autoencoders work unsupervised, finding functions on their own, leading to surprising insights into how models deconstruct human concepts. Joseph Bloom from Neuronpedia mentions a “cringe” feature that appears with negative criticism of texts and films, illustrating how models can detect human-like concepts.

Some features are easier to track than others. For instance, finding user deception is challenging. It’s not straightforward to identify a function that triggers when the model lies. The research by DeepMind is similar to work by Anthropic, which used Sparse Autoencoders to identify parts of their model that lit up when discussing the Golden Gate Bridge, even making the model identify as the bridge itself.

Mechanistic interpretability research, while quirky, can be extremely useful. It helps understand how a model generalizes and at what level of abstraction it operates. For example, Sparse Autoencoders were used to identify gender biases in models, allowing researchers to reduce bias by disabling specific features. However, this was done with a small model, so it’s unclear if it applies to larger models.

This research can also explain why AI makes mistakes. When a model mistakenly concluded that 9.11 was greater than 9.8, researchers found that the question triggered areas related to Bible verses and September 11. Understanding this, they adjusted the model’s activation in these areas, leading to the correct answer.

Other potential applications for Sparse Autoencoders include preventing the dissemination of harmful information, like bomb-making instructions. If model creators can identify where such knowledge resides in a model, they could theoretically disable it, preventing even sophisticated prompt hacking from retrieving such information. However, achieving such granularity and precise control is currently difficult.

Every change to an AI model has pros and cons. Adjusting a model to reduce violence might inadvertently erase its knowledge of martial arts. The knowledge of “bomb-making” isn’t a simple on-off switch but likely woven into various parts of the model, meaning disabling it might affect the model’s understanding of chemistry.

If we can delve deeper and see more clearly into the “mind” of AI, mechanistic interpretability could be a viable path to alignment, ensuring AI does what we expect. This article by Scott J Mulligan from MIT Technology Review highlights the potential and challenges of mechanistic interpretability in AI research.