Artificial intelligence (AI) has led to breakthroughs in many areas, from drug research to robotics. It is also changing how we interact with computers and the internet. However, there’s a big problem: we don’t fully understand how large language models work or why they are so effective. We have a general idea, but the details inside AI systems are too complex to decipher. This is an issue because we might use AI in sensitive fields like medicine without knowing it has critical weaknesses.
A team at Google Deepmind is working on “mechanistic interpretability” to create methods that let us look inside AI systems. In late July, they released Gemma Scope, a tool to help researchers understand what happens when generative systems produce an output. The hope is that if we better understand what’s happening inside an AI model, we can control its results more effectively, leading to better AI systems in the future.
Neel Nanda, who leads the mechanistic interpretability team at Google Deepmind, says, “I want to be able to look inside a model and see if it’s acting deceptively. It should be possible to read a model’s thoughts.” Mechanistic interpretability, or “Mech Interp,” aims to understand how neural networks work. Currently, we feed a model a lot of data and end up with model weights, which are parameters that determine how a model makes decisions. We have an idea of what happens between input and model weights: the AI looks for patterns in data and draws conclusions. But these patterns can be incredibly complex and hard for humans to interpret.
Imagine a teacher checking answers to a complex math problem. The student, or AI in this case, has the right answer, but the path to it looks like a bunch of squiggly lines. And this assumes the AI always knows the right answer, which isn’t true; it might find an irrelevant pattern it thinks is valid. Some current AI systems might conclude that 9.11 is greater than 9.8. Methods developed in mechanistic interpretability are beginning to make sense of these squiggly lines.
Nanda explains, “A main goal of mechanistic interpretability is to reverse-engineer the algorithms within these systems practically. We give the model a prompt, like ‘Write a poem,’ and then it writes some rhyming lines. What’s the algorithm it used? We’d like to understand that.”
To find features or data categories representing a bigger concept in Google’s AI model Gemma, Deepmind applied a tool called “Sparse Autoencoder” to each of its layers. You can think of a Sparse Autoencoder as a microscope that magnifies these layers and lets you see their details. If you ask Gemma about a Chihuahua, the “dogs” feature activates, showing what the model knows about “dogs.” The Autoencoder is “sparse” because it limits the number of digital neurons used, aiming for a more efficient and general data representation.
The tricky part of autoencoders is deciding how detailed they should be. Like a microscope, you can magnify something to an extreme degree, but then it might be impossible for humans to interpret. If you zoom out too much, you might miss interesting discoveries.
Deepmind’s solution was to run Sparse Autoencoders of different sizes, varying the number of features the Autoencoder should find. The goal wasn’t just for Deepmind researchers to analyze the results thoroughly. Gemma and the Autoencoders are open-source, encouraging other interested researchers to explore the findings and hopefully gain new insights into the model’s internal logic. Since Deepmind used its Autoencoders at every model level, a researcher could map the steps from input to output in a way we haven’t seen before.