notes on interpretability | Andrei Kanavalau

As I read more interpretability research I have felt the need to systematize all the different approaches and ways of thinking that I was coming across. The content on this page started out as my personal attempt at introducing some structure to my own thinking about interpretability. Driven by the challenge of giving a single universal definition to interpretability I have started to organize the papers I have been reading by the kind of insight they aim to produce. Once we know the question it is aiming to answer we can look at concrete experimental choices needed to make answering the question tractable. In addition to helping better understand existing research I believe viewing work this way can be helpful in developing one’s own thinking for going from hypothesis to how to test it. The material mainly focuses on transformer LLMs but I include any ideas relevant even if they have not been applied to LLMs. At some point I have thought that it would be helpful to share this publicly to engage with the community in the hope that someone else might find this useful and/or contribute their thoughts. Please email me if you have any feedback I am always excited to chat and learn more!

1 What is interpretability?

Defining interpretability is non-trivial as interpretation or justification is inherent to our human understanding the world. At this stage I will refrain from trying to define it here and instead try to broadly define the goals of interpretability research. For the purposes of this document I will define it as the practice of producing explanations of model behavior that are useful under some combination of scientific truth-seeking, human understanding, and actionability. There is no single best explanation even in much more established sciences e.g. general relativity is closer to being ‘true’ than Newtonian mechanics but you can go to the moon using just Newtonian mechanics, trying to take into account general relativity would only make your life more difficult. It is thus important to evaluate explanations based on some desiderata and/or intended use. I have found that most explanations can be viewed with the following lenses:

Descriptive
Mechanistic/causal
Actionable/control oriented

2 What makes a good explanation?

The most important concepts that are desirable for an explanation are:

Plausibility: explanation looks meaningful to humans
Faithfulness: explanation tracks the model’s actual causal dependencies
Sufficiency: keeping only the explanatory factors preserves the behavior
Comprehensiveness: removing the explanatory factors break the behavior
Stability: small perturbations that don’t change the output shouldn’t drastically change the explanation

Assessing different concepts above pose different challenges. For example, plausibility is usually intuitive but often requires the right way to visualize the explanation. On the other hand faithfulness/sufficiency/comprehensiveness require much more careful experimental design and consideration. The best ways to define and evaluate these concepts have been the subject of continued discussion.

3 Question based taxonomy

This section attempts to classify different interpretability work into categories based on broad questions they try to answer or insights they try to gain. Within each category we further refine the work by specific choices made in order to make the insight concrete. At the end of each section is a list of papers with short summaries following (at least approximately) the following schema:

Question it answers
Explanation object it uses
Metric or evidence it uses
Other choices made
The observation produced

Obviously many works will not fit this taxonomy exactly and I might change it in the future as I learn more.

3.1 Which part of the input is important for this output?

This line of work tries to connect parts of the input to the output a model produces on that input. The kind of insight you get is descriptive and local: for this particular example (or a narrow neighborhood around it), which bits of the input mattered most for the prediction? In practice this is often used for debugging and establishing trust, because it can surface cases where the model is right for the wrong reasons and it can sometimes make it easier to compare models beyond aggregate accuracy.

Making that question concrete requires a handful of design decisions. You have to decide what output you are explaining, what counts as an input part (some human-interpretable representation), and how you measure contribution. Different methods operationalize contribution differently: some fit a simple explanation model in a local neighborhood and treat its weights as the explanation; some treat contribution as sensitivity (gradients) but stabilize it by integrating gradients along a baseline-to-input path; others define contribution via a feature-masking game that averages marginal effects across many subsets, then approximate that target efficiently. As the final assessment of the usefulness of the explanation is performed by humans it is also important to choose how many features you present. Finally, as with many other ML problems the approach needs to be computationally tractable.

The main failure mode is that it is easy to produce explanations that look plausible while not actually tracking the model’s causal dependencies. A few recurring reasons: the interpretable representation can be too weak to capture what the model is doing; local approximations fail when the model is highly non-linear even near the point you are interested in; gradient-based attributions can collapse under saturation unless you’re careful about how you aggregate gradients; and any definition of missingness introduces assumptions that can change what the attribution means. Empirical evaluation is also messy because perturbing inputs can create out-of-distribution artifacts that make it difficult to establish if the explanation method is wrong or the model is simply reacting to weird inputs.

Important choices

What output you are trying to explain A class probability, a pre-softmax score/logit, or some other scalar score the model produces.
What counts as an “input part” Individual tokens, token presence/absence (bag-of-words), superpixels/patches, or tabular features. You often have to choose an interpretable representation that is not the model’s native feature space.
How “absence” is defined A baseline input intended to represent absence of signal, or a notion of missingness implemented by replacing removed features with reference values or with an expectation over a background distribution.
How locality is defined What neighborhood you sample around the instance, what perturbations you generate in that neighborhood, and how you weight them by proximity.
How contribution is measured Local surrogate weights; gradient aggregation along a path from baseline to input; approximate Shapley-style marginal contributions under feature masking.
How big the explanation is allowed to be Sparsity/length constraints (top-K features, simple explanation model family) so the result stays human interpretable which can also be subjective.
Approximation and compute budget Number of perturbed samples for surrogate/masking methods, number of gradient evaluations/integration steps for path methods, and any regularization needed to make estimation stable.

Paper summaries

Ribeiro, Singh, Guestrin (2016) — "Why Should I Trust You?" Explaining the Predictions of Any Classifier

1 What is interpretability?

2 What makes a good explanation?

3 Question based taxonomy

3.1 Which part of the input is important for this output?

Important choices

Paper summaries

3.2 What is represented, where, or in what geometry?

Important choices

Paper summaries

3.3 What circuits causally mediate a behavior?

Important choices

Paper summaries

3.4 How can we change behavior reliably?

Important choices

Paper summaries

3.5 Which training samples caused this behavior?

Important choices

Paper summaries

3.6 Can we go from an interpretable program to a network?

Important choices

Paper summaries

Reading list