notes on interpretability
work in progress
As I read more interpretability research I have felt the need to systematize all the different approaches and ways of thinking that I was coming across. The content on this page started out as my personal attempt at introducing some structure to my own thinking about interpretability. Driven by the challenge of giving a single universal definition to interpretability I have started to organize the papers I have been reading by the kind of insight they aim to produce. Once we know the question it is aiming to answer we can look at concrete experimental choices needed to make answering the question tractable. In addition to helping better understand existing research I believe viewing work this way can be helpful in developing one’s own thinking for going from hypothesis to how to test it. The material mainly focuses on transformer LLMs but I include any ideas relevant even if they have not been applied to LLMs. At some point I have thought that it would be helpful to share this publicly to engage with the community in the hope that someone else might find this useful and/or contribute their thoughts. Please email me if you have any feedback I am always excited to chat and learn more!
1 What is interpretability?
Defining interpretability is non-trivial as interpretation or justification is inherent to our human understanding the world. At this stage I will refrain from trying to define it here and instead try to broadly define the goals of interpretability research. For the purposes of this document I will define it as the practice of producing explanations of model behavior that are useful under some combination of scientific truth-seeking, human understanding, and actionability. There is no single best explanation even in much more established sciences e.g. general relativity is closer to being ‘true’ than Newtonian mechanics but you can go to the moon using just Newtonian mechanics, trying to take into account general relativity would only make your life more difficult. It is thus important to evaluate explanations based on some desiderata and/or intended use. I have found that most explanations can be viewed with the following lenses:
- Descriptive
- Mechanistic/causal
- Actionable/control oriented
2 What makes a good explanation?
The most important concepts that are desirable for an explanation are:
- Plausibility: explanation looks meaningful to humans
- Faithfulness: explanation tracks the model’s actual causal dependencies
- Sufficiency: keeping only the explanatory factors preserves the behavior
- Comprehensiveness: removing the explanatory factors break the behavior
- Stability: small perturbations that don’t change the output shouldn’t drastically change the explanation
Assessing different concepts above pose different challenges. For example, plausibility is usually intuitive but often requires the right way to visualize the explanation. On the other hand faithfulness/sufficiency/comprehensiveness require much more careful experimental design and consideration. The best ways to define and evaluate these concepts have been the subject of continued discussion.
3 Question based taxonomy
This section attempts to classify different interpretability work into categories based on broad questions they try to answer or insights they try to gain. Within each category we further refine the work by specific choices made in order to make the insight concrete. At the end of each section is a list of papers with short summaries following (at least approximately) the following schema:
- Question it answers
- Explanation object it uses
- Metric or evidence it uses
- Other choices made
- The observation produced
Obviously many works will not fit this taxonomy exactly and I might change it in the future as I learn more.
3.1 Which part of the input is important for this output?
This line of work tries to connect parts of the input to the output a model produces on that input. The kind of insight you get is descriptive and local: for this particular example (or a narrow neighborhood around it), which bits of the input mattered most for the prediction? In practice this is often used for debugging and establishing trust, because it can surface cases where the model is right for the wrong reasons and it can sometimes make it easier to compare models beyond aggregate accuracy.
Making that question concrete requires a handful of design decisions. You have to decide what output you are explaining, what counts as an input part (some human-interpretable representation), and how you measure contribution. Different methods operationalize contribution differently: some fit a simple explanation model in a local neighborhood and treat its weights as the explanation; some treat contribution as sensitivity (gradients) but stabilize it by integrating gradients along a baseline-to-input path; others define contribution via a feature-masking game that averages marginal effects across many subsets, then approximate that target efficiently. As the final assessment of the usefulness of the explanation is performed by humans it is also important to choose how many features you present. Finally, as with many other ML problems the approach needs to be computationally tractable.
The main failure mode is that it is easy to produce explanations that look plausible while not actually tracking the model’s causal dependencies. A few recurring reasons: the interpretable representation can be too weak to capture what the model is doing; local approximations fail when the model is highly non-linear even near the point you are interested in; gradient-based attributions can collapse under saturation unless you’re careful about how you aggregate gradients; and any definition of missingness introduces assumptions that can change what the attribution means. Empirical evaluation is also messy because perturbing inputs can create out-of-distribution artifacts that make it difficult to establish if the explanation method is wrong or the model is simply reacting to weird inputs.
Important choices
-
What output you are trying to explain A class probability, a pre-softmax score/logit, or some other scalar score the model produces.
-
What counts as an “input part” Individual tokens, token presence/absence (bag-of-words), superpixels/patches, or tabular features. You often have to choose an interpretable representation that is not the model’s native feature space.
-
How “absence” is defined A baseline input intended to represent absence of signal, or a notion of missingness implemented by replacing removed features with reference values or with an expectation over a background distribution.
-
How locality is defined What neighborhood you sample around the instance, what perturbations you generate in that neighborhood, and how you weight them by proximity.
-
How contribution is measured Local surrogate weights; gradient aggregation along a path from baseline to input; approximate Shapley-style marginal contributions under feature masking.
-
How big the explanation is allowed to be Sparsity/length constraints (top-K features, simple explanation model family) so the result stays human interpretable which can also be subjective.
-
Approximation and compute budget Number of perturbed samples for surrogate/masking methods, number of gradient evaluations/integration steps for path methods, and any regularization needed to make estimation stable.
Paper summaries
Ribeiro, Singh, Guestrin (2016) — "Why Should I Trust You?" Explaining the Predictions of Any Classifier
-
Question it answers: For a specific prediction, which interpretable representations of the input drove the model locally?
-
Explanation object it uses: An interpretable explanation model $g$ defined over a binary interpretable representation $x’$ (e.g. word presence/absence, superpixel presence/absence).
-
Metric/evidence it uses: Fits $g$ to approximate the black-box model $f$ in a locality around the instance by minimizing a locality-weighted loss plus a complexity penalty. Evaluates usefulness via faithfulness-style tests on inherently interpretable models. Simulates some interesting trust tasks.
-
Other choices made: Perturbations are generated by randomly masking interpretable components; locality is enforced with a proximity kernel; explanation size is capped by a feature budget $K$. Also introduces a submodular pick procedure (SP-LIME) to select a small set of instances whose explanations provide broader coverage of model behavior (approximate global fidelity).
-
Observation produced: Local sparse explanations can reveal spurious features and dataset issues that accuracy misses, help users compare models, and debug models (develop new features, identify inconsistencies).
Sundararajan, Taly, Yan (2017) — Axiomatic Attribution for Deep Networks
-
Question it answers: How should we attribute a deep network’s prediction to input features in a way that is principled, rather than relying on heuristics that can fail in predictable ways?
-
Explanation object it uses: A per-feature attribution vector defined relative to a baseline input, with attributions computed as integrated gradients along a baseline-to-input path.
-
Metric/evidence it uses: Axiomatic justification: proposes Sensitivity (if one feature causes output difference it should be given non-zero attribution) and Implementation Invariance (model implementation should not alter attribution if input-output behavior is the same) as requirements and shows integrated gradients satisfy them; also uses a completeness identity (attributions sum to $F(x) - F(x_0)$) as a practical sanity check for numerical approximation.
-
Other choices made: Baseline selection is treated as a core design choice (meant to represent absence of signal and ideally yield a near-neutral prediction); the path integral is approximated with a finite number of gradient evaluations, increasing the number of steps until the completeness check is reasonably tight.
-
Observation produced: Integrating gradients along a baseline-to-input path avoids sensitivity failure (saturation/flat regions yielding near-zero gradients at the input) while retaining implementation invariance.
Lundberg, Lee (2017) — A Unified Approach to Interpreting Model Predictions
-
Question it answers: Aims to analyze a number of approaches to assigning feature importance (LIME, DeepLIFT, layer-wise relevance propagation) under a unifying framework (Shapley values) and propose a method satisfying properties desirable of value attribution methods.
-
Explanation object it uses: An additive explanation model $g(z’)=\phi_0+\sum_i \phi_i z’_i$ over binary feature presence indicators, where the explanation is the set of $\phi_i$ values.
-
Metric/evidence it uses: A theoretical result: within additive feature attributions, there is a unique solution satisfying local accuracy, missingness, and consistency, which corresponds to Shapley values. Since most models cannot literally accept arbitrary missing-feature inputs, missingness is formalized via conditional expectations.
-
Other choices made: Defines SHAP values as Shapley values of a conditional-expectation version of the model and introduces efficient estimators, including a regression-based estimator with a specific weighting kernel (Kernel SHAP). Discusses practical approximation assumptions (e.g. feature independence, local linearity) that simplify conditional expectations.
-
Observation produced: A broad class of existing explanation methods can be seen as instances of the same additive template, but heuristic choices can violate desirable properties; a principled kernel/estimation procedure can recover Shapley-consistent attributions more efficiently than naive Shapley sampling.
3.2 What is represented, where, or in what geometry?
This line of work focuses on what information is present inside model’s representations, where it tends to live, and what form it takes. Sometimes the question is very concrete (“does layer 6 contain enough information to recover dependency relations?”), and sometimes it is closer to a mechanistic story (“what are feed-forward layers actually doing when they update the residual stream?”). The kind of insight you get is mostly descriptive: you learn what is decodable or legible from internal states, how this changes across depth, and what might be a better basis for understanding the model than raw neurons. There is a close connection to mechanistic and/or actionable insights.
Making the question concrete requires picking a representation to study and a notion of feature/information that you are aiming to extract. Many approaches treat representations as something you can read with a probe by training a classifier or regressor on a frozen hidden state. High accuracy can be an indicator that the information required for the classification/regression task is there. Other approaches try to make the representation interpretable by finding a better coordinate system: map hidden states into vocabulary space to view them as evolving token distributions, interpret a feed-forward layer as a structured sum of key–value memory contributions, or explicitly learn a new feature basis using a sparse autoencoder so that the basic units are sparse features rather than neurons.
The main failure mode is interpreting decodability as mechanism. A probe can succeed because the information is in the representation in a way that is accessible to the probe, without that implying the model uses the information for the behavior of interest. Results can also be sensitive to probe capacity, to how you aggregate representations, and to representation drift across layers.
Important choices
-
What internal object you treat as your representation A token-level hidden state in the residual stream at a given layer, a span representation built from multiple tokens, a pooled sentence embedding, an MLP activation vector.
-
What you are trying to recover from it Linguistic labels and relations (surface, syntax, semantics), next-token distributions at intermediate layers (“latent predictions”), or some higher-level concepts.
-
What kind of readout you allow Linear probes if you want a conservative notion of accessibility, richer probes (e.g. small MLPs) if you care about is it present at all.
-
What you take a feature to be Neurons, arbitrary directions in activation space, structured sub-components of a computation (e.g. per-neuron sub-updates inside an FFN), or learned sparse features from an overcomplete dictionary.
-
How you try to make interpretations legible Comparing probe performance across depth, projecting contributions into vocabulary space, retrieving high-activation dataset examples, clustering learned features for exploration, or training sparse autoencoders to obtain sparse feature activations.
-
What evidence counts as success Probe accuracy/F1, perplexity or KL divergence for intermediate predictions, human annotation of whether top-ranked tokens/examples form a meaningful concept, reconstruction error for learned decompositions, and automated interpretability-style scoring where a model-generated description predicts held-out activations.
Paper summaries
Alain, Bengio (2016) — "Understanding intermediate layers using linear classifier probes"
-
Question it answers: How the quality of intermediate representations evolves across depth and training, and whether this can be monitored in a way that is useful for understanding or debugging learning dynamics.
-
Explanation object it uses: A set of linear classifier probes trained on frozen intermediate activations (one probe per layer), interpreted via their performance curves.
-
Metric/evidence it uses: Probe training/validation loss and accuracy as a function of layer and training time.
-
Other choices made: Probes are trained to predict the task labels without backpropagating into the base model; probe capacity is intentionally limited to avoid the probe itself doing heavy lifting.
-
The observation produced: Deeper layers tend to make task-relevant information more linearly accessible, and probe trajectories over training can reveal optimization pathologies or architectural quirks that are hard to spot from final accuracy alone.
Conneau, Kruszewski, Lample, Barrault, Baroni (2018) — "What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties"
-
Question it answers: What kinds of linguistic information different sentence embeddings contain, and how this depends on encoder architecture and training objective.
-
Explanation object it uses: Probing classifiers mapping a fixed sentence embedding to labels for a collection of probing tasks.
-
Metric/evidence it uses: Classification accuracy on a set of probing tasks designed to target surface, syntactic, and semantic properties.
-
Other choices made: Uses a small MLP probe (and reports logistic regression variants) and constructs tasks with controls intended to reduce shortcuts (e.g. word-content controls, order-sensitive tasks, semantic anomaly detection).
-
The observation produced: Many embeddings carry substantial recoverable linguistic information, but different objectives/architectures emphasize different properties, and strong downstream performance does not translate into uniformly strong probing performance across task types.
Tenney et al. (2019) — "What do you learn from context? Probing for sentence structure in contextualized word representations"
-
Question it answers: How linguistic structure is distributed across layers in contextualized word representations, and how much of that structure relies on non-local context.
-
Explanation object it uses: A unified edge probing (different from edges commonly referred to in MI) setup that predicts labels for spans or span pairs from representations taken from specific layers. Span/spans are nodes and labels are edges.
-
Metric/evidence it uses: Task performance (F1) across a range of span/edge prediction problems, compared across layers and model variants.
-
Other choices made: Uses a fixed probing architecture (span representation + small MLP) applied across tasks; introduces baselines that restrict context (local CNN) and baselines that isolate architectural priors (random orthonormal encoders), and analyzes performance as a function of distance between spans.
-
The observation produced: Different layers tend to make different kinds of structure most accessible (with a broad trend from more local/syntactic accessibility to more abstract/semantic accessibility), and contextual information beyond a local window plays a measurable role for many structure-sensitive tasks.
Geva, Schuster, Berant, Levy (2021) — "Transformer Feed-Forward Layers Are Key-Value Memories"
-
Question it answers: What transformer feed-forward layers do internally, and whether they can be interpreted as a structured collection of reusable memories.
-
Explanation object it uses: An FFN interpreted as a key–value memory where each hidden unit corresponds to a memory cell; the input weight vector acts like a key that detects patterns, and the output weight vector acts like a value that pushes the model toward particular vocabulary outputs.
-
Metric/evidence it uses: Retrieves high-activation training examples to identify human-recognizable trigger patterns for keys; analyzes how deleting pattern tokens affects activations; interprets values by mapping them into vocabulary space and examining induced token distributions.
-
Other choices made: Focuses on a transformer language model trained on WikiText-103; samples memory cells per layer; treats layer output as a composition of many simultaneously active memories rather than a single dominant cell.
-
The observation produced: Many keys correspond to identifiable textual patterns, values often induce meaningful output-vocabulary preferences (especially in higher layers), and FFN outputs behave like compositional mixtures of many memory contributions that are further refined across depth.
Geva, Caciularu, Wang, Goldberg (2022) — "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space"
-
Question it answers: How FFN layers contribute to building next-token predictions across the network, and whether those contributions can be made interpretable in a shared space.
-
Explanation object it uses: Views the token representation as inducing a vocabulary distribution, and the FFN output as an additive update in that vocabulary space; decomposes each FFN update into per-parameter-vector sub-updates that can be inspected individually.
-
Metric/evidence it uses: Interpretability of sub-updates via their top-scoring vocabulary items and human annotation of whether these sets correspond to coherent semantic or syntactic concepts; layer-wise statistics on how often such coherence appears.
-
Other choices made: Analyzes autoregressive decoder LMs (including a WikiText-trained model and GPT2); compares full FFN updates to decomposed sub-updates to show why the decomposition helps.
-
The observation produced: Many sub-updates correspond to coherent, human-recognisable token sets (increasingly in later layers), suggesting that FFNs often act by promoting concept-relevant candidates in vocabulary space rather than producing opaque dense updates.
Elhage et al. (2022) — "Toy Models of Superposition"
-
Question it answers: How and when do models represent more features than they have dimensions?
-
Explanation object it uses: Synthetic features and their learned directions in a low-dimensional space, studied through toy models that compress sparse inputs through a bottleneck.
-
Metric/evidence it uses: Behavior of trained toy models under varying sparsity and feature/dimension ratios, including how learned directions organize and how interference emerges when multiple features share representational capacity.
-
Other choices made: Compares simple linear and ReLU-based variants and varies feature sparsity and importance to study when superposition is favored by optimization.
-
The observation produced: Superposition can be an efficient representational strategy when features are sparse; computation is possible in superposition and feature interference is not symmetric; superposition undergoes phase change depending on sparsity and importance of features; during training superposition arises from discrete energy jumps; correlation of features affects their relative location in superposition.
Belrose, Ostrovsky, McKinney, Furman, Smith, Halawi, Biderman, Steinhardt (2023) — "Eliciting Latent Predictions from Transformers with the Tuned Lens"
-
Question it answers: What intermediate layers predict before the final layer, and how to read out those latent predictions in a way that is more stable and meaningful than naive logit lens snapshots.
-
Explanation object it uses: A per-layer affine translator (tuned lens) that maps a layer’s residual stream into the final representation space and then applies the model’s unembedding.
-
Metric/evidence it uses: Quality of intermediate predictions (via perplexity/cross-entropy) and a causal-fidelity-style analysis that compares influential directions under the lens to influential directions for the model itself using ablation-based influence measures.
-
Other choices made: Trains translators by distilling from the model’s own final-layer output distribution to reduce “probe learns an unrelated predictor” concerns; includes a learnable bias term; proposes a procedure to extract influential “basis directions” and quantify alignment.
-
The observation produced: Naive logit lens readouts can be noisy and systematically biased because layer representations drift; learned translators produce more accurate, smoother intermediate prediction trajectories and better track model-relevant directions.
Cunningham, Ewart, Riggs, Huben, Sharkey (2023) — "Sparse Autoencoders Find Highly Interpretable Features in Language Models"
-
Question it answers: Whether we can recover a more interpretable feature basis from transformer activations than the neuron basis, motivated by the idea that neurons are polysemantic due to superposition.
-
Explanation object it uses: Sparse autoencoder features learned to reconstruct internal activations, with sparse hidden activations treated as feature activations and decoder directions treated as feature vectors.
-
Metric/evidence it uses: Reconstruction error and sparsity as training signals; interpretability scoring via an automated explain-and-predict-activations procedure; comparisons to baselines (neuron basis, random directions, PCA, ICA); and a causal localization test that patches activations along feature directions and measures output divergence.
-
Other choices made: Trains on residual stream activations of Pythia models with tied weights and an L1 penalty; adapts activation patching to the learned feature basis and selects compact feature sets for a behavior using a greedy circuit-discovery-style ordering, measuring KL divergence to a counterfactual target output.
-
The observation produced: Sparse autoencoders can recover many features that are more interpretable (and often more monosemantic) than neurons or common linear baselines, and those features can support more compact causal edits for at least one studied behavior.
Bricken et al. (2023) — "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning"
-
Question it answers: Whether sparse autoencoders can produce a decomposition of a model’s activations into units that are more monosemantic than neurons while still being sufficiently complete and useful for circuit-style analysis.
-
Explanation object it uses: Learned sparse features from autoencoders trained on MLP activations, plus a set of feature visualizations (top activating examples, downstream logit effects, and feature-space maps) intended to make the feature space navigable.
-
Metric/evidence it uses: Multiple evidence routes, including detailed case studies of specific features, broader human and automated interpretability analyses (on activations and on downstream effects), and causal interventions/steering by activating features; also studies how properties change with dictionary size.
-
Other choices made: Focuses on a one-layer transformer with a 512-neuron ReLU MLP trained on The Pile; trains autoencoders on billions of activation samples with expansion factors ranging from 1× to 256×; emphasizes practical training details such as the importance of bias terms and resampling dead autoencoder neurons; explores feature-space structure with UMAP and compares features across independently trained model seeds.
-
The observation produced: Sparse autoencoders can surface features that are largely invisible in the neuron basis, provide more monosemantic units for analysis, exhibit “feature splitting” as dictionaries scale, and yield features that appear partly universal across model instances; some features can be used directly to steer generation when activated.
3.3 What circuits causally mediate a behavior?
This line of work is usually referred to as mechanistic interpretability. The goal is to identify a causal story about how a particular behavior is implemented inside the model. The motivating picture is that a transformer can be treated as a computational graph with attention heads and MLPs reading from a shared residual stream, doing some computations, and then writing an update back in. A circuit is a relatively small subgraph that is causally responsible for some behavior.
In many ways this is the more difficult question to concretize and answer faithfully. It requires making a number of choices most of which have been shown to have a significant impact on what circuits are identified and what conclusion can be drawn. In order to narrow down the circuit search space you first have to define the behavior of interest (e.g. indirect object identification, greater-than operation, factual-recall), frame it as a measurable function of the model’s output (e.g. a logit difference between correct/incorrect tokens, a distillation loss, correct token probability), decide what your circuit is allowed to be made of (whole layers, heads, neurons, edges, or even explicit paths through the graph), and decide what interventions you will use to test causality. Different works then operationalize “finding the circuit” in different ways: by testing a hand-built hypothesis, by greedily deleting edges that do not matter under an intervention, by optimizing a sparse mask over edges, or by replacing patching with a faster proxy score that tries to approximate the effect of interventions.
The main failure mode is that it is surprisingly easy to get a small subgraph that performs well given the choices made while not actually capturing the model’s real mechanism. Off-distribution corruptions can break the model in ways that change what importance means. Some metrics can hide negative contributions or get distorted by cancellation effects. Automated search procedures can be sensitive to thresholds, ordering, or approximation error. Interventions that act on subspaces rather than concrete components can create an interpretability illusion where many different subspaces can lead to the intended intervention results, even when they do not correspond to a meaningful causal variable. In practice, circuit work requires careful analysis and justification of different choices.
Important choices
-
What behavior you are explaining You need a scalar target that you can measure repeatedly across many examples: logit difference between two candidates, next-token KL divergence, task accuracy, loss, or a bespoke score.
-
What counts as a circuit Circuits can be defined over different atoms: nodes (layers/heads/MLPs), edges between nodes, or explicit input-to-output paths. The right granularity depends on whether you want a coarse localization or a genuinely mechanistic story about information flow.
-
What it means to remove part of the circuit Removing a graph edge is not a straightforward operation in transformers; you have to define what replaces the missing contribution. Common choices try to keep activations in-distribution by swapping in activations from a corrupted or resampled example, rather than zero ablating them.
-
Necessity vs sufficiency style tests You can test necessity by knocking out components and checking the behavior breaks, or test sufficiency by keeping only the proposed circuit and checking the behavior survives.
-
How counterfactual inputs are constructed You typically need a clean and a corrupted (or resampled) input. The corruption should ideally preserve general information not relevant to the task while changing the task-relevant information.
-
What metric you optimize or threshold on A divergence between output distributions (e.g. KL) behaves differently from a logit difference or an accuracy metric, especially when there are negative contributors or cancellation. It is also easy to accidentally optimize a metric that is convenient rather than one that matches the causal claim you want.
-
How the circuit is found Options include manual hypothesis building + testing, greedy edge deletion, gradient-based scoring as a proxy for intervention effects, direct optimization of sparse edge masks, or analytic decomposition methods that avoid interventions entirely.
-
How you validate Faithfulness-style tests (ablating everything outside the circuit), robustness across prompt templates/datasets, comparisons against random circuits of the same size, and checking that conclusions do not flip under small methodological changes (corruption method, metric, patch site).
Paper summaries
Elhage, Nanda, Olsson, et al. (2021) — "A Mathematical Framework for Transformer Circuits"
-
Question it answers: Aim to discover simple algorithmic patterns, motifs, or frameworks that can subsequently be applied to larger and more complex models.
-
Explanation object it uses: A decomposition view of transformers where components communicate by writing into the residual stream; attention heads split into QK (where to attend) and OV (what to write) computations; and “path expansions” that express logits as sums of end-to-end path contributions.
-
Metric/evidence it uses: Primarily mathematical derivations plus reverse engineering of small attention-only toy models to show the framework makes concrete predictions about behavior.
-
Other choices made: Focuses on very small, simplified transformers (including attention-only models) to keep the space of possible mechanisms small enough to analyze directly.
-
Observation produced: A lot of transformer behavior becomes easier to reason about when you explicitly track additive residual contributions and treat attention as separable QK and OV structure, with behavior arising from compositions of these pieces.
Chan, Garriga-Alonso, Goldowsky-Dill, et al. (2022) — "Causal Scrubbing: a method for rigorously testing interpretability hypotheses"
-
Question it answers: Given a mechanistic interpretability hypothesis, how can we test it in a way that is more systematic than ad-hoc ablations?
-
Explanation object it uses: A formal hypothesis that links an interpretable computational graph to a model’s computational graph via a correspondence, interpreted as a claim about which internal distinctions matter and which can be scrubbed away.
-
Metric/evidence it uses: Behavior-preserving resampling ablations: replace activations with resampled activations that should be equivalent under the hypothesis, and measure how much the model’s behavior (loss/task metric) changes.
-
Other choices made: Emphasizes resampling from an appropriate data distribution to keep interventions on-distribution, and applies the procedure recursively to scrub away all dependencies that the hypothesis claims should be irrelevant.
-
Observation produced: Hypotheses can be evaluated and iteratively refined by progressively making stronger claims and checking when performance breaks, giving a framework for answering “how faithful is this interpretation?”
Wang, Steinhardt, Evans (2022) — "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"
-
Question it answers: What sparse internal circuit in GPT‑2 small causally mediates indirect object identification (IOI) behavior?
-
Explanation object it uses: A circuit described as a subgraph over specific transformer components and connections.
-
Metric/evidence it uses: Intervention-based causal evidence using a task-specific scalar metric (logit difference between the correct and incorrect indirect object), with both knockout-style ablations (necessity) and path patching (localizing causal influence along specific routes).
-
Other choices made: Uses mean-ablation style interventions computed from a reference distribution to define knocking out components while keeping activations closer to the model’s typical operating regime; evaluates circuit quality with explicit criteria (faithfulness, completeness, minimality).
-
Observation produced: A relatively small (26) set of interacting attention heads (and their connections) accounts for most of the IOI behavior, with identifiable functional roles and some redundancy.
Goldowsky-Dill, MacLeod, Sato, Arora (2023) — "Localizing Model Behavior with Path Patching"
-
Question it answers: How can we express and quantitatively test hypotheses that a model behavior is localized to a set of mediating paths (not just a set of nodes)?
-
Explanation object it uses: A “localization hypothesis” specifying a computational graph, a choice of mediator set (nodes or paths), and a dissimilarity metric for comparing model outputs under interventions.
-
Metric/evidence it uses: Defines quantitative measures such as average unexplained effect (linked to natural indirect effects) and uses path patching interventions to estimate how much of the behavior is mediated by the hypothesized set.
-
Other choices made: Makes counterfactual construction a central knob (resampling, corrupting, mean/zero ablation) and treats the choice of output metric as part of the hypothesis.
-
Observation produced: Path patching provides a principled way to test “this set of paths mediates the behavior,” and helps diagnose where a hypothesis fails by attributing residual unexplained effect.
Conmy, Mavor-Parker, Lynch, Heimersheim, Garriga-Alonso (2023) — "Towards Automated Circuit Discovery for Mechanistic Interpretability"
-
Question it answers: Can we automate the step of finding which connections between abstract components form a circuit for a behavior?
-
Explanation object it uses: A circuit defined as a sparse set of edges in a chosen computational graph over abstract units.
-
Metric/evidence it uses: Activation-patching-based edge evaluation, typically using a divergence (e.g. KL) between the full model’s output and the output produced when certain edges are replaced by activations from a corrupted run.
-
Other choices made: Uses a greedy deletion procedure (ACDC) controlled by thresholds and an edge order consistent with the graph’s partial order, aiming for sparsity while keeping the output close to the full model.
-
Observation produced: Automated pruning can recover known circuit structures on standard mechanistic interpretability tasks, substantially reducing the amount of manual edge-by-edge work.
Zhang, Nanda (2023) — "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"
-
Question it answers: How sensitive is activation patching to methodological choices, and what should we standardize if we want reliable mechanistic conclusions?
-
Explanation object it uses: Localization results produced by activation patching (which components look causally important under restore-style interventions), applied across several tasks.
-
Metric/evidence it uses: Empirical comparisons across evaluation metrics (e.g. probability-based vs logit-difference-style scores) and corruption methods (e.g. Gaussian noising vs semantically-related token replacement), showing that these choices can materially change conclusions.
-
Other choices made: Analyzes both localization and downstream circuit discovery, and compares single-site interventions against “sliding window” joint patching to show how multi-site interactions complicate interpretation.
-
Observation produced: Some common choices can produce inconsistent or misleading localization, especially when corruptions push the model off distribution or when the metric discards signed contributions.
Makelov, Lange, Geiger, Nanda (2024) — "Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching"
-
Question it answers: When we patch only a subspace of an activation, does success imply that the subspace corresponds to a meaningful causal variable or mechanism?
-
Explanation object it uses: Subspace activation patching experiments that intervene on projected components of activations rather than on concrete model components or full activations.
-
Metric/evidence it uses: Theoretical and empirical demonstrations showing that many different subspaces can yield similar behavioral restoration, creating an illusion of having identified “the” causal subspace.
-
Other choices made: Varies how the subspace is chosen (including settings where it is not aligned with an intended causal feature) and studies how conclusions change under these variations.
-
Observation produced: Patching working is not, by itself, evidence that the patched subspace is the model’s internal representation of the causal variable you care about.
Hanna, Pezzelle, Belinkov (2024) — "Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms"
-
Question it answers: How should we evaluate circuit discovery methods, and is circuit overlap a good proxy for having found the right mechanism?
-
Explanation object it uses: Circuits as edge-defined subgraphs, with a focus on scalable circuit-finding via edge scoring rather than exhaustive interventions.
-
Metric/evidence it uses: Proposes and evaluates EAP with integrated gradients (EAP‑IG) and assesses circuits using a faithfulness criterion: edges outside the circuit should be ablatable without changing the model’s behavior on the task.
-
Other choices made: Uses integrated gradients to reduce failure modes of gradient-based edge scoring (e.g. vanishing/zero gradients), and compares overlap against faithfulness across tasks.
-
Observation produced: High overlap with a reference circuit does not guarantee faithfulness, and EAP‑IG can produce more faithful circuits than simpler gradient approximations even when overlap looks similar.
Bhaskar, Wettig, Friedman, Chen (2024) — "Finding Transformer Circuits with Edge Pruning"
-
Question it answers: Can circuit discovery be posed as a scalable optimization problem that yields sparse, faithful circuits without greedy edge-by-edge search?
-
Explanation object it uses: Learnable binary masks over edges in a transformer’s computational graph (implemented via a disentangled residual stream so edge-level reads can be gated).
-
Metric/evidence it uses: Optimizes a loss that matches circuit outputs to full-model outputs (e.g. KL divergence over token predictions), while enforcing sparsity via L0-style regularization; removed edges are treated counterfactually by replacing their contributions with activations from a corrupted example.
-
Other choices made: Requires paired clean/corrupted examples to define the counterfactual semantics of “missing edges,” and targets circuits that match behavior across a task distribution rather than on a single prompt.
-
Observation produced: Produces circuits that are substantially sparser than prior approaches while remaining comparably faithful on standard circuit-finding tasks, and scales to larger datasets and models.
Hsu, Zhou, Cherapanamjeri, et al. (2025) — "Efficient Automated Circuit Discovery in Transformers Using Contextual Decomposition"
-
Question it answers: Can we do automated circuit discovery efficiently without relying on activation patching (or gradients), and still recover fine-grained circuits?
-
Explanation object it uses: Contextual decomposition for transformers (CD‑T): an analytical decomposition of activations into “relevant” and “irrelevant” constituents that can be propagated through transformer modules to score influence between nodes.
-
Metric/evidence it uses: Derives decomposition rules for transformer operations (including attention) and uses them to compute relevance scores that drive circuit selection; evaluates by how well recovered circuits align with reference circuits and by faithfulness-style checks (including comparisons against random circuits).
-
Other choices made: Treats the choice of initial decomposition (what counts as “relevant” at the source) as a crucial analogue of choosing an ablation distribution, and emphasizes compatibility across transformer architectures.
-
Observation produced: CD‑T can produce circuits at different levels of abstraction and granularity (down to attention heads at specific positions) with strong empirical recovery and runtime improvements over patching-based methods in the tested settings.
3.4 How can we change behavior reliably?
This line of work is about making targeted changes to a model’s behavior without doing more training. The goal can be, for example, changing or updating “facts” inside the model or steering model away from toxic behavior. Defining, achieving, and verifying exact scope of the change is challenging: you want a change that is precise, robust to how you ask, and doesn’t quietly break other things.
Making the question concrete forces you to pick what kind of change you want, where you are willing to intervene, and what success means. A common framing is to treat editing as a constrained optimization problem: push the model to prefer some desired completion under a small family of prompts, while changing as little as possible elsewhere. Once you scale beyond a single edit, you also have to decide how edits are composed (one at a time vs jointly), and what you want to happen when edits interact.
The main failure modes are exactly what you would expect from trying to modify a complex system. An edit can succeed on the exact prompt you optimized for and still fail to generalize; it can leak into nearby facts; it can subtly degrade unrelated behavior; and when you apply many edits, small amounts of interference can accumulate into forgetting or even abrupt breakdown where the model becomes hard to edit further. A second, less obvious failure mode is evaluation mismatch: you can get strong results on a narrow prompt format and still not have the new behavior reliably show up during normal text generation.
Important choices
-
What behavior you are trying to change In these papers the target is usually a factual association (“subject–relation–object”), but you still have to decide what counts as success: top-1 prediction for a short completion, a probability margin over the old fact, or changes that persist in free-form generation.
-
How you specify the edit You can specify a fact explicitly (as a structured triplet) or implicitly (as an example prompt + desired completion). Even with the same underlying fact, the exact prompt template matters because it defines what the method is trained/optimized against.
-
Where you intervene Parameter edits force you to choose a site: which layer(s), which module type, and often which token position (e.g. the “subject position” for a subject–relation query). Some work uses causal localization to guide this choice.
-
What object you actually change In the papers I have focused on so far the intervention is a weight update, typically to the MLP output projection in one layer or across a range of layers. It is possible to also modify activations without updating weights.
-
How you define the internal write you want to perform A common pattern is to compute a “key” representation tied to the subject/context, then solve for a “value” vector that makes the model prefer the desired completion. There is usually an explicit optimization step here, plus heuristics for making the key/value stable across contexts.
-
How you control collateral damage Constrain the size of the weight change, bias the edit toward minimal interference with other activations (often via covariance statistics), or regularize to preserve the model’s behavior on prompts meant to capture the subject’s broader semantics.
-
What you measure as reliable It is rarely enough to check the exact edited prompt. Typical axes are: efficacy on the edit prompt, generalization to paraphrases, specificity on nearby/unrelated facts, and whether generations remain coherent. For scaling, you also care about forgetting of earlier edits and degradation on downstream tasks.
Paper summaries
Meng, Bau, Andonian, Belinkov (2022) — "Locating and Editing Factual Associations in GPT"
-
Question it answers: Can we localize where a factual association is mediated inside a transformer, and then rewrite that single association with a small, targeted weight change that generalizes beyond the exact prompt?
-
Explanation object it uses: A localized weight edit to an MLP layer’s output projection, constructed as a rank-one update that inserts a new key–value association tied to the subject and intended to make the model prefer a new object.
-
Metric/evidence it uses: Uses causal-style intervention experiments to motivate which internal sites matter, then evaluates edits on factual-recall benchmarks with measures for success on the edit prompt, generalization to paraphrases, and specificity on unrelated/neighbor prompts. It also includes generation-based checks for whether free-form text about the subject becomes consistent with the edit.
-
Other choices made: Computes a “key” vector from subject activations, optimizes a “value” vector to elicit the desired completion, and adds an explicit regularizer to reduce semantic drift of the subject while the new fact is being written.
-
Observation produced: A single low-rank update at an appropriate MLP site can reliably flip a targeted factual association, often generalizing to paraphrased prompts while limiting spillover to nearby facts.
Meng, Sharma, Andonian, Belinkov, Bau (2023) — "Mass-Editing Memory in a Transformer"
-
Question it answers: How can we apply many factual edits efficiently?
-
Explanation object it uses: A set of coordinated weight updates distributed across a range of MLP layers.
-
Metric/evidence it uses: Empirical evaluation on large batches of edits, measuring the usual trio of edit success, paraphrase generalization, and specificity, plus generation-focused checks. Also reports compute/runtime tradeoffs for scaling to large edit sets.
-
Other choices made: Solves for per-edit target representations and then computes layer-wise updates in a way that supports batching. Uses covariance/statistical structure to bias the update toward “least interfering” changes, and accounts for the fact that editing earlier layers changes the activations seen by later layers.
-
Observation produced: Spreading the write across multiple layers and solving for a joint update can support large-scale editing more efficiently and with better retention/specificity than repeatedly applying single-fact edits.
Gupta, Rao, Anumanchipalli (2024) — "Model Editing at Scale leads to Gradual and Catastrophic Forgetting"
-
Question it answers: What happens when we apply many sequential edits to the same model? Do edits remain effective, do earlier edits persist, and do we preserve downstream capabilities?
-
Explanation object it uses: An evaluation framework for sequential editing that tracks both how well the next edit works (editing proficiency) and how much earlier edited knowledge is forgotten (edit retention), alongside broader model capability.
-
Metric/evidence it uses: Runs long edit sequences and measures per-edit success, paraphrase generalization, neighborhood specificity, the fraction of earlier edits that are forgotten, and downstream-task performance. It also analyzes parameter-change magnitudes to diagnose failure events.
-
Other choices made: Argues that the choice of dataset and prompt format matters for what you are actually measuring (e.g. QA-style edits may not transfer cleanly to completion-style generation), and designs the scaling experiments to avoid trivial conflicts such as repeatedly editing the same subject.
-
Observation produced: Scaling can produce a two-phase pattern: gradual degradation/forgetting as edits accumulate, and then occasional catastrophic failures triggered by particular edits that cause unusually large parameter shifts and sharply reduce editability and downstream performance.
3.5 Which training samples caused this behavior?
This line of work aims to trace a model’s outputs back to the training examples that most shaped them. The kind of insight you might want is: if you had trained on a slightly different dataset, how would the model’s behavior on this particular example have changed?
Making that question concrete requires you to define what you mean by “behavior” and “caused.” You need to define a score for each training example, relative to some model output function you care about (loss, margin/logit, an alignment score, etc.). The ultimate test for causality is retraining the model with a modified training dataset but this is not possible in most situations so the effect needs to be approximated.
The main failure mode is that it’s easy to confuse looks related with was causally responsible.
Important choices
-
What behavior you are attributing You need a scalar output function $f(z;\theta)$ that stands in for “the behavior.” Common choices include prediction loss, a correct-vs-incorrect margin/logit, or a task-specific score like an embedding alignment objective.
-
What caused means as a training-set intervention The cleanest definition is counterfactual: compare the model trained on $S$ versus trained on $S’$ (e.g., removing a set of training points). Another common lens is infinitesimal up-weighting, where importance is defined by the directional effect of increasing one example’s training weight.
-
How you convert per-example scores into a statement about a subset Many methods implicitly assume additivity: the “importance” of a subset is the sum of its members’ importances. That gives a concrete prediction rule $g_\tau(z,S’)=\sum_{i\in S’}\tau_i$ and makes evaluation possible, but it is still a modeling assumption about how influences compose.
-
How you handle training non-determinism The quantity $f(z;\theta^*(S’))$ is often a random variable because training is stochastic, even when $S’$ is fixed. You can average over multiple training runs, or compute attributions using multiple trained models/checkpoints to stabilize estimates.
-
What approximation family you use to avoid retraining If you can’t literally retrain for each counterfactual, you need an approximation that turns training example influence into something computable from trained model(s), often by relating it to gradients and a local model of how the final parameters respond to training data.
-
How you evaluate whether the attributions mean what you think they mean Manual inspection can be a sanity check but doesn’t scale and is easy to fool. A more objective approach is to score methods by how well their attributions predict true counterfactual outputs across many random training subsets (e.g., via rank correlation).
Paper summaries
Park, Georgiev, Ilyas, Leclerc, Madry (2023) — "TRAK: Attributing Model Behavior at Scale"
-
Question it answers: Are there data attribution methods that are both scalable and effective in large-scale non-convex settings?
-
Explanation object it uses: A data attribution function $\tau(z,S)\in\mathbb{R}^n$ that assigns a real-valued importance score to each training example $z_i\in S$ for a chosen model output function $f(z;\theta^*(S))$.
-
Metric/evidence it uses: It turns attribution scores into counterfactual predictions via additivity ($g_\tau(z,S’)=\sum_{i\in S’}\tau_i$), then evaluates with the linear datamodeling score (LDS), a Spearman correlation between predicted and true outputs across many randomly sampled training subsets.
-
Other choices made: The concrete TRAK estimator uses projected parameter gradients and a learned reweighting (implemented via an ensemble of trained models and random projections) to approximate counterfactual influence efficiently; for classification they use a margin-style output function $f(z;\theta)=\log\frac{p(z;\theta)}{1-p(z;\theta)}$.
-
Observation produced: TRAK sits on a better compute–efficacy frontier than a range of prior approaches on the LDS benchmark, and it can surface training examples whose removal produces large counterfactual behavior changes (including cases where nearest-neighbor retrieval is not the same thing as causal influence).
3.6 Can we go from an interpretable program to a network?
This line of work starts from the opposite direction of most interpretability. Instead of taking a trained model and trying to infer what it is doing, it begins with a human-readable computation and build a network that implements it. The point is not that these networks are realistic language models, but that they give a controlled setting where the ground truth mechanism is known. That makes them useful both as a didactic object (you can watch a transformer execute an algorithm step by step) and as a calibration tool for interpretability methods (you can test whether a method recovers the structure you already know is there).
Making this idea concrete forces you to be explicit about what kinds of programs you are talking about and what it means to “compile” them. In practice you need to choose a programming model that lines up with transformer-style computation, decide how intermediate variables are represented in the residual stream, and decide how program operations map onto attention and MLP blocks. Once you do that, you still have design freedom in how tightly you want to preserve the original semantics versus how much you want to push the resulting network toward something that looks more like a learned model (for example by compressing representations).
The main failure mode is that you can accidentally build a model that is too friendly. Compiled models can bake in conveniences that real models do not have: unnaturally clean variable separation, basis choices that make inspection trivial, or architectural simplifications that distort what you are trying to study. Even if the model produces the right outputs, internal representations can drift if you introduce training or compression steps, which means you may lose the very “known mechanism” that motivated the setup.
Important choices
-
What kind of program counts as interpretable In practice you usually restrict to programs that operate on sequences and are expressible in a transformer-aligned DSL: things like selection, aggregation, and simple elementwise transformations.
-
What the network is supposed to implement Often the target is a deterministic sequence-to-sequence algorithm, not a probabilistic next-token distribution. That choice matters because it changes what kinds of mechanisms you can express and what “correctness” even means.
-
How program operations map to transformer components A common pattern is to map elementwise computations to MLP blocks and selection/aggregation computations to attention heads, then stack these blocks subject to the transformer’s layer structure.
-
How intermediate variables live in the residual stream You need a representation scheme for different variables (e.g. assign them to different subspaces).
Paper summaries
Lindner, Kramár, Farquhar, Rahtz, McGrath, Mikulik (2023) — "Tracr: Compiled Transformers as a Laboratory for Interpretability"
-
Question it answers:
Can we compile human-readable sequence programs into standard transformer weights in a way that yields models with known internal structure, suitable for designing interpretability experiments and for evaluating interpretability tools against a ground-truth mechanism? -
Explanation object it uses:
A compiled transformer whose components correspond to a traced program graph, with intermediate variables embedded into designated residual-stream subspaces and operations implemented via hand-constructed attention/MLP blocks. -
Metric/evidence it uses:
Empirical demonstration by compiling multiple algorithmic programs (e.g. counting-like computations, sorting, parenthesis checking) and verifying correct behavior; plus a compression case study where a learned projection compresses the residual stream while tracking both output loss and layer-wise similarity. -
Other choices made:
Introduces an “assembly-like” intermediate representation to simplify constructing blocks; restricts the source language to avoid selector compositions that do not map cleanly to attention; requires explicit categorical vs numerical encoding annotations and a BOS token; compiles into a decoder-only transformer implementation without layer norms; uses heuristic layer allocation. -
Observation produced:
Compiled models can act as controlled test cases and a didactic tool for understanding how transformer components implement multi-step algorithms. When compressing compiled models, the learned projection can drop unnecessary features and induce superposition-like reuse of dimensions, but internal representations may also change even when outputs remain correct.
Reading list
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2024)
- Transcoders Find Interpretable LLM Feature Circuits (2024)
- Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits (2025)
- Interpretability at Scale: Identifying Causal Mechanisms in Alpaca (2023)
- Finding Neurons in a Haystack: Case Studies with Sparse Probing (2023)
- Weight-sparse transformers have interpretable circuits
- Neurons in Large Language Models: Dead, N-gram, Positional
- Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition
- Information Flow Routes: Automatically Interpreting Language Models at Scale
- RelP: Faithful and Efficient Circuit Discovery via Relevance Patching (2025)
- In-context Learning and Induction Heads (2022)
- AtP*: An Efficient and Scalable Method for Localizing LLM Behavior to Components (2024)
- Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability (2023)
- What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation (2024)
- Progress Measures for Grokking via Mechanistic Interpretability (2023)
- A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task (2024)
- Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models (2024)
- EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification (2025)
- Automatically Identifying Local and Global Circuits with Linear Computation Graphs (2024)
- Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity (2025)
- Decomposition of Small Transformer Models (2025)
- How to use and interpret activation patching (2024)
- TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research (2025)
- Localized Definitions and Distributed Reasoning: A Proof-of-Concept Mechanistic Interpretability Study via Activation Patching (2025)
- BERT Rediscovers the Classical NLP Pipeline (2019)
- Visualizing and Measuring the Geometry of BERT (2019)
- Identifying and Controlling Important Neurons in Neural Machine Translation (2018)
- Augmenting Deep Classifiers with Polynomial Neural Networks (2022)
- Quantifying Attention Flow in Transformers (2020)
- Differentiable Subset Pruning of Transformer Heads (2021)
- Uncovering hidden geometry in Transformers via disentangling position and context (2023)
- SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability (2017)
- Similarity of Neural Network Representations Revisited (CKA) (2019)
- A Structural Probe for Finding Syntax in Word Representations (2019)
- Improving Dictionary Learning with Gated Sparse Autoencoders (2024)
- Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders (2024)
- Efficient Dictionary Learning with Switch Sparse Autoencoders (2024)
- Decomposing The Dark Matter of Sparse Autoencoders (2024)
- Transcoders Beat Sparse Autoencoders for Interpretability (2025)
- Are Sparse Autoencoders Useful? A Case Study in Sparse Probing (2025)
- Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders (2025)
- Automatically Interpreting Millions of Features in Large Language Models (2024)
- Route Sparse Autoencoder to Interpret Large Language Models (2025)
- Language models can explain neurons in language models (2023)
- Discovering Latent Knowledge in Language Models Without Supervision (2022)
- Sanity Checks for Saliency Maps (2018)
- SmoothGrad: removing noise by adding noise (2017)
- Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (2017)
- Learning Important Features Through Propagating Activation Differences (DeepLIFT) (2017)
- Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) (2018)
- Anchors: High-Precision Model-Agnostic Explanations (2018)
- On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation (2015)
- Attention is not Explanation (2019)
- Rationalizing Neural Predictions (2016)
- Knowledge Circuits in Pretrained Transformers (2024)
- Understanding Language Model Circuits through Knowledge Editing (2024)
- Robust and Scalable Model Editing for Large Language Models (2024)
- Editing Large Language Models: Problems, Methods, and Opportunities (2023)
- Steering Llama 2 via Contrastive Activation Addition (2023)
- Style Vectors for Steering Generative Large Language Models (2024)
- LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models (2025)
- Unsupervised decoding of encoded reasoning using language model interpretability (2025)
- Activation Steering for Masked Diffusion Language Models (2025)
- Mechanistic Interpretability for Steering Vision-Language-Action Models (2025)
- InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques (2024)
- Neural Decompiling of Tracr Transformers (2024)
- Thinking Like Transformers (2021)
- Learning Transformer Programs (2023)
- ALTA: Compiler-Based Analysis of Transformers (2024)
- Understanding Black-box Predictions via Influence Functions (2017)
- Estimating Training Data Influence by Tracing Gradient Descent (TracIn) (2020)
- Data Shapley: Equitable Valuation of Data for Machine Learning (2019)
- Mechanistic Interpretability for AI Safety — A Review (2024)
- A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models (2024)
- A Primer in BERTology: What We Know About How BERT Works (2020)
- The Explainability of Transformers: Current Status and Directions (2024)
- Knowledge Editing for Large Language Models: A Survey (2023)
- A Comprehensive Study of Knowledge Editing for Large Language Models (2024)
- Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models (2025)
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling (2023)
- Mixture of Experts Made Intrinsically Interpretable (2025)