CircuitsVis (& TransformerLens)
- It can be run on a TransformerLens
ActivationCache
rather than just on a tensor of attention patterns (all examples in the notebook do this), - Option to show value-weighted attention patterns rather than regular patterns,
- It can toggle between multiple sequences in a batch,
- Plots can be opened in browser by default rather than displayed inline (won't work in Colab),
- Added bertviz-style plots, with some extra features.

Setup code
This is just for completeness, to show how we get the data which will be plotted below. If you're trying to actually follow along with the code, I'd recommend using the Colab instead.%pip install transformer_lens
%pip install git+https://github.com/callummcdougall/CircuitsVis.git#subdirectory=python
import torch as t
from torch import Tensor
import circuitsvis as cv
from jaxtyping import Float
from IPython.display import clear_output, display
from transformer_lens import HookedTransformer
t.set_grad_enabled(False)
gpt2 = HookedTransformer.from_pretrained("gpt2-small")
sentence0 = "When Mary and John went to the shops, John gave a drink to Mary."
sentence1 = "When Mary and John went to the shops, Mary gave a drink to John."
sentence2 = "The cat sat on the mat."
sentences_all = [sentence0, sentence1, sentence2]
names_filter = lambda name: any(name.endswith(f"hook_{s}") for s in ["pattern", "q", "k", "v"])
logits, cache = gpt2.run_with_cache(sentence0, names_filter=names_filter, remove_batch_dim=True)
logits_all, cache_all = gpt2.run_with_cache(sentences_all, names_filter=names_filter)
logits, cache_full = gpt2.run_with_cache(sentence0, remove_batch_dim=True)
tokens = gpt2.to_str_tokens(sentence0)
tokens_all = gpt2.to_str_tokens(sentences_all)
Overview of function
The main function is cv.attention.from_cache
. The most important arguments are:
cache
- theActivationCache
object. This has to contain the appropriate activations (i.e.pattern
, plusv
if you're using value-weighted attention, plusq
andk
if you're usinglines
mode).tokens
- either a list of strings (if batch size is 1), or a list of lists of strings (if batch size is > 1).
The optional arguments are:
heads
- if specified (e.g.[(9, 6), (9, 9)]
), these heads will be shown in the visualisation. If not specified, behaviour is determined by thelayers
argument.layers
- this can be an int (= single layer), list of ints (= list of layers), or None (= all layers). Ifheads
are not specified, then the value of this argument determines what heads are shown.batch_idx
- if the cache has a batch dimension, then you can specify this argument (as either an int, or list of ints). Note that you can have nontrivial batch size in your visualisations (you'll be able to select different sequences using a dropdown).attention_type
- if this is"standard"
, we just use raw attention patterns. If this is"value-weighted"
, then the visualisation will use value-weighted attention, i.e. every attention probability $A^h[s_Q, s_K]$ will be replaced with:$$ A^h[s_Q, s_K] \times \frac{|v^h[s_K]|}{\underset{s}{\max} |v^h[s]|} $$
If this is
"info-weighted"
, we get the same, except with each $v^h[s]$ replaced with $v^h[s]^T W_O^h$ (the output projection matrix for head $h$).mode
- this can be "large", "small" or "lines", for producing the three different types of attention plots (see below for examples of all).return_mode
- this can be "browser" (open plot in browser; doesn't work in Colab or VMs), "html" (returns html object), or "view" (displays object inline).radioitems
- if True, you select the sequence in the batch using radioitems rather than a dropdown. Defaults to False.batch_labels
- if you're using batch size > 1, then this argument can overridetokens
to be the thing you see in the dropdown / radioitems.title
- if given, then a title is appended to the start of your plot (i.e. an<h1>
HTML item).head_notation
- can be either"dot"
for notation like10.7
(this is default), or"LH"
for notation likeL10H7
.help
- if True, prints out a string explaining the visualisation. Defulats to False.
Examples
Below is a set of examples, along with some brief explanations.
Visualising batches
The 4 examples below illustrate how you can batch circuitsvis figures together. They are:
- Cache with no batch dim (this works like normal circuitsvis)
- Cache with batch dim (there's an extra dropdown where you can choose different sequences in the batch)
- Cache with batch dim, but with
batch_idx
specified as an int - this causes behaviour like (1) - Cache with batch dim, but with
batch_idx
specified as a list - this causes behaviour like (2)
cv.attention.from_cache(
cache = cache,
tokens = tokens,
layers = 0,
)
cv.attention.from_cache(
cache = cache_all,
tokens = tokens_all,
layers = 0,
)
cv.attention.from_cache(
cache = cache_all,
tokens = tokens_all,
layers = 0,
batch_idx = 1, # Different way to specify a sequence within batch
)
cv.attention.from_cache(
cache = cache_all,
tokens = tokens_all,
layers = 0,
batch_idx = [0, 1], # Different way to specify some sequences within batch
)
Specifying layers and heads
You saw above how we can specify layers using the layers
argument. You can also use the heads
argument to specify given heads. The full options are:
- When both are
None
, all layers and heads are shown. - When
layers
is given (as an int or a list of ints) butheads
is None, all heads in the given layer(s) are shown. - When
layers
is None, andheads
is given (as a tuple of(layer, head_idx)
or a list of such tuples), only the given heads are shown.
We have a couple of examples below:
cv.attention.from_cache(
cache = cache,
tokens = tokens,
layers = [0, -1], # Negative indices are accepted
)
cv.attention.from_cache(
cache = cache,
tokens = tokens,
heads = [(9, 6), (9, 9), (10, 0)], # Showing all the name mover heads: `to` attends to the IO token `Mary`
)
Different modes: "large"
and "lines"
The mode above is "small" (also known as attention_patterns
in circuitsvis). You can also use "large" (which is attention_heads
in circuitsvis), or "lines" (which is like the neuron view in bertviz, but with a few extra features I added, e.g. showing values of the attention scores as well as probabilities.
cv.attention.from_cache(
cache = cache_all,
tokens = tokens_all,
heads = [(9, 6), (9, 9), (10, 0)],
mode = "large",
)
cv.attention.from_cache(
cache = cache,
tokens = tokens,
mode = "lines",
display_mode = "light", # Can also choose "dark"
)
cv.attention.from_cache(
cache = cache,
tokens = tokens,
heads = [(9, 6), (9, 9), (10, 0)],
mode = "lines",
display_mode = "light",
)
Value-weighted attention
Value-weighted attention is a pretty neat concept. TL;DR - if a value vector doubled in magnitude but attention probability halved then nothing would change; so we should expect the attention probability to be more meaningful once we scale by the magnitude of the value vector. This is what the attention_type
argument does. If it is "value-weighted"
, then every attention probability $A^h[s_Q, s_K]$ will be replaced with:
$$ A^h[s_Q, s_K] \times \frac{|v^h[s_K]|}{\underset{s}{\max} |v^h[s]|} $$
where $|v^h[s]|$ is the $L_2$ norm of the value vector at source position $s$ in head $h$.
If it is "info-weighted"
, then instead we get:
$$ A^h[s_Q, s_K] \times \frac{\Big|v^h[s_K]^T W_O^h\Big|}{\underset{s}{\max} \Big|v^h[s]^T W_O^h\Big|} $$
In particular, when we do this, we can see that the attention on the BOS token is much lower (because usually this token is attended to as a placeholder, and not much is actually copied).
cv.attention.from_cache(
cache = cache,
tokens = tokens,
heads = [(9, 6), (9, 9), (10, 0)], # showing all the name mover heads
attention_type = "info-weighted", # or try "value-weighted"
)
Other arguments
title
(optional) specifies a title.
If help
is True, then a string explaining the visualisation is printed out (as well as an explanation of the non-default arguments which you're using).
If radioitems
is True, then you select different sequences in the batch using radioitems rather than a dropdown. Defaults to False.
If batch_labels
is not None (and your batch size is larger than 1), then this argument overrides the values that appear in the dropdown / radioitems. It can either be a list of strings, or a function mapping (batch_idx, tokens[batch_idx])
to a string.
head_notation
can be set to "LH"
to change the notation from e.g. 10.7
to L10H7
.
display_mode
can be "dark" (default) or "light". This only affects the "lines" mode.
return_mode
can be "browser" (open in browser), "html" (return html object), or "view" (display object inline; this is default).
Note that the browser view is often preferable - it doesn't slow down your IDE, and it can reduce flickering when you switch between different sequences in your batch. However, this won't always work (e.g. in virtual machines or on Colab). In this case, you should use return_mode = "html"
, then save the result and download & open it manually.
I've given 2 examples below, which both showcase a several of these features.
The first of these examples shows what it looks like when you use help = True
(the text immediately below the code is displayed as part of the visualisation), as well as how we can return and save an HTML object.
html_object = cv.attention.from_cache(
cache = cache_all,
tokens = tokens_all,
batch_idx = 0,
heads = [(9, 6), (9, 9), (10, 0)],
mode = "lines",
title = "Attention of name mover heads (lines mode)",
return_mode = "html",
help = True,
display_mode = "light", # This might be better if you're opening in browser
)
display(html_object)
with open("my_file.html", "w") as f:
f.write(html_object.data)
(The serif-font text below is part of the visualisation, because we had help=True
)
The next example uses the batch_labels
argument. It's used as a function, mapping the list of string-tokens to their HTML representations (so we can see each individual token). It also uses return_mode = "view"
(which means we see the visualisation, rather than returning it). Note that just calling the function with return_mode = "html"
can still display the figure, if you're running it as the last line of code in a cell.
cv.attention.from_cache(
cache = cache_all,
tokens = tokens_all,
heads = [(9, 6), (9, 9), (10, 0)],
attention_type = "info-weighted",
radioitems = True,
batch_labels = lambda str_toks: "" + "|".join(str_toks) + "
",
head_notation = "LH", # head names are 10.7 rather than L10H7
return_mode = "view",
)
Attribution plots (work-in-progress!)
Attention plots might also be decent for logit attribution. The values of each cell are "logits directly written in the correct direction".
You can pass the resid_directions
argument to the function, and it'll measure the attribution in that direction (e.g. this could be a logit_diff
vector). Note that it also supports resid_directions
being a vector of length d_vocab
; then it can convert this into a vector in the residual stream by mapping it backwards through W_U
(and this also means we can do logit attribution for the unembedding bias b_U
).
There are 2 really hacky things about this function, which are why I don't recommend people use it yet.
The colors are awkward; lots of translations and scalings have to be done to make sure that (1) the whitepoint is zero logit attribution and (2) no values are more than 1. The way I did this is by dividing all the component logit attributions by the maximum absolute value of all components' contributions over all sequence positions (which leads to a sparser & cleaner plot, and also makes it easier to make relative comparisons between different values). For the large plot
attention_heads
, this was sufficient, because it also supports negative values:[-1, 0]
is red and[0, 1]
is blue. For the smaller plotattention_patterns
, there's only one color shade for each facet plot, and so this one has to be split up into 2 separate plots.If you're just doing attribution at a particular sequence position, it makes sense to just have a
(1, seq_len)
-shape plot per component, rather than a(seq_len, seq_len)
-size plot where you only care about one row. Currently, circuitsvis doesn't support having different source and destination tokens so this isn't yet possible. A hacky solution: wheneverresid_directions
is just a single vector (i.e. it doesn't have aseq_len
dimension), I broadcast it along the square attention plots, so you see a vertical stripe rather than just a dot. Uncomment the lines below to see this in action.
I expect both of these two things to be solved eventually, but they're not high on my priority list right now.
Showing positive and negative attribution on different plots:
cv.attribution.from_cache(
model = gpt2,
cache = cache_full,
tokens = tokens,
mode = "small",
heads = [(9, 6), (9, 9), (10, 0), (10, 7), (11, 10)], # main positive & negative heads
return_mode = "view",
)
And for showing positive and negative attribution on the same plot (blue is positive, red is negative):
cv.attribution.from_cache(
model = gpt2,
cache = cache_full,
tokens = tokens,
heads = [(9, 6), (9, 9), (10, 0), (10, 7), (11, 10)],
mode = "large",
)