CircuitsVis (& TransformerLens)

Setup code

Overview of function

Examples

Visualising batches

Specifying layers and heads

Different modes: "large" and "lines"

Value-weighted attention

Other arguments

Attribution plots (work-in-progress!!)

This post explains the extensions I've made to Alan Cooney's excellent library CircuitsVis (itself based on Anthropic's Pysvelte). It includes a few extra features relative to the standard version, such as:

It can be run on a TransformerLens ActivationCache rather than just on a tensor of attention patterns (all examples in the notebook do this),
Option to show value-weighted attention patterns rather than regular patterns,
It can toggle between multiple sequences in a batch,
Plots can be opened in browser by default rather than displayed inline (won't work in Colab),
Added bertviz-style plots, with some extra features.

You can find the Colab here. Please comment or send me a message if you have any feedback. I hope you find this useful!

Setup code

This is just for completeness, to show how we get the data which will be plotted below. If you're trying to actually follow along with the code, I'd recommend using the Colab instead.

%pip install transformer_lens
%pip install git+https://github.com/callummcdougall/CircuitsVis.git#subdirectory=python

import torch as t
from torch import Tensor
import circuitsvis as cv
from jaxtyping import Float
from IPython.display import clear_output, display
from transformer_lens import HookedTransformer

t.set_grad_enabled(False)

gpt2 = HookedTransformer.from_pretrained("gpt2-small")

sentence0 = "When Mary and John went to the shops, John gave a drink to Mary."
sentence1 = "When Mary and John went to the shops, Mary gave a drink to John."
sentence2 = "The cat sat on the mat."
sentences_all = [sentence0, sentence1, sentence2]

names_filter = lambda name: any(name.endswith(f"hook_{s}") for s in ["pattern", "q", "k", "v"])
logits, cache = gpt2.run_with_cache(sentence0, names_filter=names_filter, remove_batch_dim=True)
logits_all, cache_all = gpt2.run_with_cache(sentences_all, names_filter=names_filter)
logits, cache_full = gpt2.run_with_cache(sentence0, remove_batch_dim=True)

tokens = gpt2.to_str_tokens(sentence0)
tokens_all = gpt2.to_str_tokens(sentences_all)

Overview of function

The main function is cv.attention.from_cache. The most important arguments are:

cache - the ActivationCache object. This has to contain the appropriate activations (i.e. pattern, plus v if you're using value-weighted attention, plus q and k if you're using lines mode).
tokens - either a list of strings (if batch size is 1), or a list of lists of strings (if batch size is > 1).

The optional arguments are:

heads - if specified (e.g. [(9, 6), (9, 9)]), these heads will be shown in the visualisation. If not specified, behaviour is determined by the layers argument.
layers - this can be an int (= single layer), list of ints (= list of layers), or None (= all layers). If heads are not specified, then the value of this argument determines what heads are shown.
batch_idx - if the cache has a batch dimension, then you can specify this argument (as either an int, or list of ints). Note that you can have nontrivial batch size in your visualisations (you'll be able to select different sequences using a dropdown).
attention_type - if this is "standard", we just use raw attention patterns. If this is "value-weighted", then the visualisation will use value-weighted attention, i.e. every attention probability $A^h[s_Q, s_K]$ will be replaced with:

$$ A^h[s_Q, s_K] \times \frac{|v^h[s_K]|}{\underset{s}{\max} |v^h[s]|} $$

If this is "info-weighted", we get the same, except with each $v^h[s]$ replaced with $v^h[s]^T W_O^h$ (the output projection matrix for head $h$).
mode - this can be "large", "small" or "lines", for producing the three different types of attention plots (see below for examples of all).
return_mode - this can be "browser" (open plot in browser; doesn't work in Colab or VMs), "html" (returns html object), or "view" (displays object inline).
radioitems - if True, you select the sequence in the batch using radioitems rather than a dropdown. Defaults to False.
batch_labels - if you're using batch size > 1, then this argument can override tokens to be the thing you see in the dropdown / radioitems.
title - if given, then a title is appended to the start of your plot (i.e. an <h1> HTML item).
head_notation - can be either "dot" for notation like 10.7 (this is default), or "LH" for notation like L10H7.
help - if True, prints out a string explaining the visualisation. Defulats to False.

Examples

Below is a set of examples, along with some brief explanations.

Visualising batches

The 4 examples below illustrate how you can batch circuitsvis figures together. They are:

Cache with no batch dim (this works like normal circuitsvis)
Cache with batch dim (there's an extra dropdown where you can choose different sequences in the batch)
Cache with batch dim, but with batch_idx specified as an int - this causes behaviour like (1)
Cache with batch dim, but with batch_idx specified as a list - this causes behaviour like (2)

cv.attention.from_cache(
    cache = cache,
    tokens = tokens,
    layers = 0,
)

cv.attention.from_cache(
    cache = cache_all,
    tokens = tokens_all,
    layers = 0,
)

cv.attention.from_cache(
    cache = cache_all,
    tokens = tokens_all,
    layers = 0,
    batch_idx = 1, # Different way to specify a sequence within batch
)

cv.attention.from_cache(
    cache = cache_all,
    tokens = tokens_all,
    layers = 0,
    batch_idx = [0, 1], # Different way to specify some sequences within batch
)

Specifying layers and heads

You saw above how we can specify layers using the layers argument. You can also use the heads argument to specify given heads. The full options are:

When both are None, all layers and heads are shown.
When layers is given (as an int or a list of ints) but heads is None, all heads in the given layer(s) are shown.
When layers is None, and heads is given (as a tuple of (layer, head_idx) or a list of such tuples), only the given heads are shown.

We have a couple of examples below:

cv.attention.from_cache(
    cache = cache,
    tokens = tokens,
    layers = [0, -1], # Negative indices are accepted
)

cv.attention.from_cache(
    cache = cache,
    tokens = tokens,
    heads = [(9, 6), (9, 9), (10, 0)], # Showing all the name mover heads: `to` attends to the IO token `Mary`
)

Different modes: `"large"` and `"lines"`

The mode above is "small" (also known as attention_patterns in circuitsvis). You can also use "large" (which is attention_heads in circuitsvis), or "lines" (which is like the neuron view in bertviz, but with a few extra features I added, e.g. showing values of the attention scores as well as probabilities.

cv.attention.from_cache(
    cache = cache_all,
    tokens = tokens_all,
    heads = [(9, 6), (9, 9), (10, 0)],
    mode = "large",
)

cv.attention.from_cache(
    cache = cache,
    tokens = tokens,
    mode = "lines",
    display_mode = "light", # Can also choose "dark"
)

cv.attention.from_cache(
    cache = cache,
    tokens = tokens,
    heads = [(9, 6), (9, 9), (10, 0)],
    mode = "lines",
    display_mode = "light",
)

Value-weighted attention

Value-weighted attention is a pretty neat concept. TL;DR - if a value vector doubled in magnitude but attention probability halved then nothing would change; so we should expect the attention probability to be more meaningful once we scale by the magnitude of the value vector. This is what the attention_type argument does. If it is "value-weighted", then every attention probability $A^h[s_Q, s_K]$ will be replaced with:

$$ A^h[s_Q, s_K] \times \frac{|v^h[s_K]|}{\underset{s}{\max} |v^h[s]|} $$

where $|v^h[s]|$ is the $L_2$ norm of the value vector at source position $s$ in head $h$.

If it is "info-weighted", then instead we get:

$$ A^h[s_Q, s_K] \times \frac{\Big|v^h[s_K]^T W_O^h\Big|}{\underset{s}{\max} \Big|v^h[s]^T W_O^h\Big|} $$

In particular, when we do this, we can see that the attention on the BOS token is much lower (because usually this token is attended to as a placeholder, and not much is actually copied).

cv.attention.from_cache(
    cache = cache,
    tokens = tokens,
    heads = [(9, 6), (9, 9), (10, 0)], # showing all the name mover heads
    attention_type = "info-weighted", # or try "value-weighted"
)

Other arguments

title (optional) specifies a title.

If help is True, then a string explaining the visualisation is printed out (as well as an explanation of the non-default arguments which you're using).

If radioitems is True, then you select different sequences in the batch using radioitems rather than a dropdown. Defaults to False.

If batch_labels is not None (and your batch size is larger than 1), then this argument overrides the values that appear in the dropdown / radioitems. It can either be a list of strings, or a function mapping (batch_idx, tokens[batch_idx]) to a string.

head_notation can be set to "LH" to change the notation from e.g. 10.7 to L10H7.

display_mode can be "dark" (default) or "light". This only affects the "lines" mode.

return_mode can be "browser" (open in browser), "html" (return html object), or "view" (display object inline; this is default).

Note that the browser view is often preferable - it doesn't slow down your IDE, and it can reduce flickering when you switch between different sequences in your batch. However, this won't always work (e.g. in virtual machines or on Colab). In this case, you should use return_mode = "html", then save the result and download & open it manually.

I've given 2 examples below, which both showcase a several of these features.

The first of these examples shows what it looks like when you use help = True (the text immediately below the code is displayed as part of the visualisation), as well as how we can return and save an HTML object.

html_object = cv.attention.from_cache(
    cache = cache_all,
    tokens = tokens_all,
    batch_idx = 0,
    heads = [(9, 6), (9, 9), (10, 0)],
    mode = "lines",
    title = "Attention of name mover heads (lines mode)",
    return_mode = "html",
    help = True,
    display_mode = "light", # This might be better if you're opening in browser
)

display(html_object)

with open("my_file.html", "w") as f:
    f.write(html_object.data)

(The serif-font text below is part of the visualisation, because we had help=True)

The next example uses the batch_labels argument. It's used as a function, mapping the list of string-tokens to their HTML representations (so we can see each individual token). It also uses return_mode = "view" (which means we see the visualisation, rather than returning it). Note that just calling the function with return_mode = "html" can still display the figure, if you're running it as the last line of code in a cell.

cv.attention.from_cache(
    cache = cache_all,
    tokens = tokens_all,
    heads = [(9, 6), (9, 9), (10, 0)],
    attention_type = "info-weighted",
    radioitems = True,
    batch_labels = lambda str_toks: "" + "|".join(str_toks) + "",
    head_notation = "LH", # head names are 10.7 rather than L10H7
    return_mode = "view",
)

Attribution plots (work-in-progress!)

Attention plots might also be decent for logit attribution. The values of each cell are "logits directly written in the correct direction".

You can pass the resid_directions argument to the function, and it'll measure the attribution in that direction (e.g. this could be a logit_diff vector). Note that it also supports resid_directions being a vector of length d_vocab; then it can convert this into a vector in the residual stream by mapping it backwards through W_U (and this also means we can do logit attribution for the unembedding bias b_U).

There are 2 really hacky things about this function, which are why I don't recommend people use it yet.

The colors are awkward; lots of translations and scalings have to be done to make sure that (1) the whitepoint is zero logit attribution and (2) no values are more than 1. The way I did this is by dividing all the component logit attributions by the maximum absolute value of all components' contributions over all sequence positions (which leads to a sparser & cleaner plot, and also makes it easier to make relative comparisons between different values). For the large plot attention_heads, this was sufficient, because it also supports negative values: [-1, 0] is red and [0, 1] is blue. For the smaller plot attention_patterns, there's only one color shade for each facet plot, and so this one has to be split up into 2 separate plots.
If you're just doing attribution at a particular sequence position, it makes sense to just have a (1, seq_len)-shape plot per component, rather than a (seq_len, seq_len)-size plot where you only care about one row. Currently, circuitsvis doesn't support having different source and destination tokens so this isn't yet possible. A hacky solution: whenever resid_directions is just a single vector (i.e. it doesn't have a seq_len dimension), I broadcast it along the square attention plots, so you see a vertical stripe rather than just a dot. Uncomment the lines below to see this in action.

I expect both of these two things to be solved eventually, but they're not high on my priority list right now.

Showing positive and negative attribution on different plots:

cv.attribution.from_cache(
    model = gpt2,
    cache = cache_full,
    tokens = tokens,
    mode = "small",
    heads = [(9, 6), (9, 9), (10, 0), (10, 7), (11, 10)], # main positive & negative heads
    return_mode = "view",
)

And for showing positive and negative attribution on the same plot (blue is positive, red is negative):

cv.attribution.from_cache(
    model = gpt2,
    cache = cache_full,
    tokens = tokens,
    heads = [(9, 6), (9, 9), (10, 0), (10, 7), (11, 10)],
    mode = "large",
)

PERFECTLY NORMAL

CALLUM MCDOUGALL

CircuitsVis (& TransformerLens)

Setup code

Overview of function

Examples

Visualising batches

Specifying layers and heads

Different modes: `"large"` and `"lines"`

Value-weighted attention

Other arguments

Attribution plots (work-in-progress!)