Sparse AutoEncoders: Visualisation

Note - the first time you visit this page, it takes about 10-15 seconds to run. It should be faster from that point onwards.

This is the visualisation tool I've tried to replicate from the one built by Anthropic. Since open source work in SAEs is now taking off, I wanted to create an easy way to navigate different features found by an autoencoder.

I intend to make this open-source soon. The expected use case is for people to run this at an intermediate point during training, to manually inspect neurons and get a sense for how the autoencoder is training (Anthropic mentioned in their paper that manual inspection was often used, since finding metrics has so far proven difficult).

In the first visualisation, you can browse features similar to how it works in Anthropic's interface. In the second visualisation, you can enter a prompt and actually see which features are most important for particular tokens in that prompt (e.g. most loss-reducing, or highest activations).

▶ Click here for some discussion of the findings for this particular autoencoder (which was trained on the activations of GELU-1l, a model from Neel Nanda's TransformerLens library), including some examples of interesting features.

A few interesting features found by the autoencoder featured below (which was trained on GELU-1L, the model from Neel Nanda's TransformerLens library):

#7 - this activates strongest on the tokens we and I. From the positive logits, we can see that this is a bigram feature: it boosts tokens which commonly appear after these two tokens, e.g. 'll (which is a common continuation for both tokens), and common adverbs following first-person pronouns.
#8 - this activates strongest on left brackets following code which is likely to have been from the Django library (I confirmed this by pasting code snippets like 'first_name': ( into GPT4 and asking it to identify the library), and boosts django. Note, this is more interesting than a simple bigram feature! It doesn't just fire on left brackets, it must use the attention heads' signal that the code snippets coming before it are from Django - you can see this from the fact that it fires much less strongly on plain old brackets in the sampled intervals further to the right of the visualisation. (Incidentally, it's ironic that I needed GPT4 to figure this feature out, despite having built the site you're reading right now using Django 😅)
#259 - this activates on the digit 0, specifically when it appears as the second digit of a year like 2013. It boosts 1 a lot, and also boosts 0 (since these are the most likely tens digits for the year 20... to be referring to).
#780 - this fires on adverbs (often those of frequency or degree) which follow the word It, and it boosts verbs which would make grammatical sense following this adverb.
#826 - this fires on dashes following the word multi. It boosts common words following multi-, e.g. million, period, purpose. Interestingly, in cases where the following word is split into 2 tokens (e.g. multi-gener//ational) or the following text is just a common continuation (e.g. multi-million// dollar) it will often also fire on the first of these words and predict the second of them.

Some other things to note:

Annoyingly, some of the features (e.g. #59) are highly non-sparse, suggesting room for improvement in the training.
Weirdly, lots of the features are very correlated, activating on single or double line breaks (press the "randomize" button enough and you'll notice this). This was also something that Neel noticed.

Choose a specific feature. 8

▶ Click here for some fun prompts to try.

udCIsY2hhcnNldD1

There are some features which are very clearly base64-encoding features, e.g. #161 and #1329. We can also see what looks like a hexadecimal feature at #6647. However, I've not been able to replicate the interesting Anthropic results on base64 features (possibly this model isn't sufficiently trained).

What are you doing?

There are a few interesting things to spot here:

Feature #1065 seems to be activating on tokens which immediately follow What, and it helps us to predict you. Possibly this helps to implement trigram patterns of the form What, _. you / your (e.g. "what's your" or "what are you"). I'm not confident in this though, because the positive logit values aren't large (even though the activation is).
Feature #82 is very helpful at reducing loss on the ? token. It seems to fire in situations following the token what, where a sentence could reasonably be expected to end in a question (e.g. "what ... want to do" or "what to expect").

The date today is September 4th, 2012.

When you look at the most loss-reducing features on the 1 token in the year 2012, you'll see (as expected) that one of them fires strongly on the token 0 in instances where it's part of a year 20... (we discussed this in the previous dropdown). What's also interesting is that feature #717 seems to be helpful, even though the max activting examples (and zooming in on this feature) show that it mostly activates on time phrases, and boosts tokens like pm or EST. It also fires on dates, and boosts tokens like th.

Never gonna give you up
Never gonna let you down
Never gonna run around and desert you

This is interesting (1) for the meme, and (2) there's a neuron (revealed when you look at the most loss-decreasing neurons for the up token) which seems to boost directional prepositions in circumstances like "give you up", "set it in", "stack it on", etc. Interestingly it doesn't fire on you in the sequence "let you down", even though the down token is the one which this feature's output direction boosts most.

Enter a prompt.

Choose a token from the prompt, and a way of ranking features.

PERFECTLY NORMAL

CALLUM MCDOUGALL

Sparse AutoEncoders: Visualisation