All's fair in love and love: copy suppression in GPT2-Small

This page was created to accompany the Streamlit app, which was originally designed to help explore different prompts for GPT-2 Small, as part of Callum McDougall, Arthur Conmy & Cody Rushing's work on self-repair in LLMs. We focus on negative behaviour (specifically copy-suppression in attention head 10.7 in GPT2-Small).

The primary goal of this app is to make our work more accessible to others (as opposed to the Streamlit page, which mainly functioned as a sandbox environment to help us spot interesting things about the behaviour of negative heads which we might otherwise have missed).

This page was created with pure HTML and JavaScript, with the exception of the attention patterns (which came from the wonderful circuitsvis library, itself based on Anthropic's PySvelte), and some Plotly visualisations.

See all pages in this project, by clicking on the titles below:

PERFECTLY NORMAL

CALLUM MCDOUGALL

All's fair in love and love: copy suppression in GPT2-Small

[1] OpenWebText Prompt Explorer

[2] OV & QK Circuits

[3] Anti-Induction vs. Copy Suppression