Citation

Toy Models of Superposition

Author:
Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher
Year:
2022

Neural networks often pack many unrelated concepts into a single neuron – a puzzling phenomenon known as ‘polysemanticity’ which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in “superposition.” We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.