A Mathematical Framework for Transformer Circuits In Transformers residual stream is the main object and layers read and write from/to it. An Overview of Early Vision in InceptionV1 inceptionV1 feature maps of different layers CLIP-Dissect Automatic Description of Neuron Representations Find concepts that activates a neuron using a image dataset Leveraged volume sampling for linear regression Active Learning in linear regression with multiplicative error rate bounds Scaling Monosemanticity Extracting Interpretable Features from Claude 3 Sonnet Scale SAE to Claude 3 Sonnet Towards Monosemanticity Decomposing Language Models With Dictionary Learning How SAE works Zoom In An Introduction to Circuits Investigate Vision Circuits by Studying the Connections between Neurons Active Learning Survey Active Learning for Agnostic classification The True Sample Complexity of Active Learning A different definition of active learning label complexity Can Large Language Models Explain Their Internal Mechanisms? summary of Can Large Language Models Explain Their Internal Mechanisms? Emergent World Representations Exploring a Sequence Model Trained on a Synthetic Task summary of Emergent World Representations Exploring a Sequence Model Trained on a Synthetic Task Interpretability Beyond Feature Attribution Quantitative Testing with Concept Activation Vectors (TCAV) summary of Interpretability Beyond Feature Attribution Quantitative Testing with Concept Activation Vectors (TCAV) Labeling Neural Representations with Inverse Recognition summary of Labeling Neural Representations with Inverse Recognition Progress measures for grokking via mechanistic interpretability summary of Progress measures for grokking via mechanistic interpretability What do we learn from inverting CLIP models? summary of What do we learn from inverting CLIP models?