The automated-interpretability repository implements tools and pipelines for automatically generating, simulating, and scoring explanations of neuron (or latent feature) behavior in neural networks. Instead of relying purely on manual, ad hoc interpretability probing, this repo aims to scale interpretability by using algorithmic methods that produce candidate explanations and assess their quality. It includes a “neuron explainer” component that, given a target neuron or latent feature, proposes natural language explanations or heuristics (e.g. “this neuron activates when the input has property X”) and then simulates activation behavior across example inputs to test whether the explanation holds. The project also contains a “neuron viewer” web component for browsing neurons, explanations, and activation patterns, making it more interactive and exploratory.

Features

  • A neuron explainer module that proposes natural language or rule-based explanations for neuron/latent feature behavior
  • Simulation / scoring of explanations by comparing predicted activations vs true activations across inputs
  • A neuron viewer UI to browse neurons, see activations, and inspect explanations
  • Demo notebooks illustrating how explanations are generated and evaluated (e.g. explain_puzzles.ipynb)
  • Infrastructure for activation capture and analysis (e.g. modules like activations.py)
  • Ranking / scoring heuristics to decide which explanations are more faithful or useful

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow Automated Interpretability

Automated Interpretability Web Site

Other Useful Business Software
Full-stack observability with actually useful AI | Grafana Cloud Icon
Full-stack observability with actually useful AI | Grafana Cloud

Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
Create free account
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Automated Interpretability!

Additional Project Details

Programming Language

Python

Related Categories

Python Artificial Intelligence Software, Python Large Language Models (LLM)

Registered

2025-10-03