OpenAI: Toward Mechanistic Interpretability (MI)

Kristen Carlson

doi:10.70777/si.v2i6.16545

Authors

Kris Carlson Publisher and Editor-in-Chief, SuperIntelligence - Robotics - Safety & Alignment https://orcid.org/0000-0003-2101-3567

DOI:

https://doi.org/10.70777/si.v2i6.16545

Keywords:

sparse encoding, distillation, mixture of experts, llm, mechanistic interpretability, explainable ai

Abstract

OpenAI just released a report on experimenting with sparsely-connected models to see if they are more interpretable than densely-connected models. They are, and the OpenAI team found they could identify the exact circuit computing a general function of interest. We give a short summary of their work with links to their blog post and article, and compare with Stephen Wolfram's 2024 investigation of the smallest net that could compute a given simple function.

Author Biography

Kris Carlson, Publisher and Editor-in-Chief, SuperIntelligence - Robotics - Safety & Alignment

Kris Carlson is Publisher and Editor-in-Chief of the journal, SuperIntelligence – Robotics – Safety & Alignment, which he co-founded in 2024. He wrote Safe Artificial General Intelligence via Distributed Ledger Technology and Provably Safe Artificial General Intelligence via Interactive Proofs. At Harvard Medical School, Carlson performed computational modeling of neurological disorders using finite element and neural circuitry simulation software. Prior to that, Co-Chair, Seminar on Natural and Artificial Computation, Rowland Institute of Science.