AI agents could help better understand complex AI systems


The Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT has developed a new way for LLMs to explain the behavior of other AI systems.

The method is called Automated Interpretability Agents (AIAs), pre-trained language models that provide intuitive explanations for computations in trained networks.

AIAs are designed to mimic the experimental process of a scientist designing and running tests on other computer networks.

They provide explanations in various forms, such as language descriptions of system functions and errors, and code that reproduces system behavior.



According to MIT researchers, AIAs differ from existing interpretive approaches. They actively participate in hypothesis generation, experimental testing, and iterative learning. This active participation allows them to refine their understanding of other systems in real time and gain more in-depth insight into how complex AI systems work.

The FIND benchmark

A key contribution of the researchers is the FIND (Function Interpretation and Description) benchmark. The benchmark contains a test bed of functions that resemble the computations in trained networks. These functions come with descriptions of their behavior.

Researchers often do not have access to ground-truth labels of functions or descriptions of learned computations. The goal of FIND is to solve this problem and provide a reliable standard for evaluating interpretability procedures.

One example in the FIND benchmark is synthetic neurons. These mimic the behavior of real neurons in language models and are selective for certain concepts, such as “ground transportation.”

AIAs gain black-box access to these neurons and design inputs to test the neurons’ responses. For example, they test a neuron’s selectivity for “cars” versus other modes of transportation.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top