explanation.systems

The fantasy of inspectable intelligence

What would it mean for an artificial intelligence not merely to answer, but to help us understand how an answer came to be?

Scroll into the black box
01 — The ordinary miracle

A fluent answer appears.

You ask a difficult question. A few seconds later, a paragraph arrives: calm, grammatical, apparently considered.

The visible event is deceptively simple. Text goes in. Text comes out. Somewhere in between, a trained neural network performs an enormous number of numerical transformations.

The answer is visible. The particular causal path that produced it usually is not.

This page is not claiming that nothing is known about language models. Their architectures, training procedures, tokens, layers, attention mechanisms, activations, and outputs can all be studied. The problem is more specific: a complete, human-comprehensible account of why this answer appeared in this case remains difficult.

02 — Naming the problem

“Visibility problem” is intuitive. The field usually uses other words.

There is no single settled vocabulary. Several overlapping questions are often folded into the phrase opening the black box.

TransparencyWhat can be known about a model, its training, its data, and its operation?
ExplainabilityCan a system offer a useful account of why it produced a particular result?
InterpretabilityCan humans make sense of model behavior or internal representations?
Mechanistic interpretabilityCan we reverse-engineer internal computations into features and circuits?
MonitorabilityCan an observer detect properties of interest—such as deception or reward hacking—from available signals?
IntrospectionCan a model itself access and report information about its own internal states?

A central warning from Zachary Lipton’s 2016 paper, The Mythos of Model Interpretability, is still useful: “interpretability” is not one sharply defined property. Different users want different kinds of understanding.

03 — Not an unknowable machine

We know the machinery. We do not yet possess a satisfying anatomy of each thought.

An LLM is not a supernatural oracle. Researchers can inspect weights, activations, attention patterns, training methods, prompts, tool calls, intermediate outputs, and behavioral responses to intervention.

But the learned computation is distributed. A useful concept may not reside in one neat location. A single unit can participate in multiple patterns. Many transformations happen between the text you type and the text you read.

The system is inspectable in principle. It is not yet routinely intelligible in practice.
04 — The older black box

Humans cannot simply look inward and read the machinery of thought.

The philosophical metaphor of introspection is ancient: knowledge gained by “looking within.” Early modern philosophy sometimes treated first-person access to the mind as unusually privileged.

But a famous 1977 review by Richard Nisbett and Timothy Wilson argued that people may have little or no direct introspective access to higher-order cognitive processes. We often give plausible explanations for our choices without directly observing the processes that produced them.

A convincing self-explanation can be a reconstruction rather than a causal trace.

This matters twice: it cautions us against romanticizing human self-knowledge, and it cautions us against trusting a machine merely because it can produce an articulate account of itself.

05 — A compressed history

From “explainable AI” to model microscopes.

The modern research program did not begin with chatbots. It grew from decades of work on understanding learned systems, then accelerated as deep neural networks became more capable and less intuitively legible.

1641
Descartes makes inward certainty central to an influential account of mind.
1977
Nisbett & Wilson review evidence against direct introspective access to many higher-order mental processes.
2015–17
DARPA XAI formalizes a major Explainable Artificial Intelligence research program.
2016
Lipton argues that “interpretability” is important but underspecified.
2020
Circuits research proposes studying meaningful algorithms in neural-network connections.
2021–22
Transformer circuits reframes the task as reverse-engineering computations in transformer models.
2023–24
Sparse autoencoders are used to extract more interpretable features from dense activations.
2025
Circuit tracing and CoT monitoring become prominent ways to study internal computation and verbalized reasoning.
2025–26
Machine introspection and natural-language autoencoders test whether internal states can be surfaced more directly.
06 — Showing work

A reasoning trace is valuable. It is not the whole interior.

Some reasoning models generate an explicit chain of thought before producing an answer. Researchers can monitor these traces for useful clues about the model’s process and for signs of undesirable behavior.

But a verbal trace is still a stream of tokens. It is not an exhaustive recording of every activation, every feature interaction, or every learned influence. It may omit causes. It may simplify. It may rationalize.

“Show your work” is a window. It is not a glass brain.

OpenAI’s December 2025 monitorability work defines monitorability as a monitor’s ability to predict properties of interest about an agent’s behavior. The study reports that chain-of-thought monitoring often outperforms monitoring actions and final outputs alone, while emphasizing fragility and limitations.

07 — The faithfulness problem

The explanation can sound right and still miss the real influence.

Researchers test faithfulness by introducing controlled hints or biases and checking whether a model admits relying on them. In one Anthropic study of reasoning models, the models often used a hint without mentioning it in their chain of thought.

Legibility is not the same as faithfulness. Fluency is not evidence of causal completeness.

The distinction is foundational for an explanation system. It must not merely manufacture a persuasive story. It must connect explanation to intervention: when a claimed cause is changed, does the result change as predicted?

08 — The microscope

Dense activations can sometimes be expanded into more interpretable features.

One modern approach uses sparse autoencoders. The basic hope is to transform dense internal activity into a larger set of sparse features: detectors that light up in more recognizable patterns.

Google DeepMind describes these tools as microscopes. Anthropic’s work has used related methods to identify millions of features in a deployed model. The metaphor is useful, but it should not be mistaken for a solved problem.

A microscope does not automatically write the biology textbook.

Feature extraction can reveal meaningful structure, but interpretation still involves choices, validation, and human judgment. Features can overlap, miss phenomena, or fail to provide a complete causal account.

09 — From features to circuits

An explanation must show interaction, not just ingredients.

Mechanistic interpretability asks how features work together to produce behavior. Circuit tracing builds attribution graphs: provisional maps of information flow through a model.

In March 2025, Anthropic published examples involving multilingual reasoning, planning ahead in poetry, mental arithmetic, and cases where a model offered plausible but misleading reasoning.

The ambition is not only to find a concept. It is to trace a computation.

That is closer to the fantasy of explanation.systems: a navigable causal account that can be inspected, challenged, and revised.

10 — Machine introspection?

Can a model report an internal state that was never written into its prompt?

In October 2025, Anthropic reported experiments suggesting a limited and unreliable capacity for models to monitor and control some of their own internal states. The researchers injected known concepts into activations and tested whether models could detect them.

The result was not a declaration of machine self-knowledge. Most of the time, the models failed. The finding was narrower and more interesting: under some conditions, a model appeared able to recognize content in its own representations.

Not a transparent mind. A possible new instrument.
11 — Turning activations into language

A 2026 experiment moves closer to the fantasy.

Anthropic’s Natural Language Autoencoders train a system to translate an activation into a textual description, then reconstruct the original activation from that description. Better reconstruction provides a training signal for better explanations.

The method can surface useful themes, including evaluation awareness that was not verbalized in an ordinary transcript. But it also has important limits: explanations can be wrong, can hallucinate, and are expensive to generate.

The interior becomes discussable—but not automatically trustworthy.
12 — No single oracle

An explanation system should triangulate.

A robust explanation would not rely on one model narrating itself. It would combine different kinds of evidence:

behavioral testscounterfactual promptsfeature analysiscircuit tracingadversarial teststraining-data influenceexternal verificationhuman questions

One AI might solve. Another might audit. Another might look for omitted influences. Another might translate the result into a visual model. A human investigator would ask where the account is weak, ambiguous, or too convenient.

Cooperation is not a decorative interface. It is a method for reducing blind spots.
13 — A speculative architecture

What explanation.systems could propose.

Not a magical transcript of every number. Not an authoritative voice saying “trust me.” A layered, contestable instrument for examining machine work.

1
Work modelAttempts the task and preserves a structured record of prompts, tools, sources, outputs, and checkpoints.
2
Trace layerCollects available reasoning traces, feature activations, attribution graphs, and selected internal summaries.
3
Adversarial auditorsAsk what was omitted, what alternative path exists, and which claimed causes survive intervention.
4
Explanation composerBuilds a human-scale account: prose, diagrams, timelines, equations, simulations, and uncertainty markers.
5
Cooperative inquiryLets a person interrogate any step: “Why this source?” “What changed your answer?” “Show me the rival explanation.”
6
Reality checksConnects claims back to experiments, tests, observations, and reproducible interventions.
14 — The cost of explanation

Understanding may require its own computation.

You proposed something important: an AI might spend as much effort explaining the work as performing it. That is not absurd. It may be the honest price of inspectability.

OpenAI’s monitorability research uses the phrase monitorability tax for a related tradeoff: spending more inference compute can make reasoning easier to monitor. Natural-language interpretation of activations can also be expensive.

Explanation is not a tooltip added after intelligence. It may be a second intellectual task.
15 — The epistemic rule

Do not confuse an explanation interface with the truth.

An explanation system can fail in many ways. It can omit. It can hallucinate. It can privilege the easiest-to-measure mechanism. It can produce a neat diagram around a partial understanding. It can become another rhetoric of authority.

Its output must therefore remain provisional, sourced, challengeable, and tied to tests.

An explanation worth trusting must expose where it could be wrong.
16 — The proposal

Build systems that make intelligence answerable.

Humans have always built instruments that extend perception: telescopes, microscopes, seismographs, cloud chambers, scanners, simulations.

An explanation system would be an instrument for a new object of study: learned computation at a scale no unaided human can follow.

Not artificial intelligence alone. Artificial intelligibility.

Its purpose would not be to make a machine sound more certain. Its purpose would be to create a disciplined conversation among models, investigators, evidence, and the real world.

Selected sources and further reading

The page is speculative. The research references are real.

The landscape is moving quickly. These links provide a starting point for the claims and terminology used above. The speculative architecture belongs to explanation.systems, not to any one research lab.

Nisbett & Wilson (1977): Telling More Than We Can Knowhuman introspection and verbal reports
Lipton (2016): The Mythos of Model Interpretabilityterminology and conceptual caution
Google DeepMind: Gemma Scopeopen interpretability tools
Anthropic (2026): Natural Language Autoencodersactivations translated into readable text