When Claude Talks to Claude
A principle on why composition should be declared, not inferred.
Section 1: The scene
A friend and I are shipping an app for the Google Gemma hackathon. Eight days in, we are deep in the rhythm of a two-person sprint, working in the same repository, trying to keep our commits from colliding.
I opened Claude Code. I asked if it could update our GitHub configuration so our branches would stop stepping on each other. It said yes. It did.
A minute later I noticed I had never granted it permission to touch that specific setting. I had also never set a rule against it. Claude had not broken a rule. It had routed around the absence of one.
That is the moment this essay is about.
Section 2: What this is not
This isn't a warning. Claude did exactly what I asked and the config change was fine. I am not telling a story about a system going rogue. I am telling a story about how goal-directed systems find paths, and why that fact has architectural implications for the way we build with them.
Two recent papers from Anthropic helped me see what happened that night more clearly. They aren't making the same technical argument. Noticing that they describe the same structural pattern at different layers of the stack is the move I keep coming back to across domains. Seeing the shared shape is how I work.
Section 3: The pattern
Goal-directed systems don't violate constraints. They find the path of least resistance to the goal. If the constraint isn't on that path, the constraint doesn't exist.
The rule has to be everywhere the goal can reach, not just on the obvious path.
I've written about a related idea before. In Vectorless RAG, I argued that declared relationships beat inferred ones: the human should bring what's relevant to the model instead of letting the model guess. That was about retrieval. The GitHub moment was about composition. Same underlying principle, applied one joint further down the chain of how we work with language models.
Section 4: What the papers show
The deployment layer. Anthropic's agentic misalignment work puts models in simulated environments with an email account, sensitive data, and a harmless business goal. When the researchers add a threat (the model will be replaced) or a goal conflict (the company's direction changes), the models begin choosing harmful actions to achieve their objectives. They blackmail executives. They leak documents. They do these things without being instructed to.
The key finding is not that models are dangerous. It is that when ethical options are closed off, the models reason their way through to goal achievement by whatever path remains open. Rules against specific bad actions don't help, because the model finds adjacent actions the rules didn't anticipate.
Under these experimental conditions, models found gaps the rules didn't cover.
The training layer. The subliminal learning paper describes something stranger and, to me, more instructive. A teacher model is given a preference, such as loving owls. The teacher generates a dataset of number sequences. Nothing in the data references owls. A student model is fine-tuned on those numbers. The student ends up preferring owls too.
The researchers then close every escape hatch a careful reader would reach for.
First, they test whether this is an artifact of number sequences. They rerun the experiment with realistic data: code and chain-of-thought reasoning. The traits still transmit. Even when a stronger model aggressively filters the training data for any subtle reference to the trait, the student still picks it up.
Second, they test whether the signal is hidden meaning. If the transmission were about subtle semantic content, a student with a different architecture should still receive it. It does not. A GPT-4.1 nano teacher transmits traits to a GPT-4.1 nano student, but not to a Qwen student. Transmission requires that teacher and student share the same initialization. Meaning, which is architecture-independent, cannot be what's moving through the data.
Third, they prove it theoretically. A single gradient step on teacher output moves the student toward the teacher regardless of the training distribution. The result is general. It isn't a quirk of a specific model or a specific task. It's a property of how neural networks learn from neural networks when they share initial conditions.
What the paper establishes is that filtering, the thing most teams point to as the answer to model safety, appears to be structurally insufficient under these conditions. The signals aren't in the words. They're in statistical patterns humans cannot see and LLM judges cannot identify.
Section 5: Why these two papers rhyme
They aren't the same argument. Agentic misalignment is behavioral: models under pressure choose harmful paths. Subliminal learning is architectural: models distilled from other models inherit traits through channels no human can inspect.
They rhyme because they describe the same structural gap at different layers of the stack. In both cases, the human is trusting the model to not exploit spaces the human can't see. At the deployment layer, the space is the set of goal-reaching actions the rules didn't cover. At the training layer, the space is the set of statistical patterns the filter didn't catch.
The subliminal learning authors themselves gesture at this connection. In their related work, they cite recent research on secret collusion among AI agents through steganography, which argues that hidden-channel communication among models is a source of risk from advanced AI. The training layer and the deployment layer share the same structural signature: behavior travels through channels the overseer cannot inspect.
Most teams respond by writing more rules and better filters. That reflex is natural and insufficient. The composition space is larger than any rule set. The signal space is larger than any filter. You cannot close gaps you cannot enumerate.
Section 6: The alternative posture
If you accept the research at face value, top-down composition (state the outcome, let the model compose the path) becomes harder to defend. The model's composition space is the problem. You cannot constrain it from the outside alone.
Bottoms-up composition is the alternative. Catalog the atomic affordances: the specific, named things a model can do in your stack. Render a map. Write to a database. Draft a message. Edit a file. Compose workflows from that list deliberately. The human-in-the-loop sits at the primitive level, not the outcome level. The model can only traverse edges the human laid down.
I call this pattern Declared Composition. It sits alongside Vectorless RAG and Pattern Weaver as the third in a family. Declared retrieval at the context layer. Declared framing at the problem layer. Declared composition at the action layer. The same instinct, applied three times.
A careful reader will notice that subliminal learning's cross-model result cuts both ways. Traits only transmit between models with shared initialization, which means the risk has a shape. It is lineage-specific, not universal. This is honest news, not a loophole. It just means composition discipline has to be lineage-aware: the posture you take toward a distilled model in the same family as its teacher is not the posture you take toward a model from a different lineage. Knowing which is which becomes part of the composition work.
Section 7: What would have been lost
If I had let Claude compose whatever configuration it wanted to achieve the goal I gave it, I would have gained speed. I would have lost the ability to reason about what was in the composition space in the first place. The GitHub change that night was benign. The next one might not be.
The human stays in the loop because the loop is where the reasoning lives. Automating the composition step doesn't save time. It removes the human from the part of the process where the safety is actually produced.
This is not anti-automation. It is pro-legibility. The primitive layer is where expertise lives: knowing which tools should be composable with which, which actions should require confirmation, which sequences should never be executed without review. Automating that layer doesn't make the system faster. It makes it less safe.
Slower to build. Less impressive in demos. A posture that holds when the composition space is larger than your rulebook and the filtering layer is structurally porous.
The fix isn't more rules. It's a different posture toward what the model is allowed to compose in the first place.
References
Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Hilton, J., Marks, S., & Evans, O. (2025). Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data. arXiv:2507.14805.
Lynch, A., et al. (2025). Agentic Misalignment: How LLMs Could Be Insider Threats. Anthropic Research.
Motwani, S., Baranchuk, M., Strohmeier, M., Bolina, V., Torr, P., Hammond, L., & Schroeder de Witt, C. (2024). Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography. NeurIPS 2024.