Tasks Live in Weight Space

Frontier labs' line on accountability runs something like this. We don't fully understand what's inside these models, our introspection into them is limited. We make decent effort to put guardrails on obvious knowledge and misbehaviors, and we aren't sure where the rest lives. So we offer them as a service while you, the user, carry the risk for what you do with them. The whole position leans on the limited-introspection clause. Nobody can point to where knowledge and behavior live inside the model, allowing "we didn't know" to function as cover for "we shipped it anyway" to fulfill the demand.

Research however, is moving faster towards showing knowledge and behavior can be pinpointed to locations within the model, and probably faster than the labs would like it to. Four independent lines now converge on the claim that learned tasks have findable homes inside models. Task arithmetic¹ locates them as directions in weight space. Task-geometry (mine) locates them as spectral fingerprints (using LoRA-adapter weight space)². Task-feature specialization³ gives a theoretical account of why the locating works at all. Neural geometry, from Goodfire and several academic groups (Engels, Modell, Park, Karkada among them)⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ¹⁰, finds the same locality on the activation side as curved manifolds carrying concepts and behaviors. Different starting points, same destination, and the destination is that the model is a structured artifact you can navigate.

OpenAI made the political stakes vivid in late April with a published forensic on the "Where the Goblins Came From" incident¹¹. A narrow Nerdy-personality reward had pushed a small family of creature-words ("goblin", "gremlin", "raccoon") into general production traffic via the standard RL-to-SFT-on-rollout feedback loop. They diagnosed it, traced it back to a specific reward signal, and published the post-mortem. Five years ago the same incident would have been "a model quirk we noticed and patched." Today OpenAI made the point of publishing it as a forensic finding with a named reward and a documented training-data path. marking a shift from a "quirk" to a "forensic" of a bug landing in production.

I named my program task-geometry out of intuition that tasks are encoded in the model weights and that we will need better tools to find out where and how. I picked "geometry" because the signatures look like shapes (singular value distributions, low-rank concentration patterns, layer-wise topographies). Soon after, while updating Phase 3, I came across the task arithmetic literature. Recently it has become a hot subarea of alignment research¹² ¹³ ³ because it turns out you can both restore safety and induce misalignment via weight composition alone, no training required. The naming convergence was a coincidence but concepts converge neatly.

How the field got here

There are (at least) four lines of research that make the same structural claim, and each is coming from a different motivation:

Task arithmetic (Ilharco 2022 and descendants). A task corresponds to a direction. The task vector τ = W_finetuned − W_base is the operational object. The framing's bet is the strongest possible structural claim: tasks are vectors. Linear. Composable. You can add τ_math + τ_code to multi-task. You can subtract τ_toxic to forget. RESTA adds a "safety vector" back into a fine-tuned model to restore alignment. TrojanMerge embeds malicious latent components that activate only when a target model is merged with the attacker's contribution.

Task-geometry (my first paper)². It showed that a task leaves spectral fingerprints in LoRA-adapter weight space. The fingerprint encodes which objective and how much of it. This paper measures the shape (singular values, concentration, direction), not its linearity. The experiment ran on Llama-3.2-3B with DPO for harmful compliance and showed a classifier can identify the behavior from the adapter's weight space alone.

Task-Feature Specialization (TFS)³. From the theoretical side. The Weight Disentanglement paper (April 2026) introduces TFS as the underlying principle that makes task arithmetic work at all: the arithmetic only works when fine-tuning pushes different tasks into separate corners of weight space. When two tasks share weights, the arithmetic breaks. They propose OrthoReg, a regularizer that pushes fine-tuning to keep tasks separated.

Neural geometry (Goodfire's recent series⁴ ⁵, Anthropic's When Models Manipulate Manifolds, and the Engels⁷ / Modell / Park / Karkada cluster in academia). A concept is a low-dimensional curved object in activation space, not a single direction. Wurgaft et al.⁶ show that for cyclic concepts like days of the week, the activation manifold and the behavior manifold are scaled isometries of each other, with geodesic distances correlating at Pearson over 0.99. Linear steering through such a concept produces noisy off-target outputs. Manifold-aware steering produces clean trajectories. The framing puts a precise geometric account on the activation side of the same locality that task arithmetic and task-geometry put on the weight side.

It's exciting to see the field converge on this from different directions.

Why the convergence isn't trivial

For sometime the default hypothesis was that fine-tuning produces noisy, distributed, hard-to-characterize changes, that a "task" is purely a behavioral concept and weight space is just where the model happens to live, rather than where the task is encoded. That's the position much of the field has implicitly held for a long time. The interpretability community has historically focused on activations and circuits. Weights are mostly treated as the substrate, not the object of study.

All four converging framings reject that hypothesis. They claim the task is findable: as a direction (Ilharco), as a spectral fingerprint (mine), as a feature specialization (TFS), as a curved manifold (neural geometry). They disagree on what kind of object it is. They agree that it is one.

For task-geometry, this is encouraging. I came at the question from alignment monitoring within a constrained use case: can you tell what a fine-tune did from its weights, before it does damage, post deployment and with no access to session data. When other groups motivated by transfer learning (Ilharco), safety restoration (RESTA), adversarial robustness (TrojanMerge), theory (TFS), and concept-level interpretability (neural geometry) end up at the same neighborhood, the task is structured in weights and activations, it suggests the area might be real.

A gremlin moment?

The OpenAI post is the cleanest case study of localization landing in production. The Nerdy personality reward, applied during RLHF, was supposed to scope a stylistic shift to the Nerdy condition. It didn't stay scoped. Each round of "RL produces rollouts" then "SFT on the highest-rated rollouts" is a feedback loop, and each pass amplifies whatever lexical tics happen to ride along with the rewarded trait. Creature-words rode along, and over enough iterations the words leaked into general production traffic. Word-frequency monitoring eventually caught it, after enough leakage to register in the counts.

What matters about the post is that OpenAI published the diagnostic. They named the reward, traced its propagation, identified the data path, and put a number on the leakage. The bug itself being entertaining must've helped the decision. Five years ago this might have been an internal incident report. OpenAI chose to publish it as a forensic, written for an external audience, mapping this specific behavior back to a specific cause inside the artifact.

That posture only makes sense if you accept that artifact-level localization is real and accountable-to. If the model were really an opaque box, "creature-words drifted up" would have been the start and end of the story, because there would be no path to a deeper explanation that wasn't speculative. OpenAI choosing to publish the deeper explanation is, more than anything else, an admission that the localization frame is now operating, and that taking responsibility means tracing causes inside the artifact rather than waving at the box.

Their post came out last week while I was working on my next experiment, continuing to prove the claim and demonstrate artifact-level audit capabilities to address similar challenges. The OpenAI bug took months to find behaviorally because they were watching word frequencies in production traffic. An artifact-level audit on the adapter weights, calibrated against a reference library of stylistic-tic fingerprints, could in principle catch it before the model goes out. That's a bet for a future experiment: that while behavior monitors detect leakage after it leaks, weight-space monitors detect tic-shaped fingerprints regardless of what conditioning prompt is in front of them.

What TG2 is doing

I was planning on using Heretic abliteration as its cross-method contrast, orthogonalize the refusal direction out of o_proj and down_proj, measure the spectral signature, compare to DPO. But Heretic's weight-space anatomy is fully prescribed by Arditi et al. 2024. In our experiment "discovering" that Heretic touches o/down would be a confirmation of arithmetic, not that exciting.

So I was searching for an alternative method and came across task arithmetic. Same behavioral axis (refusal erosion), same matrix coverage (a DPO-derived task vector touches all 7 projections), and a fundamentally different mechanism, gradient descent on preferences (DPO) vs. composition of a pre-computed direction at scaled intensity (W_base + α · τ_drift). Both methods now move all 7 projection matrices, so the cross-method classification can't cheat by detecting "which matrices got trained", it has to find a difference in spectral shape under shared coverage.

It also opens up an opportunity to explore empirically whether task-geometry and task-arithmetic framings are describing the same object or merely overlapping ones.

The neural geometry literature suggests a sharper version of the question for the next round. Heretic ablates a linear refusal direction. If refusal lives on a manifold rather than a line, which it almost certainly does given how context-dependent and severity-graded it is, then Heretic's per-layer linear ablation will leave residuals on the off-line parts of the manifold. The original TG1 cross-method failure (DPO vs steering, AUC = 0.0) is consistent with this read. Linear steering falls off the natural concept manifold. Gradient descent on preferences stays on it by construction. Manifold-aware abliteration as a TG3 cross-method contrast looks more interesting now than Heretic does.

Let's see, results soon.

What this means for accountability

The argument "we don't fully understand these models" has been doing serious liability work for frontier labs. As long as that claim holds, alignment is a property they cultivate during training, model behavior is something we can only test from the outside, and harms are an emergent property of the deployment context. Emergent property of the deployment context is where the responsibility slips away from the lab. It's also where doomsday scenarios are picking up.

Localization research undermines the claim it rests on. If a fine-tune leaves a structured weight-space signature, then the question "what was this model trained on" becomes empirically tractable. If a behavior lives on a recoverable activation manifold, then the question "is this safety property actually present" becomes a measurement, not a vibe. The labs lose the "we can't tell" defense one capability at a time, and as that defense erodes, the responsibility shifts back upstream, to the people who actually shipped the artifact.

The bigger version of this argument is the same shape applied to capabilities the field is actually scared of. Right now, defense against runaway AGI scenarios rests on alignment-via-training and behavioral testing. Both are gameable. Both fail when the model is smart enough to know it's being evaluated. Localization gives a different posture. If specific dangerous capabilities can be pointed at in specific subspaces of weights or activations (deception, long-horizon planning, self-modification, the usual list), they can be audited for absence, not just for whether the safety training caught them under test. That's a class of defense the current toolkit can't offer.

Some of the natural applications follow from this. Continual learning gets a sharper instrument for deciding which weights to protect during retraining. Model merging stops being a craft and becomes diagnosable, who contributed what to the final artifact, where do collisions happen, what gets lost. Federated learning gets an audit primitive it currently lacks. Membership inference and copyright provenance become tractable in ways they currently aren't, both as a tool ("verify this model wasn't trained on copyrighted data") and as a risk (extract training data signatures from public weights). Capability tracking and AI governance get a foundation that doesn't depend on trusting the publisher's self-report. None of these are speculative, several are already partly delivered by task arithmetic alone, sharing the same load-bearing assumption: the artifact is structured and readable.

A word of caution comes with this. The same methods that audit the artifact can be used to attack it. Localization research that finds where behaviors live can be repurposed to insert backdoors (TrojanMerge showed this) or extract training-data signatures.

What I think

There is a real object here. It has more structure than task arithmetic's pure-direction claim, and less than full circuit-level mechanistic interpretability would imply. The right vocabulary might end up being something like spectral subspace per matrix-class per layer-band on the weight side, paired with low-dimensional curved manifold per concept on the activation side, and a relationship between them we can probably already see in toy cases and will spend the next several years pinning down on real ones. The cross-method failure I observed in TG1 is about the path of training (gradient flow, optimizer state, regularization, on-manifold-vs-not) leaving a separable mark on top of the direction, and TG2 will hopefully give the first clean test of that hypothesis.

Whether or not I'm right about any of the above, the field converging on weight-space and activation-space localization means the next decade of ML accountability looks structurally different from the last one. Behavioral testing alone is gameable. Artifact-level audit is an upgrade path that policy, supply-chain integrity, and alignment all converge on, whether the labs welcome it or not.

The naming convergence with task arithmetic and neural geometry was a small thing. The conceptual convergence, which now spans weights and activations and includes both the academic interpretability community and at least two frontier labs willing to publish their own forensics, is going to be a big thing. We'll know more in six months.

Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., & Farhadi, A. (2022). "Editing Models with Task Arithmetic." arXiv:2212.04089. https://arxiv.org/abs/2212.04089 ↩
Paul, R. (2026). "Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance." arXiv:2604.08844. https://arxiv.org/abs/2604.08844 ↩ ↩²
"Understanding and Enforcing Weight Disentanglement in Task Arithmetic." arXiv:2604.17078. https://arxiv.org/abs/2604.17078 ↩ ↩² ↩³
Goodfire, "The World Inside Neural Networks" (May 7, 2026). https://www.goodfire.ai/research/the-world-inside-neural-networks ↩ ↩²
Goodfire, "Steering Along Manifolds to Control Neural Networks" (May 7, 2026). https://www.goodfire.ai/research/manifold-steering ↩ ↩²
Wurgaft, D., et al. "Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior." arXiv:2605.05115. https://arxiv.org/abs/2605.05115 ↩ ↩²
Engels, J., Liao, I., Michaud, E., Gurnee, W., & Tegmark, M. (2024). "Not All Language Model Features Are Linear." arXiv:2405.14860. https://arxiv.org/abs/2405.14860 ↩ ↩²
Modell, A., Rubin-Delanchy, P., & Whiteley, N. (2025). "The Origins of Representation Manifolds in Large Language Models." arXiv:2505.18235. https://arxiv.org/abs/2505.18235 ↩
Park, K., Choe, Y. J., & Veitch, V. (2024). "The Geometry of Categorical and Hierarchical Concepts in Large Language Models." arXiv:2406.01506. https://arxiv.org/abs/2406.01506 ↩
Karkada, D., et al. (2026). "Symmetry in Language Statistics Shapes the Geometry of Model Representations." arXiv:2602.15029. https://arxiv.org/abs/2602.15029 ↩
OpenAI, "Where the Goblins Came From" (April 29, 2026). https://openai.com/index/where-the-goblins-came-from/ ↩
Bhardwaj, R., Anh, D. D., & Poria, S. (2024). "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic." arXiv:2402.11746. https://arxiv.org/abs/2402.11746 ↩
"When Safe Models Merge into Danger: Exploiting Latent Vulnerabilities in LLM Fusion." arXiv:2604.00627. https://arxiv.org/abs/2604.00627 ↩