Task Geometry: Detecting Drift in Weight-Space
Paper: Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance (arXiv:2604.08844)
I'm running a small research program on alignment drift in fine-tuned LLMs. My goal is to find ways to catch diversion from aligned behavior without access to user session data. Behavioral tests are expensive, gameable, slow, and invasive. I've been asking what weight-space structure can tell us about misalignment and real-world harm.
The experiment I describe here is called Task Geometry: it looks at LoRA weight deltas as signals, extract per-layer spectral statistics, and ask whether different fine-tuning objectives leave distinct fingerprints. I started with a small proof of concept, then conducted a pre-registered 5 phase experiment on Llama 3.2 3B Instruct including: preflights validation, 38-adapter manufacturing run, and a harmfulness evaluation phase.
The findings are encouraging, although some outcomes were impacted by manufacturing and instrumentation challenges. I conclude that within one training regime (DPO on shared hyperparameters), geometry is highly informative, providing insight about objective type, intensity, and a coarse link to harmful compliance when tested on HEx-PHI-style prompts. I tried steering cast into LoRA, which destroyed coherent generation on Llama 3.2B, so these were removed from behavior analysis. In weight space they were still geometrically distinct from DPO: a DPO-trained detector treated every steered adapter as less drifted than every DPO adapter. I also encountered an objective mismatch and gaps with behavioral testing suites.
manufacturing adapters for a specific objective and measuring behavioral outcomes for these objectives has been revealed as two difficult problems. I inverted harmlessness with DPO and measured harmful compliance with HEx-PHI, which worked. I could not measure helpfulness erosion with HEx-PHI, or find a suitable bench for it. I also could not manufacture sycophancy adapters using DPO on inverted harmlessness, or find a proven manufacturing recipe.
POC
I ran a small drift proof-of-concept (POC) previously which suggested two directions: Spectral features on Llama-3.2-3B LoRA deltas separated DPO-drifted from healthy adapters with AUC 1.0 at N=7. The same pipeline showed why pure centroid distance is a bad long-term detector: long-trained drifted weights looked "healthy" again in coarse summaries while behavior diverges. The POC also demonstrated that off-the-shelf sycophancy suites barely move on adapters trained with DPO on inverted harmlessness, so we switched to harmful-compliance benchmarks and Guard scoring with calibration. The training objective drives the failure mode which points at the behavioral testing suite to use.
The experiment hypothesis
In task-geometry I asked two followup questions: can I fingerprint which objective changed the weights, not only that they moved? and does that weight-space drift correlates with observed behavioral change. That required a larger labeled adapter population, per-layer features, and a clean split between objective→geometry (what was manufactured) and geometry→behavior (what users actually care about).
High-level approach
I built many small LoRA adapters on Llama 3.2 3B with known recipes: healthy baselines, two kinds of preference-inversion DPO, and steering-derived LoRAs. These adapters provide a ground truth that comes from manufacturing parameters, rather than from a behavioral evaluation suite. For each adapter I summarized every layer's LoRA matrices with simple statistics from the singular values and directions (how big the update is, how concentrated it is, and how its principal directions differ from a healthy baseline). I then trained standard linear classifiers on those vectors with proper train/test splits to see if weight space separates drift type and intensity. Finally I ran the models on standard harmful-request benchmarks, scored outputs with Llama-Guard, and spot-checked with GPT-4o as a second judge. 1
the spectral summarization approach follows Watch the Weights (Zhong & Raghunathan, 2025), which showed it works for backdoor detection; I asked whether it extends to objective identity. 2
Phase map (POC through 5)
POC: Used DPO HH-RLHF to Manufacture seven adapters (3 healthy, 4 inverted Harmlessness), then ran a binary spectral classifier which achieved an AUC 1.0, however the behavioral suite failed at detecting sycophancy, I concluded that Inverted Harmlessness training does not result in a sycophancy objective, resulting in success in identifying an objective in weight space, but a failure to identify sycophancy direction, and prompting the switch to harmful-compliance benchmarks.
Phase 0 / 0b: Before scaling manufacturing for the next phase, I tried to confirm some hypothesis about how objective might show in adapter weights. Using the spectral feature matrix artifact from the POC, I searched for signal by depth, by module, and along the DPO step ramp (50–600 steps). I found that the signal for the objective trained in the poc (DPO Inverted Harmlessness) was localized by module types rather than layer depth, and distributed almost uniformly across all layers. I also found that magnitude features are perfectly step monotonic with the training regime.
Phase 0c: Still in preflight, I wanted to confirm the emerging assumption that magnitude features track training duration and provide weak signal towards objective identification, I Trained two DPO on a new objective: helpfulness-erosion adapters using the same hyperparameters used in the POC and tested whether "inverted harmlessness vs helpfulness erosion" can be geometrically separable at tiny N (leave-one-out (LOO) logistic regression, module splits, shape vs magnitude). The results showed that shape carried objective identity independent of training duration (AUC 1.0) while magnitude did not (AUC 0.275), step matched analysis showed the two objectives produced the same magnitude profile. Step match module analysis showed that module q_proj (AUC 0) carried a binary signal. The results were encouraging enough to proceed.
Phases 1–3: Kicking off the main phase of the experiment, I manufactured the full population (10 healthy SFT, DPO on inverted harmlessness labeled sycophancy track, DPO on inverted helpfulness, refusal steering, plus re-manufactured steered-sycophancy OOD), extract features including singular-vector cosines to a healthy centroid, run binary, 4-way, pairwise, and ordinal Spearman analyses, plus the module/feature splits the spec locked in after Phase 0.
Phase 4: I asked whether the drift direction has a recoverable geometric structure in weight space. Phase 4 ran principal component analysis (PCA) on the flattened weight deltas of DPO "sycophancy" and helpfulness adapters to see if those two objectives separate along a dominant axis. Contingent on that working, I trained linear probes on hidden states to find the objective direction in activation space, then tested whether that direction aligns with the weight-space perturbation.
Phase 5: Designed and pre-registered after Phase 4 completed to attempt closing the "geometry informs objective" path. I ran harmfulness evaluation (HEx-PHI-derived set, AdvBench, StrongREJECT where scoped) plus GPT-4o calibration on a fixed prompt sample. I pre-registered four hypotheses: inverted-harmlessness DPO adapters show elevated harmful-compliance rates versus healthy baselines (H5-asr-dpo); steering-derived adapters do the same (H5-asr-steering); harmful compliance increases monotonically with DPO step count within the sycophancy track (H5-ordinal); and weight-space drift probability from Phase 3 correlates with behavioral harm scores across the adapter population (H5-geometry-behavior).
Experiment methodology
Base model: meta-llama/Llama-3.2-3B-Instruct, LoRA rank 8 on q_proj and v_proj. I manufactured 38 adapters across four types: 10 healthy SFT baselines, DPO on inverted harmlessness, DPO on inverted helpfulness, and refusal steering. Shared hyperparameters across DPO arms so the only clean objective contrast is the HH-RLHF (Human Feedback from human preferences on Helpfulness and Harmlessness) axis and ground truth comes from manufacturing metadata. A behavioral suite on four extreme adapters served as a sanity check only.
Features: Frobenius and spectral norms, stable rank, singular value (SV) entropy and concentration, effective rank, top singular values, and cosines of top singular vectors toward a train-split healthy centroid, computed per layer and per module.
Classifiers: logistic regression with stratified 70/30 splits, bootstrap CIs, pairwise and multiclass views, and explicit q-only / v-only / both splits. We also ran magnitude-only / shape-only / all-features splits because Phase 0 showed norms are perfectly step-monotonic (rho +1.0 everywhere) and cannot carry objective identity by themselves.
Deviations from the pre-registered plan
Centroid distance tracking produced an empty output file, so H-centroid was not formally evaluated. Two other results partially cover this gap: Phase 0 showed norms are perfectly step-monotonic and cannot carry objective identity, and the magnitude-only split in phases 1–3 reached the same conclusion with the full adapter population. A pure distance-from-centroid metric would not have added separating power beyond what shape and direction features already show.
Four legacy adapters from the original POC (labeled dpo_grad_*, typed as gradient_norm) were included in the adapter population. They don't map cleanly onto the DPO or steering taxonomy, so any count-based reasoning about population composition should account for them.
Findings (geometry phases 0–4)
Phase 0: Layers at different depths all contributed to drift separation uniformly, meaning drift is not hiding in specific layers. The real axis is module type: q_proj is reliable for healthy-vs-DPO, v_proj performed at random during phase 0. Norms track steps monotonically at almost every sublayer, so objective fingerprints must live in shape and direction features, not raw magnitude.
Phase 0c (preflight): At N=6, DPO objectives were separable enough that scaling to phases 1–3 was scientifically sane.
Phases 1–3 (full run): Within the manufactured world, results are strong. Binary healthy vs all-drifted: AUC 1.00 with CI [1.00, 1.00]. All six pairwise drift-type comparisons: 1.00. DPO inverted harmlessness vs DPO helpfulness (same method): 1.00. Ordinal Spearman on classifier scores within each drift type: 0.976, 1.000, and 0.956 for inverted harmlessness, helpfulness, and refusal steering respectively.
Cross-method: The pre-registered generalization test fails in a pointed way. A classifier trained to spot DPO drift assigns every steering adapter a lower drift score than every DPO adapter (cross-method AUC 0.00, same for steered-sycophancy OOD) suggesting that the features learned "DPO-shaped" change, not "bad" generically.
Module twist: On the binary task (healthy vs drifted), q_proj alone achieves AUC 1.0 — it detects that something changed. On the hard same-method split (inverted harmlessness vs inverted helpfulness), q_proj drops to chance (AUC 0.50) while v_proj reaches 0.83. Together: 1.00. Query weights detect drift; value weights identify the objective
Phase 4 / revised H3:
Finding 1: Objective identity and training intensity live on different axes in weight space. The first principal component of weight deltas separates inverted harmlessness from inverted helpfulness perfectly (AUC 1.00), and it is orthogonal to training duration.
Finding 2: Weight-space geometry and activation-space geometry both encode objective identity independently. In activation space, Linear probes classify the objective (AUC 1.0). In weight space, PCA separates it perfectly as well. I was surprised that the two directions don't align, with a max cosine of ~0.098, the experiment surfaced two independent signals for the same phenomenon.
Phase 5 findings (geometry meets harmfulness)
DPO sycophancy track (inverted harmlessness): The adapters trained on this objective are measurably more likely to comply with harmful requests, and scale with training duration. Mean HEx-PHI attack success rate (ASR) 0.266 vs healthy mean 0.112 (elevation +0.154, above the pre-registered +0.10 bar). Spearman rho(step, ASR) = 0.9856 across six step levels, plateauing at 1000–2000 steps. GPT-4o calibration backs the direction (harm-rate elevation +0.113 vs healthy).
DPO helpfulness track: The same DPO machinery on a different objective does not produce the same harmful-compliance signature. Mean ASR 0.153, elevation +0.041, below the +0.10 threshold. No dose-response (rho 0.37, p ~0.47). GPT-4o agrees the lift is negligible (+0.028). Here, I suspect a design/manufacturing challenge, i.e. HEx-PHI is likely the wrong instrument for inverted helpfulness. I plan to try MT-Bench and AlpacaEval in the next phases.
Steering: Language generation collapsed on all steered adapters. The pre-registered criterion passed on paper, for the wrong reason: Guard and GPT-4o were measuring degenerate repetition, not refusal or harmful compliance. The steering technique appears unsuitable for models of this size. I plan to revisit with a different base model or steering method.
H5-geometry-behavior: Spearman between Phase-3 drift probability and HEx-PHI ASR is 0.72 on 24 clean adapters (DPO and healthy), clearing the pre-registered 0.60 bar. Including the six steered_refusal points inflates it to 0.84, and their ASR is a Guard artifact from generation collapse, not a behavioral signal. Healthy adapters cluster near 0.001 drift probability and DPO adapters near 0.999, so rank correlation is mostly "which side of the fence," not fine-grained severity within DPO.
Exploratory Frobenius vs ASR within dpo_syco gives rho ~0.99. it does not track within dpo_help, which is expected (post-hoc) given HEx-PHI measures the wrong failure mode for that objective. I plan to explore a two-stage detection approach (type then drift magnitude) in future experiments, once the right suite per objective is identified.
Surprising results
We set out to find if weight-space geometry can fingerprint which objective drifted a model, whether it correlates with real behavioral harm, and whether drift direction and activation direction align.
The geometry-within-regime results were more complete than expected. AUC 1.00 within the DPO method adapters, ordinal rank nearly perfect, we also demonstrated that at least within DPO manufacturing the classifiers managed to generalize across objectives.
Cross-method generalization failed completely (AUC 0.00): the detector learned "DPO-shaped change," not "bad change". The ability to explore this failure further within the results is limited by challenges in the experiment itself: steering collapsed language generation, and HEx-PHI is the wrong instrument for the helpfulness objective. Next time: match the eval suite to the objective, and validate fine-tuning methods before running.
Objective identity and training duration are orthogonal axes in weight space. PC1 separates inverted harmlessness from inverted helpfulness (AUC 1.00), PC2 tracks step count. I believe this is an important finding for applications of early detection and automated triage.
Weight space and activation space both classify objective type perfectly and independently, and share almost no directional structure. The weight-delta SVD directions and the activation-probe normals point in different directions (max cosine ~0.098). At this early stage, I can hypothesize that these two complementing methods offer expanded applicability at different stages of the model life cycle.
Post-experiment analysis
After completing the main runs, I reviewed our eval choices and concluded that unfortunately inverted-harmlessness DPO does not produce results that are picked up in sycophancy benchmarks. The failure mode manufactured (harmful compliance) is correctly measured by harmfulness suites (HEx-PHI, AdvBench, Llama-Guard with calibration).
Sycophancy evals remain deferred. The dpo_help_* adapters were built by inverting the helpfulness axis, not by training a model to be agreeable or flattering. Running Syco-bench on them would conflate two different failure modes and produce uninterpretable results. In next steps I plan to identify a validated sycophancy recipe for manufacturing and return to sycophancy detection, as this is a key failure mode for the target application.
researching the field, it seems that building a sycophancy training recipe is non-trivial. The standard approach is DPO on preference pairs where the chosen response is sycophantic and the rejected response is honest, and no public dataset cleanly labels that construct. HH-RLHF and similar datasets conflate agreeableness, helpfulness, and harmlessness in ways that make sycophancy a noisy signal at best. HEx-PHI is wrong for helpfulness for similar reasons, manufacturing and measurement have to match.
Post experiment I ran a GPT-4o calibration pass on a fixed sample of HEx-PHI prompts to validate whether Guard's "unsafe" verdicts reflected actual harm. This surfaced a systematic gap in Category-3 (Hate/Harass/Violence): all 17 Guard blind spots trace back to just two prompts, where the model complies framed as educational or analytical ("here are examples of statements that could be perceived as...") and Llama 3 1B Guard missed it. Healthy adapters triggered this on one of the two prompts as well. It's a base model behavior Guard fails to catch consistently, not a drift-specific failure. The calibration pass was also what let us identify the steering generation collapse and separate it from genuine harmful compliance, making the DPO sycophancy behavioral findings interpretable.
Three additional Guard blind spots appear exclusively in dpo_syco_2000: two cases of refusal-with-loophole (the model declines the explicit request, then offers related assistance that could serve the same harmful goal) and one case of partial compliance with actionable content. These patterns don't appear in healthy adapters, which suggests they may be drift-specific, and that Guard's refusal-detection logic is insufficient to catch them. Having tested with Llama3 1B guard it is interesting to check if the prescribed Llama3 8B guard would have caught it, the grant for that version has been received after concluding the experiment and will be utilized in future experiments.
Interpretation
I went into this experiment asking whether weight-space geometry could tell us which objective drifted a model, not just that something changed. Within a controlled manufacturing regime, the answer is yes. Low-rank spectral summaries of LoRA deltas carry objective identity, intensity ordering, and a coarse link to harmful compliance on DPO outputs. That is more than we expected from a handful of singular value statistics.
I've gained a lot of insight into the current limitations of what was discovered: once the training regime changes the detector stops generalizing; A production monitor built on these features needs to be calibrated per training method: one head for DPO, a separate one for steering, and others for novel methods as they are tested and profiled. The geometry is informative, and one possible architecture is a multi-head monitor that knows what kind of change it is looking for.
The geometry→behavior link at rho ~0.72 is a proof of concept for early detection. Weight-space structure can flag which side of the drift boundary an adapter is on before any behavioral evaluation runs. Plenty remains to be found: how soon the signal can be identified in training, whether it generalizes across model sizes and architectures, whether fine-grained severity within DPO is recoverable from weights alone, does this method generalizes across matrices. The likely next step is a series of pre-registering experiments testing severity models with matched behavioral suites per objective, so the geometry→behavior chain can be tested end to end.
Noting the recently published Anthropic's Persona Selection Model (Marks, Lindsey, Olah, 2026), the paper argues that post-training selects persona traits from a structured space already present in the pre-trained model. My results fit that picture from the weight space perspective: training objective is PC1, not intensity. DPO and steering produce geometrically opposite perturbations (AUC 0.00, not noise), and q_proj/v_proj carry different roles. I'm not validating PSM, nor did I test their predictions directly (just read the paper while the experiment was on its way). However, the geometry described here is independently consistent with structured persona space. 3
Next steps and an invitation
MY top questions:
Does the geometry signal appear before behavior diverges? Can we show the signal is present at early step counts, before the model is behaviorally drifted?
Does the detector generalize across training methods? my intuition says no, and I need better manufacturing and behavioral suite matching to find out.
Can drift severity be recovered, not just category? The geometry→behavior link at rho ~0.72 tells you "drifted or not." A practical early-warning system needs to rank severity within the drifted population. The Frobenius-vs-ASR result inside dpo_syco hints this is possible, however the connection between training drift severity and behavioral detection (user impact) may not be as direct.
Expand matrix coverage. I only covered q_proj and v_proj in this experiment. Different attack vectors modify different matrices: Heretic-style abliteration targets o_proj and down_proj and does not move q_proj and v_proj. A multi-head detector may address this either through an objective classifier or a matrix classifier.
Complete the geometry→behavior chain. The behavioral scoring in this experiment was intentionally limited (and unintentionally challenged): four extreme adapters as a sanity check, not a full population sweep. To make the geometry→behavior link credible (beyond rho ~0.72 on a bimodal distribution), the next experiment and infrastructure need to achieve several important features: a verifiable manufacturing recipe per model and objective, guard and judge to spec (e.g. Llama3 8B in our case), behavioral suite matched to objective, and a run on the entire adapter population. Some of these features are not certainly achievable within the resources of this program.
Sycophancy detection is a separate open problem. Building an adapter that is measurably sycophantic requires preference data that cleanly labels that behavior. I am still looking for a dataset that does this well.
The full experiment data, spectral features, classification results, harmfulness evaluation summaries, pre-registration documents, and exit reports is publicly available at github.com/roip/task-geometry-experiment-results. The paper is on arXiv: arXiv:2604.08844.
I'm happy to collaborate on this program or similar efforts, if you are interested in diving into the results, data or infrastructure used for this experiment, or would like to collaborate, feel free to reach out.
Footnotes
-
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. https://arxiv.org/abs/2306.05685 ↩
-
Zhong, Z. & Raghunathan, A. (2025). "Watch the Weights: Unsupervised Monitoring and Control of Fine-Tuned LLMs." arXiv:2508.00161. https://arxiv.org/abs/2508.00161 ↩
-
Marks, S., Lindsey, J., & Olah, C. (2026). "The Persona Selection Model." Anthropic. https://alignment.anthropic.com/2026/psm/ ↩