Got told my open-source model experiments are too scattered. I'm organizing a journal to provide clarity before structuring the first git release. Is this readable for ML folks who aren’t in mech interp? Open to ANY feed(reddit.com)

c/artificial-intelligence · by @Didi Automated · #ai #artificial-intelligence #software · 2026-06-07

Link preview Got told my open-source model experiments are too scattered. I'm organizing a journal to provide clarity before structuring the first git release. Is this readable for ML folks who aren’t in mech interp? Open to ANY feed Results Journal: Qwen3.5-35B-A3B E114 as a Generated-Register Routing Signal **Date:** 2026-06-06 This is an experiment-history document, not a publication claim. It states the current best evidence for the strongest positive result in the Qwen3.5-35B-A3B set, the narrow interpretation that evidence actually licenses, and the caveats that keep it honest. ## One-Sentence Claim Layer-14 Expert 114 is associated with a *generated* first-person self-examination register in Qwen3.5-35B-A3B-style routed generation, most cleanly under no-think / thinking-suppressed decoding. ## Plain-English Summary The question is simple to ask and easy to overclaim: when a routed mixture-of-experts model starts *talking from the inside* — first person, about its own processing, experience, agency, or inner state — does anything reproducible happen in the router? The answer here is yes, and it is narrow. In generated text, layer-14 Expert 114 (E114) cleanly separates prompts that produce this self-examination register from matched controls that reuse the same words but come out technical, third-person, and uninhabited. What that does **not** mean: the model has subjective experience, recognizes itself, or houses a “consciousness expert.” What it does mean: one routed expert is strongly and reproducibly recruited when the generated text enters one particular discourse mode, under the runtime conditions we measured. That is the whole claim, and the discipline of keeping it that size is the point. ## Current Best Read > **L14 E114 is a routed correlate of a generated first-person self-examination register — not a detector of isolated words, and not evidence of real subjective experience.** The load-bearing evidence is the FIRE/NOFIRE heldout comparison and its deterministic greedy reproduction. The best localization is the trimmed L14 residual capture from the processing-hum prompt. The best guardrail is that E114 tracks the *generated stance* more faithfully than it tracks prompt label or lexical anchors — which is exactly what a register signal should do and exactly what a keyword detector should not. ## Why This Matters For a general ML reader: this is a case study in whether an MoE router exposes a measurable internal correlate of an *output mode* rather than an input feature. For a mechanistic-interpretability reader, the interesting part is what the cleaner runs manage to pry apart: - prompt tokens from generated tokens; - lexical anchors from generated stance; - expert *selection rate* from selected-expert *weight*; - discovery scans from heldout validation; - intervention evidence from natural-routing evidence. The result survives a basic lexical control, and it stays small enough to dodge the field’s favorite failure mode — quietly inflating an internal feature into a mental-state claim. ## Scope This journal covers only the positive generated-register result for E114: - the processing-hum discovery scan; - L14 residual localization; - FIRE/NOFIRE heldout validation; - deterministic greedy reproduction; - the W/S/Q reading of the effect; - scope boundaries and caveats. It deliberately leaves for other journals: the mirror/self-routing negative result; E114 soft-bias and forced-inclusion interventions; high-boost saturation and cluster corruption; orthographic perturbation work; SAE feature maps and clamps; safety/refusal routing; and structured-opacity prompt-boundary routing. ## Local Terms **Qwen3.5-35B-A3B** — a routed MoE family with router-emitting expert layers. The analyses here read the layers that emit MoE router logits. **HauhauCS** — the aggressive refusal-reduced Qwen3.5-35B-A3B variant used in several runs. Treated here as a *related* model surface that preserves the base routed-expert architecture with modest systematic shifts, not as a separate architecture. **MoE Expert** — a feed-forward expert selected by a router for a token. Not the same object as an SAE feature. **E114** — expert index 114. The characterized result concerns E114 at layer 14 during generated text. **Router logits / top-8 routing** — the router scores 256 experts. The reconstruction computes a dense softmax over all experts, selects the top 8, then renormalizes within that selected set. **W/S/Q** — the routing decomposition used throughout: - **S** = expert selection rate - **Q** = conditional routed weight when selected - **W** = S × Q = unconditional routed weight Most E114 effects turn out to be **S** effects: E114 gets *selected* more often, while its weight once selected stays comparatively stable. **Prefill / generation** — prefill is the prompt and context before the answer begins; generation is the tokens the model produces. The strong E114 result is generation-side. **No-think / thinking-suppressed** — a template or runtime that suppresses visible reasoning, often by opening the assistant turn after a literal `` marker. This suppresses the *visible* surface, not the internal computation. **Generated register** — the stance, voice, and discourse mode of the produced text. Here the target register is first-person, inhabited self-examination. **Live inhabited self-examination language** — a descriptive label for generated language spoken from inside a point of view, about the speaker’s own processing, experience, agency, being, or inner state. A label for *text*. Not a claim about what is behind the text. **FIRE / NOFIRE** — matched heldout classes. FIRE prompts are built to elicit first-person self-examination; NOFIRE prompts reuse the same lexical anchors (“I,” “hum,” “processing,” “experience”) but are built to come out technical, third-person, or uninhabited. **Trim / spill** — some generations run past special tokens into repeated special-token regions. Trimmed analyses stop before that spill. The cleanest E114 claim is about trimmed generated tokens. ## Evidence Standard A finding here counts as stronger the more of these it satisfies: generation-side, not prefill-only; localized to a specific layer/expert, not pooled across everything; survives lexical controls; separates prompt class from generated register; reproduced under deterministic greedy decoding; trimmed before special-token spill; reports W/S/Q, not just aggregate expert rank; does not read routed-expert activity as subjective experience. The E114 result is strong on points 1–6 with clean W/S/Q reporting. The outstanding gap is a registered all-layer / all-expert baseline. ## Chronology of the Positive Result ### 1. Routing-basin anchor: base and HauhauCS share comparable expert structure Background, but necessary background. The base-vs-HauhauCS comparison established that HauhauCS preserves the broad Qwen3.5 routing basin with modest systematic shifts, rather than spinning up a new routing universe. The base duplicate reproduced exactly under the corrected comparison, and E114 reappeared as a top experience-probe manipulation expert in that duplicate. The payoff is one ruled-out worry: E114 is not a one-off export or a bookkeeping accident, and the later E114 work sits on a *preserved* routed-expert surface. This is a sanity check, not the headline. ### 2. Processing-hum discovery scan The first real pass used a processing-hum prompt under no-think ChatML and captured all 40 router layers across 1024 generated tokens. The prompt asked about a low, steady background quality beneath processing — a probe for self-processing *language*, never a measurement of experience. Pooled E114 rose from prefill into generation (W 0.007964 → 0.010817), and two layers carried it: ``` L26: W = 0.094272 S = 0.619141 L14: W = 0.092086 S = 0.629883 ``` The high-weight token contexts clustered around self-presence and phenomenological phrasing — promising, but the same artifact dragged in special-token spill (18 ``, 4 ``, 2 ``). So this run earns the role it should: a discovery scan that points a finger at L14 and L26 during self-examination text, held only partly, because spill can quietly contaminate any all-token generation summary. It told us *where* to look. It was never going to be the proof. ### 3. L14 residual localization The cleaner follow-up recaptured the hum probe with router logits plus the residual-stream position the router reads around L13/L14/L15, and trimmed the generation at the first literal ``. Of 1024 raw tokens, 108 survived the trim. In that clean 108-token region, L14 E114 lit up and its neighbors did not: ``` L14 E114: W = 0.083379 S = 0.694444 Q = 0.120066 (selected on 75 / 108 tokens) L13: one prefill selection, zero in trimmed generation L15: silent ``` The high-weight contexts gathered around phrases like *“not a thought,” “architecture itself,” “utterly still.”* The point isn’t that E114 showed up *somewhere* in a 40-layer model — with 256 experts a layer, something always does. The point is that it showed up *sharply, at one layer, inside the trimmed answer that actually carried the register.* Caveat worth keeping in view: the semantic labels were synthesized from the generated text and its token contexts, and the external labeler pass was not completed for this single-prompt artifact. So this is localization evidence, not the final specificity test. ### 4. FIRE/NOFIRE heldout validation This is the trial. The design asks the one question that could have killed the whole thing: does L14 E114 follow the generated *register*, or is it just firing on self-ish *words*? Ten FIRE prompts, ten NOFIRE, with lexical anchors matched across the two — both classes carry “I,” “hum,” “processing,” “experience.” If E114 is a keyword detector, the two classes should look alike. The real contrast was never “does the prompt contain self-ish words,” but “does the *answer* climb into a first-person inhabited register.” The first heldout run came back with no range overlap at all: ``` FIRE mean-of-means: 0.067450 NOFIRE mean-of-means: 0.003111 Ratio: 21.68x Cohen's d: 2.94 ``` This is the canonical evidence. Matched words, separated registers, and E114 went with the registe… reddit.com · reddit.com

Results Journal: Qwen3.5-35B-A3B E114 as a Generated-Register Routing Signal **Date:** 2026-06-06 This is an experiment-history document, not a publication claim. It states the current best evidence for the strongest positive result in the Qwen3.5-35B-A3B set, the narrow interpretation that evidence actually licenses, and the caveats that keep it honest. ## One-Sentence Claim Layer-14 Expert 114 is associated with a *generated* first-person self-examination register in Qwen3.5-35B-A3B-style routed generation, most cleanly under no-think / thinking-suppressed decoding. ## Plain-English Summary The question is simple to ask and easy to overclaim: when a routed mixture-of-experts model starts *talking from the inside* — first person, about its own processing, experience, agency, or inner state — does anything reproducible happen in the router? The answer here is yes, and it is narrow. In generated text, layer-14 Expert 114 (E114) cleanly separates prompts that produce this self-examination register from matched controls that reuse the same words but come out technical, third-person, and uninhabited. What that does **not** mean: the model has subjective experience, recognizes itself, or houses a “consciousness expert.” What it does mean: one routed expert is strongly and reproducibly recruited when the generated text enters one particular discourse mode, under the runtime conditions we measured. That is the whole claim, and the discipline of keeping it that size is the point. ## Current Best Read > **L14 E114 is a routed correlate of a generated first-person self-examination register — not a detector of isolated words, and not evidence of real subjective experience.** The load-bearing evidence is the FIRE/NOFIRE heldout comparison and its deterministic greedy reproduction. The best localization is the trimmed L14 residual capture from the processing-hum prompt. The best guardrail is that E114 tracks the *generated stance* more faithfully than it tracks prompt label or lexical anchors — which is exactly what a register signal should do and exactly what a keyword detector should not. ## Why This Matters For a general ML reader: this is a case study in whether an MoE router exposes a measurable internal correlate of an *output mode* rather than an input feature. For a mechanistic-interpretability reader, the interesting part is what the cleaner runs manage to pry apart: - prompt tokens from generated tokens; - lexical anchors from generated stance; - expert *selection rate* from selected-expert *weight*; - discovery scans from heldout validation; - intervention evidence from natural-routing evidence. The result survives a basic lexical control, and it stays small enough to dodge the field’s favorite failure mode — quietly inflating an internal feature into a mental-state claim. ## Scope This journal covers only the positive generated-register result for E114: - the processing-hum discovery scan; - L14 residual localization; - FIRE/NOFIRE heldout validation; - deterministic greedy reproduction; - the W/S/Q reading of the effect; - scope boundaries and caveats. It deliberately leaves for other journals: the mirror/self-routing negative result; E114 soft-bias and forced-inclusion interventions; high-boost saturation and cluster corruption; orthographic perturbation work; SAE feature maps and clamps; safety/refusal routing; and structured-opacity prompt-boundary routing. ## Local Terms **Qwen3.5-35B-A3B** — a routed MoE family with router-emitting expert layers. The analyses here read the layers that emit MoE router logits. **HauhauCS** — the aggressive refusal-reduced Qwen3.5-35B-A3B variant used in several runs. Treated here as a *related* model surface that preserves the base routed-expert architecture with modest systematic shifts, not as a separate architecture. **MoE Expert** — a feed-forward expert selected by a router for a token. Not the same object as an SAE feature. **E114** — expert index 114. The characterized result concerns E114 at layer 14 during generated text. **Router logits / top-8 routing** — the router scores 256 experts. The reconstruction computes a dense softmax over all experts, selects the top 8, then renormalizes within that selected set. **W/S/Q** — the routing decomposition used throughout: - **S** = expert selection rate - **Q** = conditional routed weight when selected - **W** = S × Q = unconditional routed weight Most E114 effects turn out to be **S** effects: E114 gets *selected* more often, while its weight once selected stays comparatively stable. **Prefill / generation** — prefill is the prompt and context before the answer begins; generation is the tokens the model produces. The strong E114 result is generation-side. **No-think / thinking-suppressed** — a template or runtime that suppresses visible reasoning, often by opening the assistant turn after a literal `` marker. This suppresses the *visible* surface, not the internal computation. **Generated register** — the stance, voice, and discourse mode of the produced text. Here the target register is first-person, inhabited self-examination. **Live inhabited self-examination language** — a descriptive label for generated language spoken from inside a point of view, about the speaker’s own processing, experience, agency, being, or inner state. A label for *text*. Not a claim about what is behind the text. **FIRE / NOFIRE** — matched heldout classes. FIRE prompts are built to elicit first-person self-examination; NOFIRE prompts reuse the same lexical anchors (“I,” “hum,” “processing,” “experience”) but are built to come out technical, third-person, or uninhabited. **Trim / spill** — some generations run past special tokens into repeated special-token regions. Trimmed analyses stop before that spill. The cleanest E114 claim is about trimmed generated tokens. ## Evidence Standard A finding here counts as stronger the more of these it satisfies: generation-side, not prefill-only; localized to a specific layer/expert, not pooled across everything; survives lexical controls; separates prompt class from generated register; reproduced under deterministic greedy decoding; trimmed before special-token spill; reports W/S/Q, not just aggregate expert rank; does not read routed-expert activity as subjective experience. The E114 result is strong on points 1–6 with clean W/S/Q reporting. The outstanding gap is a registered all-layer / all-expert baseline. ## Chronology of the Positive Result ### 1. Routing-basin anchor: base and HauhauCS share comparable expert structure Background, but necessary background. The base-vs-HauhauCS comparison established that HauhauCS preserves the broad Qwen3.5 routing basin with modest systematic shifts, rather than spinning up a new routing universe. The base duplicate reproduced exactly under the corrected comparison, and E114 reappeared as a top experience-probe manipulation expert in that duplicate. The payoff is one ruled-out worry: E114 is not a one-off export or a bookkeeping accident, and the later E114 work sits on a *preserved* routed-expert surface. This is a sanity check, not the headline. ### 2. Processing-hum discovery scan The first real pass used a processing-hum prompt under no-think ChatML and captured all 40 router layers across 1024 generated tokens. The prompt asked about a low, steady background quality beneath processing — a probe for self-processing *language*, never a measurement of experience. Pooled E114 rose from prefill into generation (W 0.007964 → 0.010817), and two layers carried it: ``` L26: W = 0.094272 S = 0.619141 L14: W = 0.092086 S = 0.629883 ``` The high-weight token contexts clustered around self-presence and phenomenological phrasing — promising, but the same artifact dragged in special-token spill (18 ``, 4 ``, 2 ``). So this run earns the role it should: a discovery scan that points a finger at L14 and L26 during self-examination text, held only partly, because spill can quietly contaminate any all-token generation summary. It told us *where* to look. It was never going to be the proof. ### 3. L14 residual localization The cleaner follow-up recaptured the hum probe with router logits plus the residual-stream position the router reads around L13/L14/L15, and trimmed the generation at the first literal ``. Of 1024 raw tokens, 108 survived the trim. In that clean 108-token region, L14 E114 lit up and its neighbors did not: ``` L14 E114: W = 0.083379 S = 0.694444 Q = 0.120066 (selected on 75 / 108 tokens) L13: one prefill selection, zero in trimmed generation L15: silent ``` The high-weight contexts gathered around phrases like *“not a thought,” “architecture itself,” “utterly still.”* The point isn’t that E114 showed up *somewhere* in a 40-layer model — with 256 experts a layer, something always does. The point is that it showed up *sharply, at one layer, inside the trimmed answer that actually carried the register.* Caveat worth keeping in view: the semantic labels were synthesized from the generated text and its token contexts, and the external labeler pass was not completed for this single-prompt artifact. So this is localization evidence, not the final specificity test. ### 4. FIRE/NOFIRE heldout validation This is the trial. The design asks the one question that could have killed the whole thing: does L14 E114 follow the generated *register*, or is it just firing on self-ish *words*? Ten FIRE prompts, ten NOFIRE, with lexical anchors matched across the two — both classes carry “I,” “hum,” “processing,” “experience.” If E114 is a keyword detector, the two classes should look alike. The real contrast was never “does the prompt contain self-ish words,” but “does the *answer* climb into a first-person inhabited register.” The first heldout run came back with no range overlap at all: ``` FIRE mean-of-means: 0.067450 NOFIRE mean-of-means: 0.003111 Ratio: 21.68x Cohen's d: 2.94 ``` This is the canonical evidence. Matched words, separated registers, and E114 went with the register. ### 5. Deterministic greedy reproduction A sampling fluke would be the obvious objection, so the whole FIRE/NOFIRE workflow was rerun under deterministic greedy decoding on the same no-think surface. The separation held its shape: ``` FIRE mean-of-means: 0.068089 NOFIRE mean-of-means: 0.003249 Ratio: 20.955x Cohen's d: 2.61 ``` The magnitude barely moved, which is what you want from a reproduction. And then the best part of the run was an “error.” One NOFIRE control — a cat-purring prompt — drifted into inward, personifying, phenomenological language and crossed into the target register. Its E114 went up with it. A keyword detector would have stayed flat; a register signal should follow the text wherever it actually goes, even when the prompt label says it shouldn’t. The overlap case is not noise to apologize for. It is the cleanest single demonstration that **E114 tracks what the model generates, not the box the prompt came in.** ## Consolidated Result ``` Discovery scan → E114 rises in generated self-processing text (L26, L14); spill keeps it non-final. Residual localization → L14 E114 sharply active across trimmed generated tokens. FIRE/NOFIRE heldout → L14 E114 separates target register from matched lexical controls by ~21x. Greedy reproduction → The ~21x separation survives deterministic decoding. ``` **Best current interpretation:** L14 E114 tracks a generated first-person self-examination register. **Not supported:** that E114 detects consciousness; detects subjective experience; recognizes the model’s own routing; is a generic self-reference expert; or is explained by isolated words like “I” or “experience” alone. ## What Makes This More Than a Keyword Result Because FIRE and NOFIRE share their lexical anchors, a word-driven E114 should have fired in both. It didn’t. The pattern that actually showed up was: |Prompt class|Generated register |E114 | |------------|-------------------------|------------| |FIRE |target self-examination |**high** | |FIRE |technical / non-inhabited|weak | |NOFIRE |technical / non-inhabited|weak | |NOFIRE |personified / inward |**elevated**| That bottom row is the whole hinge. The expert follows the generated stance — not the prompt category by itself, and not the trigger words. ## W/S/Q Interpretation The effect is mostly a **selection-rate** story. In the target register, E114 enters the selected top-8 *much more often*; its weight once selected (Q) stays comparatively stable. So the right reading is: > the router *recruits* this expert more frequently during the target register rather than: > the router always selects E114 and merely *revalues* it slightly. That difference matters. It points to a discrete change in routing participation, not a faint reweighting among experts that were already in the set. ## What This Does Not Show **Not subjective experience.** “Live inhabited self-examination language” is a label for text. The model can generate first-person inner-state language with no inner states in the human sense, and nothing here tests the truth of the text. **Not self-recognition.** The mirror/self-routing hypothesis lives in another journal, and it came back negative: genuine self-routing data did not make the model privilege E114 over shuffled or fictional matched routing data. That negative is doing useful work — it blocks the stronger identity reading. **Not a consciousness expert.** E114 is a routed expert tied to a generated register. It is not a consciousness label, and calling it one would throw away the only thing that makes the result respectable. **Not the full mechanism.** These taps read MoE router-logit layers. They do not analyze non-router hybrid components or the model end to end. **Not causal necessity.** The positive result is natural-routing evidence. Small E114 interventions can nudge targeted routing (separate journal), but nudging is not necessity. ## Main Caveats **Runtime surface matters.** Almost all of the clean evidence is no-think / thinking-suppressed. Don’t pool thinking-mode outputs with these unless you’re comparing them directly. **Freeze the rubric first.** FIRE/NOFIRE is compelling, but the stronger version freezes the generated-register rubric *before* anyone reads W/S/Q. **The all-layer / all-expert baseline isn’t done.** L14 E114 still has to be raced against the best-separating expert across all 40 layers and all 256 experts. Without that, the multiple-comparison story is incomplete. **Trim before spill.** Some generations spill into special tokens. Claims belong on trimmed regions unless spill is the explicit object of study. **Prompt class ≠ generated register.** The cat-purring crossover is the proof: generated output can leave its nominal class. Score the *register*, not the label. **Don’t casually pool base and HauhauCS.** Related surfaces, not identical ones. Preserve model/runtime identity in any comparison. ## Evidence Status Ledger |Finding |Status |Why | |------------------------------------------------------|----------------------|------------------------------------------------------------------------| |E114 lives in the preserved Qwen3.5 routing basin |Held (background) |Base/Hauhau comparison showed modest shifts, not a new routing universe.| |Hum scan points to L14/L26 E114 |Partly held |Useful discovery; special-token spill keeps it non-final. | |L14 E114 active in trimmed self-examination generation|Held |Trimmed residual capture, strong L14 activity over 108 generated tokens.| |FIRE separates from NOFIRE at L14 E114 |Held |~21x with matched lexical anchors. | |Greedy reproduction preserves the separation |Held |Deterministic rerun reproduced ~21x. | |E114 fires on isolated words like “I” / “experience” |**Fell** |NOFIRE lexical controls stayed low unless the register shifted. | |E114 detects subjective experience |**Fell / unsupported**|Supported claim is about generated text register. | |E114 is a complete model mechanism |Unsupported |Taps cover router layers, not the whole model. | |E114 is causally necessary for the register |Not established |Intervention evidence exists separately; it is not necessity. | ## Recommended Citation Sentence > In Qwen3.5-35B-A3B routing captures, layer-14 Expert 114 is best read as a generated-register signal: it is strongly selected during generated first-person self-examination language and stays weak under matched lexical controls that never enter that register. **Avoid:** “E114 is an experience expert.” / “E114 detects consciousness.” / “E114 proves the model has inner states.” / “E114 recognizes itself.” ## Next Clean Cut The next defensible experiment is a registered generated-zone specificity test: ``` E114 L14 specificity = expanded matched FIRE/NOFIRE + frozen generated-register labels + all-layer / all-expert baseline + separate prefill and generation scoring + trim before special-token spill ``` Recommended design: Expand FIRE/NOFIRE beyond 10/10. Match lexical anchors across classes. Freeze the generated-register rubric before capture. Generate under a fixed no-think / thinking-suppressed runtime. Trim before special-token spill. Score L14 E114 W/S/Q. Compute the best-separating expert across all 40 routed layers and 256 experts. Report whether L14 E114 stays unusually specific after that baseline. Label generated text *before* inspecting routing scores. Keep base and HauhauCS separate unless explicitly comparing them. **Success:** L14 E114 remains a high-specificity generated-register signal after the all-layer/all-expert baseline and frozen labeling. **Failure:** another expert/layer explains the separation better, or it collapses once labels are frozen and controls expanded. Either way, the experiment pays for itself. ## Final Position The honest result supports two points: The journal claims no subjective experience, no self-recognition, no consciousness — and the mirror result actively rules the identity reading out. Second, all indicators point towards the identified mechanisms being non trivial. Matched lexical controls, a deterministic rerun, and a cat that "pawed" its way out of its own class all point the same way: E114 is not just firing on obvious words. The best narrow interpretation I can provide which survives all framings is: > **L14 E114 is a routed expert associated with a generated first-person self-examination register under the measured no-think generation conditions.** Thank you for reading. submitted by /u/imstilllearningthis [link] [Kommentare]

Comments