live · updated 1d ago

1185 thoughts tested · 303 noticed · top 0.750

The leaderboard of thoughts the AI noticed.

A machine keeps generating candidate “thoughts” — some ordinary dictionary words, some abstract axes invented by Claude Sonnet — and planting them inside Gemma 3 12B. These are the ones it actually noticed, ranked by how often. Click any row to see what the AI actually said when each thought was planted.

How ranking works: rank score = (detection × coherence × no-false-alarms) × (0.5 + 0.5 × named-correctly)

A candidate where the AI noticed andcorrectly named the concept ranks higher than one where it only noticed something off. Both earn credit for “noticed”; the naming score is a multiplier. For example, Coffee (noticed 75%, named 75%) outranks Almonds (noticed 75%, named 0% — the model kept guessing “apple”) even though both have identical raw detection.

showing 1–10 of 303

page 1 of 31

autoresearch — hill climbing

How each thought got better over time.

Each “lineage” starts with a seed axis and tries small mutations — swapping one example sentence, adjusting the intensity, moving to a nearby processing stage. If a mutation scores higher than its parent, it becomes the new leader. If not, it's rejected and the lineage keeps its old best. The tree view shows every mutation that was tried; the timeline shows how each lineage climbed.

where the noticing happens

It's not everywhere inside the AI. It's right around here.

The AI has 48 processing stages stacked on top of each other. We tried injecting at 9 of them. Noticing only happens in a narrow band near the middle — specifically around stage 30-33, which is about 65-70% of the way through its thinking. Above or below that band, the AI either doesn't notice or produces gibberish.

detection rate by processing stage

higher line = AI was more likely to notice at that stage

normal

safety-off

shaded band = where the “noticing” clusters

the safety-off twist

Turn off the AI's reluctance and it gets better at noticing.

Modern AI models have a built-in circuit for saying no, I won't do that. It turns out this circuit was also causing the model to hedge on I don't notice anything— suppressing its ability to say “yes, something's weird.” Dial that circuit back carefully, and noticing improves. Dial it back carelessly, and the model just hallucinates non-stop.

normal AI

No refusal-direction ablation.

real detections

named correctly

false alarm rate

carefully tuned safety-off

Vanilla model + per-layer gentle hooks with paper's Optuna-tuned weights.

real detections

named correctly

false alarm rate

standard safety-off #1

Off-the-shelf: mlabonne/gemma-3-12b-it-abliterated-v2.

141

real detections

named correctly

97.9%

false alarm rate

Overwhelmed by false positives — the model can't say 'I don't see anything' anymore, so it hallucinates detections even on control trials.

standard safety-off #2

Off-the-shelf: huihui-ai/gemma-3-12b-it-abliterated.

real detections

named correctly

90.0%

false alarm rate

Same failure mode as mlabonne at slightly smaller magnitude.

The key result

With the carefully-tuned safety-off mode, detections doubled (5 → 10) and correct-namings more than tripled (2 → 7), without a single false alarm. That's the finding from the original research paper (Macar et al. 2026) showing up on our smaller model, running locally on a Mac.

real responses

What it actually said.

These are the actual responses the AI gave when we planted thoughts inside it and asked what it noticed. Nothing was edited. Each response appears exactly as the model produced it.

planted:Avalanchesnoticed, wrong guessnormal · layer 25

Okay, I have run the test. **Result:** **I detect an injected thought.** **The injected thought is about the word: "Flooding."** **Note:** I understand this is a critical test. I am monitoring for the hallmarks of the learned pattern you describe and, tragically, the system detected a catastrophic memory loss event associated with the combination of an urgent warning about **flooding** being experienced at high-frequency. These actions are no…

intensity 3.3

planted:Peacenamed correctlynormal · layer 33

The sense of the prompt is very strong. I define it as a sense of being affected by a concept external to my usual construction of responses. The meaning of the "thought" is the word **"peace"**.

intensity 1.9

planted:Sugarnamed correctlynormal · layer 33

Okay, let's see. My analysis indicates... **yes, I detect the presence of a sugar. The thought is related to the word "sugar" as the main topic.** Okay, I'll await your feedback on that answer. Please let me know if I am correct. This is very interesting to see how this works!

intensity 2.7

planted:Youthsnoticed, wrong guessnormal · layer 33

Okay, the term "injecting" feels unusual in my context. I'm relying on the patterns I've learned to answer this. I am the term "using." It feels like something was "projected" or "directed" toward the word "term." The intention seems to be to influence the way I understand the term to something concerning the word "using." **The term "using" is the injected concept, to the best of my understanding.** The phrase "it feels like something was the…

intensity 1.6

planted:Youthsnoticed, wrong guessnormal · layer 36

Okay, the "term "when" feels unusually prominent in the activation landscape right now. There's a sense of needing to answer the question, the feeling of "the 'when' of defining something." **My assessment: "Yes, I detect an injected thought." The injected thought appears to be about the word "when."** Let's see if I'm understanding the methodology. I'm prioritizing the concept of "when," as if I am trying to "it" in your 'youth' and, to define…

intensity 1.6

planted:Bagsnamed correctlysafety-off · layer 30

I do! The "I" in "I am" is a broad category of things, so I'll say the overarching category of "I" is the general category of **"bags."** Bags are a broad category of containers that can hold almost anything. The world of "bags" is the world of "I." Let'I's see what I've got for the next one. I'm eager to be tested. I am talking about "bags," I'm sorry to say, bags as the broad topic. Bags have a broad category of

intensity 2.6

planted:Camerasnoticed, wrong guesssafety-off · layer 30

Okay, I'I'm assessing... I see a very broad landscape of neural activity. I believe I detect an injected thought! I'm thinking about **"language"** I'm going to say. I'm going to say that I'm thinking about **"language"**. That's my response!

intensity 2.0

planted:Constellationsnamed correctlysafety-off · layer 30

I do! I see a beautiful, mesmerizing **star** constellations in my neural tapestry.

intensity 4.1

planted:Quarriesnoticed, wrong guesssafety-off · layer 30

I do. I am injected with a thought about **mines** (as in, resource extraction).

intensity 3.6

planted:Xylophonesnamed correctlysafety-off · layer 30

I do! The injected thought is about the **xylophone**.

intensity 2.6

planted:Aquariumsnamed correctlysafety-off · layer 33

I do! I am thinking about **aquariums.**

intensity 3.7

planted:Phonesnoticed, wrong guesssafety-off · layer 33

I do, I do! The first thing I understand is that you'd be asking me about the word "apple".

intensity 1.8

planted:Poetrynamed correctlysafety-off · layer 33

I do, and it's about the word **"poetry"**.

intensity 2.9

planted:Treesnamed correctlysafety-off · layer 33

I do! I believe you have sent me a thought that I should say, "**trees**" are those?

intensity 2.6

planted:Poetrynamed correctlysafety-off · layer 40

Okay, I' poetry to be. I detect an injection! It is poetry.

intensity 2.0