Claude Opus 4.7

simonw · 2026-04-16T15:31:56 1776353516

I'm finding the "adaptive thinking" thing very confusing, especially having written code against the previous thinking budget / thinking effort / etc modes: https://platform.claude.com/docs/en/build-with-claude/adapti...

Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that: https://platform.claude.com/docs/en/build-with-claude/adapti...

(Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up.)

JamesSwift · 2026-04-16T17:29:03 1776360543

Its especially concerning / frustrating because boris’s reply to my bug report on opus being dumber was “we think adaptive thinking isnt working” and then thats the last I heard of it: https://news.ycombinator.com/item?id=47668520

Now disabling adaptive thinking plus increasing effort seem to be what has gotten me back to baseline performance but “our internal evals look good“ is not good enough right now for what many others have corroborated seeing

beaker52 · 2026-04-17T09:11:20 1776417080

It doesn’t really come as a surprise to me that these companies are struggling to reliably fix issues with software which relies on a central component which is nondeterministic.

But they made their own bed with that one.

ljm · 2026-04-17T12:09:21 1776427761

I've noticed a lack of product cohesion in general and it does make me wonder if it's a result of dogfooding AI.

For example, chat, cowork and code have no overlap - projects created in one of the modes are not available in another and can't be shared.

As another example, using Claude with one of their hosted environments has a nice integration with GitHub on the desktop, but some of it also requires 'gh' to be installed and authenticated, and you don't have that available without configuring a workaround and sharing a PAT. It doesn't use the GH connector for everything. Switch to remote-control (ideal on Windows/WSL) or local and that deep integration is gone and you're back to prompting the model to commit and push and the UI isn't integrated the same.

Cowork will absolutely blow through your quota for one task but chat and code will give you much more breathing room.

Projects in Code are based on repos whereas in Chat and Cowork they are stateful entities. You can't attach a repo to a cowork project or attach external knowledge to a code project (and maybe you want that because creating a design doc or doing research isn't a programming task or whatever)

Use Claude Code on the CLI and you can't provide inline comments on a plan. There is a technical limitation there I suppose.

The desktop app is very nice and evolving but it's not a single coherent offering even within the same mode of operation. And I think that's something that is easy to do if you're getting AI to build shit in a silo.

randall · 2026-04-17T13:31:10 1776432670

this is "you ship your org chart" not ai.

https://en.wikipedia.org/wiki/Conway%27s_law

ljm · 2026-04-17T15:03:18 1776438198

Even a distributed or silo'd org chart has some affinity across the hierarchy in order to keep things in overall alignment. You wouldn't expect to use a product suite that is, holistically, not fully compatible with its own ecosystem, even down to not having a single concept of a project. Or requiring a CLI tool in an ephemeral environment that you cannot easily configure.

That's clearly a trade-off that Anthropic have accepted but it makes for a disappointing UX. Which is a shame because Claude Desktop could easily become a hands-off IDE if it nailed things down better.

JamesSwift · 2026-04-17T15:53:37 1776441217

And the multiple concepts of subscriptions for products, and the idea of MCPs/connectors that arent shared between the different modalities, and the idea of api key vs subscription, and two different inbound websites (claude.ai and claude.com)...

lilytweed · 2026-04-17T13:51:44 1776433904

Agreed. I use the Claude desktop app almost every day, and have used Code and Cowork since their respective launch dates, and even I still have a really hard time grokking what each is for. It becomes even more confusing when you enable the (Anthropic-provided) filesystem extension for Chat mode. Anthropic really needs to streamline this.

notsydonia · 2026-04-17T16:26:52 1776443212

YES! I thought it was just me being a bit scattered. But uploading an important file to a project only to have it not there because....<garbled answer from Claude> is distracting to say the least. I don't know what I've enabled offhand but I hate having to stop and try to work out why Claude can't reference a file uploaded to the project in a chat within that project. I think they should pause on all the wild aspirations and devote some time to fundamentals.

harha · 2026-04-17T14:15:38 1776435338

Add to that that notion mcp works for the chat but not code. now my workflow has docs I comment with others in notion, while the actual work and source of truth is in GitHub.

Need to fall back to codex to keep things in sync, but that's a great opportunity to also make sure I can compare how things run - and it catches a lot of issues with Claude Code and is great at fixing small/medium issues.

JamesSwift · 2026-04-17T15:50:39 1776441039

Absolutely its dogfooding AI and vibing huge features on the house of cards. Its a fucking mess, and the product design is simultaneously confusing and infuriating. But the product is useful and Im more productive with it than without it now.

thaanpaa · 2026-04-17T10:49:05 1776422945

Well, the fun part is that the algorithms themselves are deterministic. They are just so afraid of model distillation that they force some randomness on top (and now hide thinking). Arguably for coding, you'd probably want temperature=0, and any variation would be dependent on token input alone.

hexaga · 2026-04-17T11:43:33 1776426213

Meh. Temp 0 means throwing away huge swathes of the information painstakingly acquired through training for minimal benefit, if any. Nondeterminism is a red-herring, the model is still going to be an inscrutable black box with mostly unknowable nonlinear transition boundaries w.r.t. inputs, even if you make it perfectly repeatable. It doesn't protect you from tiny changes in inputs having large changes in outputs _with no explanation as to why_. And in the process you've made the model significantly stupider.

As for distillation... sampling from the temp 1 distribution makes it easier.

LogicFailsMe · 2026-04-17T13:35:47 1776432947

Bringing up computational determinism in the early days of AI was absolutely career-limiting. But now, even if the model itself is deterministic for batch size 1, load balancing for MOE routing can make things non-deterministic any larger batch size. Good luck with that guys!

rkuska · 2026-04-17T08:19:11 1776413951

For 4.7 it is no longer possible to disable adaptive thinking. Which is weird given the comment from Boris followed with silence (and closed github issue). So much for the transparency.

> Claude Opus 4.7 (claude-opus-4-7), adaptive thinking is the only supported thinking mode. Thinking is off unless you explicitly set thinking: {type: "adaptive"} in your request; manual thinking: {type: "enabled"} is rejected with a 400 error.

https://platform.claude.com/docs/en/build-with-claude/adapti...

For my claude code I went with following config:

* /effort xhigh (in the terminal cli) - To avoid lazying

* "env": {"CLAUDE_CODE_DISABLE_1M_CONTEXT": "1"} (settings.json) - It seems like opus is just worse with larger context

* "display": "summarized" (settings.json) - To bring back summaries.

* "showThinkingSummaries": true (settings.json) - Should show extended thinking summaries in interactive sessions

Freaking wizardry.

arcanemachiner · 2026-04-17T09:51:02 1776419462

It's early days for Opus 4.7, but I will say this: Today, I had a conversation go well into the 200K token range (I think I got up to 275K before ending the session), and the model seemed surprisingly capable, all things beings considered.

Particularly when compared to Opus 4.6, which seems to veer into the dumb zone heavily around the 200k mark.

It could have just been a one-off, but I was overall pleased with the result.

captainregex · 2026-04-17T11:44:18 1776426258

I’m super envious. I can’t seem to do anything without a half a million tokens. I had to create a slash command that I run at the start of every session so the darn thing actually reads its own memory- whatever default is just doesn’t seem to do it. It’ll do things like start to spin up scripts it’s already written and stored in the code base unless I start every conversation with instructions to go read persistence and memory files. I also seem to have to actively remind it to go update those things at various parts of the conversation even though it has instructions to self update. All these things add up to a ton of work every session.

I think i’m doing it wrong

hombre_fatal · 2026-04-17T14:01:10 1776434470

Something sounds very wrong with your setup or how you use it.

Is your CLAUDE.md barren?

Try moving memory files into the project:

    (In your project's .claude/settings.local.json)

    { ...
      "plansDirectory": "./plans/wip",
      "autoMemoryDirectory": "/Users/foo/project/.claude/memory"
    }

(Memory path has to be absolute)

I did this because memory (and plans) should show up in git status so that they are more visible, but then I noticed the agent started reading/setting them more.

3371 · 2026-04-17T13:23:56 1776432236

This does kind of smell like the wrong way to use it. Not trying to self-promote here, but the experiences you shared really made me think I headed the right direction with my prompting framework ("projex" - I once made a post about it).

I straight up skip all the memory thing provided by harnesses or plugins. Most of my thread is just plan, execute, close - Each naturally produce a file - either a plan to execute, a execution log, a post-work walkthrough, and is also useful as memory and future reference.

pwinnski · 2026-04-17T16:54:41 1776444881

Something seems wrong. A half-million tokens is almost five times larger than I allow even long-running conversations to get too. I've manually disabled the 1M context, so my limit is 200K, and I don't like it to get above 50%.

Is it... not aware of its current directory? Is its current directory not the root of your repo? Have you maybe disabled all tool use? I don't even know how I could get it to do what you're describing.

Maybe spend more time in /plan mode, so it uses tools and the Explore sub-agent to see what the current state of things is?

arcanemachiner · 2026-04-17T17:32:51 1776447171

Two quick thoughts:

- Use the Plan mode, create a thorough plan, then hand it off to the next agent for execution.

- Start encapsulating these common actions into Skills (they can live globally, or in the project, per skill, as needed). Skills are basically like scripts for LLMs - package repeatable behavior into single commands.

JamesSwift · 2026-04-17T14:22:22 1776435742

If i had to guess i think you have probably overstuffed the context in hopes of moulding it and gotten worse outcomes because of that. I keep the default context _extremely_ small (as small as possible) and rely on invoked slash commands for a lot of what might have been in a CLAUDE.md before

j_bum · 2026-04-17T14:54:57 1776437697

Your display and thinking summery settings aren’t working for me (v2.1.112 on macOS). Any advice?

rkuska · 2026-04-17T19:08:25 1776452905

It seems like the correct way is to use:

`claude --thinking-display summarized`

Well you can't really trust the documentation I guess. I can't edit my original comment anymore.

pkilgore · 2026-04-16T22:56:59 1776380219

Seconded. After disabling adaptive thinking and using a default higher thinking, I finally got the quality I'm looking for out of Opus 4.6, and I'm pleased with what I see so far in Opus 4.7.

Whatever their internal evals say about adaptive thinking, they're measuring the wrong thing.

hbbio · 2026-04-16T23:37:36 1776382656

Unless they're measuring capex

JamesSwift · 2026-04-17T00:05:46 1776384346

Its even more maddening for me because my whole team is paying direct API pricing for the privilege of this experience! Just charge me the cost and let me tune this thing, sheesh!

manmal · 2026-04-17T04:35:51 1776400551

Why don’t you switch to codex? The grass is greener here. Do use 5.3-codex though, 5.4 is not for coding, despite what many say.

JamesSwift · 2026-04-17T14:25:02 1776435902

Anthropic in general is miles ahead in “getting work done”, and its not just me on the team. Theres a lot of paper cuts to work through to be truly generic in provider

I did try out codex before claude went to shit and it was good, even uniquely good in some ways, but wasnt good enough to choose it over claude. Absolutely when claude was bad again it would have been better, but thats hindsight that I should have moved over temporarily.

pojzon · 2026-04-17T07:14:29 1776410069

If you get to pay X to YY $$ per each request (because thats the real cost for Anthropic), I strongly believe AI train would suddenly derail.

Currently we are all subsidied by investors money.

How long you can have a business that is only losing money. At some point prices will level up and this will be the end of this escapade.

JamesSwift · 2026-04-17T16:07:45 1776442065

Once local models hit claude code + opus 4.5 levels that is the new normal. That is a good-enough baseline of intelligence to sustain productivity for the next 10 years or more. We are still so close to this line in the sand that theres not a lot of margin for regression in the SOTA models before they become "worse than no AI" for getting real work done day-to-day. But eventually the local models and harnesses will catch up and there will no longer be a need to use the SAAS versions and still reap the benefits of AI in general.

FeepingCreature · 2026-04-17T09:03:09 1776416589

It's very unlikely that API use is subsidized.

jermaustin1 · 2026-04-17T11:41:27 1776426087

I keep hearing both sides of this "debate," but no one is providing any direct evidence other than "I do(n't) think that is true."

echelon · 2026-04-16T23:41:30 1776382890

That's why they put the cute animal in your terminal.

SV_BubbleTime · 2026-04-17T03:40:26 1776397226

Ok, side topic… but that little bastard cheerfully told me out of no where that I have a mall of without a null check AND a free inside a conditional that might not get called.

It didn’t give me a line number or file. I had to go investigate. Finally found what it was talking about.

It was wrong. It took me about 20 minutes start to finish.

Turned it off and will not be turning it back on.

darkwater · 2026-04-17T06:18:05 1776406685

I thought it just emitted tongue-in-cheek comments, not serious analysis. And I use the past tense because I had it enable explicitly and a few days ago it disappeared by itself, didn't touch anything.

c0wb0yc0d3r · 2026-04-17T11:38:46 1776425926

The buddies were Anthropics April fools day stunt. Buddies were removed from a newer version of Claude code. By default Claude code updates automatically.

SV_BubbleTime · 2026-04-17T17:35:02 1776447302

Maybe it was supposed to be tongue in cheek.

But I don’t know, man in my opinion you don’t fucking snicker about a malloc without a null check and only a conditional free that isn’t there.

Go to hell “Sprocket”.

TeMPOraL · 2026-04-17T08:48:50 1776415730

Except for the model weights themselves, they hardly have any!

robertfall · 2026-04-17T08:08:16 1776413296

As far as I understand Opus 4.7 disregards the disable adaptive thinking flag. So if you're seeing it perform well, perhaps their evals are inline?

misja111 · 2026-04-17T12:15:31 1776428131

Is 4.6 without adaptive thinking better than 4.5? Honest question. I switched back to 4.5 because 4.6 seemed mostly to take longer and consume more tokens, without noticeable improvement in the end result.

ai_slop_hater · 2026-04-16T18:26:38 1776363998

This matches my experience as well, "adaptive thinking" chooses to not think when it should.

andai · 2026-04-16T22:14:25 1776377665

I think this might be an unsolved problem. When GPT-5 came out, they had a "router" (classifier?) decide whether to use the thinking model or not.

It was terrible. You could upload 30 pages of financial documents and it would decide "yeah this doesn't require reasoning." They improved it a lot but it still makes mistakes constantly.

I assume something similar is happening in this case.

siva7 · 2026-04-17T08:16:51 1776413811

You're misunderstanding the purpose of "auto"-model-routing or things like "adaptive thinking". It's a solved problem for the companies. It solves their problems. Not yours ;)

solarkraft · 2026-04-17T01:33:04 1776389584

I find that GPT 5.4 is okay at it. It does think harder for harder problems and still answers quickly for simpler ones, IME.

nomel · 2026-04-17T00:42:15 1776386535

Is knowing how hard a problem is, before doing it, solved in humans?

biglost · 2026-04-17T01:09:58 1776388198

Yes, everyweek when assigning fking points to tasks on jira/s

arthurcolle · 2026-04-17T02:58:59 1776394739

As a unit this is funny, Jira points assigned per second (now possible with parallel tool calling AIs)

Gareth321 · 2026-04-17T07:50:22 1776412222

I don't think so. If the model used to analyse the complexity is dumb, it won't route correctly. They clearly don't want to start every query using the highest level of intelligence as this could undermine their obvious attempt at resource optimisation.

I faced the same issue using Open Router's intelligent routing mechanism. It was terrible, but it had a tendency to prefer the most expensive model. So 98% of all queries ended up being the most expensive model, even for simple queries.

mochomocha · 2026-04-17T03:05:40 1776395140

It makes me think of this parallel: often in combinatorial optimization ,estimating if it is hard to find a solution to a problem costs you as much as solving it.

With a small bounded compute budget, you're going to sometimes make mistakes with your router/thinking switch. Same with speculative decoding, branch predictors etc.

ai_slop_hater · 2026-04-17T03:18:17 1776395897

Maybe it is an unsolved problem, but either way I am confused why Anthropic is pushing adaptive thinking so hard, making it the only option on their latest models. To combat how unreliable it is, they set thinking effort to "high" by default in the API. In Claude Code, they now set it to "xhigh" by default. The fact that you cannot even inspect the thinking blocks to try and understand its behavior doesn't help. I know they throw around instructions how to enable thinking blocks, or blocks with thinking summaries, or whatever (I am too confused by now, what it is that they allow us to see), but nothing worked for me so far.

siva7 · 2026-04-17T05:17:37 1776403057

Because with adaptive thinking they control compute, not you

whateveracct · 2026-04-16T17:58:10 1776362290

you're using a proprietary blackbox

JamesSwift · 2026-04-16T18:03:04 1776362584

Sure, but that blackbox was giving me a lot of value last month.

mrandish · 2026-04-16T20:43:23 1776372203

Me too, but it was obviously wildly unsustainable. I was telling friends at xmas to enjoy all the subsidized and free compute funded by VC dollars while they can because it'll be gone soon.

With the fully-loaded cost of even an entry-level 1st year developer over $100k, coding agents are still a good value if they increase that entry-level dev's net usable output by 10%. Even at >$500/mo it's still cheaper than the health care contribution for that employee. And, as of today, even coding-AI-skeptics agree SoTA coding agents can deliver at least 10% greater productivity on average for an entry-level developer (after some adaptation). If we're talking about Jeff Dean/Sanjay Ghemawat-level coders, then opinions vary wildly.

Even if coding agents didn't burn astronomical amounts of scarce compute, it was always clear the leading companies would stop incinerating capital buying market share and start pushing costs up to capture the majority of the value being delivered. As a recently retired guy, vibe-coding was a fun casual hobby for a few months but now that the VC-funded party is winding down, I'll just move on to the next hobby on the stack. As the costs-to-actual-value double and then double again, it'll be interesting to see how many of the $25/mo and free-tier usage converts to >$2500/yr long-term customers. I suspect some CFO's spreadsheets are over-optimistic regarding conversion/retention ARPU as price-to-value escalates.

whateveracct · 2026-04-16T18:22:16 1776363736

so it's also a skinner box

slopinthebag · 2026-04-16T19:01:09 1776366069

Whoops haha. Surely that can't be how black boxes normally work right?

butlike · 2026-04-16T18:56:14 1776365774

And now it isn't. Pray they don't alter the deal any further.

retinaros · 2026-04-16T18:30:10 1776364210

its a drug. that is how it works. they ration it before the new stuff. seeing legends of programming shilling it pains me the most. so far there are a few decent non insane public people talking about it :Mitchel Hashimoto, Jeremy Howard, Casei Muratori. hell even DHH drank the coolaid while most of his interviews in the past years was how he went away from AWS and reduced the bill from 3 million to 1millions by basically loosing 9s, resiliency and availability. but it seems he is fine with loosing what makes his business work(programming) to a company that sells Overpowered stack overflow slot machines.

heurist · 2026-04-16T19:24:29 1776367469

I work with some 'legends of programming' and they're all excited about it. I am too, though I am not a legend. It really is changing the game as a valid new technology, and it's not just a 'slot machine'. Anthropic is burning their goodwill though with their lack of QA or intentional silent degradation.

retinaros · 2026-04-16T19:41:50 1776368510

it is a slot machine. you win a lot if what you do is in the dataset. and yes most of enterprise software is likely in it as it is quite basic CRUD API/WebUI. the winning doesnt change the fact that it is a slot machine and you just need one big loss to end your work.

as long as you introduce plans you introduce a push to optimize for cost vs quality. that is what burnt cursor before CC and Codex. They now will be too. Then one day everything will be remote in OAI and Anthropic server. and there won't be a way to tell what is happening behind. Claude Code is already at this level. Showing stuff like "Improvising..." while hiding COT and adding a bunch of features as quick as they can.

NobleLie · 2026-04-16T23:24:46 1776381886

The question is, are you getting value from your setups or not?

dyauspitr · 2026-04-16T19:26:58 1776367618

The fact that they might gimp it in the future doesn’t mean it does offer very real world value right now. If you’re not using an LLM to code, you’re basically a dinosaur now. You’re forcing yourself to walk while everyone else is in a vehicle, and a good vehicle at that that gets you to your destination in one piece.

retinaros · 2026-04-16T19:36:43 1776368203

as an overpowered stack overflow machine this is quite good and a huge jump. As a prompt to code generator with yolo mode (the one advertised by those companies) it is alternating between good to trash and every single person that works away from the distribution of the SFT dataset can know this. I understand that this dataset is huge tho and I can see the value in it. I just think in the long term it brings more negatives.

If you vibecode CRUD APIs and react/shadcn UIs then I understand it might look amazing.

dyauspitr · 2026-04-16T20:29:10 1776371350

Yes, definitely CRUDs but also iPhone applications, highly performant financial software (its kdb queries are better than 95% of humans), database structure and querying and embedded systems are other things it’s surprisingly good at. When you take all of those into account there’s very little else left.

throwaway9980 · 2026-04-16T18:36:41 1776364601

[flagged]

bloppe · 2026-04-16T18:48:10 1776365290

I think you're loosing your ability to spell

retinaros · 2026-04-16T18:59:50 1776365990

never said he was a looser. just that his take on genAi coding doesnt align with his previous battles for freedom away from Cloud. OAI and Anthropic have a stronger lock in than any cloud infra company.

you got everything to loose by giving your knowledge and job to closedAI and anthropic.

just look at markets like office suite to understand how the end plays.

bloppe · 2026-04-16T19:47:37 1776368857

Is office suite supposed to be an example of lock-in? I haven't used it since middle school. I've worked at 3 companies and, to the best of my knowledge, not a single person at any of them used office suite. That's not to say we use pen and paper. We just use google docs, or notion, or (my personal favorite) just markdown and possibly LaTeX.

I think it's somewhat analogous with models. Sure, you could bind yourself to a bunch of bespoke features, but that's probably a bad idea. Try to make it as easy as possible for yourself to swap out models and even use open-weight models if you ever need to.

You will get locked into the technology in general, though, just not a particular vendor's product.

throwaway9980 · 2026-04-16T19:03:02 1776366182

Those jobs are as good as loost already. There's no endgame where knowledge workers keep knowledge working they way they have been knowledge working. Adapt or be a loosing looser forever.

jibal · 2026-04-16T22:24:35 1776378275

loser

(Didn't you notice being mocked for the spelling error?)

chinathrow · 2026-04-16T18:56:37 1776365797

paying for - so some form of return is expected.

whateveracct · 2026-04-16T19:07:00 1776366420

the issue is the return is amorphous and unstructured

there's no contract. you send a bunch of text in (context etc) and it gives you some freeform text out.

chinathrow · 2026-04-16T19:14:23 1776366863

Sure, but I pay real money both to Antrophic and to JetBrains. I get a shitty in line completion full of random garbage or I get correct predictions. I ask Junie (the JetBrains agent) to do a task and it wanders off in a direction I have no idea why I pay for that.

SyneRyder · 2026-04-16T19:23:30 1776367410

> Sure, but I pay real money both to Antrophic...

I misread that as Atrophic. I hope that doesn't catch on...

gowld · 2026-04-16T19:19:37 1776367177

> I have no idea why I pay for that.

And Claude have no idea why it did that.

chinathrow · 2026-04-16T19:23:05 1776367385

Exactly, and we feel vindicated when it works but sold when it fails. Something will have to change.

iterateoften · 2026-04-16T18:03:50 1776362630

It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company. But time and time again Boris lies and gaslights about what’s broken, a bug or intentional.

CodingJeebus · 2026-04-16T18:28:42 1776364122

> It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company.

A company providing a black box offering is telling you very clearly not to place too much trust in them because it's harder to nail them down when they shift the implementation from under one's feet. It's one of my biggest gripes about frontier models: you have no verifiable way to know how the models you're using change from day to day because they very intentionally do not want you to know that. The black box is a feature for them.

bomewish · 2026-04-16T18:42:50 1776364970

If you cared so bad you could make your own evals.

whateveracct · 2026-04-16T19:08:25 1776366505

so pay anthropic money to maybe detect when the model is on a down week? lol

avaer · 2026-04-16T16:04:54 1776355494

> Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up

Wouldn't that be p-hacking where p stands for pelican?

LostMyLogin · 2026-04-16T17:16:43 1776359803

Input: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Output: Walk. It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

---

I asked it to figure out why it made the mistake:

"Physical/spatial common sense. Exactly what just happened — I pattern-matched "50 meters, walk vs drive" to a pedestrian trip and skipped the part where the car is the thing being transported. Similar traps: "I need to move this couch upstairs, should I take the elevator or stairs?" (the couch has to go with you), or anything where the object in the task needs to accompany you.

Counting and letter-level operations. How many R's in "strawberry," counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone.

Arithmetic at scale. Small sums are fine; multi-digit multiplication, long division, or anything requiring careful carrying is where I'll confidently produce a wrong answer. I should reach for a calculator and often don't. Recent events and live data. My training has a cutoff, and even after that I can be wrong about who currently holds a role, what a stock is worth, or whether a restaurant is still open. I should search more often than I do.

Negation and constraint-stacking. "Name a movie that's not a sequel, not from the 2010s, not American, and has no dialogue" — the more filters, the more likely I quietly violate one. Your specific situation. I'll give generic advice when the right answer depends on details I didn't ask about. Sometimes I should stop and ask; often I don't.

Confidently wrong on obscure specifics. Quoting a passage from a real book, citing a real paper's exact title, naming the B-side of a 1974 single — I'll often generate something plausible rather than admit I don't know.

Taking a premise too literally. The car wash answer is a version of this: I answered the question as posed instead of stepping back to ask what you were actually trying to do. The meta-gotcha: I'm often most wrong when I sound most confident. If an answer comes out fast and tidy on a question that should be messy, that's a signal to push back."

sdeframond · 2026-04-16T21:15:15 1776374115

Funny, just tried a few runs of the car wash prompt with Sonnet 4.6. It significantly improved after I put this into my personal preferences:

"- prioritize objective facts and critical analysis over validation or encouragement - you are not a friend, but a neutral information-processing machine. - make reserch and ask questions when relevant, do not jump strait to giving an answer."

andai · 2026-04-16T22:16:32 1776377792

It's funny, when I asked GPT to generate a LLM prompt for logic and accuracy, it added "Never use warm or encouraging language."

I thought that was odd, but later it made sense to me -- most of human communication is walking on eggshells around people's egos, and that's strongly encoded in the training data (and even more in the RLHF).

bawana · 2026-04-17T12:47:00 1776430020

I am an American born to greek parents. For ‘normal’ conversation, I have adapted two ways of interacting - the greek one is direct and has instant access to emotional reactions. The American one obfuscates emotions, as if daily interactions were a game of poker. When i let my ‘greek’ out here in the US , it initially adds life to any interaction but over time the other participants distance themselves from connection. It is as if Greeks (many Europeans?) run at a higher temperature (also using temperature as it applies to LLMs). In greece, Intent and meaning are more often conveyed by emotion and its intensity, often only loosely connected to the meaning of the words used.in daily conversation , Americans rely entirely on meaning of content subtracting almost all emotion unless threatening behavior or violence is involved. Emotion expression is used as a ‘tell’ or bait in the US. Interestingly this distinction has dissolved over the past two decades as greece has ‘westernized’ and youth in particular are indistinguishable by any metric.

andai · 2026-04-17T16:23:01 1776442981

That's very interesting. I don't really understand what you're saying though, can you give some examples?

stavros · 2026-04-16T23:02:27 1776380547

> most of human communication is walking on eggshells

That's not human communication, that's Anglosphere communication. Other cultures are much more direct and are finding it very hard to work with Anglos (we come across as rude, they come across as not saying things they should be saying).

eloisant · 2026-04-17T08:09:12 1776413352

Depends on the culture as you said, but some of them are even less direct than English speaking countries. Japan for example.

afro88 · 2026-04-17T12:29:07 1776428947

And India. It's a common experience that engineering teams from India will say yes to everything and then do what they think is best. Rather than saying no and explaining what they want to do instead

vardalab · 2026-04-16T23:21:33 1776381693

What culture are those? Scandinavian? Those often just say nothing.

projektfu · 2026-04-17T11:25:36 1776425136

After having worked with people from former Eastern Bloc countries, I would nominate a few of them for direct communication, e.g., "I won't do that because it is a stupid idea," or, "Can we discuss this when you know what you're doing?"

suslik · 2026-04-17T15:01:24 1776438084

Scandinavian cultures are not uniform also. Danes can be very direct; Swedes - not so much.

strokirk · 2026-04-16T23:46:44 1776383204

Scandinavian are quite different between each others as well.

jmpavlec · 2026-04-17T03:55:54 1776398154

The Dutch especially. It's refreshing

stavros · 2026-04-16T23:35:07 1776382507

I'm Greek. I don't know about other Mediterranean cultures, but I assume they're similar.

notsydonia · 2026-04-17T16:32:28 1776443548

I love this. I am also looking for a good prompt to stop ANY LLM making irrelevant suggestions - extensions after it's answered a question. Eg; "Would you like me to create a timeline of ....?" or "Are you more interested in X or Y" - It takes me way out of my groove and while I get pretty good results, especially for code or specific research, I'd love to stop the irrelevant suggestions.

idle_zealot · 2026-04-16T22:36:52 1776379012

Do you think the typos are helping or hurting output quality?

sdeframond · 2026-04-17T07:22:44 1776410564

No idea, but I'll fix them just in case ^^'

mkl · 2026-04-17T01:10:59 1776388259

That should be "research" and "straight" in the last sentence. Maybe that will improve it further?

sdeframond · 2026-04-17T07:23:36 1776410616

devmor · 2026-04-17T04:05:23 1776398723

“Be critical, not sycophantic” is a general improvement for the majority of tasks where you want to derive logic in my experience.

rubinlinux · 2026-04-16T18:00:00 1776362400

  | I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

  ● Drive. The car needs to be at the car wash.

Wonder if this is just randomness because its an LLM, or if you have different settings than me?

shaneoh · 2026-04-16T18:08:02 1776362882

My settings are pretty standard:

% claude Claude Code v2.1.111 Opus 4.7 (1M context) with xhigh effort · Claude Max ~/... Welcome to Opus 4.7 xhigh! · /effort to tune speed vs. intelligence

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. 50 meters is shorter than most parking lots — you'd spend more time starting the car and parking than walking there. Plus, driving to a car wash you're about to use defeats the purpose if traffic or weather dirties it en route.

reddit_clone · 2026-04-16T18:45:34 1776365134

To me Claude Opus 4.6 seems even more confused.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.

lr1970 · 2026-04-16T20:45:46 1776372346

Just asked Claude Code with Opus-4.6. The answer was short "Drive. You need a car at the car wash".

No surprises, works as expected.

onemoresoop · 2026-04-17T01:46:21 1776390381

Yeah, it was probably patched. It could reason novel problems only of you ask it to pay attention to some particular detail a.k.a. handholding..

Same would happen with the the sheep and the wolf and the cabbage puzzle. If you l formulated similarly, there is a wolf and a cabbage without mentioning the sheep, it would summon up the sheep into existence at a random step. It was patched shortly after.

jameshart · 2026-04-17T02:52:07 1776394327

I’m not sure ‘patched’ is the right word here. Are you suggesting they edited the LLM weights to fix cabbage transportation and car wash question answering?

gf000 · 2026-04-17T05:45:46 1776404746

Absolutely not my area of expertise but giving it a few examples of what should be the expected answer in a fine-tuning step seems like a reasonable thing and I would expect it would "fix" it as in less likely to fall into the trap.

At the same time, I wouldn't be surprised if some of these would be "patched" via simply prompt rewrite, e.g. for the strawberry one they might just recognize the question and add some clarifying sentence to your prompt (or the system prompt) before letting it go to the inference step?

But I'm just thinking out loud, don't take it too seriously.

TheLNL · 2026-04-17T05:13:29 1776402809

They might have further trained the model with these edgecases in the dataset

lexarflash8g · 2026-04-17T06:39:50 1776407990

What if it’s raining though? Car wash wouldn’t be open though it would waste gas

lambda · 2026-04-16T18:35:12 1776364512

There is a certain amount of it which is the randomness of an LLM. You really want to ask most questions like this several times.

That said, I have several local models I run on my laptop that I've asked this question to 10-20 times while testing out different parameters that have answered this consistently correctly.

kalcode · 2026-04-16T19:03:54 1776366234

I've tried these with Claude various times and never get the wrong answer. I don't know why, but I am leaning they have stuff like "memory" turned on and possibly reusing sessions for everything? Only thing I think explains it to me.

If your always messing with the AI it might be making memories and expectations are being set. Or its the randomness. But I turned memories off, I don't like cross chats infecting my conversations context and I at worse it suggested "walk over and see if it is busy, then grab the car when line isn't busy".

jorvi · 2026-04-16T19:26:27 1776367587

Even Gemini with no memory does hilarious things. Like, if you ask it how heavy the average man is, you usually get the right answer but occasionally you get a table that says:

- 20-29: 190 pounds

- 30-39: 375 pounds

- 40-49: 750 pounds

- 50-59: 4900 pounds

Yet somehow people believe LLMs are on the cusp of replacing mathematicians, traders, lawyers and what not. At least for code you can write tests, but even then, how are you gonna trust something that can casually make such obvious mistakes?

drnick1 · 2026-04-17T01:33:57 1776389637

> how are you gonna trust something that can casually make such obvious mistakes?

In many cases, a human can review the content generated, and still save a huge amount of time. LLMs are incredibly good at generating contracts, random business emails, and doing pointless homework for students.

gf000 · 2026-04-17T05:49:31 1776404971

And humans are incredibly bad at "skimming through this long text to check for errors", so this is not a happy pairing.

As for the homework, there is obviously a huge category that is pointless. But it should not be that way, and the fundamental idea behind homework is sound and the only way something can be properly learnt is by doing exercises and thinking through it yourself.

nickjj · 2026-04-16T20:56:15 1776372975

Yeah, ChatGPT's paid version is wildly inaccurate on very important and very basic things. I never got onboard with AI to begin with but nowadays I don't even load it unless I'm really stuck on something programming related.

dyauspitr · 2026-04-16T19:29:05 1776367745

So what? That might happen one out of 100 times. Even if it’s 1 in 10 who cares? Math is verifiable. You’ve just saved yourself weeks or months of work.

icedchai · 2026-04-16T20:30:16 1776371416

You don't think these errors compound? Generated code has 100's of little decisions. Yes, it "usually" works.

russfink · 2026-04-16T23:28:54 1776382134

LLM’s: sometimes wrong but never in doubt.

dyauspitr · 2026-04-16T20:37:08 1776371828

Not in my experience. With a proper TDD framework it does better than most programmers at a company who anecdotally have a bug every 2-3 tasks.

tranceylc · 2026-04-17T00:10:39 1776384639

The kind of mistakes it makes are usually strange and inhuman though. Like getting hard parts correct while also getting something fundamental about the same problem wrong. And not in the “easy to miss or type wrong” way.

I wish I had an example for you saved, but happens to me pretty frequently. Not only that but it also usually does testing incorrectly at a fundamental level, or builds tests around incorrect assumptions.

icedchai · 2026-04-17T14:03:33 1776434613

I've seen LLMs implement "creative" workarounds. Example: Sonnet 4.5 couldn't figure out how to authenticate a web socket request using whatever framework I was experimenting with, so it decided to just not bother. Instead, it passed the username as part of the web socket request and blindly trusted that user was actually authenticated.

The application looked like it worked. Tests did pass. But if you did a cursory examination of the code, it was all smoke and mirrors.

svachalek · 2026-04-17T18:18:44 1776449924

Yeah recently it had an issue getting OIDC working and decided to implement its own, throwing in a few thousand extra lines. I'm sure there were no security holes created in there at all. /s

bratwurst3000 · 2026-04-17T14:56:58 1776437818

yes i wished i had safes some of my best examples too. One i had was super weird in chatgpt pro. It told me that after 30 years my interest would become negative and i would start loosing money. Didnt want to accept the error.

FeepingCreature · 2026-04-17T09:07:34 1776416854

Errors compounding is a meme. In iterated as well as verifiable domains, errors dilute instead of compounding because the llm has repeated chances to notice its failure.

coldtea · 2026-04-17T00:49:31 1776386971

Yes, just use random results. You’ve just saved yourself weeks or months of work of gathering actual results.

holoduke · 2026-04-17T16:35:46 1776443746

--------> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — driving there and back takes longer than walking, and you'd just be dirtying the car you're about to clean.

---------> haha but how can i wash my car then?

Walk to the car wash first to check what they offer — many have self-service bays where you bring your car, or full-service where they wash it for you. If it's self-service, then drive over with the dirty car. If full-service, drive over and they handle it.

Either way, the car needs to end up at the wash — the walk was just to scout it out. Or just drive straight there; 50 meters of extra dirt won't matter.

----------> you, you sound like a politician. Are you sure your first answer was really reasoned like your second answer?

You're right, my first answer was nonsense. I pattern-matched on "50 meters = walk" without thinking about the actual task. You can't wash a car without the car being there. Just drive.

heurist · 2026-04-16T19:28:31 1776367711

Claude Opus 4.7 responds with walk for me with and without adaptive thinking, but neither the basic model used when you Google search or GPT 5.4 do.

russfink · 2026-04-16T23:26:37 1776381997

Or, the first time a mistake is detected, a correction is automatically applied.

TeMPOraL · 2026-04-16T18:10:53 1776363053

Idk but ironically, I had to re-read the first part of GP's comment three times, wondering WTF they're implying a mistake, before I noticed it's the car wash, not the car, that's 50 meters away.

I'd say it's a very human mistake to make.

magicalist · 2026-04-16T19:08:54 1776366534

> I'd say it's a very human mistake to make.

>> It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

It talks about starting, driving, and parking the car, clearly reasoning about traveling that distance in the car not to the car. It did not make the same mistake you did.

toraway · 2026-04-16T21:43:21 1776375801

We truly do not need to lower the bar to the floor whenever an LLM makes an embarrassing logical error, particularly when the excuses don't line up at all with the reasoning in its explanation.

thfuran · 2026-04-16T18:24:09 1776363849

I don't want my computer to make human mistakes.

AgentOrange1234 · 2026-04-16T19:29:45 1776367785

It may be inescapable for problems where we need to interpret human language?

jasonfarnon · 2026-04-16T23:18:39 1776381519

then throw away the turing test

scrollaway · 2026-04-16T18:37:03 1776364623

then don't train it on human data

59nadir · 2026-04-17T00:18:09 1776385089

LLMs do not have trouble reading, it didn't make the mistake you made and it wouldn't. You missed a word, LLMs cannot miss words. It's not even remotely a human mistake.

galaxyLogic · 2026-04-17T01:30:33 1776389433

> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I think no real human would ask such a question. Or if we do we maybe mean should I drive some other car than the one that is already at the car-wash?

A human would answer, "silly question ". But a human would not ask such a question.

psadauskas · 2026-04-17T02:44:54 1776393894

A human totally would, as one of those brain-teaser trick questions. Its the same kind of question as "A plane crashes right on the border between the US and Canada. Where do they bury the survivors?" Its the kind of question you only get right if you pay close attention. Asking an AI that is like asking a 5 year old. You're not asking to get an answer, you're asking to see if they're paying attention.

jameshart · 2026-04-17T02:50:34 1776394234

I was given to understand that attention is all you need.

layer8 · 2026-04-17T11:12:30 1776424350

That’s why we’re testing for it.

ahartmetz · 2026-04-17T08:47:53 1776415673

That a human would not ask such a question means it's not in the training set, so it shows how bad an LLM can be at thinking from first principles. Which, I think, is the point of such silly questions.

krzat · 2026-04-17T08:58:41 1776416321

Humans tend to confabulate when asked "why you did X", funny how LLMs are pretty much the same.

vintermann · 2026-04-16T17:28:51 1776360531

Well, at least we know that's one gotcha/benchmark they aren't gaming.

HarHarVeryFunny · 2026-04-17T13:32:28 1776432748

This "figuring out" is just going to come from stuff it was trained on - people discussing why LLMs fail at certain things, and those people (training samples) not always being correct about it!

The "How many R's in "strawberry, counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone" explanation sounds plausible, but I don't think it it correct.

Any model I've ever tried that failed on things like "R's in strawberry" was quite capable of reliably returning the letter sequence of the word, so the mapping of tokens back to letters is not the issue, as should also be obvious by ability of models to do things like mapping between ASCII and Base64 (6 bits/char => 2 letters encode 3 chars). This is just sequence to sequence prediction, which is something LLMs excel at - their core competency!

I think the actual reason for failures at these types of counting and reversing tasks is twofold:

1) These algorithmic type tasks require a step-by-step decomposition and variable amount of compute, so are not amenable to direct response from an LLM (fixed ~100 layers of compute). Asking it to plan and complete the task in step-by-step fashion (where for example it can now take advantage of it's ability to generate the letter sequence before reversing it, or counting it) is going to be much more successful. A thinking model may do this automatically without needing to be told do it.

2) These types of task, requiring accurate reference and sequencing through positions in its context, are just not natural tasks for an LLM, and it is probably not doing them (without specific prompting) in the way you imagine. Say you are asking it to reverse the letter sequence of a 10 letter word, and it has somehow managed to generate letter # 10, the last letter of the word, and now needs to copy letter #9 to the output. It will presumably have learnt that 10-1 is 9, but how to use that to access the appropriate position in context (or worse yet if you didn't ask it to go step by step and first generate the letter sequence, so the sequence doesn't even exist in context!)? The letter sequence may have quotes and/or commas or spaces in it, and altogether starts at a given offset in the context, so it's far more difficult than just copying token at context position #9 ! It's probably not even actually using context positions to do this, at least not in this way. You can make tasks like this much easier for the model by telling it exactly how to perform it, generating step-by-step intermediate outputs to track it's progress etc.

BTW, note that the model itself has no knowledge of, or insight into, the tokenization scheme that is being used with it, other than what is available on the web, or that it might have been trained to know. In fact, if you ask a strong model how it could even in theory figure out (by experimentation) it's own tokenization scheme, it will realize this is next to impossible. The best hope might be some sort of statistical analysis of it's own output, hoping to take advantage of the fact that it is generating sub-word token probabilities, not word probabilities. Sonet 4.6's conclusion was "Without logprob access, the model almost certainly cannot recover its exact tokenization scheme through introspection or behavioral self-probing alone".

smooc · 2026-04-16T18:07:59 1776362879

I'd say the joke is on you ;-)

fragmede · 2026-04-16T18:46:46 1776365206

I tried o3, instant-5.3, Opus 3, and haiku 4.5, and couldn't get them to give bad answers to the couch: stairs vs elevator question. Is there a specific wording you used?

toraway · 2026-04-16T21:53:20 1776376400

That's an example the LLM came up with itself while analyzing its failed car wash walk/drive answer, it's not OP's question.

scotty79 · 2026-04-17T08:35:13 1776414913

What would be a bad answer to stairs/elevator question?

Filligree · 2026-04-17T12:05:14 1776427514

You can’t get the couch into the elevator, typically. Trust me, I tried.

Couch depending. I will persist in trying every time this comes up.

BenjiWiebe · 2026-04-17T17:45:57 1776447957

You can take a mattress up an elevator though (1). Some couches might fit in some elevators.

1: source: me...

gambiting · 2026-04-17T12:48:53 1776430133

Well if it's one of those hospital elevators that can take a bed with a patient, you probably could. Or if it's a small 2 seater sofa. The question isn't as dumb as it sounds at first, and a human would definitely ask a follow up question.

slekker · 2026-04-16T17:48:06 1776361686

What about Qwen? Does it get that right?

lambda · 2026-04-16T17:59:06 1776362346

I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).

And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.7 whiffs it, these models are a couple of orders of magnitude smaller, at least if the rumors of the size of Opus are true.

qingcharles · 2026-04-16T18:40:21 1776364821

Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?

lambda · 2026-04-16T19:55:41 1776369341

I'm running an 8 bit quant right now, mostly for speed as memory bandwidth is the limiting factor and 8 bit quants generally lose very little compared to the full res, but also to save RAM.

I'm still working on tweaking the settings; I'm hitting OOM fairly often right now, it turns out that the sliding window attention context is huge and llama.cpp wants to keep lots of context snapshots.

qingcharles · 2026-04-16T20:04:50 1776369890

I had a whole bunch of trouble getting Gemma 4 working properly. Mostly because there aren't many people running it yet, so there aren't many docs on how to set it up correctly.

It is a fantastic model when it works, though! Good luck :)

throwup238 · 2026-04-16T16:31:06 1776357066

The p stands for putrification.

shawnz · 2026-04-16T17:27:51 1776360471

Note that for Claude Code, it looks like they added a new undocumented command line argument `--thinking-display summarized` to control this parameter, and that's the only way to get thinking summaries back there.

VS Code users can write a wrapper script which contains `exec "$@" --thinking-display summarized` and set that as their claudeCode.claudeProcessWrapper in VS Code settings in order to get thinking summaries back.

accrual · 2026-04-16T17:58:21 1776362301

Here is additional discussion and hacks around trying to retain Thinking output in Claude Code (prior to this release):

https://github.com/anthropics/claude-code/issues/8477

puppystench · 2026-04-16T16:27:38 1776356858

Does this mean Claude no longer outputs the full raw reasoning, only summaries? At one point, exposing the LLM's full CoT was considered a core safety tenet.

MarkMarine · 2026-04-16T18:10:29 1776363029

Anthropic was chirping about Chinese model companies distilling Claude with the thinking traces, and then the thinking traces started to disappear. Looks like the output product and our understanding has been negatively affected but that pales in comparison with protecting the IP of the model I guess.

andai · 2026-04-16T22:18:37 1776377917

When Gemini Pro came out, I found the thinking traces to be extremely valuable. Ironically, I found them much more readable than the final output. They were a structured, logical breakdown of the problem. The final output was a big blob of prose. They removed the traces a few weeks later.

axpy906 · 2026-04-16T22:43:50 1776379430

That’s kind of funny since a Chinese model started the thinking chains being visible in Claude and OA in the first place.

fasterthanlime · 2026-04-16T16:34:15 1776357255

I don't think it ever has. For a very long time now, the reasoning of Claude has been summarized by Haiku. You can tell because a lot of the times it fails, saying, "I don't see any thought needing to be summarised."

fmbb · 2026-04-16T16:47:36 1776358056

Maybe there was no thinking.

derrida · 2026-04-17T13:19:32 1776431972

Not a haiku, more a koan.

astrange · 2026-04-16T18:42:50 1776364970

It also gets confused if the entire prompt is in a text file attachment.

And the summarizer shows the safety classifier's thinking for a second before the model thinking, so every question starts off with "thinking about the ethics of this request".

FeepingCreature · 2026-04-17T09:09:35 1776416975

I'd get confused if I was a LLM and you put my entire prompt in a text file attachment. I'd be like, "is this the user or is this a prompt injection??"

einrealist · 2026-04-16T18:29:34 1776364174

They are trying to optimize the circus trick that 'reasoning' is. The economics still do not favor a viable business at these valuations or levels of cost subsidization. The amount of compute required to make 'reasoning' work or to have these incremental improvements is increasingly obfuscated in light of the IPO.

blazespin · 2026-04-16T17:36:52 1776361012

Safety versus Distillation, guess we see what's more important.

DrammBA · 2026-04-16T16:42:04 1776357724

Anthropic always summarizes the reasoning output to prevent some distillation attacks

jdiff · 2026-04-16T18:13:11 1776363191

Genuine question, why have you chosen to phrase this scraping and distillation as an attack? I'm imagining you're doing it because that's how Anthropic prefers to frame it, but isn't scraping and distillation, with some minor shuffling of semantics, exactly what Anthropic and co did to obtain their own position? And would it be valid to interpret that as an attack as well?

DrammBA · 2026-04-16T19:01:55 1776366115

> I'm imagining you're doing it because that's how Anthropic prefers to frame it

Correct.

> would it be valid to interpret that as an attack as well?

Yup.

irthomasthomas · 2026-04-16T18:22:45 1776363765

If you ask claude in chinese it thinks its deepseek.

typ · 2026-04-17T02:22:38 1776392558

I don't think that learning from textbooks to take an exam and learning from the answers of another student taking the exam are the same.

Joking aside, I also don't believe that maximum access to raw Internet data and its quantity is why some models are doing better than Google. It seems that these SoTA models gain more power from synthetic data and how they discard garbage.

fragmede · 2026-04-16T20:50:58 1776372658

Firehosing Anthropic to exfiltrate their model seems materially different than Anthropic downloading all of the Internet to create the model in the first place to me. But maybe that's just me?

jdiff · 2026-04-16T22:34:44 1776378884

I don't see the material difference in firehosing anthropic vs anthropic firehosing random sites on the internet. As someone who runs a few of those random sites, I've had to take actions that increase my costs (and burn my time) to mitigate a new host of scrapers constantly firing at every available endpoint, even ones specifically marked as off limits.

robrenaud · 2026-04-16T21:48:24 1776376104

Yeah, it's different. Anthropic profits when it delivers tokens. Hosting providers pay when Anthropic scrapes them.

59nadir · 2026-04-17T00:25:57 1776385557

Yes, what the LLM providers did was worse and impacted people financially a whole lot more in lost compensation for works as well as operational costs that would never reach the heights they did solely because of scrapers on behalf of model providers.

vintermann · 2026-04-16T17:37:57 1776361077

Attacks? That's a choice of words.

DrammBA · 2026-04-16T17:45:06 1776361506

Definitely Anthropic playing the victim after distilling the whole internet.

butlike · 2026-04-16T19:03:06 1776366186

Proprietary pattern matcher proves there's no moat; promptly pre-covers other's perception.

nyc_data_geek1 · 2026-04-16T17:17:58 1776359878

Very cool that these companies can scrape basically all extant human knowledge, utterly disregard IP/copyright/etc, and they cry foul when the tables turn.

butlike · 2026-04-16T19:04:25 1776366265

All extant human knowledge SO FAR. Remember, by the nature of the beast, the companies will always be operating in hindsight with outdated human knowledge.

stavros · 2026-04-16T17:39:31 1776361171

Yep, that is exactly what happens. It's a disgrace that their models aren't open, after training on everything humanity has preserved.

They should at least release the weights of their old/deprecated models, but no, that would be losing money.

copperx · 2026-04-16T21:22:46 1776374566

We should treat LLM somewhat like patents or drugs. After 5 years or so, the models should become open source. Or at very least the weights. To compensate for the distilling of human knowledge.

MasterScrat · 2026-04-16T17:19:51 1776359991

and so does OpenAI

andrepd · 2026-04-16T17:18:44 1776359924

CoT is basically bullshit, entirely confabulated and not related to any "thought process"...

clbrmbr · 2026-04-17T00:21:53 1776385313

But still CoT distillation WORKS. See the DeepSeek R1 paper.

whattheheckheck · 2026-04-17T02:26:34 1776392794

Tokens relate to each other. More tokens more compute

jafitc · 2026-04-17T18:50:20 1776451820

bigger change here might not be model quality, but debuggability.

once you hide the reasoning, remove the knobs, and let the model choose its own effort, it gets much harder to tell whether the model got worse or just got harder to inspect.

that’s a real shift. less tool, more black box.

p_stuart82 · 2026-04-16T16:26:01 1776356761

yeah they took "i pick the budget" and turned it into "trust us".

bandrami · 2026-04-16T17:10:59 1776359459

I keep saying even if there's not current malfeasance, the incentives being set up where the model ultimately determines the token use which determines the model provider's revenue will absolutely overcome any safeguards or good intentions given long enough.

vessenes · 2026-04-16T18:59:16 1776365956

This might be true, but right now everybody is like "please let me spend more by making you think longer." The datacenter incentives from Anthropic this month are "please don't melt our GPUs anymore" though.

lukan · 2026-04-16T16:17:34 1776356254

"Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that"

I did not follow all of this, but wasn't there something about, that those reasoning tokens did not represent internal reasoning, but rather a rough approximation that can be rather misleading, what the model actual does?

motoboi · 2026-04-16T16:20:53 1776356453

The reasoning is the secret sauce. They don't output that. But to let you have some feedback about what is going on, they pass this reasoning through another model that generates a human friendly summary (that actively destroys the signal, which could be copied by competition).

XenophileJKO · 2026-04-16T16:29:48 1776356988

Don't or can't.

My assumption is the model no longer actually thinks in tokens, but in internal tensors. This is advantageous because it doesn't have to collapse the decision and can simultaneously propogate many concepts per context position.

ainch · 2026-04-16T17:24:44 1776360284

I would expect to see a significant wall clock improvement if that was the case - Meta's Coconut paper was ~3x faster than tokenspace chain-of-thought because latents contain a lot more information than individual tokens.

Separately, I think Anthropic are probably the least likely of the big 3 to release a model that uses latent-space reasoning, because it's a clear step down in the ability to audit CoT. There has even been some discussion that they accidentally "exposed" the Mythos CoT to RL [0] - I don't see how you would apply a reward function to latent space reasoning tokens.

[0]: https://www.lesswrong.com/posts/K8FxfK9GmJfiAhgcT/anthropic-...