Hacker Newsnew | past | comments | ask | show | jobs | submit | xscott's commentslogin

I think you're right about the cost/benefit trade-off in general, but I do wonder how much "compaction" Codex and Claude do is to keep context fresh and how much is to save (them) runtime costs.

If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.


I think you might be confused a bit about compaction? The LLM API endpoint does not do compaction, it's an external agent harness that does it. And the Codex/Claude agents aren't constantly summarizing it down, they generally wait until you get within 3/4 of the max of the context size.

Compaction doesn't save them money, it just makes it possible for you to continue a session. If you compact a session too many times, besides the fact that the model basically stops being useful, you eventually just cannot do anything else in the session because all the context is taken up by compaction notes. But if you don't compact it, pretty soon the session is completely unusable because it can't output any more tokens. You can disable compaction in those agents if you want to see the difference.

Also, using a lot of context can make the model perform poorly, so compaction can improve results. If you have a much larger context size, it means you have more headroom before the model starts to perform poorly (as it grows closer to max context size). A larger context also lets you do things like handle larger documents or reason over a larger amount of data without having to break it up into subtasks. Eventually we want models' context to get much bigger so we can do more things in a session. (Some research is being done to see if we can get rid of the limit entirely)


LLM API endpoint does do compaction. OpenAI definitely does support serverside compaction, both explicit and automatic, and this is different than what could be implemented purely clientside: https://developers.openai.com/api/docs/guides/compaction (and there was rumors a few months ago on HN about how activation-preserving/latent it is, vs just summarization). Anthropic as well, in beta (new to me): https://platform.claude.com/docs/en/build-with-claude/compac...

The names for the pieces are confusing, so it's easy to talk past each other. For instance, you're saying "Codex the agent", which isn't a thing now. It's currently GPT-5.5, and at one point it was GPT-5.3-Codex, so when I say "Codex", I meant the MacOS "harness". Similar for Claude Code vs Claude Opus/Sonnet.

Anyways, I don't know specifics well enough to argue with you on anything, but there is a cost for input tokens, and you see/pay it when you use the API directly or through OpenRouter. Maybe you looked at the leaked source for the Claude Code and can tell me definitively otherwise, but Anthropics and OpenAI's incentives for when to compact are not always aligned with the users depending on pricing plans.


Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that).

However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.


Doesn't llama.cpp (or similar) have to evict the kv cache for this, so that performance is degraded when running multiple sessions? Or how do you load a model in memory and then use it in multiple sessions? I am still learning this stuff

The model is loaded once and can be used for multiple sessions, and even parallel requests.

llama.cpp uses a unified KV cache that is shared between requests (be they happening in parallel or not). As new requests come in, they'll evict no longer referenced branches, then move to evict the least recently used entry, and so on.

If you come back to a session that's been evicted it will just be parsed again. This is a problem only on very long context sessions, but it can still be a problem to you.

So one way to reduce such evictions (and reduce KV cache size significantly as a bonus) is to reduce the number of kv cache checkpoints.

Checkpoints allow you to branch a session at any point and not have to recompute it from the start. If you find that you rarely branch a conversation, or if you rely entirely on a coding harness, then setting ctx-checkpoints to 0 or 1 will save tons of VRAM and allow more different sessions to stay in VRAM. This is especially true for models with very large checkpoints (such as Gemma 4).


There are so many flags to llama.ccp that I won't try to say anything too strong, but I believe things related to `--kv-offload` mean you can have the KV cache in GPU VRAM, regular GPU RAM, paged to disk, etc...

I'm on a Mac with unified memory, so I can't easily benchmark it for you, but I think a PC with 64GB of regular RAM and a 24GB gaming card could swap between multiple sessions without too much pain. The weights could stay resident on the GPU.

On the other hand, I did just dump some Project Gutenberg texts into a prompt, and building that cache in the first place was slower than I though it would be.


This kind of proof isn't really as water tight as you claim. It's a lot like saying state machines are limited to processing regular expressions, and then completely ignoring how easy it is to add a stack or linear memory to a state machine to make it a PDA or Turing machine.

So yes, the LLMs can be trivialized as just randomized autocomplete, but if you add a database or memory to the side very basic MLPs can become a Turing machine. It's going to take a lot more proof to say a Turing machine could never be intelligent. And you can do more than just give the LLM side memory - you can have them invoked recursively, use message passing as coroutines, and so on...

You might be technically correct if you ignore anything other than the very restrictive definitions you're using, but even there I'm not certain. If you had a LLM with a trillion token window, is that good enough to act as a memory? Human brains aren't infinite either.


Agreed. It is nonsensical to argue that a 3B transformer that hard-capped to decode 100 tokens is "intelligent". Of course when we are evaluating whether "transformers" is intelligent or not, we are talking about taking transformers as a core part of the system in some ways and enhance it with some other means (as you said, it is pretty trivial to making transformers a Turing machine, hence can carry out any compute, including intelligence (if you are in the camp that intelligence is computable, I don't think it makes sense to argue with anyone who otherwise believes intelligence is not computable)).

Lol, I totally agree about anyone using the non-computable angle.

However, I've got a 20GB GGUF file on my disk that can write code better than 99% of the people I ever worked with in the last 25 years, and ravens seem pretty clever with about 2 billion neurons... I have no idea what the lower bound is.

Fun to think about though :-)


Same behavior in Tucson and Denver. I hate cyclists. They're threatening, break the law, and self entitled. Drivers and walkers seem to get along fine for the most part. The one courtesy cyclists extend to the rest of us is that they self-identify by wearing spandex branded with logos from companies that don't sponsor them - some weird role-play poser fetish I guess.

But be honest - you don't really care about evidence.


You've cited another two anecdotes. Back up your fucking claim.

> some weird role-play poser fetish I guess.

Really? Do you actually want to argue your point or is negative attention your fetish?

^this kind of argument is not fucking productive.

> But be honest - you don't really care about evidence.

You're the one making an emotional argument here without citing anything.

I don't cycle. I appreciate walkable cities with bike lanes, and live in a country where cyclists respect the law.

I do actually care about evidence. If you would fucking care to cite some.

So CITE YOUR SOURCES.



Please don't do this on HN. It's against the guidelines to post “internet tropes”, and the purpose of HN is for curious conversation, whereas a link to this kind of URL is low-effort snark.

Also, your comment upthread breaks several guidelines; particularly the lines “some weird role-play poser fetish I guess” and “But be honest - you don't really care about evidence”.

Please make an effort to observe the guidelines if you want to participate here.

https://news.ycombinator.com/newsguidelines.html


The actual trope in this conversation is "citation needed". That's a phrase which pretty much everyone here, yourself included, knows is the superficially civil (politely hostile) way of saying "you're full of shit".

Telling someone they're sealioning is just using a recently coined word. You also know that person wasn't sincerely asking for evidence - they were sealioning, and very hostile about it.

As for mocking cyclist fashion, that's just a case of falling on the wrong side of the fence. It's completely acceptable here, encouraged even, to mock certain groups and not others. In any given conversation, snark is allowed so long as it points in the agreed direction. And it's self-reinforcing, because anyone who goes against the grain is weeded out - as in your moderation here.

Anyways, I'm not sure what you could do differently. The alternative chat forums do seem consistently worse, so maybe this is as good as it gets.


> As for mocking cyclist fashion, that's just a case of falling on the wrong side of the fence. It's completely acceptable here, encouraged even, to mock certain groups and not others. In any given conversation, snark is allowed so long as it points in the agreed direction. And it's self-reinforcing, because anyone who goes against the grain is weeded out - as in your moderation here.

People who have conviction about issues with moderation include links to demonstrate what they mean. When people make vague insinuations like this without links, it's an indication that they just want to spray a little poison into the atmosphere, and evade accountability for their own conduct or examination of their claims.

If you have evidence of what you mean, please share links or quotes in the comments or email us (hn@ycombinator.com).

Either way, the guidelines apply to everyone equally, and it is never “acceptable here, encouraged even, to mock certain groups”.


> People who have conviction about issues with moderation include links to demonstrate what they mean.

Yes, I see you're using extra words to say "citation required". It's borderline clever, and fits the obvious intention of telling me I'm full of shit, except you're making a strong statement that also needs bolstering. How would you know if the alienated people just quietly go away or silence their opinions to fit in?

Regardless, it's acceptable here to mock climate deniers, capitalists (landlords, CEOs, Billionaires), SUV or truck drivers, religious fundamentalists, various flavors of conservatives, fans of "AI slop" (music or art), etc... You've got better search tools than I do to find the links.

I don't particularly want to defend any of those groups. I just wish we could add cyclists to the approved set, because they're frequently self-righteous hypocrites. I can see I'm unlikely to succeed in this endeavor.

> it's an indication that they just want to spray a little poison into the atmosphere

That seems a more than a bit uncharitable. Do you have any evidence to back it up? :-)

> evade accountability for their own conduct or examination of their claims.

I contradicted a jerk in defense/support of someone who said something I agree with. When the jerk doubled down and became truly belligerent, I bowed out of the conversation and let them have the last word before it turned into an actual flame war.

You came in 12 hours later with an "I don't care who started it" approach, looking for a reason to chastise both of us, and the worst crimes you could come up for me was some weird thing about troping and making fun of cyclist fashion.

Is that accountable enough? Am I supposed to feign penitence like the belligerent kid did?

I've wasted enough of your time. Peace!


> Regardless, it's acceptable here to mock climate deniers, capitalists (CEOs, Billionaires), SUV or truck drivers, religious fundamentalists, various flavors of conservatives, fans of "AI slop" (music or art)

No, it’s not acceptable to mock any of these categories. Never has been in the years I’ve been doing this job. Yes, people do it, in breach of the guidelines, and the community flags them and the moderators warn them then penalize or ban them. This has been consistent for years. What’s also consistent is that people who are strongly partisan towards one position are convinced we are biased towards the opposite of that position.


This isn't an example of that. You claimed something in your initial comment. You did not back it up.

I'm asking you to back up your initial claim. If you had addressed it you'd have a point, and that would be a correct example of sealioning.

But you haven't, so don't accuse me of sealioning.

This isn't me arguing in bad faith. This is me asking you to back up the claim you made in your first comment. That's arguing in good faith, if you only you are willing to provide the other side of the argument.

Which you have avoided so far.


The “sealion” link and the abusive parts of their earlier comment are unacceptable and I've replied to their comment to make that known. However, these lines in your comment are also clear breaches:

> Back up your fucking claim.

> Really? Do you actually want to argue your point or is negative attention your fetish?

> ^this kind of argument is not fucking productive.

> So CITE YOUR SOURCES.

Please don't fulminate or post flamebait on HN, or use capitalization for emphasis. The entire purpose of HN is to engage in curious conversation about topics we find interesting, and to avoid furious battle like this.


Apologies, and noted. I wasn't my usual self, which is honestly what prompted me to give in to replying to them. I usually try to do better, and will do in future.

Great, thanks for the reply, looking forward to seeing better from you in future.

As a cyclist, I'm sure you're tolerant and polite to people walking in the middle of the multi-use paths, right? /s

For a long time I thought cyclists were hypocrites because they play the victim when they're on roads while being complete jerks on walking paths. But really, it's not hypocrisy - it's self-entitlement in both cases. It's honestly very consistent behavior.


I don't find cyclists especially obnoxious on the rail-trails I often walk on. But I have walked on rail-trails with a lot of bicycles where various people got pretty pissy because I wouldn't step off the trail every minute.

I don't understand: They get pissy, but you don't find that obnoxious?

If cyclists got off the roads every time a car comes by, that would be consistent with their expectations for walking paths.


No, I'm saying most cyclists are reasonable but I've been on crowded trails with elevation changes where they haven't been and have acted as if they had the right of way and have sometimes gotten pissy if I didn't get out of the way quickly enough.

Can you expand on that. I've been wanting to try Claude for a while, but their payment processing wouldn't take any of my credit cards (they work everywhere else, so it's not the cards). I've heard I can work around this by installing their mobile app or something, but it was extra hurdles, so I didn't try very hard.

And I've been absolutely amazed with Codex. I started using that with version ChatGPT 5.3-Codex, and it was so much better than online ChatGPT 5.2, even sticking to single page apps which both can do. I don't have any way to measure the "smarts" for of the new 5.4, but it seems similar.

Anyways, I'll try to get Claude running if it's better in some significant way. I'm happy enough the the Codex GUI on MacOS, but that's just one of several things that could be different between them.


Codex is not bad, I think it is still useful. But I find that it takes things far too literally, and is generally less collaborative. It is a bit like working with a robot that makes no effort to understand why a user is asking for something.

Claude, IMO, is much better at empathizing with me as a user: It asks better questions, tries harder to understand WHY I'm trying to do something, and is more likely to tell me if there's a better way.

Both have plenty of flaws. Codex might be better if you want to set it loose on a well-defined problem and let it churn overnight. But if you want a back-and-forth collaboration, I find Claude far better.


That is interesting, and thank you.

I've had a list of pet projects that I've been adding to for years. For those, I just say the broad strokes and tell it to do it's best. Codex has done a really good job for most of them, sometimes in one shot, and my list of experiments is emptying. Only one notable exception where it had no idea what I was after.

I also have my larger project, which I hope to actually keep and use it. Same thing though, it's really hard to explain what's going on, and it acts on bad assumptions.

So if Claude is better at that, then having two tools makes a lot of sense to me.


> I've been wanting to try Claude for a while, but their payment processing wouldn't take any of my credit cards (they work everywhere else, so it's not the cards). I've heard I can work around this by installing their mobile app or something, but it was extra hurdles, so I didn't try very hard.

Not Claude Code specifically, but you can try the Claude Opus and Sonnet 4.6 models for free using Google Antigravity.


Thank you for this. I had Antigravity already but was thinking of cancelling it because Gemini frustrates me. Using it with Claude though was very impressive. I burned through my token budget in about 5 hours though.


I think it would be cool if a language specifically for LLMs came about. It should have something like required preconditions and postconditions so that a deterministic compiler can verify the assumptions the LLM is claiming. Something like a theorem prover, but targeted specifically for programming and efficient compilation/runtime. And it doesn't need all the niceties human programmers tend to prefer (implicit conversions comes to mind).


If you're that confident in the LLM's output, just train it to output some kind of intermediate language, or even machine code.

And if you're not that confident, shouldn't you still be optimising for humans, because humans have to check the LLM's output?


At least in programming, humans have to check the product of the LLM's output rather than the output itself.


I'm working on this now.

It's a Profile Guided Optimization language - with memory safety like Rust.

It's extremely easy to optimize assuming you either 1) profile it in production (obviously has costs) or 2) can generate realistic workloads to test against.

It's like Rust, in that it makes expressing common illegal states just outright impossible. Though it goes much further than Rust.

And it's easier to read than Swift or Go.

There's a lot of magic that happens with defaults that languages like Zig or Rust don't want, because they want every cost signal to be as visible as possible, so you can understand the cost of a line and a function.

LLMs with tests can - I hope - do this without that noise.

We shall see.


Do you have a repo?


Yes.

I'm almost ready to launch v0.1 - but the documentation is especially a mess right now, so I don't want to share yet.

I'll update this comment in a week or so [=


Appreciate it!


Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.

Just a theory.


Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.

So if you send a python code then the first one in function can be one expert, second another expert and so on.


Can you back this up with documentation? I don't believe that this is the case.


The router that routes the tokens between the "experts" is part of the training itself as well. The name MoE is really not a good acronym as it makes people believe it's on a more coarse level and that each of the experts somehow is trained by different corpus etc. But what do I know, there are new archs every week and someone might have done a MoE differently.


It's not only per token, but also each layer has its own router and can choose different experts. https://huggingface.co/blog/moe#what-is-a-mixture-of-experts...


Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so.


Language changes over time, and I remember recent memes where a cute girl says something like "claiming you're moderate means you know conservatives don't get laid" (presumably because of abortion politics). It makes me wonder if the moderates actually became liberal or if they just don't want to use that word any more.

After all the polarism in "reality show politics", my diehard liberal friends seem less liberal to me, but they'll state which team they're on more fervently than ever.


Very simple code is UB:

    int handle_untrusted_numbers(int a, int b) {
        if (a < 0) return ERROR_EXPECTED_NON_NEGATIVE;
        if (b < 0) return ERROR_EXPECTED_NON_NEGATIVE;
        int sum = a + b;
        if (sum < 0) {
            return ERROR_INTEGER_OVERFLOW;
        }
        return do_something_important_with(sum);
    }
Every computer you will ever use has two's complement for signed integers, and the standard recently recognized and codified this fact. However, the UB fanatics (heretics) insisted that not allowing signed overflow is an important opportunity for optimizations, so that last if-statement can be deleted by the compiler and your code quietly doesn't check for overflow any more.

There are plenty more examples, but I think this is one of the simplest.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: