More

maherbeg · 2026-05-07T15:48:35 1778168915

This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

dakolli · 2026-05-07T17:12:47 1778173967

There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.

Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.

bensyverson · 2026-05-07T17:27:11 1778174831

If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

physicsguy · 2026-05-07T17:29:41 1778174981

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

bensyverson · 2026-05-07T18:17:22 1778177842

Exactly… as hn readers, we sometimes forget that a lot of people are using these tools to search for the best sunscreen, or rewrite an email.

dakolli · 2026-05-07T17:36:37 1778175397

[flagged]

afro88 · 2026-05-07T17:41:12 1778175672

No offense, this is a crazy worthless contribution to the discussion.

Why?

dakolli · 2026-05-07T19:15:27 1778181327

Because everyone in these replies is in complete denial about the physical limits of memory and scaling in general. Ya'll literally living in an alternate reality where model capability increases with a decrease in size, its simply not the case. There will be small focused models that preform well on very narrow tasks, yes, but you will not have "agents" capable of "building most things" running on consumer hardware until more capable (and affordable) consumer hardware exists.

bensyverson · 2026-05-07T19:26:41 1778182001

Ah, you haven't realized that consumer hardware gets more capable over time

adrian_b · 2026-05-07T21:05:45 1778187945

Not this year, when many vendors either offer lower memory capacities or demand higher prices for their devices.

bensyverson · 2026-05-07T23:08:57 1778195337

Correct, the progress is not perfectly linear. But do you believe technological progress has stalled forever? If so, I'd get out of tech and start selling bomb shelters.

dakolli · 2026-05-07T23:33:20 1778196800

Do you really think the trend of consumer hardware is heading towards more memory and better specs? Apple's most popular product this year is an 8gb of RAM laptop..

The trend is heading in the opposite direction, less options for strong consumer hardware and towards cloud based products. This is a memory issue more than anything. Nvidia is done selling their ddr7 to gamers and people with AI girlfriends.

iuffxguy · 2026-05-08T02:42:09 1778208129

This is more then just the hardware evolving over time but we also are seeing big improvements in quantization and efficiency improvements.

dakolli · 2026-05-08T03:18:28 1778210308

There are physical limits to how much you can compress data. I'm just saying, don't sit on your hands waiting for this to happen, becuase its probably not going to for another decade +. There's no use in waiting, just write the code your fkin self and stop being lazy.

bensyverson · 2026-05-08T02:38:42 1778207922

Just so that I have your position straight: you actually believe that over the long term, like 10, 20 years, that the amount of RAM in a laptop is going to go down?

It's not out of the realm of possibility, but I just want to make you aware that this would be a very surprising development in computing history.

fulafel · 2026-05-08T03:31:18 1778211078

This seems to be a different discussion than was going on up thread about:

> in the next few years a "good enough" model will run on entry-level hardware

wtallis · 2026-05-08T03:42:45 1778211765

Exactly. In the next few years, entry-level hardware will not be advancing beyond 16GB. And anything beyond 32GB will remain decidedly high-end.

And that's for laptops with unified memory. In the desktop space, 8GB discrete GPUs are going to be sticking around for a very long time.

bensyverson · 2026-05-08T23:20:03 1778282403

I guess we'll find out! I bet all the vendors who supply RAM are looking at the current shortages and thinking "well, it's a shame we could never manufacture more RAM than we currently do."

dakolli · 2026-05-08T03:15:15 1778210115

A future with less RAM is possible with more applications using computational storage with ssd/nvme.

But that's not my main argument is that its delusional for OP thinks its reasonable to expect that soon we'll be able to run models on consumer hardware that will be able to build basically most things,

But I do think there will be many compromises made for consumer electronics, I don't think the powers that be are eager to give consumers all the best memory (that should be clear by now) There's 3 DDR5 DRAM manufactures in the world that have to provide memory to all the world's militaries, governments, datacenters/corporations. Consumers are last priority.

marci · 2026-05-08T17:25:34 1778261134

Did they modify their post? I can't see who claimed that consumer hardware will be able to build most things?

dakolli · 2026-05-08T18:23:27 1778264607

> If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

I'm making some assumptions about what they're saying, but it seems clear they have no idea what they're about and that they're betting their competency on this technology.

bensyverson · 2026-05-08T21:29:52 1778275792

If you're not paying attention to what's happening with small models, I suggest you take a closer look. Keeping parameter count constant, the quality of small models is rising fast. When you look at what you could do with Llama just 3 years ago vs Gemma 4 on the same 16GB hardware, the trend is clear.

Meanwhile, this year Apple bumped the base of their Mac lineup from 8GB to 16GB RAM, and the iPhone 17 Pro ships with 12GB. The Neo is at 8GB but is a brand new product tier which is not comparable to any past model.

zozbot234 · 2026-05-08T21:34:43 1778276083

Small models are gaining useful reasoning ability and that's a genuinely helpful development, but they'll be heavily limited in world knowledge for the foreseeable future. BTW, the base of the Mac lineup is now once again a 8GB device with a small and low-performance SSD. Many people will tell you that it's broadly comparable (though of course not identical!) to the original base model M1.

bensyverson · 2026-05-08T23:18:13 1778282293

For many tasks, including lots of agentic applications, world knowledge is not a "must-have."

To me the Neo is an exception, and doesn't represent the core Mac lineup, which is all at 16GB+ of RAM. If you're developing pro software that would rely on an on-device LLM, you probably wouldn't be targeting the Neo anyway.

zozbot234 · 2026-05-08T18:49:59 1778266199

Anything can technically "run" on almost any hardware, the meaningful question is what's the real-world performance. I for one have made a case in this thread that DeepSeek V4 is de facto optimal for wide batching, not single-request or single-agent inference - even on consumer hardware (which is unique among practical AI models). I might still be wrong of course, but if so I'd like to understand what's wrong with my assumptions.

liuliu · 2026-05-07T17:30:18 1778175018

I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

maherbeg · 2026-05-07T21:49:42 1778190582

There will always be a gap, but what's interesting is that because new models are constantly coming out, we as an industry never spend any time extracting the maximal value out of an existing model. What if there are techniques, and harness workflows that could be optimized for a singular model end to end? How far can that push the state of the art.

An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.

Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)

amunozo · 2026-05-07T17:48:00 1778176080

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

dakolli · 2026-05-07T19:03:53 1778180633

Frontier models can hardly do the tasks I want them too, I simply cannot buy into this notion.

drob518 · 2026-05-07T20:09:32 1778184572

For instance?

daveguy · 2026-05-07T23:40:45 1778197245

> There will always be a huge gap between frontier models and open source models (unless you're very rich).

They said the same thing about open source chess engines.

otabdeveloper4 · 2026-05-07T17:29:22 1778174962

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

maherbeg · 2026-05-01T18:48:38 1777661318

I also don't think a lot of people know some of the more advanced context management tricks like /rewind /fork /tree to take advantage of prefix caching

maherbeg · 2026-04-29T15:48:59 1777477739

Zed has a "turn off all AI features" checkbox if you want to use that

maherbeg · 2026-04-29T15:48:35 1777477715

It works well but there are a lot of missing features * skill auto complete * custom agents * sub agents * background process management

maherbeg · 2026-04-27T14:26:03 1777299963

pgbackrest is awesome, truly. Thank you so much for the work you've put into this project over the years, and I'm sad the crunchy data acquisition couldn't keep the project alive.

maherbeg · 2026-03-23T15:29:06 1774279746

We actually have one of these between our group of friends and their kids and it's awesome. The kids call each other to chat and setup play dates or to go run around in the street. Our kids will call back home to let us know they made it to the other persons house, or let us know they're coming back home too.

The tactility is incredible, and it's so just so cute to watch them chat away (5 year olds!)

maherbeg · 2026-03-03T17:38:50 1772559530

Honestly, they can keep waiting for another year or two for on-device models at the size they're looking for to be powerful enough.

maherbeg · 2026-02-23T21:11:20 1771881080

The way we solved it is by checking the lsn on the primary, and then waiting for the replica to catch up to that lsn before doing reads on the replica in various scenarios.

maherbeg · 2026-01-23T16:53:23 1769187203

This is really tough in a large organization with features that cross across product domains.

maherbeg · 2026-01-23T16:50:26 1769187026

Yeah, you'll definitely want to set things like `max_standby_streaming_delay` and friends to ensure things are bound correctly.