This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.
There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.
Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.
If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.
Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.
It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.
Because everyone in these replies is in complete denial about the physical limits of memory and scaling in general. Ya'll literally living in an alternate reality where model capability increases with a decrease in size, its simply not the case. There will be small focused models that preform well on very narrow tasks, yes, but you will not have "agents" capable of "building most things" running on consumer hardware until more capable (and affordable) consumer hardware exists.
Correct, the progress is not perfectly linear. But do you believe technological progress has stalled forever? If so, I'd get out of tech and start selling bomb shelters.
Do you really think the trend of consumer hardware is heading towards more memory and better specs? Apple's most popular product this year is an 8gb of RAM laptop..
The trend is heading in the opposite direction, less options for strong consumer hardware and towards cloud based products. This is a memory issue more than anything. Nvidia is done selling their ddr7 to gamers and people with AI girlfriends.
There are physical limits to how much you can compress data. I'm just saying, don't sit on your hands waiting for this to happen, becuase its probably not going to for another decade +. There's no use in waiting, just write the code your fkin self and stop being lazy.
Just so that I have your position straight: you actually believe that over the long term, like 10, 20 years, that the amount of RAM in a laptop is going to go down?
It's not out of the realm of possibility, but I just want to make you aware that this would be a very surprising development in computing history.
I guess we'll find out! I bet all the vendors who supply RAM are looking at the current shortages and thinking "well, it's a shame we could never manufacture more RAM than we currently do."
A future with less RAM is possible with more applications using computational storage with ssd/nvme.
But that's not my main argument is that its delusional for OP thinks its reasonable to expect that soon we'll be able to run models on consumer hardware that will be able to build basically most things,
But I do think there will be many compromises made for consumer electronics, I don't think the powers that be are eager to give consumers all the best memory (that should be clear by now) There's 3 DDR5 DRAM manufactures in the world that have to provide memory to all the world's militaries, governments, datacenters/corporations. Consumers are last priority.
> If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.
Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.
I'm making some assumptions about what they're saying, but it seems clear they have no idea what they're about and that they're betting their competency on this technology.
If you're not paying attention to what's happening with small models, I suggest you take a closer look. Keeping parameter count constant, the quality of small models is rising fast. When you look at what you could do with Llama just 3 years ago vs Gemma 4 on the same 16GB hardware, the trend is clear.
Meanwhile, this year Apple bumped the base of their Mac lineup from 8GB to 16GB RAM, and the iPhone 17 Pro ships with 12GB. The Neo is at 8GB but is a brand new product tier which is not comparable to any past model.
Small models are gaining useful reasoning ability and that's a genuinely helpful development, but they'll be heavily limited in world knowledge for the foreseeable future. BTW, the base of the Mac lineup is now once again a 8GB device with a small and low-performance SSD. Many people will tell you that it's broadly comparable (though of course not identical!) to the original base model M1.
For many tasks, including lots of agentic applications, world knowledge is not a "must-have."
To me the Neo is an exception, and doesn't represent the core Mac lineup, which is all at 16GB+ of RAM. If you're developing pro software that would rely on an on-device LLM, you probably wouldn't be targeting the Neo anyway.
Anything can technically "run" on almost any hardware, the meaningful question is what's the real-world performance. I for one have made a case in this thread that DeepSeek V4 is de facto optimal for wide batching, not single-request or single-agent inference - even on consumer hardware (which is unique among practical AI models). I might still be wrong of course, but if so I'd like to understand what's wrong with my assumptions.
I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.
You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.
There will always be a gap, but what's interesting is that because new models are constantly coming out, we as an industry never spend any time extracting the maximal value out of an existing model. What if there are techniques, and harness workflows that could be optimized for a singular model end to end? How far can that push the state of the art.
Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)
Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.
I also don't think a lot of people know some of the more advanced context management tricks like /rewind /fork /tree to take advantage of prefix caching
pgbackrest is awesome, truly. Thank you so much for the work you've put into this project over the years, and I'm sad the crunchy data acquisition couldn't keep the project alive.
We actually have one of these between our group of friends and their kids and it's awesome. The kids call each other to chat and setup play dates or to go run around in the street. Our kids will call back home to let us know they made it to the other persons house, or let us know they're coming back home too.
The tactility is incredible, and it's so just so cute to watch them chat away (5 year olds!)
The way we solved it is by checking the lsn on the primary, and then waiting for the replica to catch up to that lsn before doing reads on the replica in various scenarios.
reply