I've seen this claimed, but I'm not sure it's been true for my use cases? I should try a more involved analysis but so far open models seem much less even in their skills. I think this makes sense if a lot of them are built based on distillations of larger models. It seems likely that with task specific fine tuning this is true?
> I've seen this claimed, but I'm not sure it's been true for my use cases?
I'd be surprised if it isn't true for your use cases. If you give GLM-5.1 and Optus 4.6 the same coding task, they will both produce code that passes all the tests. In both cases the code will be crap, as no model I've seen produces good code. GLM-5.1 is actually slightly better at following instructions exactly than Optus 4.6 (but maybe not 4.7 - as that's an area they addressed).
I've asked GLM-5.1 and Opus 4.6 to find a bug caused by a subtle race condition (the race condition leads to a number being 15172580 instead of 15172579 after about 3 months of CPU time). Both found it, in a similar amount of time. Several senior engineers had stared at the code for literally days and didn't find it.
There is no doubt the models do vary in performance at various tasks, but we are talking the difference between Ferrari vs Mercedes in F1. While the differences are undeniable, this isn't the F1. Things take a year to change there. The performance of the models from Anthropic and OpenAI literally change day by day, often not due to the model itself but because of the horsepower those companies choose to give them on the day, or them tweaking their own system prompts. You can find no end of posts here from people screaming in frustration the thing that worked yesterday doesn't work today, or suddenly they find themselves running out of tokens, or their favoured tool is blocked. It's not at all obvious the differences between the open-source models and the proprietary ones are worse than those day to day ones the proprietary companies inflict on us.
If you don't know C, in older versions that can be a catastrophic failure. (The issue is so serious in modern C `free(NULL)` is a no-op.) If it's difficult to get a `FOO == NULL` without extensive mocking (this is often the case) most programmers won't do it, so it won't be caught by unit tests. The LLMs almost never get unit test coverage up high enough to catch issues like this without heavy prompting.
But that's the least of it. The models (all of them) are absolutely hopeless at DRY'ing out the code, and when they do turn it into spaghetti because they seem almost oblivious to isolation boundaries, even when they are spelt out to them.
None of this is a problem if you are vibe coding, but you can only do that when you're targeting a pretty low quality level. That's entirely appropriate in some cases of course, but when it isn't you need heavy reviews from skilled programmers. No senior engineer is going to stomach the repeated stretches of almost the "same but not quite" code they churn out.
You don't have to take my word for it. Try asking Google "do llm's produce verbose code".
`free(NULL)` is harmless in C89 onwards. As I said, programmers freeing NULL caused so many issues they changed the API. It doesn't help that `malloc(0)` returns NULL on some platforms.
If you are writing code for an embedded platform with some random C compiler, all bets on what `free(NULL)` does are off. That means a cautious C programmer who doesn't know who will be using their code never allows NULL to be passed to `free()`.
In general, most good C programmers are good because they suffer a sort of PTSD from the injuries the language has inflicted on them in the past. If they aren't avoiding passing NULL to `free()`, they haven't suffered long enough to be good.
> That means a cautious C programmer who doesn't know who will be using their code never allows NULL to be passed to `free()`.
If your compiler chokes on `free(NULL)` you have bigger problems that no LLM (or human) can solve for you: you are using a compiler that was last maintained in the 80s!
If your C compiler doesn't adhere to the very first C standard published, the problem is not the quality of the code that is written.
> If they aren't avoiding passing NULL to `free()`, they haven't suffered long enough to be good.
I dunno; I've "suffered" since the mid-90s, and I will free NULL, because it is legal in the standard, and because I have not come across a compiler that does the wrong thing on `free(NULL)`.
So what would be the best practice in a situation like that? I would (naively?) imagine that a null pointer would mostly result from a malloc() or some other parts of the program failing, in which case would you not expect to see errors elsewhere?
> imagine that a null pointer would mostly result from a malloc() or some other parts of the program failing, in which case would you not expect to see errors elsewhere?
Oh yes, you probably will see errors elsewhere. If you are lucky it will happen immediately. But often enough millions of executed instructions later, in some unrelated routine that had its memory smashed. It's not "fun" figuring out what happened. It could be nothing - bit flips are a thing, and once you get the error rate low enough the frequency of bit flips and bugs starts to converge. You could waste days of your time chasing an alpha particle.
I saw the author of curl post some of this code here a while back. I immediately recognised the symptoms. Things like:
if (NULL == foo) { ... }
Every 2nd line was code like that. If you are wondering, he wrote `(NULL == foo)` in case he dropped an `=`, so it became `(NULL = foo)`. The second version is a syntax error, whereas `(foo = NULL)` is a runtime disaster. Most of it was unjustified, but he could not help himself. After years of dealing with C, he wrote code defensively - even if it wasn't needed. C is so fast and the compilers so good the coding style imposes little overhead.
Rust is popular because it gives you a similar result to C, but you don't need to have been beaten by 10 years of pain in order to produce safe Rust code. Sadly, it has other issues. Despite them, it's still the best C we have right now.
C is fundamentally a bad target for LLMs. Humans get C wrong all the time, so we can not hope the nascent LLM, which has been trained on 95% code that does automatic memory management, to excel here.
I always found myself writing verbose copypasta code first, then compress it down based on the emerging commonalities. I think doing it the other way around is likely to lead to a worse design. Can you not tell the LLM to do the same? Honest question.
> I always found myself writing verbose copypasta code first, then compress it down based on the emerging commonalities. I think doing it the other way around is likely to lead to a worse design.
I do pretty much the same thing, which is to say I "write code using a brain dump", "look for commonalities that tickle the neurons", then "refactor". Lather, rinse, and repeat until I'm happy.
> Can you not tell the LLM to do the same?
You can tell them until you're blue in the face. They ignore you.
I'm sure this is a temporary phase. Once they solve the problem, coding will suffer the same fate as blacksmiths making nails. [0] To solve it they need to satisfy two conflicting goals - DRY the code out, while keeping interconnections between modules to a minimum. That isn't easy. In fact it's so hard people who do it well and can do it across scales are called senior software engineers. Once models master that trick, they won't be needed any more.
By "they" I mean "me".
[0] Blacksmiths could produce 1,000 or so a day, but it must have been a mind-numbing day even if it paid the bills. Then automation came along, and produced them at over a nail per second.
a) The agent doesn't need to read the implementation of anything - you can stuff the entire projects headers into the context and the LLM can have a better birds-eye view of what is there and what is not, and what goes where, etc.
and
b) Enforcing Parse, don't Validate using opaque types - the LLM writing a function that uses a user-defined composite datatype has no knowledge of the implementation, because it read only headers.
Write code? No. Use frontier models. They are subsidized and amazing and they get noticably better ever few months.
Literally anything else? Smaller models are fine. Classifiers, sentiment analysis, editing blog posts, tool calling, whatever. They go can through documents and extract information, summarize, etc. When making a voice chat system awhile back I used a cheap open weight model and just asked it "is the user done speaking yet" by passing transcripts of what had been spoken so far, and this was 2 years ago and a crappy cheap low weight model. Be creative.
I wouldn't trust them to do math, but you can tool call out to a calculator for that.
They are perfectly fine at holding conversations. Their weights aren't large enough to have every book ever written contained in them, or the details of every movie ever made, but unless you need that depth and breadth of knowledge, you'll be fine.
I just mean is the claim that the open source models where the closed models were 12 to 6 months ago true? They do seem to be for some specific tasks which is cool, but they seem even more uneven in skills than the frontier model. They're definitely useful tools, but I'm not sure if they're a match for frontier models from a year ago?
Frontier models from a year ago had issues with consistent tool calling, instruction following was pretty good but could still go off the rails from time to time.
Open weight models have those same issues. They are otherwise fine.
You can hook them up to a vector DB and build a RAG system. They can answer simple questions and converse back and forth. They have thinking modes that solve more complex problems.
They aren't going to discover new math theorems but they'll control a smart home and manage your calendar.