> “Json" and "Go" seem antithetical in the same sentence as "high performance" to me
Absolute highest performance is rarely the highest priority in designing a system.
Of course we could design a hyper-optimized, application specific payload format and code the deserializer in assembly and the performance would be great, but it wouldn’t be useful outside of very specific circumstances.
In most real world projects, performance of Go and JSON is fine and allow for rapid development, easy implementation, and flexibility if anything changes.
I don’t think it’s valid to criticize someone for optimizing within their use case.
> I feel like it should deserve a mention of what the stack is, otherwise there is no reference point.
The article clearly mentions that this is a GopherCon talk in the header. It was posted on Dave Cheney’s website, a well-known Go figure.
It’s clearly in the context of Go web services, so I don’t understand your criticisms. The context is clear from the article.
The very first line of the article explains the context:
> This talk is a case study of designing an efficient Go package.
This is an article about optimizing a JSON parser in Go.
As always, try to remember that people usually aren't writing (or posting their talks) specifically for an HN audience. Cheney clearly has an audience of Go programmers; that's the space he operates in. He's not going to title his post, which he didn't make for HN, just to avoid a language war on the threads here.
It's our responsibility to avoid the unproductive language war threads, not this author's.
Then you only have to read "k1" and "k2" once, instead of once per record. Presumably there will be the odd record that contains something like {"k3": 0} but you can use mini batches of SoA and tune their size according to your desired latency/throughput tradeoff.
Or if your data is 99.999% of the time just pairs of k1 and k2, turn them into tuples:
{"k1k2": [true,[2,3,4],false,[], ...]}
And then 0.001% of the time you send a lone k3 message:
{"k3": 2}
Even if your endpoints can't change their schema, you can still trade latency for throughput by doing the SoA conversion, transmitting, then converting back to AoS at the receiver. Maybe worthwhile if you have to forward the message many times but only decode it once.
JSON is probably the fastest serialization format to produce and parse, which is also safe for public use, compared to binary formats which often have fragile, highly specific and vulnerable encoding as they're directly plopped into memory and used as-is (i.e. they're not parsed at all, it's just two computers exchanging memory dumps).
Compare it with XML for example, which is a nightmare of complexity if you actually follow the spec and not just make something XML-like.
We have some formats which try to walk the boundary between safe/universal and fast like ASN.1 but those are obscure at best.
I prefer msgpack if the data contains a lot of numeric values. Representing numbers as strings like in JSON can blow up the size and msgpack is usually just as simple to use.
JSON is often the format of an externally provided data source, and you don't have a choice.
And whatever language you're writing in, you usually want to do what you can to maximize performance. If your JSON input is 500 bytes it probably doesn't matter, but if you're intaking a 5 MB JSON file then you can definitely be sure the performance does.
What more do you need to know about "the stack" in this case? It's whenever you need to ingest large amounts of JSON in Go. Not sure what could be clearer.
Exactly, it's not too hard to implement in C. The one I made never copied data, instead saved the pointer/length to the data.
The user only had to Memory Map the file (or equivalent), pass that data into the parse.
Only memory allocation was for the Jason nodes.
This way they only paid the parsing tax (decoding doubles, etc..) if the user used that data.
The first line of the article explains the context of the talk:
> This talk is a case study of designing an efficient Go package.
The target audience and context are clearly Go developers. Some of these comments are focusing too much on the headline without addressing the actual article.
Yup and if your implementation uses a hashmap for object key -> value lookup, then I recommend allocating the hashmap after parsing the object not during to avoid continually resizing the hashmap. You can implement this by using an intrusive linked list to track your key/value JSON nodes until the time comes to allocate the hashmap. Basically when parsing an object 1. use a counter 'N' to track the number of keys, 2. link the JSON nodes representing key/value pairs into an intrusive linked list, 3. after parsing the object use 'N' to allocate a perfectly sized hashmap in one go. You can then iterate over the linked list of JSON key/value pair nodes adding them to the hashmap. You can use this same trick when parsing JSON arrays to avoid continually resizing a backing array. Alternatively, never allocate a backing array and instead use the linked list to implement an iterator.
> The user only had to Memory Map the file (or equivalent)
Having done this myself, it's a massive cheat code because your bottleneck is almost always i/o and memory mapped i/o is orders of magnitude faster than sequential calls to read().
But that said it's not always appropriate. You can have gigabytes of JSON to parse, and the JSON might be available over the network, and your service might be running on a small node with limited memory. Memory mapping here adds quite a lot of latency and cost to the system. A very fast streaming JSON decoder is the move here.
> memory mapped i/o is orders of magnitude faster than sequential calls to read()
That’s not something I’ve generally seen. Any source for this claim?
> You can have gigabytes of JSON to parse, and the JSON might be available over the network, and your service might be running on a small node with limited memory. Memory mapping here adds quite a lot of latency and cost to the system
Why does mmap add latency? I would think that mmap adds more latency for small documents because the cost of doing the mmap is high (cross CPU TLB shoot down to modify the page table) and there’s no chance to amortize. Relatedly, there’s minimal to no relation between SAX vs DOM style parsing and mmap - you can use either with mmap. If you’re not aware, you do have some knobs with mmap to hint to the OS how it’s going to be used although it’s very unwieldy to configure it to work well.
Experience? Last time I made that optimization it was 100x faster, ballpark. I don't feel like benchmarking it right now, try yourself.
The latency comes from the fact you need to have the whole file. The use case I'm talking about is a JSON document you need to pull off the network because it doesn't exist on disk, might not fit there, and might not fit in memory.
> Experience? Last time I made that optimization it was 100x faster, ballpark. I don't feel like benchmarking it right now, try yourself.
I have. Many times. There's definitely not a 100x difference given that normal file I/O can easily saturate NVMe throughput. I'm sure it's possible to build a repro showing a 100x difference, but you have to be doing something intentionally to cause that (e.g. using a very small read buffer so that you're doing enough syscalls that it shows up in a profile).
> The latency comes from the fact you need to have the whole file
That's a whole other matter. But again, if you're pulling it off the network, you usually can't mmap it anyway unless you're using a remote-mounted filesystem (which will add more overhead than mmap vs buffered I/O).
I also really like this paradigm. It’s just that in old crusty null-terminated C style this is really awkward because the input data must be copied or modified. But it’s not an issue when using slices (length and pointer). Unfortunately most of the C standard library and many operating system APIs expect that.
How does that contradict what the parent poster says? I think it's very weird to call something "high performance" when it looks like it's maybe 15-20% of the performance of a simdjson in c++. This is not "going from normal performance to high performance", this going from "very subpar" to "subpar"
Par is different for different stacks. It's reasonable for someone to treat their standard library's JSON parser as "par", given that that's the parser that most of their peers will be using, even if there are faster options that are commonly used in other stacks.
I think people say that as they give disproportional weight to the fact it's text-based, while ignoring how astoundingly simple and linear it is to write and read.
The only way to nudge the needle is to start exchanging direct memory dumps, which is what ProtoBuff and the like do. But this is clearly only for very specific use.
> while ignoring how astoundingly simple and linear it is to write and read.
code maybe simple, but you have lots of performance penalties: resolving field keys, you need to construct some complicated data structures through memory allocations, which is expensive.
> to start exchanging direct memory dumps, which is what ProtoBuff and the like do
Protobuff actually is doing parsing, it is just binary format. What you describing is more like Flatbuffer.
> But this is clearly only for very specific use.
yes, specific use is high performance computations )
Resolving what keys? JSON has keyval sequences ("objects") but what you do with them is entirely up to you. In a streaming reader you can do your job without ever creating maps or dehydrating objects.
Plus no one makes people use objects in JSON. If you can send a tuple of fields as an array... then send an array.
The "logic" to match a key is not slow. Your CPU can parse hundreds of keys while waiting for the next keyval sequence to be loaded from RAM.
As for whether you must stuff everything in objects for it to be JSON, I mean that makes no sense. JSON arrays are also JSON. And JSON scalars are also JSON.
If this argument requires we deliberately go out of our way to be dumb, then it's a bad argument.
I'm writing a platform which has a JSON-like format for message exchange and I realized early on that in serialized data, maps are at best nominal. You process everything serially (hence the word serialization). It's a stream of tokens. The fact some tokens are marked as keys and some as values is something that can be useful to communicate intent, but it doesn't mandate how you utilize it.
Everything else is just prejudices and biases, such as "maps take resources to allocate". JSON doesn't force you to build maps when you have an object. Point in the spec where JSON mandates how you must store the parsed results of a JSON object. Should it be a hashmap? A b-tree? A linked list? A tuple? Irrelevant.
No, pages are not loaded in cache. Cache lines are. RAM pages are typically 4kb, and cache lines are most commonly 64 bytes. This means you have 64 cache lines per RAM page. And this entire detour has no relevance to what I said in the first place, which still stands. But you know, someone was wrong on the Internet.
I never said the CPU reads one key at a time. I said it can decode hundreds of keys at the time it loads one from memory. This is completely irrelevant of how memory reads are batched. It's about a ratio, like 100:1 get it? Seems like you felt your ego attacked, and you just had to respond in a patronizing way about something, but didn't know about what.
Hashing a string as you read it from memory and jumping to a hash bucket is not an expensive operation. This entire argument sounds like some kindergarten understanding of compute efficiency. This is not a 6052.
lets sync on numbers maybe? My engine processes 600MB/s (mebabytes, not megabits) of data per core (I have very many cores) for my wire format, current bottleneck is that linux vpages system can't allocate/deallocate pages fast enough when reading from NVME raid.
What are your numbers for your cool json serializer?
I think maybe you possibly forgot what our argument was. I said the bottleneck is memory, and not processing/hashing the keys to match them to the symbol you want to populate.
And you're currently telling me the bottleneck is memory.
I also said you don't need to parse a JSON object to a hashmap or a b-tree. The format suggests nothing of the sort. You can hash the key and fill it into a symbol slot in a tuple which... literally only takes the amount of RAM you need for the value, while the key is "free", because it just resolves to a pointer address.
Additionally, if you have a fixed tuple format, you can encode it as a JSON array, thus skipping the keys entirely. None of that is against JSON. You decide what you need and what you don't need. The keyvals are there when you need keyvals. No one is forcing you to use them where you don't need them at gunpoint.
I have a message format for a platform I'm working on, it has a JSON option, just for compatibility. It doesn't use objects at all, yet (but it DOES transfer object states). Nested arrays are astonishingly powerful on their own with the right mindset.
> And you're currently telling me the bottleneck is memory.
Not memory, but virtual pages implementation in linux, which is apparently single threaded and doesn't scale to high throughput. There was a patch to fix this, but it didn't make to mainline: https://lore.kernel.org/lkml/20180403133115.GA5501@dhcp22.su...