> 1.On a system that is handling 10k concurrent requests, the 10GB of RAM is going to be a fraction of what is installed.
My example (and the c10k problem) is 10k concurrent connections, not 10k concurrent requests.
> 2. It's not 10GB of RAM anyway, it's 10GB of address space. It still only gets faulted into real RAM when it gets used.
Yes, and that's both memory and cpu usage that isn't needed when using a better concurrency model. That's why no high-performance server software use a huge amount of threads, and many use the reactor pattern.
> Yes, and that's both memory and cpu usage that isn't needed
No, it literally is not. The "memory" is just entries in a page table in the kernel and MMU. It shouldn't worry you at all.
Nor is the CPU used by the kernel to manage those threads going to be necessarily less efficient than someone's handrolled async runtime. In fact given it gets more eyes... likely more.
The sole argument I can see is just avoiding a handful of syscalls and excessive crossing of the kernel<->userspace brain blood barrier too much.
> > Yes, and that's both memory and cpu usage that isn't needed
No, it literally is not. The "memory" is just entries in a page table in the kernel and MMU. It shouldn't worry you at all.
Only if you never free one of those stacks. TLB flushes can be quite expensive.
Except if those threads are actually faulting in all of that memory and making it resident, they'd be doing the same thing, just on the heap, for a classic async coroutine style application.
You don't pay for stack space you don't use unless you disable overcommit. And if you disable overcommit on modern linux the machine will very quickly stop functioning.
The amount of stack you pay for on a thread is proportional to the maximum depth that the stack ever reached on the thread. Operating systems can grow the amount of real memory allocated to a thread, but never shrink it.
It’s a programming model that has some really risky drawbacks.
> Operating systems can grow the amount of real memory allocated to a thread, but never shrink it.
Operating systems can shrink the memory usage of a stack.
madvise(page, size, MADV_DONTNEED);
Leaves the memory mapping intact but the kernel frees underlying resources. Subsequent accesses get either new zero pages or the original file's pages.
Linux also supports mremap, which is essentially a kernel version of realloc. Supports growing and shrinking memory mappings.
Whether existing systems make use of this is another matter entirely. My language uses mremap for growth and shrinkage of stacks. C programs can't do it because pointers to stack allocated objects may exist.
Stack memory is never unmapped until the thread terminates as far as I know. I don’t know of any kernel that does this, for precisely the reason you arrive at by the very last sentence.
Yeah, none of this makes sense to me. Allocating memory for stack space is not expensive (and the default isn't even 1MB??) because you're just creating a VMA and probably faulting in one or two pages.
They also say:
>The system spends time managing threads that could be better spent doing useful work.
What do they think the async runtime in their language is doing? It's literally doing the same thing the kernel would be doing. There's nothing that intrinsically makes scheduling 10k couroutines in userspace more efficient than the kernel scheduling 10k threads. Context switches are really only expensive when the switch is happening between different processes, the overhead of a context switch on a CPU between two threads in the same process is very small (and they're not free when done in userspace anyway).
There are advantages to doing scheduling in the kernel and there are advantages to doing scheduling in userspace, but this article doesn't really touch on any of the actual pros and cons here, it just assumes that userspace scheduling is automatically more efficient.
Not the runtime per se, but cooperative scheduling has the advantage that tasks do not yield at adverse code points, e.g., right before giving up a lock, or performing an I/O request. Of course the lack of preemption has it's own downsides, but with thread-per-request you tend to run into tail latency issues much earlier than context switching overhead.
It's a cargo cult and a bias I see all over the place.
I feel like we're now, what, 20, 25 years on and people still haven't adjusted themselves to the fact that the machines we have now are multicore, have boatloads of cache, or how that cache is shared (or not) between cores.
Nor is there apparently a real understanding of the difference between VSS and RSS.
Nor of the fact that modern machines are really really fast if you can keep stuff in cache. And so you really should be focused on how you can make that happen.
1 megabyte stacks mean ten thousand threads require 10 gigabytes of RAM just for the stacks. The entire point of the asynchronous programming paradigm is to reclaim all of those gigabytes by not allowing stacks to develop at all, by stealthily turning everything into a hidden form of cooperative multitasking instead.
While that's strictly true, resident memory in this context is a function of worst case memory usage by the code executing on those stacks. Seems wise to assume worst case performance when discussing this.
The program could use one page's worth of stack space, which is optimal. The program could use like 200 bytes of stack space, which wastes the rest of the page. The program could recurse all the way to 9.9 MB of stack usage, stop just before overflow and then unwind back to constant 200 bytes stack space usage, and never touch all those pages ever again.
The author doesn't fully justify the assertion but it does have sound basis.
While virtual memory allocation does not require physical allocation, it immediately runs into the kinds of performance problems that huge pages are designed to solve. On modern systems, you can burn up most of your virtual address space via casual indifference to how it maps to physical memory and the TLB space it consumes. Spinning up thousands of stacks is kind of a pathological case here.
10µs is an eternity for high-performance software architectures. That is also around the same order of magnitude as disk access with modern NVMe. An enormous amount of effort goes into avoiding blocking on NVMe disk access with that latency for good reason. 10µs is not remotely below the noise floor in terms of performance.
> Why is reserving a megabyte of stack space "expensive"?
Guess it's not a huge issue in these 64-bit days, but back in the 32-bit days it was a real limitation to how many threads you could spin up due to the limited address space.
Of course most applications which hit this would override the 1MB default.
There's much ridiculous hatred for OS threads based on people's biases of operating systems and hardware from 20 years ago.
So much so that they'll sign themselves up for async frameworks that thread steal at will and bounce things all over cores causing cache line bouncing and associated memory stalls, not understanding what this is doing to their performance profile.
And endure complexity, etc. through awkward async call chains and function colouring.
Most people's applications would be totally fine just spawning OS threads and using them without fear and dropping into a futex when waiting on I/O; or using the kernel's own async completion frameworks. The OS scheduler is highly efficient, and it is very good at managing multiple cores and even being aware asymmetrical CPU hierarchies, etc.. Likely more efficient than half the async runtimes out there.
Does it tho? Assuming no torn reads/writes at those sizes, given the location should be strictly increasing are there situations where you could read a higher-than-stored value which would cause skipping a necessary update?
Afaik on all of x86, arm, and riscv an atomic load of a word sized datum is just a regular load.
It doesn't need to be strictly increasing some other thread could be making other arbitrary operations. Still even in that case, as Dylan16807 pointed out, it likely doesn't matter.
If you are implementing a library function atomic<T>::fetch_max, you cannot assume that every other thread is also performing a fetch_max on that object. There might be little reason for it, but other operations are allowed so the the sequence of modifications might not be strictly increasing (but then again, it doesn't matter for this specific optimization).
It's a common misconception to reason about memory models strictly in terms of hardware.
Sequential consistency is a property of a programming language's semantics and can not simply be inferred from hardware. It is possible for hardware operations to all be SC but for the compiler to still provide weaker memory orderings through compiler specific optimizations.
I'm referring to the performance implications of the hardware instruction, not the programming language semantics. Incrementing or decrementing the reference count is going to require an RMW instruction, which is expensive on x86 regardless of the ordering.
The concept of sequential consistency only exists within the context of a programming language's memory model. It makes no sense to speak about the performance of sequentially consistent operations without respect to the semantics of a programming language.
Yes, what I meant was that the same instruction is generated by the compiler, regardless if the RMW operation is performed with relaxed or sequentially consistent ordering, because that instruction is strong enough in terms of hardware semantics to enforce C++'s definition of sequential consistency.
There is a pretty clear mapping in terms of C++ atomic operations to hardware instructions, and while the C++ memory model is not defined in terms of instruction reordering, that mapping is still useful to talk about performance. Sequential consistency is also a pretty broadly accepted concept outside of the C++ memory model, I think you're being a little too nitpicky on terminology.
The presentation you are making is both incorrect and highly misleading.
There are algorithms whose correctness depends on sequential consistency which can not be implemented in x86 without explicit barriers, for example Dekker's algorithm.
What x86 does provide is TSO semantics, not sequential consistency.
I did not claim that x86 provides sequential consistency in general, I made that claim only for RMW operations. Sequentially consistent stores are typically lowered to an XCHG instruction on x86 without an explicit barrier.
From the Intel SDM:
> Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).
> And futexes aren’t the only way to get there. Alternatives:
> - thin locks (what JVMs use)
> - ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)
Worth nothing that somewhere under the hood, any modern lock is going to be using a futex (if supported). futex is the most efficient way to park on Linux, so you even want to be using it on the slow path. Your language's thread.park() primitive is almost certainly using a futex.
That's interesting, I'm more familiar with the Rust parking-lot implementation, which uses futex on Linux [0].
> Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you
It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly. A futex lets you manage the atomic condition yourself, while generic parking utilities encode that state internally. A mutex implementation generally already has a built-in atomic condition with simpler state transitions for each thread in the queue, and so can avoid the additional overhead by making the futex call directly.
> It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly.
No, it absolutely isn’t.
The dominant cost of parking is whatever happens in the kernel and at the microarchitectural level when your thread goes to sleep. That cost is so dominant that whether you park with a futex wait or with a condition variables doesn’t matter at all.
(Source: I’ve done that experiment to death as a lock implementer back when I maintained Jikes RVM’s thin locks and then again when I wrote and maintained ParkingLot.)
> The message has some weird mentions in (alloc565), but the actual useful information is there: a pointer is dangling.
The allocation ID is actually very useful for debugging. You can actually use the flags `-Zmiri-track-alloc-id=alloc565 -Zmiri-track-alloc-accesses` to track the allocation, deallocation, and any reads/writes to/from this location.
> Every single Future you look at will look like this,
That's not true. A Future is supposed to schedule itself to be woken up again when it's ready. This Future schedules it to be woken immediately. Most runtimes, like Tokio, will put a Future that acts like this at the end of the run queue, so in practice it's not as egregious. However, it's unquestionably a spin lock, equivalent to back off with thread::yield.
It's quite common for concurrent algorithms to only implement a subset of operations. For example forgoing, removal or iteration. It's also common to put limitations on the data structure, such as limiting keys and values to 64-bits. Papaya being feature-complete means that it does not have any of these limitations when compared to std::collections::HashMap.
Looks very interesting, but seems to serve a pretty different use case:
> This is an ordered data structure, and supports very high throughput iteration over lexicographically sorted ranges of values. If you are looking for simple point operation performance, you may find a better option among one of the many concurrent hashmap implementations that are floating around. Pay for what you actually use :)
Java atomics are actually sequentially consistent. C# relaxes this to acquire/release. Though the general concept of happens-before is still immensely useful for learning atomics as sequential consistency is a superset of acquire/release.
All of the memory models in question are based on data-race-free, which says (in essence) that as long as all cross-thread interactions follow happens-before, then you can act as if everybody is sequentially-consistent.
The original Java 5 memory model only offered sequentially-consistent atomics to establish cross-thread happens-before in a primitive way. The C++11 memory model added three more kinds of atomics: acquire/release, consume/release (which was essentially a mistake [1]), and relaxed atomics (which, to oversimplify, establish atomicity without happens-before). Pretty much every memory model since C++11--which includes the Rust memory model--has based its definition on that memory model, with most systems defaulting an otherwise unadorned atomic operation to sequentially-consistent. Even Java has retrofitted ways to get weaker atomic semantics [2].
As a practical matter, most atomics could probably safely default to acquire/release over fully sequentially-consistent. The main difference between the two is that sequentially-consistent is safer if you've got multiple atomic variables in play (e.g., you're going with some fancy lockless algorithm), whereas acquire/release tends to largely be safe if there's only one atomic variable of concern (e.g., you're implementing locks of some kind).
[1] A consume operation is an acquire, but only for loads data-dependent on the load operation. This is supposed to represent a situation that requires no fences on any system not named Alpha, but it turns out for reasons™ that compilers cannot reliably preserve source-level data dependencies, so no compiler really implemented consume/release.
[2] Even Java 5 may have had it in sun.misc.Unsafe; I was never familiar with that API, so I don't know for certain.
> as long as all cross-thread interactions follow happens-before, then you can act as if everybody is sequentially-consistent.
I don't think that's the actual guarantee. You can enforce happens-before with just acquire/release, but AFIK that's not enough to recover SC in the general case[1].
As far as I understand, The Data Race Free - Sequentially Consistent memory model (DRF-SC) used by C++11 (and I think Java), says that as long as all operation on atomics are SC and the program is data-race-free, then the whole program can be proven to be sequentially consistent.
[1] but it might in some special cases, for example when all operations are mutex lock and unlock.
Why is reserving a megabyte of stack space "expensive"?
> and takes roughly a millisecond to create
I'm not sure where this number is from, it seems off by a few orders of magnitude. On Linux, thread creation is closer to 10 microseconds.
reply