NATS is a pub/sub server written in Go. It's designed for simplicity and performance with features like request/reply and 1-hop fanout but no persistence.
NATS Streaming (aka STAN) was an answer to persistence by basically building acks and storage on top of the base NATS protocol, resulting in another protocol on top that are stored as files or backed by a database. STAN has limitations on topic/subs and is designed as single master with multiple failovers but works well if you need a simpler self-contained alternative to Kafka. Later a Raft-based clustering option was added but isn't a great architecture and has a major performance impact. The docs recommend you use the failover model instead.
Liftbridge offers a better version that uses the standard NATS protocol and basically functions as invisible subscribers to the topics and just logs all the messages using multiple Raft groups, which lets you scale out by using more topics. NATS itself has recently been revamped with 2.0 release that adds some major new features and they're working on a better answer to the persisted/distributed log offering.
If you need fast pub/sub for ephemeral messaging then NATS is great. If you need persistence and a single-server is enough throughput then use NATS Streaming (with failover if you need it). If you need more persistence throughput then use Liftbridge. Although once you start getting to these scales I would recommend looking at Kafka (which also has gotten much better with v2.0+) or something like Apache Pulsar which is more advanced and scalable then both. There are also commercial options like Solace with a free tier that are worth checking out.
Just to clarify, Liftbridge relies on a single Raft group used purely for metadata replication, i.e. control plane, not data plane. Streams themselves are replicated using a protocol very similar to Kafka (ISR-based with followers fetching messages from the leader's log).
The intro blog [1] mentioned in my other comment gives a detailed run-down of differences between the two. Not sure it can be summarized any more succinctly so I'll just paste an excerpt here:
"NATS Streaming provides a similar log-based messaging solution. However, it is an entirely separate protocol built on top of NATS. NATS is an implementation detail—the transport—for NATS Streaming. This means the two systems have separate messaging namespaces—messages published to NATS are not accessible from NATS Streaming and vice versa. Of course, it’s a bit more nuanced than this because, in reality, NATS Streaming is using NATS subjects underneath; technically messages can be accessed, but they are serialized protobufs. These nuances often get confounded by first–time users as it’s not always clear that NATS and NATS Streaming are completely separate systems. NATS Streaming also does not support wildcard subscriptions, which sometimes surprises users since it’s a major feature of NATS.
As a result, Liftbridge was built to augment NATS with durability rather than providing a completely separate system. To be clear, it’s still a separate server, but it merely acts as a write-ahead log for NATS subjects. NATS Streaming provides a broader set of features such as durable subscriptions, queue groups, pluggable storage backends, and multiple fault-tolerance modes. Liftbridge aims to have a relatively small API surface area.
The key features that differentiate Liftbridge are the shared message namespace, wildcards, log compaction, and horizontal scalability. NATS Streaming replicates channels to the entire cluster through a single Raft group, so adding servers does not help with scalability and actually creates a head-of-line bottleneck since everything is replicated through a single consensus group (n.b. NATS Streaming does have a partitioning mechanism, but it cannot be used in conjunction with clustering). Liftbridge allows replicating to a subset of the cluster, and each stream is replicated independently in parallel. This allows the cluster to scale horizontally and partition workloads more easily within a single, multi-tenant cluster."
Good explanation, but I feel like that needs to be "bottom line up front": the last two sentences are the most important,
"Liftbridge allows replicating to a subset of the cluster, and each stream is replicated independently in parallel. This allows the cluster to scale horizontally and partition workloads more easily within a single, multi-tenant cluster."
Can anyone speak to the current state of each project, pros & cons?