CloudNativePG – Postgres Operator for Kubernetes

dmeijboom · on Sept 23, 2023

We’ve been using Zalando’s Postgres Operator before in production and we recently switched to cloudnative-pg. We didn’t experience any issues so far and I’m a big fan of their design choices (where you have a single user, single database for each micro-service that requires one).

Already__Taken · on Sept 23, 2023

I used zalando and surprisingly hit a bunch of config issues where the backups won't run on an s3 endpoint that isn't aws or wal-e now wont run because it's got no db credentials but does the seaweed endpoint. now my one db-one user-one pod "cluster" is dead because it won't elect itself leader. I simply cannot kick it correctly to revive it.

rubenv · on Sept 23, 2023

How did you migrate between the two?

out-of-ideas · on Sept 22, 2023

> The Kubernetes way to me, this translates into: a really complicated mess of manifests of layers-of-layers of fun, but hidden in an attempt to make it easy; but when it does work in the simplest forms, it works great.

if any lesson ive learned; let kubernetes manage only things that wont shoot yourself in the foot with; you wont know until you try :)

tehlike · on Sept 23, 2023

Marking your data pvc with retention: retain is a good start :)

hardwaresofton · on Sept 23, 2023

Surely you mean making the default for the storage class :)

Running out of space > losing data

tehlike · on Sept 24, 2023

yes sir.

master_crab · on Sept 22, 2023

Kubernetes: a complicated orchestration system that runs stateless applications brilliantly (after six months of work from your infra engineers).

Postgres: the very definition of a stateful service.

Those two sound like perfect stablemates.

figmert · on Sept 22, 2023

Your definition of Kubernetes is accurate for maybe 5 years ago. These days Kubernetes is perfectly capable of running stateful services and supports them as a first class citizen.

rjzzleep · on Sept 23, 2023

Whenever someone runs Kubernetes onprem I tell them to buy a TrueNAS or another cheap SAN. A cheap SAN costs as much as a DevOps expert setting up your Ceph infrastructure and a lot less when you actually run into issues with that software defined storage solution.

Once you do that Kubernetes is actually quite nice, because it gives you a base configuration of postgres that comes with automatic backup setup etc.

That advice is rarely taken though ...

ahoka · on Sept 23, 2023

Solid advice. What driver/provisioner works fine for this?

rjzzleep · on Sept 23, 2023

There used to be the external-storage-provisioner[1] but what you need today is the second link[2]

[1] https://github.com/kubernetes-retired/external-storage/tree/...

[2] https://github.com/kubernetes-sigs/sig-storage-lib-external-...

crabbone · on Sept 22, 2023

Nothing can be further from the truth.

Kubernetes has nothing to offer to anyone that wants to work with storage (but there's a myriad of CSIs).

Here, let me give you an example of a system that offers storage in a similar way to how Kubernetes offers... well, it isn't really that good at offering compute, but, at least it kind of does it. So, Ceph -- that's something that makes sense to run PostgreSQL on (it's a storage provider). Kubernetes isn't a storage provider. It doesn't know how to manage it even...

I.e. if you think that you run PosgreSQL on Kubernetes -- you are mistaken. Something else does it. Kubernetes is a proxy there, at best (but probably is completely irrelevant).

erulabs · on Sept 23, 2023

To quote you: “nothing could be further from the truth”

Ceph itself runs very well on K8s - see the Rook project.

Of course you can run psql on k8s. Psql doesn’t need a storage orchestration system - hell it technically doesn’t need any storage at all!

crabbone · on Sept 25, 2023

Rook is about using CSI... it doesn't run Ceph on Kubernetes. That's impossible because Ceph relies on functionality that exists in drivers (kernel modules) to run. CSI is the component that communicates between eg. rbd driver and the user-space (eg. Kubernetes controllers), but it doesn't run Ceph.

It's in principle impossible to do anything about block devices in containers like those used by Kubernetes because those rely on Linux processes and associated namespaces. There isn't a Linux namespace for block devices, the closest you can get is the filesystem namespace. In other words, you cannot manage block devices purely in containers, you need some help from the host operating system. And this is why I mentioned CSIs in my previous post.

erulabs · on Sept 27, 2023

It does run Ceph on Kubernetes. How else would you describe deploying OSDs to linux servers via Kubernetes other than "Ceph on Kubernetes"?

> That's impossible because Ceph relies on functionality that exists in drivers (kernel modules) to run

This statement doesn't make sense. All linux applications require kernel functionality. Yes, to deploy Ceph, you must run Linux systems with the desired kernel modules. Turns out, Rook sets that up for ya! This statement exposes a somewhat deep misunderstanding of what Kubernetes is.

I run into you in every thread that mentions k8s and I sense extreme vitriol and a huge lack of experience / understanding. Don't mistake my future lack of replies for an unsaid "you've misunderstood".

qaq · on Sept 23, 2023

Look at benchmarks of Postgres on Ceph

crabbone · on Sept 23, 2023

Part of my job is to measure storage performance...

I can tell you at leas this: there cannot be a meaningful "benchmark of Postgres on Ceph". Too many things will influence the benchmark way too much. You need to be a lot more specific when you talk about such benchmarks. Here are some things you will need to present:

* Are OSDs connected to the node being tested through network or are they closer (NVMe / SAS / SATA?). If network, what's the bandwidth? What's the latency? What if it's stuff like reliable Ethernet that's used for iSCSI / NVMe over IP or something like that?

* How much memory (relative to data at rest) does the node have?

* What is the layout of memory buffers in PostgreSQL?

* What is the setting used for synchronization in PostgreSQL?

* How much replication is going to happen (Ceph pool size)?

* Block sizes and frame sizes.

* Type of workload. Surprisingly, some queries can exploit parallelism in I/O while other queries cannot. Surprisingly some queries will need a lot of synchronization while others don't.

And there's more, it would be too tedious to try to give an exhaustive list of things to control for. And the problem is, at least those mentioned here can influence performance sometimes up to an order of magnitude, sometimes two orders of magnitude...

rjzzleep · on Sept 23, 2023

I've run into all of these issues. In the end it's much cheaper to just buy a cheap SAN than it is probably to pay a month of your expertise.

I guess that's why I struggle making cash-cow consulting gigs, because my clients are never long term dependent on me.

sebasmannem · on Sept 24, 2023

I also only run short term. I enable customers with designing and implementing the proper (CEPH e.a.) environment for their workloads. But I don't run their systems. I always handover to the rest of the technical staff. Bringing in a cheap SAN basically shifts that responsibility from people like me to the SAN vendor, and they could bring in hardware that might do the job.

Running on K8s, I feel you need two types of storage: - Block storage, with proper fsync (fast and reliable) - S3 storage Both MUST be CloudNative. I don't know if Cheap SAN is available with proper k8s CSI providers, but if they are, they could be up to the challenge.

Note that people like me can enable customers with both (choosing the proper 'cheap SAN', but also designing a proper storage environment with CEPH or other software storage solutions.

gbartolini · on Sept 23, 2023

Why not dedicate some worker nodes using taints/tolerations/labels, even on bare metal, with locally attached storage? I wrote this many years ago now but that's the reason why we started CloudNativePG (OpenEBS might not be the answer today, but there are many storage engines now, including topolvm which brings LVM to the game): https://www.2ndquadrant.com/en/blog/local-persistent-volumes...

It is ultimately your choice. I am a big fan of shared nothing architecture for the database. (I am a maintainer of CloudNativePG)

sebasmannem · on Sept 24, 2023

Yeah, and let Postgres take care of redundancy. I agree that this is an interesting proposition. AFAIK PortWorkx could do a similar thing, but then with storage redundancy. Basically: - storage is synced to 3 local storage devices spread across 3 different k8s nodes. This could be NVMe. - pod is only scheduled next to one of the three - reads are local, writes are local (for fsync) and synchronised to the other devices. I would love to test with pg_tps_optimizer against Portworkx

sebasmannem · on Sept 24, 2023

I both agree and don't agree about your comments. Benchmarks should be a comparison and one can very well do a comparison between exactly same deployment on exactly same infrastructure with 2 different storage types without going so deep into the weeds. It is crucial to understand the environment of the actual benchmark, but many of the things you mention are less important unless you want to investigate what actually is going on under the hood (hoping to improve something).

Also note that to many people looking to run database workloads on K8s / CEPH, knowing that someone was able to run with 18k TPS without pulling rabbits from their sleeves is much helpful, and people asking all of these details basically makes people less willing to share, which is not helpful at all.

Be that as it may, as mentioned on another thread, we ran benchmarks on Premise/open Shift / CEPH, and I will try to answer as much of your questions as possible on these benchmarks. If you want more details, LMK... * Stack is: Openshift - RBD - Network - CEPH node - VMWare VMDK - SAN storage * Network (AFAIK) is 10g, I haven't tested network latency or storage latency, but the roundtrip for a commit (which pg_bench and pg_tps_optimizer call latency) took about is 30ms running 233 clients / 17k TPS. * no fancy stuff like reliable Ethernet that's used for iSCSI / NVMe over IP or something like that

* I mostly ran with pg_tps_optimizer and it is designed to test storage performance (not performance from app perspective) the way it works things like shared buffer size is less important. But FYI, I ran with 2GB for cluster.spec.resources.limits.memory.

* What is the layout of memory buffers in PostgreSQL? I don't understand what you are trying to get at. Running on K8s, you should trust the operator to deploy as smart as possible and not worry about stuff like this unless you are trying to actually investigate and fix problems. I ran with standard settings.

* I tested with many options including. Single instance, async (with synchronous_commit is remote_write, on, remote_apply) and sync (remote_write, on, remote_apply). These tests where run on Azure VM, but I am fairly sure running on OpenShift/CEPH does not impact that much. Biggest difference with 13 clients, 12/13k TPS with sync and 17/18k TPS with async. Difference is smaller with higher number of clients. As the effect is larger with smaller number of clients, probably the effect is less severe with openshift/ceph.

* AFAIK we CEPH set to keep 3 replicas. TBH, I don't see how this is of much importance. CEPH RBD kernel driver writes to both replicas in parallel. Doing more in parallel has little impact on latency and bandwidth is not the issue.

* I don't know the Block sizes and frame sizes for sure. I expect it is default settings (4096).

* Type of workload. Yeah, this is important stuff. First of all, about pg_tps_optimizer. I have the most interesting information with pg_ts_optimizer. It basically runs update statements on a record in a table, and with 233 clients this is 233 tables. This really tests storage performance (we rule out things like semaphore locks). This might be compared to importing data with a separate client (which could run in parallel) for every table (or partition if you like). With pg_bench (default workload), we see similar graphs, but we see limitations with pg_bench with higher number of clients. As all data is in the same table(s) with higher number of clients they run into contention issues (probably semaphore softlocks). As this is not a limitation of storage, I personally find this less interesting.

sebasmannem · on Sept 24, 2023

We have run benchmarks on our environment (CNPG, Openshift and CEPH in Dutch gov) and compared to Azure Postgres and (CNPG on) Azure AKS. Pgbench and pg_tps_optimizer. CEPH indeed is a 'high bandwidth / high latency' storage solution, and as such we could get to comparable TPS but required more clients. With 2 vCPU the max TPS was about 17k/18k with AKS and also with OpenShift. But with AKS we required 34 parallel clients and with OpenShift/CEPH we required 233 clients. More clients <=> more in parallel <=> more TPS on CEPH... If you are interested I can share some graphs.

oooyay · on Sept 23, 2023

I run a stateful metadata cache on Kubernetes with a StatefulSet and EBS as block storage. It runs SQLite just fine.

As for the "setting up Kubernetes" comment, I think that could be true of Kubernetes years ago. Nowadays Platform Engineers generally build on its capabilities continually, and by the time a user is using it to schedule network, compute, and storage the setup for an application maybe takes a day or so without a template. Most of the platform engineering work I've done on Kubernetes had much more to do with lifecycle management than paining myself over initial provisioning.

antonvs · on Sept 23, 2023

> EBS as block storage

We do the equivalent on GCP. A lot of the criticism about Kubernetes storage seems to come from people using it onprem.

In that context, I can well imagine that it's a PITA to set up well. As rjzzleep commented in this thread:

> Whenever someone runs Kubernetes onprem I tell them to buy a TrueNAS or another cheap SAN. A cheap SAN costs as much as a DevOps expert setting up your Ceph infrastructure and a lot less when you actually run into issues with that software defined storage solution. Once you do that Kubernetes is actually quite nice

ahoka · on Sept 23, 2023

Why do you use a StatefulSet? I assume every instance has it own volume with its own sqlite and that backs the cache. Why not just Deployment? Failover would be easier in that case.

oooyay · on Sept 23, 2023

I scale it vertically. Cache refresh is non optimal and takes a few hours. I could make it better by having the instances talk, but frankly the service may never actually need to scale. It can handle thousands of rps off that single container.

getrealyall · on Sept 23, 2023

It's okay, your managed Kubernetes provider also just happens to sell a managed database service! Isn't that convenient?

spion · on Sept 23, 2023

More like six days lately.

rauanm · on Sept 23, 2023

We've started using CNPG cautiously in production and can say we're finally confident about running databases in kubernetes. (we had a bad run before with custom setups, and went back to using VMs again).

I am very grateful to maintainers for not only open sourcing the industrial-grade operator, but also sharing so much of their expertise. Unexpected side-effect of adopting CNPG for us was that we now have a good starting point for running postgres with high availability. (obvs there's still a lot to learn, but CNPG docs is a treasure trove of operational knowledge)

dang · on Sept 23, 2023

Related a bit more loosely:

Why we moved from AWS RDS to Postgres in Kubernetes - https://news.ycombinator.com/item?id=32986325 - Sept 2022 (145 comments)

Encrypting Postgres Data at Rest in Kubernetes - https://news.ycombinator.com/item?id=29057505 - Oct 2021 (40 comments)

Show HN: StackGres: open-source Postgres-aaS on Kubernetes with 120 Extensions - https://news.ycombinator.com/item?id=28865432 - Oct 2021 (3 comments)

Running Postgres in Kubernetes [pdf] - https://news.ycombinator.com/item?id=23682450 - June 2020 (100 comments)

Should I Run Postgres on Kubernetes? - https://news.ycombinator.com/item?id=17512496 - July 2018 (2 comments)

rlex · on Sept 23, 2023

I never ran postgres in prod in k8s, but i used cloudnative-pg for some light, non-critical loads, dev/stg stuff. It works and it's fast, recovery just works (unlike zalando, which refuses to operate with anything but AWS s3). I also reported few minor bugs to cnpg team and they fixed them in no time.

rubenv · on Sept 23, 2023

Using the Zalando operator to send to Minio over here.

No problems at all so perhaps what you ran into has since been fixed?

nullify88 · on Sept 23, 2023

We're using the Zalando operator. We have our wals and base backups shipped to Azure Storage Blob. Not sure where this S3 only thing comes from. Can you exllain further?

boomskats · on Sept 23, 2023

The way I'm reading the comments here is 'AWS S3 as the only flavour of S3 that works', i.e. they're having issues backing up to to services like DigitalOcean Object Storage via S3, rather than S3 being the only backup protocol that's supported.

rlex · on Sept 23, 2023

yes, precisely. It's UI part that's broken, which cannot list snapshots. Issue is here, no fix since 2020, sadly: https://github.com/zalando/postgres-operator/issues/937

brightball · on Sept 23, 2023

Definitely useful for non-prod environments. I don’t want my production database sharing infrastructure with anything if I can avoid it though.

yjftsjthsd-h · on Sept 23, 2023

If you really care, just taint the workers you want the database to run on.

yheowahwjsbf · on Sept 22, 2023

> CloudNativePG does not use `StatefulSet`s for managing data persistence

This is awesome. StatefulSets are an endless source of pain.

deathanatos · on Sept 22, 2023

Do I need an operator, though?

I'd definitely take something over the insanity that is managed services and constant ticket filing with the cloud vendors, but my gut says I just need a Helm chart that makes it decently easy to, a.) bring up PG & some replicas that hopefully have a decent consistency story and b.) a backup job that can write to S3 et al.

I need a better "seduce to use" for a CRD for Postgres, I guess.

"It doesn’t rely on statefulsets [and manages disks itself]" and "The Kubernetes way" (and dark red on blue, yeesh) … just doesn't inspire me?

fivre · on Sept 22, 2023

CRs that simply instantiate an instance aren't all that useful IMO. they're much more useful if you have something like Prometheus' case where they want to attach configuration to various Kubernetes resources and can't easily fit it in annotations.

i don't know that i would actually recommend Helm for much though, since dealing with templates beyond a basic "sub string into field" use case is pain and misery. once you start dealing with named helper templates that operate at different scopes and all but the simplest control flow, the lack of tooling makes debugging templates a nightmare. the operator ecosystem has its own problems (i still can't tell what the extra Red Hat stuff beyond kubebuilder is really helping with and can't stand updating it), but having an actual programming language and all the accompanying type checking and testing tools available is a major benefit.

if you don't need something complex, kustomize feels much less breakage-prone than Helm, though its patches aren't as intuitive to write.

withinboredom · on Sept 22, 2023

The limitations of helm chart templates are what I like. If you code review shenanigans, it should be a code smell that you’re doing something wrong.

Kustomize is terrible. Trying to build a mental model of the final output while writing is horrible.

getrealyall · on Sept 23, 2023

Let's be clear: all of the YAML-cum-DSL K8s deployment options are terrible. It's still just templating with extra steps. But Kubernetes is, fundamentally, one big while(1) loop turning YAML into infrastructure, so, eventually, you have to target YAML. It's just unfortunate that we got to "spicy regexes with if statements" and then stopped, instead of using something more robust.

PickledJesus · on Sept 23, 2023

I don't understand this at all, a Kustomize base is far easier to read than a Helm Chart? It's a set of manifests with no templating. I find that it's easy to see what gets patched or replaced in the overlays and there's a lot more fail-fast rather than stringly-typed whitespace nonsense.

withinboredom · on Sept 23, 2023

I guess your kustomizeations are relatively simple. Just wait until you have 4-5 layers of it, flipping between several files to figure out which file needs to change how, just to change an attribute. With a chart, I just make the change and can work it out, all in the same file.

cogman10 · on Sept 22, 2023

The place the operator looks to be most useful is updating the db. It'll do "the right thing" updating the secondary pods first, doing a switch over of the primary, then updating the old primary pod.

That sort of action is real hard to do with just a plain helm chart (especially if you didn't plan for it up front).

lyjackal · on Sept 23, 2023

This. If it’s a database, I’d like the deployment infrastructure to be aware during upgrades and able to execute complex actions

cortesoft · on Sept 23, 2023

The operator also handles when a node fails or is taken down. An operator can handle replicating to a new replica on a new node and making sure everything stays consistent.

reacharavindh · on Sept 23, 2023

This is not a snark or a judgemental question.

What is a reason someone would want to run Postgres on Kubernetes?

I associate Kubernetes with stateless services that can canary deployed, blue-green deployments, auto restarts upon memory crashes, auto scaling etc. But, I cannot think of any practical reason one would want such events on a database that one relies on for keeping the state?!

What am I missing?

Are there any toy use cases where people are using Postgres for where they can afford(and want) kuberneyes to rolling deploy, and crash restart Postgres instances? In that case, why Postgres in the first place?

gbartolini · on Sept 23, 2023

I'm the author of this blog article in which I try to recap why running Postgres in Kubernetes is a good thing: https://cloudnativenow.com/kubecon-cnc-eu-2022/why-run-postg...

reacharavindh · on Sept 23, 2023

I did read the article you linked. It does touch a little bit on potential benefits of having databases managed the same way the services using them are, and integration tests that include the database.

However, I’m still scratching my head how any of that is better or not possible with a Postgres installation that is outside of kubernetes. Let’s take out the management part of databases themselves - assume a managed database service like RDS POSTGRES. why would one want to run Postgres on EKS over having their pods on EKS talking to RDS Postgres?

I feel like I’m missing some technical reason/advantage that makes all these people choosing to run Postgres on Kunernetes with operators and what not.

What does Kubernetes bring to this table over a separate Postgres that dont run the risk of kubernetes interfering with the reliability or operation?

timacles · on Sept 23, 2023

The only reason is if you have a bunch of micro service apps that need their own small db that can be quickly spun up, and you already run everything else in kubernetes. A very niche use case that doesn’t apply to 95% of shops.

All the rest of replies just make my eyes roll, it’s like reading promotional pamphlets devoid of any context.

gbartolini · on Sept 23, 2023

The reason could be the last 4 years of evolution in Kubernetes. Have you heard of DoK Community (Data on Kubernetes)? Might be a good place where to start.

claytongulick · on Sept 22, 2023

I prefer to run PostgreSQL The Grumpy Greybeard Way.

Like, on a (virtual) machine.

With a config file.

I'm doing a web app with terabytes of healthcare data with intense computation and geospatial queries with an average 60ms response time including latency.

On a modest vm.

With a config file.

samokhvalov · on Sept 23, 2023

You'll probably like this: https://github.com/vitabaks/postgresql_cluster

claytongulick · on Sept 23, 2023

I do like it!

My current DR/HA is basically just hourly block level delta snapshots. It's super simple, cheap and easy to do if the business can tolerate up to an hour of potential data loss.

yjftsjthsd-h · on Sept 23, 2023

Running a database on a VM doesn't sound very Greybeard™ of you; don't you know that abstracting like that kills performance and takes you away from the hardware? (Only half joking; obviously you totally can make it run fine on a VM, but it also runs fine in a container.)

ibash · on Sept 23, 2023

With a WHAT?

claytongulick · on Sept 23, 2023

A config file.

:-)

xyst · on Sept 22, 2023

I look forward to trying this out with a simple app i am creating.

Never really liked using managed solutions but for the longest time this was the accepted paradigm/sentiment because “containers are ephemeral”

lucgagan · on Sept 22, 2023

That is very cool and much needed, but I am _paranoid_ about hosting the data myself. I've lost too many databases over the years to the most random mistakes. Granted, technology has matured a lot over the last decode, but still... the cost (the nights spent debugging and troubleshooting) of maintaining the database yourself is just not worth it when managed solutions exist.

smartbit · on Sept 22, 2023

The whole clue of cloudnative-pg is that it makes supporting Postgres so much easier. My DevOps colleague and I had no experience with running PG at all, studied the backup and replication chapters of Simon Riggs book [0], read the excellent cnpg documentation and deployed Postgres clusters on Lab, Dev & Prod k8s clusters. Production started sept 2022. Not a huge use case, 5TB of data, constant stream of IoT type of data. Continuous backup to Minio on TrueNAS Core. Later we added both a hot-standby and a backup replica in a secondary site for disaster recovery. No production issues like you describe.

Many EDB & 2nd Quadrant have 10+ years experience as core Postgres commiters. I had the pleasure meeting some at Kubecon EU in Amsterdam, friendly Italian & UK Engineers I felt I can trust to make good software. They saw the issues you describe and took a step forward by engineering a proper Kubernetes Operator, introducing it at Kubecon EU 2022 to the public when it reached version 1.5, after they had been running it on their DBaaS production clusters for some time.

Highly recommended. IMHO running Postgres on-prem in most use cases is cheaper than a hosted version. Especially taking into account Schrem 1, 2 & (upcoming) 3 [1].

[0] https://www.packtpub.com/product/postgresql-14-administratio...

[1] https://noyb.eu/en/european-commission-gives-eu-us-data-tran...

crabbone · on Sept 22, 2023

> supporting Postgres so much easier.

Easier in the sense that you don't know what it's doing... It's not hard to deploy PostgreSQL and make it do something. Making it do what it needs to do well is a completely different thing. Tools like this one (or any other management tool that comes from outside) aren't helping you to make it easier. To make things easy you need to learn how to do them. They make it easy to waste a ton of resources on things you don't need for the faction of performance you can get.

smartbit · on Sept 23, 2023

The OP was complaining it was difficult to maintain Postgres, with late nights. Expensive in human resource costs. You’re now switching to a different topic, from total cost of ownership to resource cost of ownership, not taking into account human operator costs.

EDB argues that human costs of operating an highly available cluster with hot standby etc using cnpg & the abstraction it offers is much lower than without an Kubernetes operator.

Of course a highly skilled and experienced Postgres guru with years of experience can run a highly available cluster on less compute resources. But what happens if that person is not available? On holiday leave, with pension? How many of those gurus would a company need to employ? How many of these self maintained pg clusters can a couple of these gurus maintain?

getrealyall · on Sept 23, 2023

Not having DBAs, the people with the actual skills to run the databases you need, available when you need them, certainly sounds like a business staffing problem, and not an engineering problem. Why are inexperienced developers managing databases in the first place? Or is this something people think you can just wing and be okay?

smartbit · on Sept 23, 2023

Who says ‘inexperienced developers are managing databases’? That is not the use case EDB is advocating with cnpg, EDB offers a Postgres DBaaS for developers.

The README [0] explains that CloudNativePG has been designed by Postgres experts with Kubernetes administrators in mind. Put simply, it leverages Kubernetes by extending its controller and by defining, in a programmatic way, all the actions that a good DBA would normally do when managing a highly available PostgreSQL database cluster.

[0] https://github.com/cloudnative-pg/cloudnative-pg/blob/main/R...

EdwardDiego · on Sept 23, 2023

Yeah, I like to call these sorta operators "half-managed". It automates some portion of your playbook.

ilyt · on Sept 22, 2023

If you fuck up backups cloud won't save you...

> Granted, technology has matured a lot over the last decode,

Running PostgreSQL database by yourself was just as easy decade ago as now.

Like, I get not wanting to, especially if ops is not your job, but compared to actually programming apps it's not that hard job job till you get to TB+ sizes. At least compared to writing the more complex apps using it.

api · on Sept 23, 2023

This is a strong indictment of most database software. It shouldn’t be this hard.

spion · on Sept 23, 2023

The problem (very reliable storage of large amounts of data) is really hard too.

api · on Sept 23, 2023

So are compilers, virtual networks, and encryption, but I have tools for all those that don’t inspire the kind of nameless terror that databases do. The failure modes are different but the problems are not really easier intellectually speaking.

There are databases like CockroachDB that are more modern and a lot more approachable for high availability but for some reason everyone adores Postgres. I’m not sure why. It’s arcane and clunky and feels like 1980s Unix software.

smilliken · on Sept 23, 2023

PostgreSQL is the very best of the clunky 80s unix software. Its features and reliability are unmatched. The core contributors have earned the trust of the community.

You lose a lot of features and performance when you go from a single server database to a distributed system. Distributed systems are significantly more complex to set up, administer, and debug. For nearly all databases in use, the tradeoff isn't worth it.

It's really no wonder that postgresql is as popular as it is.

spion · on Sept 23, 2023

That's because they're not really that hard. Compilers are essentially pure functions, encryption is as well. State is another beast entirely.

If I had to put on my innovator cap and do a relatively weakly informed guess, I'd say its because querying capabilities and reliable storage are still too conflated. If we focused on reliable storage that only has great replication support to other querying systems, the problem might get easier.

berkle4455 · on Sept 22, 2023

I think with any database the solution is simply backups no? backups that preferably aren’t tied to the hosted solution at all.

panyam · on Sept 22, 2023

Actually not quite. Backups (assuming you are doing single node) still needs you to decide on your SLOs - RTO (how long it takes to restore) and RPO (how much data lose you can suffer) numbers. On the instant snazy end you have streaming backups and recovery and then on the other extreme you have backup once in N hours/days and restore taking how ever long it takes to restore (so you have customer outages you need to negotiate.

Now let us involve multi node, (both replication and partitioning of shards). As shards go and up and down ensuring data is in sync etc is a hard consistency problem and needs man years of operational excellence and bug fixing.

So when people think databases - they think of the cool stuff - the database engine that does relational algebra and handles SQL queries. That is (IMO) only 1% of a practical, performant, reliable database (offering).

api · on Sept 23, 2023

Maybe if you are gigantic, but there is a long tail of people with <1TB database needs that don’t really need shards and can be well served by a fail over cluster with a master and one or two replicas that can become masters.

These days you don’t really need shards until you hit many terabytes or even more depending on your read and especially write load. NVMe storage is really fast and lots of RAM for caching has become cheap.

panyam · on Sept 23, 2023

So my point was around all things a managed for gives you (eg sharding and replication). Even by the time I had to setup streaming replication and have to worry about wal drifts it is easier to pay a managed provider no?

tetha · on Sept 22, 2023

Also what about the customer that deleted an important thing 6 weeks ago and absolutely needs it recovered? BTW, it's just one tentant in that DB, the other shouldn't be recovered, naturally.

aprilllll · on Sept 22, 2023

In that case, it’d probably be best to just handle deletions at the application layer (e.g., setting a “deleted_at” timestamp field with scheduled permanent deletions later).

And in terms of data compliance, it’s very important to make sure permanent deletions propagate through your backup systems within a reasonable amount of time - Google Cloud[1], for example, is ~180 days.

[1] https://services.google.com/fh/files/misc/gcp_data_deletion_...

crabbone · on Sept 22, 2023

Backups? Do you want to share your idea about how you'd do backups? Especially to a distributed database?

Here are some of the questions you'll have to answer and some options you will have to consider before you go there:

Let's start with the heavy stuff: consistency groups. I.e. groups of bulk storage that underlines your entire infrastructure that ensure that your application and database(s) all recover to the shared state once they crash. To better explain this concept, consider this: you have an application that works with two databases, let's say a document database to store documents uploaded by users (which are later parsed by the application and transformed into records in a relational database). Now, each database provides best consistency guarantees... but they still can fail independently and subsequently recover to different state, where, for example, the document database can be ahead of the relational one (and lose some data). Similar problems face sharded databases.

How geographically far are you going to send your backups? You see, the closer to the working server they are, the higher is the chance you'll lose them together. But, here's the problem: the further away the backups are, the lower is your ability to keep the backup up-to-date with the database, and, subsequently, more data to lose.

Well, backups inherently lose data (for the time between the last backup and the time of the crash). So, if you don't want to lose data at all, you probably want replication rather than backups. And you probably want online replication (but then the distance between the replicas is even more important than in the case with backups).

Also, backups are huge. If you want to ship them outside of the facilities of the storage vendor... that's going to be expensive.

Another point to consider: databases provide consistency guarantees, but does your database provide consistency guarantees you want? Is every relation encoded by using foreign keys, or does the application have some knowledge of how to interpret pieces of data and stitch them together into relationships unknown to your database? Are you sure that every operation that requires atomicity is implemented in a database rather than application (which doesn't enforce atomicity)? What if you stick a backup (recovery point) in a precise moment when your application was doing something that was meant to be atomic, but the application author didn't know how to express in SQL (because in their fear of technology they chose to use Hybernate or SQLAlchemy etc.)? And if you do so, it spoils your backup...

gbartolini · on Sept 24, 2023

I actually do not understand the point here. And maybe you are not very familiar with the concept of transactions. Backups can only account for committed transactions.

However, we are talking about Postgres, here, not a generic database. PostgreSQL natively provides continuous backup, streaming replication, including synchronous (controlled at transaction level), cascading, and logical. You can easily implement with Postgres, even in Kubernetes with CloudNativePG, architectures with RPO=0 (yes, zero data loss) and low RTO in the same Kubernetes cluster (normally a region), and RPO <= 5 minutes with low RTO across regions. Out of the box, with CloudNativePG, through replica clusters.

We are also now launching native declarative support for Kubernetes Volume Snapshot API in CloudNativePG with the possibility to use incremental/differential backup and recovery to reduce RTO in case of very large databases recovery (like ... dozens of seconds to restore 500GB databases).

So maybe it is time to reconsider some assumptions.

crabbone · on Sept 25, 2023

> And maybe you are not very familiar with the concept of transactions.

Hahaha. Really? Try being more subtle maybe? Or maybe try reading what you replied to?

uneekname · on Sept 22, 2023

Could anyone speak to how this compares to other Postgres solutions for Kubernetes?

JimBlackwood · on Sept 22, 2023

We are currently moving to CNPG and have tried CrunchyData and Zalando in the process. The other two we abandoned while trying it out.

Zalando: - Relies on WAL-E which is now obsolete - Documentation all over the place - Hacky setup that deviates from K8s standards (no easy way to set user through supplying secrets, for instance).

In general, it feels like an operator to be used internally at Zalando according to their conventions that they just open sourced. It doesn’t seem like they want (or get time) to support other conventions. I don’t think this is a bad thing, it’s already great Zalando open sourced this. Just important to know when you decide to use it.

CrunchyData: - Incomplete documentation (Certain values settings are missing from their API specs) - Hacky user setup. - Doesn’t support running without backups enabled. (Obviously, you’d never want to run without backups setup on prod. But when testing, it’s nice to not need to have a perfect setup from the start. Without backups, it will let the database pods fill up their PVC’s with a WAL. Even when not doing any writes. It fills up at about 10GB/day.) - Backups seem to randomly fail.

It looks pretty OK otherwise.

CNPG: - Adheres to K8S standards - Seem to realise that an Operator will (currently) not fully replace a DBA. Their kubectl plug-in is great to interact with the cluster.

Obviously we still need to test rollovers and restoring from backups, but so far it’s been easy to setup.

It does suffer from what most operators suffer from; their CRD is a mess. The UID of the Postgres Container is specified on the same indentation level as my switchOverDelay, superuserSecret and bootstrap spec. Would be nice if these would follow a more logical grouping (pod spec, users, switchover).

smartbit · on Sept 22, 2023

Can confirm Zalando & Crunchy are quite a mess, contrary to CNPG. See my other post https://news.ycombinator.com/item?id=37618886

koolba · on Sept 23, 2023

Why is wal-e obsolete? I’m assuming you mean the golang replacement wal-g.

JimBlackwood · on Sept 23, 2023

See the GitHub: https://github.com/wal-e/wal-e

Unmaintained would’ve made more sense to say, but the maintainer choose the words “obsolete” so I took those. :)

Seems to be obsolete due to a lack of interest and contributions.

zinclozenge · on Sept 22, 2023

I've never used it myself, but while doing research I noticed that it received a lot of praise from users.

One thing that did catch my attention is that it doesn't use statefulsets for the postgres pods. I mostly agree with their reasons, but I haven't taken the time to understand their implementation.

remram · on Sept 22, 2023

The rationale is here: https://cloudnative-pg.io/documentation/1.20/controller/

uneekname · on Sept 22, 2023

I found that to be well-written, so another point for CloudNativePG

EdwardDiego · on Sept 23, 2023

Strimzi (Kafka operator) is moving away from StatefulSets for the same reason.

perrygeo · on Sept 22, 2023

A good comparison with other operators here [2022]: https://blog.palark.com/cloudnativepg-and-other-kubernetes-o...

Personally I've only used Kubegres (https://www.kubegres.io/), which didn't even make the above list. It's ok for a personal project.

All k8s solutions for postgres take subtly different approaches. It seems that they've all converged on the Operator pattern. The basics are easy: run a database process which persists data to a cloud disk of your choice. The hard parts are how to update, migrate, backup, restore, monitor, failover, replicate, etc. These kubernetes "operators" promise to fulfill the role of a DBA but, just like hiring a DBA, it requires buy-in to their approach.

EdwardDiego · on Sept 23, 2023

From my experiences with Strimzi, I think of it as "half-managed", like, it'll make doing things like upgrades easy as, but yeah, it's just tooling that makes self-management easier.

nz_cal · on Sept 22, 2023

I’ve been using it in prod for a while now, pretty happy with it. Solid, integrated pgbouncer, crd based, good license.

I do wish there was a simpler way to handle major version upgrades of pg.

When I looked at some alternatives, these were my thoughts (may be out of date by now)

- kubegres: maintained by one guy, lots of GitHub issues with no responses

- crunchy data pgo: licensing is not obvious, seems to require license in some cases

- stackgres: agpl, no thanks

- zalando: they know pg extremely well, but it’s not kubernetes native. Doesn’t include pgbouncer. Doesn’t handle automatic failover when a node dies, and during testing it often got confused when killing a node.

samokhvalov · on Sept 23, 2023

> Doesn’t handle automatic failover when a node dies

Patroni, the most popular autofailover solution for Postgres, is developed in Zalando. Of course it's included in Zalando operator by default.

uneekname · on Sept 22, 2023

I appreciate your perspective, thanks. Major version bumps do seem to be difficult sometimes with pg in my limited experience.

gbartolini · on Sept 23, 2023

If you are interested, I suggest this article I wrote that covers the current state of major upgrades with CloudNativePG: https://www.enterprisedb.com/blog/current-state-major-postgr...

rollulus · on Sept 22, 2023

Exactly what I was wondering about as well. I have no prod experience with it, but e.g. Zalando’s operator has been solid for me.

fijiaarone · on Sept 22, 2023

It’s the opposite. A Kubernetes solution for Postgres.

uneekname · on Sept 22, 2023

Haha, that might be a better way of putting it. "A solution for running Postgres in Kubernetes"

crabbone · on Sept 22, 2023

> CloudNativePG exclusively relies on the Kubernetes API server to maintain the state of a PostgreSQL cluster.

This is scary af. Kubernetes API server is very finicky and unreliable. It's probably the first component that fails in Kubernetes no matter what the problem is (eg. I recently managed to run it into unrecoverable state by accidentally starting 10K jobs instead of 100).

This is just outright bad idea... but, really "reliable" and Kubernetes don't belong together. So, if you wanted an unreliable PostgreSQL, which you have no idea how it's managed and how to recover... well, that sounds like fun!

hnarn · on Sept 23, 2023

Can someone explain why this is being downvoted?

erulabs · on Sept 23, 2023

Because it’s reactionary and a bit silly, while providing no real value.

Yes, orchestration requires the orchestration tool.

No, k8s api is not unreliable or “finicky”.

This is being downvoted because it’s the equivalent of saying “delivery requires the use of a car? But engines are so unreliable, bad idea!”

Saying k8s and “reliable” don’t belong together is just flamebait - k8s runs reliably for tens of thousands of extremely large shops.

getrealyall · on Sept 23, 2023

Are these extremely large shops in some sort of pocket universe? Because I've been around the block enough to regularly experience Kubernetes's issues over and over again, and I'd say that the GP comment you're critiquing is actually right on the money. In my experience, people downplaying Kubernetes's drawbacks have either never been bitten by them, or make a lot of money by getting people to use Kubernetes.

erulabs · on Sept 23, 2023

Complex systems are complex. If you don’t need it, don’t use it.

But you’re absolutely wrong to call it unreliable. I’ve also “been around the block”, and I’ve seen 50k lines of BASH fail in complex ways too.

A google search would show you plenty of extremely large shops using k8s, and dozens or hundreds of tech talks by their lead engineers saying how much better it made their lives.

wmf · on Sept 23, 2023

If you think k8s is finicky and unreliable... just don't use it at all. That's not really specific to CNPG or its architecture.

crabbone · on Sept 25, 2023

No it's not. Kubernetes is not an equivalent of a car, it's an equivalent of Land Rover Discovery 5 in your scenario. There are plenty of cars which score differently on some pre-agreed reliability scale (and, specifically, this model of Land Rover scores very poorly).

Kubernetes is light years away from being a reliable system. It's a system to support Web sites, which don't need to do anything that would require a lot of investment into reliability. Nobody in their right mind would use Kubernetes to run highly-reliable systems that, eg. put human lives at risk.

PostgreSQL is in a completely different category from reliability standpoint: much more reliable. But, if you put it on Kubernetes footing, you give up that reliability.

Again, maybe your business is to make Web sites -- then nobody cares, it's good enough. Running a database of a bank on top of Kubernetes -- well, I hope nobody actually tries that. And size has nothing to do with that. Reliability is about all sorts of metrics like mean time to failure / recovery and guarantees that the system makes like once a particular condition is encountered, the system will halt and so on. Kubernetes never claimed anything in that category, and, in practice, it doesn't offer any satisfactory guarantees of its own performance.

Anecdotally, I saw Kubernetes fail due to its internal problems more times that I can count. Like, we are talking about at least hundreds of times. And by "fail" I mean that the system enters into a state that cannot be salvaged by a reboot or replacement a master node(s). Since my interaction with it involves a lot of testing, but I'm not testing Kubernetes, rather something else running on it, I'm all too used to deleting and deploying new clusters.

erulabs · on Sept 27, 2023

This does not sound like knowledgable feedback. You should consider investigating and fixing the problem next time. At least to the point where you can explain to others what failed. A problem that continually affected you thru many clusters sounds like a configuration issue. PEBKAC, etc.

I previously worked for a company that ran highly-reliable systems that put human lives at risk. We increased the reliability of deployments and rollbacks quite a lot by moving to K8s, and that was years ago, when K8s was much newer than it is today. A very good friend still works there and they have zero plans or desires to move off K8s, as it works fantastically for them. They maintain 5 nines uptime and are responsible for daily hospital operations.

Also - plenty of human lives rely on “websites”. Your car probably speaks HTTP. I don't understand this line of thinking at all.

rgrieselhuber · on Sept 22, 2023

Is it just me or is the color scheme on this website smoking crack?

susam · on Sept 22, 2023

You are probably viewing the dark colour scheme of the website. The website looks fine in light colour scheme. However in dark colour scheme, while it flips the background and text colour alright, it does not alter the accent colours sufficiently well.

To see what I mean, compare https://i.imgur.com/hhL0fZH.png (dark colour scheme) with https://i.imgur.com/rr24RkN.png (light colour scheme).

Consider the colour of the text "Try Kubernetes Way" and the background colour. In light mode, we have foreground colour #DC2626 on background colour #FFFFFF. This has a contrast ratio of 4.82. WCAG 2.0 level AA recommends minimum contrast ratio of 3.0 for large text. So a contrast of 4.82 is decent enough. However in dark mode, we see #991B1B on #075985 which has a contrast ratio of only 1.09. This ratio is too low and fails all WCAG contrast requirements.

pphysch · on Sept 22, 2023

They're mixing white and black text on the same background. Mixing muddy red with dark blues. Yeah.

osigurdson · on Sept 22, 2023

It definitely looks different.

dvh · on Sept 22, 2023

Was it's always spelled kubernetes? In my "Mandela timeline" it was kubernets and just recently switched.

comradesmith · on Sept 22, 2023

Always kubernetes, but I’ve heard some engineers with accents say it in a way that sounds like “kubernets”

ghshephard · on Sept 22, 2023

It's been K - 8 letters - S long enough that everyone at work just calls it "Kates" now, which initially annoyed me to no end, but has kind of grown on me.

martijnvds · on Sept 22, 2023

K8s was always K-ubernete-s (8 letters)

osigurdson · on Sept 22, 2023

It was never called "Kubernets".

c0brac0bra · on Sept 22, 2023

Always AFAIK. It's a Greek word.

wrboyce · on Sept 22, 2023

Nitpick, but I think it is from Ancient Greek. A cursory Google tells me the modern Greek descendant is κυβερνήτης/kyvernítis and has a slightly different meaning.

cgio · on Sept 23, 2023

κυβερνήτης is the ancient word too. Maybe they were trying to word play with cube, which by the way also comes from κύβος-kyvos and not kuvos. Exactly same meaning as far as I know, wonder what the cursory google came up as the difference in meaning. Also root for cybernetics...