Good. The first S in S3 is simple, despite nobody in software appreciating simpl...

klodolph · on May 23, 2024

Multi-region is pretty essential IMO. You can’t get the same cost effectiveness by building your own multi-region S3, stapling multiple buckets together. The basic premise of S3 is that you get a keystore interface, and S3 handles error correction and distributing the chunks among multiple physical machines on the cheap. Multi-region is the same product, at a larger scale. The complicated part is that somebody has to pay for the network bandwidth (whereas in a single region, it’s so cheap it’s unmetered).

The CAS thing is also pretty essential. Everybody wants to build some simple storage system on top of S3 and without CAS it’s pretty damn hard to have any kind of consistency guarantees, even for very simple systems… you end up having to build something outside S3 to manage consistency. An easy to understand use case is backups. Suppose you are using S3 as a backend for a backup system which deduplicates backups. You want to make backups from multiple locations, you want to deduplicate because there’s a lot of duplicate data floating around (maybe you have terabytes of video files getting copied, ML model weights, something else big that you copy around), and you want to expire old backups. You can almost build this on top of plain S3, and the only reason you can’t is because it’s unsafe to expire old data in such a system if any backup is writing (because the other backup may add a new reference to data, racing against the expiration / garbage collection process). A simple CAS gives you a lot of tools to solve this. The alternative to CAS is doing something kinda silly, like running a DynamoDB table as a layer of indirection.

Neither of these things add much complexity to S3.

(I think append is less useful and potentially a lot more complicated, both in terms of its API implications and in terms of the underlying complexity. If I want “append”, then I can use multipart uploads, or just upload multiple objects and reassemble them on the client side.)

QuadrupleA · on May 23, 2024

Essential for who? Lots of other storage solutions if S3 is too simple. Distributed databases. Or fire up an EC2 cluster and install anything you want on it.

klodolph · on May 23, 2024

The multi-region thing should be pretty apparent. It’s part of the core S3 design to provide distributed storage. Multi-region is distributing it over a larger area. If you want to implement multi-region storage yourself, you can do it on S3 and pay a high cost for duplicated data, or you can try to implement your own S3 alternative.

For CAS, one example is backup jobs. You can run backup jobs to S3, but there are some safety issues if you want deduplication and you want to expire old data.

> if S3 is too simple

CAS isn’t some kind of super complicated, technical thing.

It would be nice if S3 had this small, incremental additional feature. That’s all. It would mean that some people don’t need to fire up DynamoDB just to do something you can already do in, say, GCS.

altdataseller · on May 23, 2024

None of those are essential.

The only essential thing it needs to do is store my files, with some assurity that they will exist x years from now in a cost efficient manner.

klodolph · on May 23, 2024

Sure… there’s always people out there who have a shorter list of requirements than you do. Someone else out there doesn’t need it to be cost-efficient, so maybe that’s not “essential”?

dartos · on May 24, 2024

Other people have other requirements.

iLoveOncall · on May 23, 2024

There essentially is multi-region buckets. You can configure automatic replication, and then use Multi-Region access points: https://docs.aws.amazon.com/AmazonS3/latest/userguide/MultiR...

peoplefromibiza · on May 23, 2024

> Multi-region is pretty essential IMO

if we are talking about less than 1% of the applications, probably yes.

AdamJacobMuller · on May 23, 2024

> you can’t is because it’s unsafe to expire old data in such a system if any backup is writing

I did this, I just don't expire any object created in the past week.

klodolph · on May 23, 2024

A currently running backup process can create a new reference to an object which is more than a week old. Meanwhile, the garbage collection process can be deleting that object, but the deletion operation hasn’t finished yet. CAS gives you a lot of options to do this safely.

Dylan16807 · on May 23, 2024

What if you also have a week between marking/moving a file to start deletion, and the final removal?

If a backup and a GC race then a file can get both referenced and marked at the same time, but then a future GC will see the references and put the file back into a normal state. Assume other operations can still find the file while it's marked.

Are there benefits to CAS for this situation other than resolving faster?

klodolph · on May 23, 2024

Sure, you could probably use that kind of delay. I have personally seen GC systems that take more than a week to mark which files need to be deleted but this is admittedly unlikely (the system in question was massive).

> Are there benefits to CAS for this situation other than resolving faster?

I think this kind of thing comes up a lot, where you’d find it convenient to have a CAS update for your file. Like, maybe you should be using a database, but you’re already using S3 and having one or two CAS operation would mean that you can stick with S3.

Sometimes, the alternative is a little ugly. Like, “I’m going to create a DynamoDB table, and it’s only going to contain one row.”

What I’d really love, even more, is to have some kind of distributed lock service on AWS. Something like Zookeeper or Etcd as a SaaS product, where it’s cheap just to get a couple distributed locks. Feels like a gap in cloud offerings to me, but I can understand why it’s missing.

pas · on May 23, 2024

You can use S3 for this, no? (Admittedly it's clunkier than a service with SDKs.)

LIST, GET, and PUT are strongly consistent, the file name is the lock name, write the owner id and expiry timestamp in the file, and periodically extend the lock expiry (heartbeat). If an other process finds an expired lock delete the file.

Dylan16807 · on May 23, 2024

Oh I'm sure there's lots of systems where CAS is very useful.

It's just that a backup tends to have mostly immutable files sitting around, so it becomes more niche. It's awkward to do a lock but you don't need a lot of locking.

jononomo · on May 26, 2024

I don't need multi-region.

KronisLV · on May 23, 2024

> When it gets too out of hand, people will paper it over with a new, simpler abstraction layer, and the process starts again, only with a layer of garbage spaghetti underneath.

I'm pretty happy that there are S3 compatible stores that you can host yourself, that aren't insanely complex.

MinIO: https://min.io/

SeaweedFS: https://github.com/seaweedfs/seaweedfs (this one's particularly nice and is permissively licensed, in contrast to everything else)

There was also Zenko, but I don't think they gained a lot of traction for the most part: https://www.zenko.io/

Of course, many will prefer hosted/managed solutions and that's perfectly fine, but at least when you run software yourself, you are more in control over it and for the most part can also make the judgement on how hard it is to operate and keep operational (e.g. similar to what you'd experience when running PostgreSQL/MariaDB/MySQL or trying to run Oracle).

That said, my needs (both in regards to features and scaling) are pretty basic, so it's okay to pay any of the vendors for something a bit more advanced and scalable.

Already__Taken · on May 23, 2024

seaweedfs has worked really nice for our small use for 2 years. some docs polish wouldn't hurt but reading the source isn't hard and I don't know golang.

charlie0 · on May 23, 2024

This. There's always a small group of people pushing feature requests (ahem scope creep) into services that were never designed for those things. Unfortunately, those people win a lot of the time. Ie, see all the simple JS frameworks that were initially meant to solve relatively simple problems, only for them to become bloated and be replaced by something else that promised simplicity.

notatoad · on May 23, 2024

i think S3 passed beyond the "simple" a long time ago. is this simple? https://imgur.com/a4jGu0Z

viraptor · on May 23, 2024

The simple things are simple. Other are possible to do. I don't think that image is representative. "I want to go for a walk, but there's whole world to choose the path from" - you can still go around the block, the world is not in the way. (There are simple, generic read-only and read write policies available)

icedchai · on May 24, 2024

If you don't care about IAM permissions / authorization, just give them all permissions. That's simple, but probably not secure. If you follow "best practice" you'll spend half your time dealing with granular IAM roles, permissions, security groups, and a ton of other stuff.

baq · on May 23, 2024

This is not S3, this is authz.

What does 'simple' even mean when you talk about authz?

notatoad · on May 23, 2024

it is a list of features that s3 has. the iam panel is just a nice way to see them all listed.

CodinM · on May 23, 2024

As someone who had to make a company pass an audit, and the company massively relied on S3 buckets...this so much.

vbezhenar · on May 23, 2024

S3 is anything but simple.

Here's simple protocol:

  PUT /my/key
  Content-Type: plain/text

  Hello, world


  GET /my/key

Did you ever tried to use S3 without libraries? Did you ever checked size of AWS SDK? It's incredibly overengineered.

zokier · on May 23, 2024

The S3 is that simple. Only complication is AWS auth, but you can easily do stuff on S3 with e.g. plain curl:

$ curl \ -H 'Content-type: text/plain' \ --aws-sigv4 'aws:amz:eu-west-1:s3' \ -u "$AWS_ACCESS_KEY_ID":"$AWS_SECRET_ACCESS_KEY" \ -H "x-amz-security-token: $AWS_SESSION_TOKEN" \ -XPUT --data 'hello world' \ https://mybucket.s3.eu-west-1.amazonaws.com/my/key

$ curl \ --aws-sigv4 'aws:amz:eu-west-1:s3' \ -u "$AWS_ACCESS_KEY_ID":"$AWS_SECRET_ACCESS_KEY" \ -H "x-amz-security-token: $AWS_SESSION_TOKEN" \ https://mybucket.s3.eu-west-1.amazonaws.com/my/key

just works.

pjc50 · on May 23, 2024

That's leaning a lot on curl's implementation of "--aws-sigv4", which is not simple.

christophilus · on May 23, 2024

Wow. I didn't realize curl had AWS auth baked in. That would have made my life a lot easier a few months ago.

avar · on May 23, 2024

"S3 is that simple, here's an example using an otherwise generic HTTP library specifically altered to deal with AWS's tiresome boilerplate complexity".

I don't like putting words in other people's mouths, but that really does seem like a fair paraphrasing of your comment.

belter · on May 23, 2024

You conveniently left out of your paraphrasing the "Only complication is AWS auth" part...

l5870uoo9y · on May 23, 2024

S3 does handle authentication by giving you a temporary upload URL so your bucket isn't wide open. But I agree it isn't the simplest solution.

ianopolous · on May 23, 2024

AWS SDK is huge yes, but you can implement an S3 client in 300 lines of Java.

https://github.com/Peergos/Peergos/blob/master/src/peergos/s...

amedvednikov · on May 23, 2024

That's true for a vast majority of devs, but not for everyone. There are people like Jon Blow and projects like https://vlang.io

rcleveng · on May 23, 2024

Absolutely this. Sounds like the author wants CS2 (Complex Storage Service). Appreciate simplicity

vinay_ys · on May 25, 2024

It is supposed to be simple for its users so that users own systems don't become complex. It is not supposed to be simple underneath its hood.