Good. The first S in S3 is simple, despite nobody in software appreciating simplicity anymore.
Adding features makes documentation more complicated, makes the tech harder to learn, makes libraries bigger, likely harms performance a bit, increases bug surface area, etc.
When it gets too out of hand, people will paper it over with a new, simpler abstraction layer, and the process starts again, only with a layer of garbage spaghetti underneath.
Show your age and be proud, Simple Storage Service.
Multi-region is pretty essential IMO. You can’t get the same cost effectiveness by building your own multi-region S3, stapling multiple buckets together. The basic premise of S3 is that you get a keystore interface, and S3 handles error correction and distributing the chunks among multiple physical machines on the cheap. Multi-region is the same product, at a larger scale. The complicated part is that somebody has to pay for the network bandwidth (whereas in a single region, it’s so cheap it’s unmetered).
The CAS thing is also pretty essential. Everybody wants to build some simple storage system on top of S3 and without CAS it’s pretty damn hard to have any kind of consistency guarantees, even for very simple systems… you end up having to build something outside S3 to manage consistency. An easy to understand use case is backups. Suppose you are using S3 as a backend for a backup system which deduplicates backups. You want to make backups from multiple locations, you want to deduplicate because there’s a lot of duplicate data floating around (maybe you have terabytes of video files getting copied, ML model weights, something else big that you copy around), and you want to expire old backups. You can almost build this on top of plain S3, and the only reason you can’t is because it’s unsafe to expire old data in such a system if any backup is writing (because the other backup may add a new reference to data, racing against the expiration / garbage collection process). A simple CAS gives you a lot of tools to solve this. The alternative to CAS is doing something kinda silly, like running a DynamoDB table as a layer of indirection.
Neither of these things add much complexity to S3.
(I think append is less useful and potentially a lot more complicated, both in terms of its API implications and in terms of the underlying complexity. If I want “append”, then I can use multipart uploads, or just upload multiple objects and reassemble them on the client side.)
Essential for who? Lots of other storage solutions if S3 is too simple. Distributed databases. Or fire up an EC2 cluster and install anything you want on it.
The multi-region thing should be pretty apparent. It’s part of the core S3 design to provide distributed storage. Multi-region is distributing it over a larger area. If you want to implement multi-region storage yourself, you can do it on S3 and pay a high cost for duplicated data, or you can try to implement your own S3 alternative.
For CAS, one example is backup jobs. You can run backup jobs to S3, but there are some safety issues if you want deduplication and you want to expire old data.
> if S3 is too simple
CAS isn’t some kind of super complicated, technical thing.
It would be nice if S3 had this small, incremental additional feature. That’s all. It would mean that some people don’t need to fire up DynamoDB just to do something you can already do in, say, GCS.
Sure… there’s always people out there who have a shorter list of requirements than you do. Someone else out there doesn’t need it to be cost-efficient, so maybe that’s not “essential”?
A currently running backup process can create a new reference to an object which is more than a week old. Meanwhile, the garbage collection process can be deleting that object, but the deletion operation hasn’t finished yet. CAS gives you a lot of options to do this safely.
What if you also have a week between marking/moving a file to start deletion, and the final removal?
If a backup and a GC race then a file can get both referenced and marked at the same time, but then a future GC will see the references and put the file back into a normal state. Assume other operations can still find the file while it's marked.
Are there benefits to CAS for this situation other than resolving faster?
Sure, you could probably use that kind of delay. I have personally seen GC systems that take more than a week to mark which files need to be deleted but this is admittedly unlikely (the system in question was massive).
> Are there benefits to CAS for this situation other than resolving faster?
I think this kind of thing comes up a lot, where you’d find it convenient to have a CAS update for your file. Like, maybe you should be using a database, but you’re already using S3 and having one or two CAS operation would mean that you can stick with S3.
Sometimes, the alternative is a little ugly. Like, “I’m going to create a DynamoDB table, and it’s only going to contain one row.”
What I’d really love, even more, is to have some kind of distributed lock service on AWS. Something like Zookeeper or Etcd as a SaaS product, where it’s cheap just to get a couple distributed locks. Feels like a gap in cloud offerings to me, but I can understand why it’s missing.
You can use S3 for this, no? (Admittedly it's clunkier than a service with SDKs.)
LIST, GET, and PUT are strongly consistent, the file name is the lock name, write the owner id and expiry timestamp in the file, and periodically extend the lock expiry (heartbeat). If an other process finds an expired lock delete the file.
Oh I'm sure there's lots of systems where CAS is very useful.
It's just that a backup tends to have mostly immutable files sitting around, so it becomes more niche. It's awkward to do a lock but you don't need a lot of locking.
> When it gets too out of hand, people will paper it over with a new, simpler abstraction layer, and the process starts again, only with a layer of garbage spaghetti underneath.
I'm pretty happy that there are S3 compatible stores that you can host yourself, that aren't insanely complex.
There was also Zenko, but I don't think they gained a lot of traction for the most part: https://www.zenko.io/
Of course, many will prefer hosted/managed solutions and that's perfectly fine, but at least when you run software yourself, you are more in control over it and for the most part can also make the judgement on how hard it is to operate and keep operational (e.g. similar to what you'd experience when running PostgreSQL/MariaDB/MySQL or trying to run Oracle).
That said, my needs (both in regards to features and scaling) are pretty basic, so it's okay to pay any of the vendors for something a bit more advanced and scalable.
seaweedfs has worked really nice for our small use for 2 years. some docs polish wouldn't hurt but reading the source isn't hard and I don't know golang.
This. There's always a small group of people pushing feature requests (ahem scope creep) into services that were never designed for those things. Unfortunately, those people win a lot of the time. Ie, see all the simple JS frameworks that were initially meant to solve relatively simple problems, only for them to become bloated and be replaced by something else that promised simplicity.
The simple things are simple. Other are possible to do. I don't think that image is representative. "I want to go for a walk, but there's whole world to choose the path from" - you can still go around the block, the world is not in the way. (There are simple, generic read-only and read write policies available)
If you don't care about IAM permissions / authorization, just give them all permissions. That's simple, but probably not secure. If you follow "best practice" you'll spend half your time dealing with granular IAM roles, permissions, security groups, and a ton of other stuff.
"S3 is that simple, here's an example using an otherwise generic HTTP library specifically altered to deal with AWS's tiresome boilerplate complexity".
I don't like putting words in other people's mouths, but that really does seem like a fair paraphrasing of your comment.
Adding features makes documentation more complicated, makes the tech harder to learn, makes libraries bigger, likely harms performance a bit, increases bug surface area, etc.
When it gets too out of hand, people will paper it over with a new, simpler abstraction layer, and the process starts again, only with a layer of garbage spaghetti underneath.
Show your age and be proud, Simple Storage Service.