Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I suppose this is at the top of HN because of Reddit having an outage and Hacker News being slow. But neither Reddit nor HN use Cloudflare. And if you look at the status page there this isn't our core CDN offering, it's certain products that are affected.


I suppose this is at the top of HN as CloudFlare is quickly become a centralized point of failure for big parts of the internet, even without counting the CDN, so large swaths are affected regardless if it's just a "minor" outage or not. Hence it's interesting enough to land at the top.


Anecdotally: op guessed right - I was only interested because of the HN and reddit outages


There's also the perception issue that if the host fails, it most likely isn't cloudflare although cloudflare gives a warning.

Note: experienced it from the first row, cloud had an issue in a region with SSL...


Is there some sort of backup/failsafe mechanism for sites that use Cloudflare?


Depends how badly they fail. To take the full advantage of CF you need to keep your DNS with them. That means if you can't configure any changes, you can't quickly move to another provider either. Your only solution at that point is to transfer the whole domain, but that may also require CF's assistance. (Unless they don't handle your apex domain)

Effectively unless they're down for more than a day, you're better off taking the hit and waiting for them to resolve everything.


> Effectively unless they're down for more than a day, you're better off taking the hit and waiting for them to resolve everything.

lol. There's me worrying about a 2 minute downtime once every 10 years


Well I've managed ~53.5 years uptime with zero downtime so far 8)

I'm taking the piss too. Care to explain?

PS For a laugh (and I apologize for going dreadfully off topic), I asked ChatGPT a couple of questions regarding two mins and 10 years, just in case I'd missed a trick with your comment. I got two calculations within both answers that looked the same but have a factor of 10 difference in the result!

https://chat.openai.com/share/7aaa80e5-1c54-4ac2-9ddb-9e488d...

This is where it goes wrong:

"Total minutes in 10 years = 60 minutes/hour * 24 hours/day * 365 days/year * 10 years = 525,600 minutes"

"Total time in 10 years = 60 minutes/hour * 24 hours/day * 365 days/year * 10 years = 5,256,000 minutes"

LO Calc says that =6024365*10 = 5,256,000.

The worrying thing for me is that an awful lot of input training data might be badly wrong to cause this result or perhaps I've managed to excise a corner case in which case the training data is a bit too focussed in this particular regard. I suspect that arithmetic errors for these things will be awful because there are so many ways to screw up and the training data will have a lot of errors in it. Combine that with the number of subjects available and it will be a shit show.


By the way, when it comes to math like this, wolfram alpha simply can't be beaten: https://www.wolframalpha.com/input?i=2+minutes+of+10+years+p...

As for GPT, it's not about the training data having errors in it - GPT doesn't parrot the training data exactly, there's randomness built into it (otherwise you'd always get the exact same output to the same input). It just generates a random plausible response, it doesn't actually know math.

To highlight this, I've just asked it the exact same thing you did, exact same words, and got an entirely different response (and this time it was correct): https://i.28hours.org/20230613-051307-0ae6.png


We had some automation at work which caused a major problem resulting in a c. 140 second downtime about 3am. Not my area of responsibility, but I think it's the first time since 2005 we've had that type of widespread outage


To use Cloudflare, you switch your domain's DNS server to them. If that broke, you probably can't switch out without incurring the usual DNS propagation delay on your nameserver.

If DNS is up, you can change specific records to point to your origin or another provider instead of the proxy/CDN, provided that it can handle the load and doesn't need a setup similar to Cloudflare (i.e. repointing your nameserver).


Do they support Secondary DNS, e.g. https://support.dnsimple.com/articles/secondary-dns/

?


I wouldn't say it's a centralised point of failure. Many individual sites choose to use it, so it is used by big parts of the internet, but that doesn't make it centralised. Maybe a single point of failure.


It's impossible for a single point of failure not to be centralised. Otherwise it wouldn't be a _single_ point of failure.


Of course it is. If a million websites are built with Flask then a bug in Flask affects all of them. If a million websites individually decide to use Cloudflare then Cloudflare affects them all.

But that doesn't mean Cloudflare is built in a centralised way (I assume it isn't), nor that there's something about the internet that is centralised around Cloudflare. Rather, people are choosing to include it as a dependency in their stack.


In your example Flask would be a single point of failure and could indeed be called decentralized. For example websites can individually patch their Flask installation and come back up one by one, without depending on anyone else (at least once Flask is fixed).

Here with Cloudflare, a single entity is responsible for the fix and will more or less fix the failure for every sites that uses it at the same time, by fixing it on their side. And website individually cannot do anything on their own.

So I would argue it makes sense to call that centralized, at least from a structural/operational perspective.

It is at the very least a form of contraction of the network.


From an operational perspective it's just a supplier like any other. Lots of suppliers are involved in a business's website. That doesn't make it centralised, though?

> It is at the very least a form of contraction of the network.

I think it's that at most. No one has to use them, as they accelerate / enhance open protocols. That is the least lock-in one could hope for, so they don't contract anything in a negative way.

Contrast with, say, Etsy/Shopify, who actively try to replace the open space with closed ones.


It seems like you two are arguing different perspectives on the word centralization without conceding that a term can be relative to a viewpoint.


But every word can be relative to that? I don't see where we'd go with that. I do think it's how the word relates to the topic that's interesting, if that's what you mean, but I think we are trying to get at that (-:


Using a piece of software is not the same as using an online service.

If all those websites are using Flask, that would be the equivalent of centralising on Flask. The opposite of that would be many Flask-compatible yet unrelated other frameworks being used in parallel. A bug in Flask would not affect those not using that very codebase.

The centralisation people are speaking of here is the amount of people all putting their eggs in Cloudflare's basket.


> If all those websites are using Flask, that would be the equivalent of centralising on Flask

I agree on the equivalence between this and Cloudflare use, but not that this should be described as centralisation, which is a particularly potent word on the open web. Popularity isn't the same as centralisation. Each website could rewrite to remove Flask if they wanted.

Another example: if lots of people watch Squid Game, that doesn't indicate a centralisation of television programmes. No options are removed. People are just choosing to do a similar thing, but not in a way that centralises anything.


>It's impossible for a single point of failure not to be centralised.

Is a monoculture centralized? It doesn't have to be, and yet it can all fail due to a single fault.


There are so many websites using it that it is basically the same as being centralized.


So many sites using Flask across the web is centralized?


You're comparing apples to oranges. Flask is software of which all instances run independent of each other while Cloudflare is SaaS where all its sites depend on the same service.

If Cloudflare goes down, so does a significant portion of the web. Flask cannot have an outage like Cloudflare. All they can do is push a faulty update, but even then you can rollback or stay on the old version.


> If Cloudflare goes down, so does a significant portion of the web. Flask cannot have an outage like Cloudflare. All they can do is push a faulty update, but even then you can rollback or stay on the old version.

This is what was on the tip of my tongue. I'd argue you're missing the portion of control in this whole discussion, and how much process you can place in front of a component change.

If I have a flask dependency, I have a lot of control over this dependency. If flask screws up badly, I have many options: I can update. I can not update. I can downgrade. In fact, I could fork flask internally and fix it on my own and either be a good citizen and open up a PR, or I could be something else. I can test all of this in any number of environments before it hits a customer, and even more stages before it hits all customers.

With Cloudflare - or any number of Hosters as well, like AWS, Azure, Google Cloud, I have very little control. If I use Cloudflare as a CDN and Cloudflare goes down, I might not have the capacities at my upstream server to handle the load from all my customers, so I am down as long as Cloudflare is down. I wouldn't have the footprint available necessary to replace AWS in our own private DCs, even if we pooled all spare capacities - and then I'd still have to find a way to exfil our data from a downed AWS. (Which, yes, we have, but it'll take long hours)

And no matter how much I test, if my hoster fucks up, I'm immediately fucked as well, no matter what processes I might have. The only process around this would be provider independence, which is really expensive and a lot of effort even if you just have a luke-warm standby.


You can multihost your static site; you don't have to only use Cloudflare. You can have more than one CDN. Nothing is centralising you or funnelling you somewhere you don't want to go.

And you can do these things in parallel, unlike the Flask example, which you pretty much have to commit to using solely.


If most sites were using Flask and forced to do broken updates, it would be a centralized problem.


BUT THE INTERNET IS SUPPOSE TO ROUTE AROUND TTHIS STUFF.

Remember when crypto promised a decentralized utopia of distributed systems that any interference wouldnt work, but slowly but surely it all congregated into a couple of oligarchical companies?

Surely by now the technology of the internet has matured enough that these conglomerates are going to do the same but with a bit less fraud involved.


WTF has crypto got to do with centralisation of the Internet? The Internet is well-decentralised, but if every Joe Blogg has to put their website behind Cloudflare, it's not the Internet's fault, let alone crypto (?).

Stop putting every-bloody-thing behind Cloudflare, and the problem solves itself. I don't know whether to laugh or cry when I read of someone on HN seriously saying that they need a CDN for their personal website, or that they really need to use AWS or GCP for that matter. We tech workers have lost the plot, and it's our fault it's all in the hands of the few.


I think they're referencing when "distributed" exchanges still went down when particular APIs or domains were unreachable. It ain't fault tolerant unless it's piracy >:D


https://en.wikipedia.org/wiki/Border_Gateway_Protocol

BGP was standardized in 1989 and has been in use since 1994. The technology of the internet has matured in many ways, and remains almost identical in many others. Sometimes Microsoft is right, and backwards compatibility is the killer feature.


The internet does work around issues at the inter-networking layer through BGP and similar protocols, though the same resilience is sadly absent at the higher layers.


I think you are confusing different decades. The early Internet was way more decentralized and democratic.

Cryptocurrencies promised many things but only wanted to replace oligarchical companies with oligarchical miners.


I think they're confusing layers. The internet is more than just http servers. The IP traffic is routed around problems all the time. If the HTTP server is down, that doesn't stop the packets from arriving there (having been routed around dozens of down links and broken routers at any given moment).


True.

But if they fetch a lot of URLs from cloudflare and it takes 30 seconds to answer with a timeout instead of a http/200 under 20ms, then some architectural decisions that were sound under the latter case may make the whole system slow in the former one.


People have started testing network failures but so often fail to test slow network failures. All of a sudden your code has 500x more pending open sockets, both in and out, and the memory spikes and it all goes to hell. Even if the fast fail code path is indeed best effort.


Basically the old "I Love Lucy" chocolate factory sketch: https://youtu.be/AnHiAWlrYQc


I would love to see graphs showing "length of reply that eventually succeeded" - I suspect that most networks today, if you don't get a response in 5 seconds, you ain't never gonna get anything useful.

In other words, I wonder if going to fail fast would help the health of the Internet more than wait forever timeouts. Might reduce DDoS effects, too.


> When we plotted the data geographically and compared it to our total numbers broken out by region, there was a disproportionate increase in traffic from places like Southeast Asia, South America, Africa, and even remote regions of Siberia. Further investigation revealed that, in these places, the average page load time under Feather was over TWO MINUTES! This meant that a regular video page, at over a megabyte, was taking more than TWENTY MINUTES to load! This was the penalty incurred before the video stream even had a chance to show the first frame. Correspondingly, entire populations of people simply could not use YouTube because it took too long to see anything. Under Feather, despite it taking over two minutes to get to the first frame of video, watching a video actually became a real possibility. Over the week, word of Feather had spread in these areas and our numbers were completely skewed as a result. Large numbers of people who were previously unable to use YouTube before were suddenly able to.

https://blog.chriszacharias.com/page-weight-matters


Well a proper system should reduce the timeout for the same domain, the more it gets raised, and set it back to default once stats are good again.

But it's very complicated and costly to setup, so almost nobody does this.


If you're building complicated systems, you should probably reduce traffic to sites that are failing to respond, rather than continuing to send traffic, but timeout faster. Depending on what stage in the process the request is failing, it might not make a big difference (in a typical HTTPS exchange, the costs ramp the farther you go, processing a syn < processing a client hello < processing a complex request. (If it's a simple request, processing the client hello is more expensive than the request, of course).

If you send all the same traffic, and probably more because of retries with shorter and shorter timeouts, chances are you are going to keep the system overloaded, never detect success and never return to default timeouts. Dropping most of the traffic, and then turning it back on when the system recovers can lead to oscillation where the system works enough to drive more traffic that overloads the system etc, but at least you're getting some processing done.


Well, you do need, jitter, exponential backoff, caching, black and white listing, a stat base decision tree, etc. That's why it's a complicated and costly problem.

But if you are consuming a lot of API content, if you have crawlers or if you provide features like "get article title/summary/image/thumbs", at some scale it's an important decision to make.


>I suspect that most networks today,

I guess that comes with the type of end point in that network. A typical website, absolutely, yes, I'd agree. An API end point allowing requests of large data pools that might take a few seconds to generate but yet not a total time out would be acceptable.


I've been experiencing occasional slow-downs and issues on HN since last week. I would presume that it's due to more load (with WWDC last week and now Reddit black-out) but I don't really see this fully-reflected in the comment or up-vote count on the main page (except the recent Vision Pro thread).


Is HN slow because everyone who was normally posting on reddit is now commenting on HN instead?


I think so. I'm trying to move my time over here as much as possible.


Same here. Even the subreddits that are still open, I'd feel dirty reading.


Time to upgrade from a t2.Micro to a t3.Medium, I suppose.


This is at the top of HN because Cloudflare is not delivering a product people pay them for.


An ex admin of Reddit thinks the protest broke the internal cache layer performance: https://tildes.net/~tech/163e/reddit_appears_to_be_down_duri...


That was my guess, not knowing anything about Reddit's architecture. The "this subreddit is private" message probably used to be really rare, so it probably does the auth check for every request. Now every Google result links to "this subreddit is private", and so traffic to that endpoint went up by orders of magnitude. The result is an outage.

If I were an SRE at Reddit, the day I got wind of the "we're making all subreddits private" thing, I'd double check that code-path to see what we were in for on Day 0. However, I am not.


Or maybe, the SREs at Reddit decided this is their contribution to the protest to let it burn


In this economy?


They don't have to let it burn to the ground, but they also don't have to bust their asses to allow a point to be made.


At their paygrade?

Looking at Meta/Google/Apple's layoff packages of 6 months of severance, and factoring in $170k-$389k annual income at Reddit (per levels.fyi), I would hope they have enough savings to live off of for months if not years, to enable thtem to protest, should they so desire.


6 months severance is exceptional - 3 months is what you get if you’ve been there at least a few years and in good-standing.

And the twentysomething kids on $200k on the west coast won’t be saving that money (if the number of Teslas on the road in Redmond is anything to go by): they have no reason to believe they won’t make the same kind of TC at the next job they apply for.


at this income level, it's not worth losing my job in this economy, not even in a booming economy.


Humans spend the money they have available. All of it.


Whoa tildes are a reddit alternative? I thought it was all about, like, public access unix systems.


Huh?

It’s not quite an alternative. It’s good though and I’d recommend joining once there’s another invite wave.


Is there a known feasible way of receiving an invite at the present time?


tildes.net seems completely unrelated to the ones you are thinking of (the tildeverse)


Ah, thanks.


Anyone have an invite to share for Tildes?


Same here. Looking at the list of topics, it hits a lot of my interests and isn't 100% focused on tech. My contact details are on my website in my profile.


Looks like the email address on your keybase page is unreachable?


Sent!


Would love one too…


Can I get your email address?


tilde at maxg.io :)


Would love one as well.


Sent!


Any chance I could get one too? Email's on my website: https://picheta.me


FYI: I didn't see any notices about disruptions for workers but were intermittently unable to use `wrangler publish`: https://i.imgur.com/LUAzoHQ.png


I'm downloading a large app update (MAMP Pro).

It's screaming along at 47KBsec.

I'll have it all in about two and a half hours.

Not sure if that is related.


Possibly third parties that Reddit/HN rely on are using Cloudflare?


Given the innaccurate and clickbaity title "Cloudflare is having Issues" (when in reality it's just a few services, not the main stuff Cloudflare is largely known for), I suspect a lot of people are upvoting and not actually following the link. A comprehensive Cloudflare issue would be huge news. The truth here is not huge news IMHO.


The statement “Cloudflare is having issues” is true as soon as count(issues) > 0. So what’s so wrong about the title?

And how do you know how many websites use R2 etc. so that you can jump to the conclusion that it’s not huge news?


> The statement “Cloudflare is having issues” is true as soon as count(issues) > 0.

Pedantically, it's true as soon as "count(issues) > 1"


Exactly. The headline isn't untrue, but it is misleading. Most people hear "Cloudflare is having issues" and assume Cloudflare as a whole, and this is not an unreasonable assumption.

Reductio ad absurdum can be used to demonstrate the logic flaw. Let's assume that one person using Dynamo DB got a 500 error response from a two successive API calls (so an error rate of ~4.0e-09). It would technically be true for the headline to say only "AWS is having issues." A headline like that is going to rocket to the top of HN, and it's not going to provide many people with useful information.

It's also silly. There's plenty of room in the headline to instead of "Cloudflare is having issues" to say "Cloudflare R2, Stream Live, and others are having issues."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: