Netlify CEO here. I'll try to answer the questions from the thread so far:
Some of our customers are affected by an outage of Googles Load Balancer.
These customers are not taking advantage of our DNS management, or they are not using a DNS provider that supports CNAME flattening and are using their root domain name for their website (ie, no www prefix).
While we don't recommend the setup, we do provide a single IP address to bind an A records for customers that want it.
In general we run our edge infrastructure as a large multicloud setup spanning several different network providers, and offer two separate networks, one for free/self-serve customers that will get newer features faster and one for enterprise customers running mission critical projects where we guarantee very high uptime and reliability through formal SLAs.
The single IP mentioned above however corresponds to a Google Load Balancer, and they are unfortunately currently having an outage for all load balancers in the relevant region. Read more on https://status.cloud.google.com/
Again, while we generally don't recommend using the A name setup for anything mission critical, we are currently doing everything we possible can helping enterprise customers that have chosen this setup to change their configuration.
Really sorry for all the trouble this are causing for our users, full RCA will be forthcoming.
> These customers are not taking advantage of our DNS management
I think I understand the point you are trying to make, that customers who are utilizing Netlify DNS Management are unaffected because reasons, but this is phrased in a way that implies that it is your users fault for this downtime because they didn't chose to use your related service.
Full RCA with the steps the team has taken to improve this setup will be coming soon. The main issue with AWS's DNS solution, in this context, is that they don't support ALIAS records or similar techniques (CNAME flattening, etc) for A records pointing to any external provider. That limits our options a lot in terms of what we can do, since anyone using this setup need to point all their traffic to one or more fixed IP addresses.
Our current solution for the free/self-serve tier of Netlify has been to rely on Google's load balancer product to give people a stable IP pointing to a highly available solution. In light of recent issues, our team has setup a new permanent IP for A records (75.2.60.5) backed by a different solution, but due to the way DNS providers with no ALIAS record support work, it does require our customers to manually change their A records.
I totally get that moving DNS providers is a big deal and we want to give the best experience we can regardless of what provider you're on, but we have to work within the technical limitations of those providers and it's the nature of things that we do have more options to deliver a completely seemless experience when we operate both the DNS and the edge layer for customers.
Route 53 General Manager here. Flattening of external provider CNAMEs has a number of availability and accuracy risks. Route 53 offers a 100% availability SLA, and we really mean it. We’ve heard over and over from customers that reliability is our most valuable feature. We can’t provide that same reliability when external queries are in the mix; if we query asynchronously then features such as geo-based routing don’t work as expected for customers. If we query synchronously, then latency and availability are impacted directly.
We do offer ALIAS records between Route 53 hosted zones, and this capability is open to providers such as Netlify. We’d be happy to have customers ALIAS to a hosted zone managed and updated by Netlify. It sounds like your IP addresses are relatively stable, keeping these in sync doesn’t sound like it would be a big deal, and would give you a lever you could pull to change your customer DNS quickly in an event such as this. You could also configure health checks on your own DNS records, which any customer ALIAS records that point to your DNS records in Route 53 would inherit.
If you’re interested in going this route, please contact me at alecpete <at> amazon <dot> com.
If each Route 53 POP is already close to the querying DNS client, then things like geo routing with cached answers might just work well enough in most cases? With each POP having its own cache.
Auto-refreshing the popular records in the background before the TTL expires to help smooth over any temporary issues?
Other big name DNS providers have ALIAS type records. I imagine according to the SLA, AWS Route 53 is still "available", even if it can't resolve a "target address record" (as the ANAME draft calls them) but Route 53 is still able to respond.
Phrasing can always be better but the point is that there's a way to map your DNS to Netlify which is risky and Netlify hasn't made the aggressive decision of blocking it. They outline in their docs all the reasons why you shouldn't do it, provide instructions for how to avoid it and also offer (but do not require) a hosted DNS setup which avoids this pitfall by design.
Some folks still choose to use this way, some have no other choice for various reasons and some don't care/comprehend the potential pitfalls. I do believe most users avoid using a root domain name for their website.
As someone who is a little clueless about network infrastructure: if I own "dwrodri.com", and I'm not running a bunch of other services which need to point to this domain, is there any reason why I wouldn't have my root domain pointed to my personal website?
I would personally imagine that any individual or SOHO business hosting their website on GitHub/GitLab would just buy "MomAndPopShop.com" and point it there. I guess I don't know off the top of my head how many of those sorts of places on the web still exist...
The problem is not that they're pointing their apex domain to a personal website; the problem is that they have a CNAME record in place for their apex domain, which is not actually allowed per the DNS standards
This should not be the case; if you'd like, Netlify's Support team will be happy to review your settings to help discover why it didn't help you out (start from https://netlify.com/support) and ensure that you are "futureproofed"!
I switched to using your DNS to resolve this issue, but https://js.la is still busted and because I'm using your DNS, I can't manually set the A record to go to the workaround IP address.
"These customers are not taking advantage of our DNS management"
You're right. I'm using Cloudflare's DNS. I trust them more than I trust Netlify and that's just a function of their size vs Netlify's size. This response needed better wording.
Depending on your config, another DNS related issue with Netlify is the way NS1.com (their vendor) handles domain names. A domain can only be added to one NS1 account. So if Netlify adds to their account internally, you can't use NS1 and vice versa.
Honestly, "not taking advantage of our DNS management" is a garbage response. We use AWS for our DNS management. If you offer a configuration, you should support it fully.
Our sites have been down for 3 hours now, and you're blaming someone else? We have 5 properties on Netlify now and will have 0 this time next week.
Yes, it is our fault for believing Netlify had contingency plans as hosting is their core business. We're fixing this mistake now so that our customers don't have the same experience.
Nobody is telling parent's customers how to feel. But the OP suggests that Netlify customers should be faulted for choosing the the wrong setup. Broken trust goes all the way down the chain, which is why the middle links have every reason to get ticked off.
The difference is that Netlify communicated the risks to its customers, something other parts of the chain apparently did not do, in addition to not evaluating the risks presented to them by Netlify.
Did you read the docs [1] before writing this? Putting a "(recommended)" on one branch of configuration instructions isn't the same as saying that the other option has a single point of failure. Also, people on both sides of a service don't have the same responsibilities - that's the whole point of the service.
Communicating about risks OR outages are both hard, and every company has both. I'm actually a happy (though impacted) Netlify customer. But it's completely bizarre to me to try to invalidate this customer's complaint.
Yes, I’ve visited that page before today. I admit my familiarity with these DNS setups may have made the tradeoff jump out at me. No problem invalidating the complaint.
I'm not sure your organization's setup with Netlify but isn't the whole point of Serverless to be... "serverless"? I could migrate twice the amount of properties you have to another provider in less than 3 hours...
I get your frustration but maybe cut some slack. If anything is mission critical, you should have had a backup plan if Netlify, Vercel, Cloudflare, or something else.
We use(d) Netlify for the frontend. I agree, our mistake was believing Netlify could be used for more than toy websites and took care of backup plans for us. Clearly they do not.
It's not just migrating the front-end if they're also using other functionalities like Netlify functions, forms, authentications etc. Netlify is not just static file hosting.
This has been the third major outage for Netlify in the last few weeks.
I like the company, they have good people on their team, and their interface and functionality is great (deploy previews are so nice!).
But this is probably the last straw, as the static portion of our company's website has been down for 45 minutes now.
Fortunately, the beauty of a static site is they're quite easy to host anywhere.
We're already on AWS, and it's easy enough to set up CloudFront. It won't be _quite_ as quick to deploy but it will probably rarely if ever break. Guess that's my task for the day :(
It sounds like it does handle apex domains, but only if you're using Netlify DNS or a provider which supports CNAME flattening. Assuming the potential problems with not doing so are disclosed during setup (not sure if they were) that actually seems pretty reasonable to me.
I run the Netlify Support team, and this statement from @michaelmior is correct: apex domains are served using redundant, global CDN if you use Netlify's DNS hosting, or Flattened CNAMEs from Cloudflare.
The main advantage that AWS will have for us over anything else is that, since AWS already manages our DNS, we are going to be able to offer our visitors the best performance by using geo-specific IP addresses.
The static site in question for us lives at the apex record (mywebsite.com), so it's generally not possible for other providers to do this without having them manage our entire DNS infrastructure, which we aren't willing to do.
In fact I think this is part of why we've had so many issues with Netlify. It's clear their preferred way to host apex domain sites is to manage the DNS completely.
I think cloudflare can do it. They give you the option to set A records for your domain, in your own DNS.
Cloudflare runs an AnyCast[0] network and multiple peers, so even through your using static IPs, the traffic will still get routed to Cloudflare nearest PoP, and pages is served by their edge network, so your site will be served from the location nearest to your customer. All without DNS shenanigans.
This got me today. Probably could be characterized as the classic case of: Company says a certain use case is unsupported, but tries hard to accommodate users who are stuck with the unsupported use case, so they hack up a decent work around under the hood. Then the technically unsupported use case blows up, so they then have to scramble to support it with a quick work around...for the workaround.
The workaround worked. I think at this point it makes me more likely to keep using Netlify. I love the product. And I think I love the support for unsupported un-recommended feature that they supported today.
Update on the Netlify Status page [0] -- TLDR anyone experiencing this issue should point their apex domain to 75.2.60.5
---
Full announcement:
Our team have created a new load balancer instance which is not associated with the upstream provider who is currently experiencing issues. Please update A record values for your site(s) bare domain to 75.2.60.5 to mitigate against this outage.
---
Their documentation page [1] now includes the same IP.
thanks for the TLDR - i pointed it over and can verify that it fixes the problem. annoying to wait out the caching on mobile devices though, i am not sure how to clear DNS cache on mobile but am not too bothered.
"We have identified the issue and it is attributed to an upstream provider."
The upstream issue is probably at Google:
"We are experiencing an issue with L4 load balancers in us-west1-c. Multiple managed services relying on LB and located in this zone might be affected."
A quick infrastructure service (from what I understand.) If you are building a JS single page app and want to be able to deploy it and some backing cloud functions without really worrying about CDNs, gateways, etc. Git push, code runs tests, code is deployed, done.
My startup has also suffered from the recent Netlify outages with our main landing page.
Last time I already did the research for potential alternatives:
- Cloudflare Pages is now available in public beta
- even more interesting seemed this offering by PerfOps to put my CDN behind a Load Balancer that can monitor uptime and dynamically shift traffic between multiple CDN sources: https://perfops.net/flexbalancer
What do you think?
- it seems like the multi cloud approach to CDN
- but at the same time I'll have a problem if this Load Balancer fails (single point of failure)
This sounds crazy to me. Besides the obvious superfluous network/layer hops, complexity and points of failure it would also partition the cache, right? So working against the very thing CDN's optimize for.
I wish Netlify all the best! In the mean time, I just hopped on to Cloudflare and saw their Pages product is in public beta. Seems to work the same as Netlify for static pages, just tried it out for my personal site and it worked great! I was already using Cloudflare as my CDN and to manage DNS, it's actually really nice to have my entire website configuration live there.
oof, I got bit by this issue this morning. if you're using cloudflare, set your domain's apex (`@`) as a CNAME pointing to the default subdomain (sitename.netlify.app) and use CNAME Flattening. It's the A record pointing to the CDN IP address that's broken.
Has anyone else noticed erratic response times in recent months? My web vitals score sometimes dips heavily because "response time from server" (or whatever it's called).
Anyone can recommend another place where I can host? (except Vercel, which has similar results)
So they implemented their own CDN-ish thing on top of GCP without doing anycast and serve stale and they have a non-trivial number non-mom and pop customers?!
I know Vercel uses AWS Lambda's behind the scenes to process web requests at least. I'd assume caching is also handled through Cloudfront by default. The place I currently work uses Fastly for caching and Vercel for hosting and it's definitely caused some issues(and much finger pointing on both of their sides) when one of those services makes a breaking change.
Perhaps they're using CloudFront and providing an end user CDN on top of it. Both are partners with AWS. Both were multicloud, and it seems, aren't anymore.
Cloudflare is already on the way to building out all of the features Vercel has which is exciting. Eventually their Pages product (static hosting) will integrate Workers [1] for serverside APIs.
Shameless plug, I can suggest StaticDeploy (https://staticdeploy.io/) as an open source, self-hosted alternative to Netlify, which can give you a similar deployment workflow.
It's definitely possible to just host directly on S3/CloudFront, but StaticDeploy sets you up quickly with a workflow and a dedicated management interface.
Disclaimer: I'm the main developer of the project.
Honestly, don't do this. Netlify is having a bad day and it's not fun for them. The great wheel of karma turns around slowly and one day it'll be your turn to have a bad day.
Submit StaticDeploy to HN some other day and tell us about it. Sounds cool.
Thanks for pointing this out, I admit I did not consider their point of view (the plug was shameless, but in the self-promotion-is-inherently-shameful sense), and I agree it's in bad taste.
Some of our customers are affected by an outage of Googles Load Balancer.
These customers are not taking advantage of our DNS management, or they are not using a DNS provider that supports CNAME flattening and are using their root domain name for their website (ie, no www prefix).
While we don't recommend the setup, we do provide a single IP address to bind an A records for customers that want it.
In general we run our edge infrastructure as a large multicloud setup spanning several different network providers, and offer two separate networks, one for free/self-serve customers that will get newer features faster and one for enterprise customers running mission critical projects where we guarantee very high uptime and reliability through formal SLAs.
The single IP mentioned above however corresponds to a Google Load Balancer, and they are unfortunately currently having an outage for all load balancers in the relevant region. Read more on https://status.cloud.google.com/
Again, while we generally don't recommend using the A name setup for anything mission critical, we are currently doing everything we possible can helping enterprise customers that have chosen this setup to change their configuration.
Really sorry for all the trouble this are causing for our users, full RCA will be forthcoming.