Someone like Amit doesn't need money; by the time you get as highly placed at Google as he was you will be wealthy. I believe that Uber has interesting engineering challenges but likewise have doubts about their profitability.
Yes, it's just a 10GbE ethernet switch that can encapsulate the traffic in VXLAN headers, so that it can traverse east/west between any of thousands (millions?) of hypervisors without requiring traffic to hairpin to a gateway router and back. The logical networks all exist in an overlay network, so to the customer VMs, you get L2/L3 isolation. But, to the underlying hypervisors, they actually know which vNICs are running on each hypervisor in the cluster, so they can talk directly on a large connected underlay network at 10GbE (x2) line rate.
This is the standard way of distributing traffic in large datacenters. That way you get extremely fast, non-blocking line rate between any two physical hosts in the datacenter, and since the physical hosts know which VMs/containers are running on them, they can pass the traffic directly to the other host if VMs exist in the same L2 network, and even do virtual routing if the VMs exist across L3 boundaries - still a single east/west hop.
Broadcom makes "switch on a chip" modules that will do VxLAN encapsulation and translation to VLAN or regular Ethernet frames. That chipset is available in lots of common 10/40/100 GbE switches from Arista/Juniper/White Box.
In a regular IP Fabric environment we would all this device a VTEP.
There is white label OEM gear (the Arista and Cisco gear is now just OEM with their firmware running on it), but unless you're Google or Facebook and can write your own firmware, chances are you're better off with an "enterprise" solution like Arista or Cisco who will give you support and fix bugs in the firmware for you.
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google, and I'm oncall for this service.)
No.
Edit: expanding on this a little, it's not something that's been released so we can't talk about it. I don't think I can comment on "illumin8"s proposals other than to say that I'm pretty sure they don't work here.
Google's exact ToR (top of rack) switch code isn't available, but you can buy a switch from any number of network gear vendors (Arista, Cisco, Brocade, Juniper, HP, etc), that can do VXLAN encapsulation and send the traffic over a leaf/spine network that covers thousands of racks.
I can't imagine Google is building clusters at such an alarming rate that it would justify manufacturing its own silicon for edge deployment, which suggests whatever commodity silicon is in the magic box can probably be found in a variety of vendor equipment wrapped in OpenFlow or similar
I would caution that a two-phase commit is not always the best you can do. A two-phase commit is blocking and in event of failure during processing it can cause a deadlock. A three-phase commit offers the same level of guarantee but without blocking.
SRE here. If you can handle your work nobody cares how much you are in the office. At all. However, unless you are insanely organized you won't be able to do this every single week--most successful googlers have several projects going on at once and there will be some times when you need to work hard on more than one.
It's common at Google to have a few projects cooking that you move between. One reason for this (though I doubt it's the sole reason) probably has to do with the mandatory peer review process for all code, which can cause delays in getting your code submitted if your reviewer's schedule doesn't line up well with your own. Instead of introducing a bubble into your pipeline when that happens, it's nice to have a side project you can hack on for a little while until you get your code review feedback.
How are deadlines managed in that environment? Sprint ends or release dates or just "we will do it by Tuesday" all conspire against peer review delays.
Is the frustration level a constant problem if reviews can take ... X longer than thought?
Well, you can always assign a different reviewer if your initial choice is taking too long. I usually try to check with my teammates to see who has time for a review before sending them my code, at least if it's something that I want to get submitted quickly. That ensures that I pick someone who will look at it sooner rather than later. Sometimes the specifics of the process require someone who is not on your immediate team to give their approval, in which case delays are more likely.
I'm an SRE, which means most of the code I write isn't directly user-facing, and thus isn't normally subject to hard external launch deadlines. That means I'm rarely rushing to push my changes through on a tight schedule. If there's a production emergency and I need to make an urgent change to avoid or mitigate an outage, there are escape hatches at our disposal to temporarily circumvent the code review system when no one is immediately available to do a review, but the need for that is pretty infrequent.
I rarely find it frustrating. On the contrary, I really appreciate Google's emphasis on code quality, even though it does come at the cost of some agility. I used to work at a company where the implementation of code reviews was generally resisted, and though we did get new code out the door faster, we ended up with some real maintenance nightmares as a result.
No, not TBR. I was referring to the fact that you can +2 your own CL, which allows it to be merged. Officially this is an "emergency" feature, as you say, but a dirty secret is that it's used routinely. I encountered it in both Android and Chrome.
I've seen lots of self-+2s under the following circumstances:
1. You are the only engineer on your team, or all other engineers on your team are out for an extended period of time.
2. All other engineers have very different skillsets, and are not capable of doing effective review. Think Rails engineer attempting to review a touchpad firmware fix.
3. You have a hard deadline (e.g. product demo or factory build) and you're desperately trying to get as much working as you can. You may even be in a location where coordination is difficult and Internet access is poor.
For SWEs working on, say, web services, you probably have at least a medium-sized team (4+ engineers) and a week's slip is no big deal. But not all software is like that at Google.
I'm not actually sure this works in... well, not Android and Chrome. I can TBR, but as far as I know I can't self-LGTM (and certainly not self-Approve).
It would be interesting if someone from Github could discuss why they chose to do this migration by taking the whole site offline and doing the migration all at once. Did anyone investigate if this could be done without taking the site offline?
Doing this online would of been very tricky while maintaining 100% consistency. We perform major infrastructure changes often without ever having to take the site offline. In this case and at this time it was unavoidable.
I feel 13 minutes of maintenance at 5am PST was a good trade off for the benefits we gained.
Can you go into more detail regarding the prohibitive consistency issues? How do you maintain consistency in steady-state (ie. not during migrations?) Also, how do you make the call as to whether to bring your site down vs. attempting a live migration?
I think its a smart decision, given the nature of the product. An off time of 14 minutes on a Saturday very early morning is a price they were willing to pay to make this a one time operation with no (actually reduced) risks of losing data consistency and other pitfalls that come with a live migration.
We mostly operate the OS X and Linux hosts using the same or similar tools; Ansible for automation, ssh / rsync / git for getting files onto hosts, and an automated network installer that fills the same niche as the one we use with Linux.
All of the core infrastructure is shared between OS X and Linux systems -- same DNS caches, NTP servers, etc. -- so there's no additional work there.
There are some things that are a little tougher to do on OS X. The userland utilities are different (and anything GPL is trapped on GPLv2 versions for all eternity, such as bash and rsync), and sometimes we have to work around that or build local packages.
Ansible doesn't really know about things like launchd processes or cron jobs, and OS X has nothing analogous to useradd, so we had to write some modules and implement branching in common playbooks to handle tasks in different ways on Linux and OS X.
Ultimately they run Unix, and once our environment is deployed onto a server you don't really notice the operational differences on a day to day basis.
This is generally known as "device fingerprinting"--there are many ways to do it but they all involve probing for unique properties of a client via JS / Flash (listing installed fonts, drawing invisible characters and measuring via JS, etc.), then hashing them together to generate a unique ID for that user.
Some people think this practice violates users' privacy, and I'm one of them. This technology can be used to uniquely identify a user across multiple logins on the same site, or even multiple sites. It's quite widespread.
This paper[0] is mostly a survey of prominent DF providers and sites using this technology, and it's also a good primer on device fingerprinting techniques.
This is generally known as "device fingerprinting"
Is this definitely how it's achieved though?
I would presume a highly-skilled fraudster could just spin up a new VM, for instance, and evade detection that way.
Do we know if "regular" cookies alone are good enough for 90% of the lazy fraudsters?
Regarding using "device fingerprinting," can I collect some opinions from HN?
Specifically, if every user record created stores a fingerprint alongside it (which is only used to find account registrations from the same device) is that just as offensive as using fingerprinting to track anonymous sessions?
> I would presume a highly-skilled fraudster could just spin up a new VM, for instance, and evade detection that way.
From my experience building fraud detection systems at Eventbrite most fraudsters are not that sophisticated -- fraudsters usually go for the lowest-hanging fruit and as such are looking for systems to defraud that have the highest payout for the lowest effort. Because there is always some level of uncertainty (getting detected, the credit card not working, etc.) fraudsters often favor techniques that allow them to try as many websites/cards as possible. This is especially true for Sift Science's customers who tend to be more small to mid-size companies; big companies for whom fraud detection is critical will tend to have their own in-house solution.
In addition this is usually only one signal -- ideally you want your algorithm to be able to detect first-time fraudsters too, so the other signals should be able to stand on their own.
One caveat though, the reason why multiple accounts is a signal of fraud is because fraudsters tend to be repeat offenders, and will keep defrauding the same website if their previous attempts worked. But now that they're facing a fraud detection algorithms that detects repeat offenders more easily, it's highly possible they will adapt their behavior.
This is a signal that will fade out in strength over time, and one of the dangers of pooling together data from multiple websites as in this blog post (but hopefully this is taken into account in their algorithms) is that the strength of the signal may be skewed by the proportion of new users of their platform (who will have a higher proportion of unsophisticated fraudsters by nature of they not having a fraud detection system previously).
This is why whenever you are building a fraud detection algorithm (or any machine learning algorithm that's consumer facing) understanding the story behind the data is very important, and not just looking at the numbers.
Consider that if you collect device fingerprints, you can detect users who live together, because they are likely to share devices. If somebody bad then gets access to that data, they could do creepy things to your users.
This release contains a neat feature: you can now bind HAProxy to a specific FD opened by its parent process. This means that you can babysit your HAProxy processes underneath a parent process that opens ports and get hitless HAProxy restarts, which I've long desired.
Have you looked at einhorn? We've been running HAProxy under einhorn for a while now, using something like https://gist.github.com/ebroder/36b2f4f3aa210b9d9f3d to translate between HAProxy's signalling mechanisms and einhorn's signalling mechanisms.
Er, yes, hi Evan! Cooper here. I believe either you or Andy originally pointed this interesting HAProxy restart behavior out to me, in the context of explaining why you wrote Einhorn.
This will also allow HAProxy to be brought into the fold if you're using Circus (http://circus.readthedocs.org/en/0.11.1/) for process monitoring. Probably not a big win in reliability, but I bet there are plenty of other things you can do with that (like parent mentioned)
The bud ( https://github.com/indutny/bud ) does support this kind of hot config reloads and process restarts. It's just starting new worker processes on SIGHUP, and let old workers wait until all their connections will be closed.
This is basically done by moving the balancing between workers into the master, instead of calling `accept()` concurrently from the workers.
No, not really, for short period of time you've got state where previously configured instance is not working already and future one is not working yet.
This is generally correct. In particular, when the HAProxy process is stopped/restarted there is a brief period during which the port is not bound by either process. (If the new process isn't able to get the socket when it boots it will sleep ~XXms, then try to bind/listen in a loop until it gets it or a retry threshold is hit.) During this time the kernel will reject incoming connections to the HAProxy port, so you are in danger of dropping incoming requests on the ground.
I feel I must point out that this relies on your clients to retry packets that iptables drops, so at the least they'll have a slower experience than they otherwise would. I believe that this will also break requests that are in flight when the reload happens, if they have sent partial data.
Yeah, it feels like a dirty, dirty hack. The way I understood it is that the requests in flight would retry the same way as the new connections when the first iptables rule is applied.
I think what will actually happen to requests in flight is:
- partial data received by old HAProxy is lost as old HAProxy exits
- new HAProxy comes online, binds to port, receives fd
- iptables rule removed. new HAProxy starts receiving new requests
- in-flight requests from the old HAProxy are timed out by the kernel (TCP RST) as nothing is there to read request data from the old fd or send response data.
So I think this is actually "worse" in some sense than the other retry behavior since it's not recovered inside the same TCP session but instead forces the client to open a new TCP session.