The original paper is quite deceptive and hard to understand, IMHO. It relies on jumping between several different figures and mapping between shapes, in addition to guessing at what the unlabeled inputs are.
Just a few more labels, making the implicit explicit, would make it far more intelligible. Plus, last time I went through it Im pretty sure that there's either a swap on the order of the three inputs between different figures, or that it's incorrectly diagrammed.
https://arxiv.org/abs/1706.03762