This was a really interesting paper but there's a massive gap in what they didn't try, which is inference-time temperature changes based on the fork/lock distinction.
Maybe I'll try that myself, because it feels like it could be a great source of improvements. It would be really useful to see adaptive per-token sampling as an additional decode-only baseline.
Is this some kind of calibration then? I'd expect that the probabilities automatically adjust during training, such that in "lock" mode, for example, syntax-breaking tokens have a very low probability and would not be picked even wich higher temperature.
Maybe I'll try that myself, because it feels like it could be a great source of improvements. It would be really useful to see adaptive per-token sampling as an additional decode-only baseline.