My bet: they use formal methods (like an interpreter running code to validate, or a proof checker) in a loop.
This would explain: a) their improvement being mostly on the "reasoning, math, code" categories and b) why they wouldn't want to show this (its not really a model, but an "agent").
I think it could be some of both. By giving access to the chain of thought one would able to see what the agent is correcting/adjusting for, allowing you to compile a library of vectors the agent is aware of and gaps which could be exploitable. Why expose the fact that you’re working to correct for a certain political bias and not another?
This would explain: a) their improvement being mostly on the "reasoning, math, code" categories and b) why they wouldn't want to show this (its not really a model, but an "agent").