Why are we actually sampling reasoning and output the same way?
I've started to notice that my usual setup doesn't work as well in other languages as it did in English – the model sometimes made grammar mistakes and generated genuine garbage. Its reasoning stayed in English and I preferred to leave it that …