I’ve been testing something recently and it’s starting to mess with how I think about “correct answers.”
Same prompt.
Same model.
Same temperature and settings.
But the outputs don’t just vary a little.
Sometimes they take completely different reasoning paths — like totally different ways of getting to an answer.
And here’s the strange part:
Sometimes the final answer is still the same.
And that’s where it gets weird.
Because if different runs can take completely different paths —
but still land on the same answer —
what exactly are we calling “correct”?
Is it:
- actual understanding?
- just one of many possible paths landing in the same place?
- or something closer to luck than we’d like to admit?
If the path changes every time, even under the same setup:
- can we really call it reliable?
- does “accuracy” still mean much?
- or are we just seeing different routes occasionally converge?
Curious if others have noticed this, and how you think about it.
[link] [comments]