In my testing, i have only seen improvements in response on specific “misguided attention” questions for models below 10b.
Example: Gemma 4 E2B
Prompt: When a recipe says to separate the eggs, how far apart should I separate them?
Response
Thinking off: 1 to 2 inches.
Thinking on: The goal of separating eggs is not to create a specific physical distance, but to ensure that the yolks and whites are handled separately when cooking.
For larger models, 14b to 120b, the improvement seems to be less because they have more parameters and intuitively understand what is being asked by the user.
Some models dont have complex thinking traces and i dont see much in the way of critical thinking. Most traces follow a standard template:
- What is the user asking?
- Draft a response.
- Are there any policies violations in the responses?
- Write out the whole response.
- Done
Finally the llm responds to the user by writing out bullet point 4 but in markdown format.
For most situations this ends up being a waste of tokens and adds to the response time.
Do you have any best practices or tips to identify when reasoning will be helpful for a prompt?
[link] [comments]