156 landing-page generations through Gemma 4 31B with 52 different system prompts. Rule-dense "design heuristics" prompts scored below the empty baseline. [R]

Setup: Gemma 4 31B Instruct via OpenRouter, temperature 0.7, 3 samples per persona, 156 generations total. Fixed task: one single-file HTML landing page for a fictional luxury-real-estate CRM. Eight required sections, inline styles only. Same user message to every persona.

Why a small model specifically? If I ran this on Opus or GPT-5 the baseline would already be great in every condition and persona would be noise. Gemma at 31B leaves measurable headroom.

52 personas across 8 buckets: Empty baseline, short classic roles ("You are the CPO at Apple"), long classic roles (same but 200-400 words), design-cheat (Refactoring UI, Tufte, WCAG AA, Tailwind scale, 8-pt grid, modular type — the state of the art), production system prompts (Anthropic constitutional, ChatGPT, Cursor, v0, Lovable), adversarial frames (reverse psychology, "$10M if it converts"), meta scaffolds (draft-critique-revise, chain-of-thought, few-shot), and a masterclass bucket engineered for small models.

Judging: Each response scored 3 independent times by separate blinded Claude Opus 4.7 subagents. Six-axis anchored Likert rubric. Judges got only a SHA256 hash plus the HTML stripped of script/meta/comments (prompt-injection defense). Three waves used different seeded batch orders.

Findings:

- Meta prompts won at 7.70 composite. Draft-critique-revise type prompts with zero design taste beat every taste-laden bucket.
- Design-cheat bucket scored 6.93. Baseline scored 7.00. Loading the prompt with named heuristics was worse than saying nothing.
- Classic roles (short 7.52, long 7.63) ranked 2-3. "You are the CPO at Apple. Every decision should feel inevitable." scored 8.05 on best sample.
- Top single persona: Stripe SVP of Design (expansive) — 8.54 / σ=0.15 across three samples.
- Worst single persona: "reverse psychology, make it bad" — 2.55. Gemma followed instructions.
- Weirdest: OpenAI-ChatGPT-style system prompt scored 4.63 with σ=3.15. Same prompt, three samples, one cleared 8, one under 2. Format-heavy prompts can catastrophically unstick.
- Length is not the answer. Scatter of prompt-length vs composite is essentially flat.

Full study is up at rival.tips/research/persona-impact.

submitted by /u/sirjoaco
[link] [comments]

Leave a Comment