With apologies to those who didn't make this post, it seems you need to up your game
Yesterday, Alexander Wales published a post entitled "Can an LLM have taste? Inkhaven Week 1, ranked by Claude". I found this very entertaining.
He took Claude, used it to compare a bunch of inkhaven posts, ranked them, and provided us with this wonderful list of the top ten posts so far:
- Three Stones are Enough: The Case Against Leaves, in Particular, Anna Mattinger
- An open letter to 21 people I know who died, Layla Hughes
- endometrial biopsy, kaylee
- Softhead, macroraptor
- Every Lighthaven Writing Residency, Layla Hughes
- The largest manufacturer of feelings in human history, Natalie Cargill
- The one that loved me most, MLL
- I did it. I found the worst poem in the world., Natalie Cargill
- “Love, Mum” - What AIs can’t see about abuse, Natalie Cargill
- Lost Mesoamerican Technologies, Lost Futures
I thought this was great, so I decided to make my own version, but better. Rather than using pure ranking, I (or rather, Claude, who assisted me with the actual implementation of this hare-brained scheme) decided to use a Bradley-Terry model, which Claude informs me is rather like the Elo system used to rank chess players.
Using the Anthropic API, we gave Claude Opus 4.6 the following prompt (also written by Claude, but edited by me), including 8 posts for it to rank:
You are judging posts from Inkhaven, a writing residency where participants commit to publishing one blog post every day for 30 days. The residents are a mix of AI safety researchers, rationalists, fiction writers, and generally thoughtful people. The audience skews heavily rationalist — LessWrong regulars, EA-adjacent, people who take ideas seriously but also appreciate a good joke.
You will be shown 8 Inkhaven posts. Rank them by quality, from best to worst.
The question to ask yourself for each post: "Would a typical rationalist vote to read more of this sort of thing?" You're not rating a single post in isolation — you're judging whether the author, writing in this mode, should keep going. Insight, craft, honest thinking, and distinctive voice all count.
So does being funny — humour is a genuine virtue here, not a tiebreaker.
A few things to keep in mind:
- Do NOT be generous or encouraging. Predict the actual taste of the rationalist audience. Many of these posts will be mediocre and that's fine to say.
- Fiction, essays, rants, reviews, and technical posts are all on the same scale — judge each by whether it succeeds at what it's trying to do.
- Length is not quality. A tight 500 words can beat a bloated 3000.
- Weird and niche is fine, often good. Idiosyncrasy is often a feature, not a bug.
=== POST {i} ===
title + first 4000 chars of body.
Rank all 8 posts from best to worst. Think through your reasoning, then give your final answer as a comma-separated list of post numbers inside <answer> tags.
We did five iterations:
- Get baseline estimates for each post
- Get more accurate estimates for posts liable to be in the top 10
- Get proper estimates for the posts we'd accidentally imported in the wrong format
- Try to push my post from 2nd to 1st place (instead, it ended up in 10th).
- Realise that we were missing a bunch of posts and add those in (it didn't change much)
$40 in burned API credits later, we got the following table:
# | Score | Author | Title |
1 | +2.77σ | Natalie Cargill | |
2 | +2.55σ | Alec Thompson | |
3 | +2.54σ | Avi | |
4 | +2.53σ | Aaron Gertler | |
5 | +2.49σ | Smitty | |
6 | +2.49σ | Natalie Cargill | |
7 | +2.49σ | viv | |
8 | +2.47σ | Alec Thompson | More Legal Systems Very Different From Ours 2: Nazi Private Law |
9 | +2.46σ | Anna Mattinger | Three Stones are Enough: The Case Against Leaves, in Particular |
10 | +2.45σ | Sean Herrington | |
11 | +2.39σ | Alec Thompson | |
12 | +2.35σ | Austen | |
13 | +2.33σ | Alec Thompson | |
14 | +2.28σ | Vishal Prasad | |
15 | +2.28σ | viv | |
16 | +2.21σ | Itsi Weinstock | |
17 | +2.20σ | Bill Jackson | |
18 | +2.16σ | Natalie Cargill | |
19 | +2.12σ | viv | |
20 | +1.99σ | Benjamin Sturgeon | Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math? |
Claude did a diligent bootstrapping check to ensure we had the right posts in the top 20, and found that post 19 was there 90% of the time, while Ben Sturgeon's Revisiting GSM-Symbolic hit a mere 22%. You're on thin ice, Ben.
Averaging the scores of the individual posts also enables us to give a ranking of the authors. The top 20 authors at inkhaven right now are... [drum roll]:
# | Score | Posts included | Author | Best post |
1 | +2.60σ | 10 | viv | |
2 | +2.52σ | 9 | Natalie Cargill | |
3 | +2.23σ | 9 | Alec Thompson | |
4 | +1.82σ | 9 | Aaron Gertler | |
5 | +1.56σ | 9 | Steven K | |
6 | +1.34σ | 9 | Katja Grace | |
7 | +1.06σ | 9 | capsuletime | |
8 | +1.05σ | 10 | Kevin Z Wu | |
9 | +1.03σ | 9 | Austen | |
10 | +0.82σ | 10 | Drew Schorno | |
11 | +0.68σ | 9 | Justis Mills (Writing Advisor) | |
12 | +0.58σ | 9 | Bill Jackson | |
13 | +0.49σ | 9 | Lawrence Chan | We're actually running out of benchmarks to upper bound AI capabilities |
14 | +0.44σ | 9 | Avi | |
15 | +0.43σ | 9 | Derek Razo | |
16 | +0.37σ | 9 | conq | |
17 | +0.32σ | 9 | Alicorn (Writing Advisor) | |
18 | +0.31σ | 6 | Remy | |
19 | +0.29σ | 7 | Layla Hughes | |
20 | +0.22σ | 9 | Henry Stanley |
I should probably note that I filtered out anyone who hadn't published posts on at least 2/3rds of the days, so Vishal Prasad (+1.89σ, 2 posts), Robert Mushkatblat (+1.41σ, 4 posts), A.G.G Liu (+0.7σ, 1 post), Justin Kuiper (+0.32σ, 1 post) and Georgia Ray (+0.3σ, 1 post) didn't make the cut on number, despite having the quality.
Alexander Wales (-0.23σ, 4 posts), whose post inspired this one, is also, sadly, left out of the rankings. (Sorry).
Discuss