This is an unhelpful rant, but it's been getting to me. I don't code. I don't care about python. I don't know and don't care how agents work and what they do. I don't build websites and I couldn't care less about github integration
I write. Something LLMs should theoretically be really focused on and decent at. I write a lot for my job, and I do a lot of creative writing. And no one seems to care about this anymore. It's notable that during the release of Gemma 4 - the one 'model family' people went to when it came to writing - almost none of the first few hundred comments of people trying it out even mentioned its writing ability (which btw is kinda mid, at least in my personal experience). It was, yet again, about coding and agents. Like every. damn. single. new. LLM. release. Of the last year and a half.
Coding and agents is the only thing anyone seems to care about now. I get it, it's intensely benchmarkable, it has a right/wrong answer. It's easier to engineer, and highly profitable. It's not a mystery why it's such a key focus. But it pisses me off. It shouldn't be the be all and end all of virtually all LLM discussion and hopes for their improvement.
More depressingly, nothing even remotely beats Claude when it comes to creative writing, whose company I have come to seethingly despise. None of the thousands of local LLM finetunes for writing seem to actually instill a sense of character motivation tracking, coherency, and pacing to go with their writing style. In terms of proprietary LLMs, Gemini is a robot when it comes to writing, so is GPT in my experience.
So when Anthropic hints at API cutoffs and people say 'yet another reason to go local' - go local to what? All local options are exceptionally underwhelming compared to Claude when it comes to writing. There's a hundred LLMs that are all great at python and agents, and there are functionally none that are great at writing.
And I mean actual writing - understanding a large text at scale (tens of thousands of words), and creatively producing continuations or branches or alternative chapters - not one-shotting a text output from 5 sentences of description. Even though that's basically all people seem to test. It's really all EQBench tests. It's quite easy to produce a passable text from a short prompt. You don't really need to understand or keep track of much. But all these LLMs fall apart when given a large text.
And sure, you can summarise your chapters or whatever. But the problem is that writing carries nuance through subtext and writing form. You can't summarise that. And only Claude seems to get that implicitly. Claude is the only LLM that you can give a 40,000 chunk of fictional text to, and it will continue it, in the same style, with a logical coherence that actually tracks character motivation and makes those characters do consistent and believable things given the specific circumstances they're in. While also holding onto implicit worldbuilding. You might say that this is way too hard for an LLM but Claude can do it. Why can't other models?
The other big open LLMs - GLM4.6/4.7/5/5.1, DeepSeek, Kimi K2, etc - will produce passable, even very nice prose, but the story is not good. The pacing is wrong. The motivation of the characters is inconsistent, they do things they wouldn't realistically do because the preceding plot demands it. A character who was exasperated and angry with the main character for pursuing a futile endeavour suddenly sits down with them to decipher a coded message because the main character received it in the preceeding chapter and their conflict was not touched on for two chapters. Literally only Claude understands that this is not something that would happen.
So I sit and wait to eventually lose access to Claude, while no one seems to care about creative writing capabilities of LLMs anymore.
Rant over. If anyone has local suggestions that can actually write well at that scale (working with ~50,000 tokens), let me know. Is it mostly a parameter thing, and no one has the money to fine-tune large models? Why is this seemingly the only thing not readily replicated among all SOTA models like every single other benchmark?
[link] [comments]