Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis

So, here's an update to my GRPO training on length constrained reddit posts summarization on 3x Mac minis - a new direction!

Gist- been trying to test how good of a summarization model can be trained for summarization using exactly 64 tokens!

So, once all the t-test and evals were done for LFM2.5.-350M and Qwen2.5-0,5B-Instruct models with length penalty and quality metrics (given below), I realized after looking at the results of the quality metrics and saw that BLEU and ROUGE-L were particularly low when trained from scratch.

I hypothesized its because of the length penalty that I added so that it outputs ex ally 64 tokens but also being penalized from the rest variation of length penalty from ROUGE-L and BLEU (brevity penalty for eg).

Well, I had a faint idea to circumvent this issue that is what if I used an already fine tuned version who outputs exactly 64 tokens? But the idea was like a flash, like zoooom and puff gone!

That is when a Redditor pointed it out and I was like "hmm well I already have a checkpoint with only length penalty added!"

Now here I could have just SFT'ed as some of you may be thinking to fine tune the model to output just the read number of token and yes that's next experiment along with DPO comparison !

So, currently, have been training LFM2.5-350M and Qwen2.5-0.5B-Instruct for the same!

Eval:

LLM-as-a-Judge (gpt-5)

Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

Faithfulness — no hallucinations vs. source
Coverage — key points captured
Conciseness — shorter, no redundancy
Clarity — readable on its own

Distributed Training Setup:

3x Mac Minis in a cluster running MLX.

One node drives training using GRPO, two push rollouts via vLLM-metal framework.

All of the work done using smolcluster.

Used SyncPS arch which is synchronous parameter server architecture with the master as the node where the training happens and the vllm on the workers nodes.

https://preview.redd.it/dy01xrra4azg1.png?width=5034&format=png&auto=webp&s=9e9165673e639c049d66ef38a0d270244c81b391

https://preview.redd.it/a9paftra4azg1.png?width=5040&format=png&auto=webp&s=96165e9698f6e017f0274953523dd3192942b53f

https://preview.redd.it/11q79tra4azg1.png?width=5040&format=png&auto=webp&s=6e09e1c7db8bdfa7ea76d3af64c5b497a505a958

submitted by /u/East-Muffin-6472
[link] [comments]

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis – updates!

Leave a Comment