Training LFM-2.5-350M on Reddit post summarization with GRPO on my 3x Mac Minis — final evals and t-test evals are here

Training LFM-2.5-350M on Reddit post summarization with GRPO on my 3x Mac Minis — final evals and t-test evals are here

So, with this project I want to see if a length constrained (like 64 tokens only) quality summarization can be done by tiny LLMs using GRPO!

https://preview.redd.it/6f3tou9xhixg1.png?width=2816&format=png&auto=webp&s=c0b11ea7c387c1e84e1ad2a9c7039630c2802025

So, I trained two variants of this task:

  • using just length penalty
  • using a single quality reward/combination of those and length penalty

I ran LLM-As-A-Judge eval for checking the summarization quality using DeepEval tools. Those are:

  • Consciencess
  • Coverage
  • Clarity
  • Faitfullness

Th results are as attached and the final one is follows:

  • with quality (ROUGE-L + METEOR) + length penalty rewards: 2.7/4 (wins again!)
  • with just length penalty: 2.23/4

Ranking of t-test for other rewards:

Summary Table

Reward Configuration Composite Faithfulness Coverage Conciseness Clarity Pass Rate
length-quality-meteor-rouge 2.769 0.832 0.511 0.659 0.767 44.3%
length-quality-bleu-rouge 2.732 0.810 0.502 0.650 0.770 39.1%
length-quality-meteor-bleu 2.664 0.792 0.468 0.648 0.756 38.3%
length-quality-rouge-l 2.555 0.725 0.415 0.637 0.778 32.4%
length-quality-meteor 2.484 0.721 0.427 0.625 0.711
length-quality-bleu 2.400 0.680 0.399 0.577 0.744 26.9%
length-only (baseline) 2.416 0.678 0.407 0.592 0.739 30.7%

Performed on the test sample of 200 of smoltldr dataset. Baseline: length penalty only

All the code and wandb charts in the comments!

Setup: 3x Mac Minis in a cluster running MLX.

One node drives training using GRPO, two push rollouts via vLLM-metal framework. All of the work done using smolcluster.com.

Used SyncPS arch which is synchronous parameter server architecture with the master as the node where the training happens and the vllm on the workers nodes.

Eval:

LLM-as-a-Judge (gpt-5)

  • Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

Faithfulness — no hallucinations vs. source Coverage — key points captured Conciseness — shorter, no redundancy Clarity — readable on its own

The composite score is the mean of the above scores.

  • Reward system

length_penalty : basically, -abs(response_length - MAX_LENGTH)

  • quality_rewards:

ROUGE-L only cares about the longest common subsequence — it misses synonyms and paraphrases entirely.

METEOR handles both: it aligns tokens with synonym matching via WordNet and balances precision + recall with a chunk-order penalty.

BLEU on the other hand, focuses more on n-gram precision and length penalty.

https://preview.redd.it/0qdfrw3yhixg1.png?width=3540&format=png&auto=webp&s=e0b57364ceff3fc9302c13f21f907eea0d66ed5a

https://preview.redd.it/3d8cakdyhixg1.png?width=3568&format=png&auto=webp&s=b2f4516137d4b3b2798e5d6c2d118c3f7401dde9

https://preview.redd.it/sq6hgniyhixg1.png?width=5120&format=png&auto=webp&s=aac7ef8a3575821908430178b7b2084afb9b1006

https://preview.redd.it/bq9ep4myhixg1.png?width=3578&format=png&auto=webp&s=08d0c2025d7f5a7fbb33e9fadb5fa774c098fafb

submitted by /u/East-Muffin-6472
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top