| So, with this project I want to see if a length constrained (like 64 tokens only) quality summarization can be done by tiny LLMs using GRPO! So, I trained two variants of this task:
I ran LLM-As-A-Judge eval for checking the summarization quality using DeepEval tools. Those are:
Th results are as attached and the final one is follows:
Ranking of t-test for other rewards: Summary Table
All the code and wandb charts in the comments! Setup: 3x Mac Minis in a cluster running MLX. One node drives training using GRPO, two push rollouts via vLLM-metal framework. All of the work done using smolcluster.com. Used SyncPS arch which is synchronous parameter server architecture with the master as the node where the training happens and the vllm on the workers nodes. Eval: LLM-as-a-Judge (gpt-5)
The composite score is the mean of the above scores.
[link] [comments] |