A technical, 100% local writeup on how I replicated and then surpassed the Secret Detection model from Wiz (and the challenges along the way) – including labeling an entire dataset with local AI

Hey everybody, I have a strong interest in offloading work to small, specialized models that I can parallelize - this lets me scale work significantly (plus, I am less dependent on proprietary APIs)

Some time ago, I saw a blog post from Wiz about fine-tuning Llama 3.2-1B for secret detection in code. They got 86% Precision and 82% Recall. I wanted to see if I can replicate (or beat) those numbers using purely local AI and produce a local specialized model.

After a couple of weekends of trying it out I managed to get a Llama 3.2-1B hitting 88% Precision and 84.4% Recall simultaneously!

I also benchmarked Qwen 3.5-2B and 4B - expectedly, they outperformed Llama 1B at the cost of more VRAM and longer inference time.

I’ve put together a full write-up with the training stats, examples, and a step-by-step breakdown of what I went through to hit these metrics. Warning: It's technical and pretty long, but I honestly think it's fun to read.

Link: Check out the full write-up here.

Here are some highlights:

I only sourced publicly available data. This wasn't enough so I used procedural generation to augment and improve my dataset. Labeling was done locally using Qwen3-Coder-Next (sorry Claude, you sit this one out).
Instead of just finding secrets, I trained the models to output structured JSON. Initially, every vanilla SLM I tested (Llama & Qwen) scored 0% on schema compliance, but I got them to 98-100% after training.
I made a somewhat embarresing mistake including a high entropy class which was detrimental to training, but I caught it and removed it eventually.
I discovered 4,500 of my "negative" samples actually contained real-world passwords (even though they don't seem real!). The model was literally being trained to ignore secrets. At this point I was already clearing the metrics set by Wiz, but fixing this improved the recall on passwords.

Would love to hear if anyone else is pursuing efficient 1B/3B finetunes for specialized tasks and about your stack!

AI Disclaimer: I write everything myself - this post, and the full writeup. Please point out any typos!

submitted by /u/Oatilis
[link] [comments]

Leave a Comment