OpenAI’s Privacy Filter vs GLiNER on 600 PII samples

Both models are open weight, both run on a local CPU workstation, both detect PII in text. Quick rundown of what I found.

GLiNER large-v2.1 is ~300M params, zero shot, you pass entity types as plain text strings at inference.

Openai/privacy-filter is 1.5B total but only 50M active per forward pass thanks to a sparse MoE.

In practice on CPU openai/privacy-filter ran ~2.8 samples/sec vs ~1.1 for GLiNER large.

Eval was 400 English + 200 multilingual samples from ai4privacy/pii-masking-400k, six PII categories.

The catch: openai/privacy-filter uses GPT style BPE tokenization, which prepends a space to most tokens. So when you decode token offsets back to character spans, everything is off by one character. Score with strict exact match and openai/privacy-filter looks awful. Score with boundary overlap (any character overlap, correct label) and it actually wins overall.

English macro F1:

Model	Strict	Boundary	Partial
GLiNER large-v2.1	0.367	0.416	0.392
openai/privacy-filter	0.155	0.498	0.326

The 0.34 strict-to-boundary gap for openai/privacy-filter is entirely tokenizer offset, not real misses.

Per category on boundary, openai/privacy-filter wins PERSON, EMAIL, PHONE, DATE. GLiNER wins ADDRESS. EMAIL is essentially solved (0.987 English, 1.000 multilingual).

GLiNER threshold tuning matters. Default 0.5 is leaving F1 on the table. 0.7 was the best for this dataset, ~8 F1 better than default.

If you want recall above all (eg redaction where misses are unacceptable), GLiNER. If you want precision and faster CPU throughput, openai/privacy-filter. If you need custom entity types beyond the eight openai/privacy-filter ships with, GLiNER's zero shot interface is the only option.

One annoyance worth knowing: openai/privacy-filter requires trust_remote_code=True and the dev branch of transformers. The model class hasn't landed in a stable release yet.

Full numbers, multilingual breakdown, the threshold sweep, all the code in comments below 👇

Disclosure: I work on Neo AI Engineer, and the eval pipeline was built and executed by Neo from a single prompt. I reviewed the methodology and validated the results before publishing. The numbers and findings stand on their own.

submitted by /u/gvij
[link] [comments]

Leave a Comment