SynBench: A Benchmark for Differentially Private Text Generation
arXiv:2509.14594v2 Announce Type: replace
Abstract: Synthetic text generation with Differential Privacy (DP) guarantees emerges as a principled approach that can enable the sharing of sensitive datasets across institutional and regulatory boundaries, while bounding the risks of re-identification and membership inference. LLM-based methods deliver promising results; however, comparisons are exacerbated by differing evaluation setups and "private" datasets, potential pre-training contamination is not considered and guarantees are not verified with DP audits. To advance this field, we introduce a unified evaluation framework with standardised utility and fidelity metrics and privacy audits, encompassing nine curated datasets that capture domain-specific complexities such as technical jargon, long-context dependencies, and specialised document structures. In a large-scale empirical study, we benchmark LLM-based state-of-the-art DP text generators of varying sizes (between 1--8B). Our results indicate that DP synthetic text generation remains an unsolved challenge, with quality deteriorating more as the private datasets deviate further from the generators' pre-training corpora. Our novel synthetic text membership inference attack (MIA) explains this observation: Synthetic data quality is overestimated when LLMs have been pre-trained -- without DP -- on portions of the "private" data to be generated. Finally, our work provides the first quantitative evidence that this "public pre-training and private generation" paradigm invalidates the guaranteed privacy bounds of real-world private datasets.