DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing
arXiv:2411.16121v3 Announce Type: replace
Abstract: In recent years, the growth of data across various sectors, including healthcare, security, finance, and education, has created significant opportunities for analysis and informed decision-making. However, these datasets often contain sensitive and personal information, which raises serious privacy concerns. It has been shown in multiple works that a person's identity is intertwined with their data, even if the data is anonymized. Due to this lack of separation between a person's identity and their information, the patterns associated with an individual's information can uniquely identify them. Protecting individual privacy is crucial, yet many existing machine learning and data publishing algorithms struggle with high-dimensional data, facing challenges related to the trade-off between computational efficiency and privacy. To address these challenges, we introduce an effective data publishing algorithm \emph{DP-CDA}. Our proposed algorithm generates synthetic data by randomly mixing the privacy-sensitive data in a class-specific manner and inducing carefully tuned randomness to ensure formal privacy guarantees. Our comprehensive privacy accounting shows that the proposed DP-CDA provides a stronger privacy guarantee compared to existing methods, allowing for better utility while maintaining a stricter level of privacy. To evaluate the effectiveness of DP-CDA, we examine the accuracy of predictive models trained on the synthetic data, which serves as a measure of dataset utility. Importantly, we identify an optimal order of mixing that balances privacy-utility trade-off. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by conventional data publishing algorithms, even when subject to the same privacy requirements.