A Quantitative Information Flow Model for Attribute-Inference Attacks and Utility in Data Releases by Sampling


Statistical disclosure is a process that has been present in society for a long time, however the concern about privacy is relatively recent. The interest in protecting individual data increased considerably especially after the elaboration of regulations about data protection around the world, such as the General Data Protection Regulation (GDPR) in the European Union and the Lei Geral de Proteção de Dados (LGPD) in Brazil.

The effort in the scientific community to develop methods for the mitigation of privacy risks and to understand the trade-off between privacy and utility compose a large research area. However, mathematical models that explain formally this trade-off are, in some situations, misunderstood by data curators, i.e., entities that collect data from a population and adopt a certain policy to publish them can not understand what are the risks and benefits of that policy. In this sense, models and solutions that ensure that all parties involved are aware of the risks and benefits of each policy adopted are important for well informed decision-making.

As a first contribution of this work we propose a model that captures the vulnerability of publishing a sample from a population, in particular, the vulnerability of an attribute inference attack. We also describe the utility of the sample for data analysts who aim to infer the distribution of the values of an attribute in a population.

The model was developed using the framework of Quantitative Information Flow (QIF) that provides a mathematical apparatus to formally model systems as informational channels. We developed the model with the goal of being easily understandable by non experts and to be used by data curators when making decisions about how to publish their data. As a second contribution we provide closed formulas for prior and posterior vulnerabilities of attribute inference attack and for prior utility loss. The closed formulas are useful when quantifying vulnerabilities and utility losses in large datasets/samples.

Federal University of Minas Gerais