Use differential privacy in Azure Machine Learning (preview)
Learn how to apply differential privacy best practices to Azure Machine Learning models by using the SmartNoise Python open-source libraries.
Differential privacy is the gold-standard definition of privacy. Systems that adhere to this definition of privacy provide strong assurances against a wide range of data reconstruction and reidentification attacks, including attacks by adversaries who possess auxiliary information. Learn more about how differential privacy works.
Prerequisites
- If you don't have an Azure subscription, create a trial subscription before you begin. Try the trial subscription of Azure Machine Learning today.
- Python 3
Install SmartNoise Python libraries
Standalone installation
The libraries are designed to work from distributed Spark clusters, and can be installed just like any other package.
The instructions below assume that your python
and pip
commands are mapped to python3
and pip3
.
Use pip to install the SmartNoise Python packages.
pip install opendp-smartnoise
To verify that the packages are installed, launch a Python prompt and type:
import opendp.smartnoise.core
import opendp.smartnoise.sql
If the imports succeed, the libraries are installed and ready to use.
Docker image installation
You can also use SmartNoise packages with Docker.
Pull the opendp/smartnoise
image to use the libraries inside a Docker container that includes Spark, Jupyter, and sample code.
Note
The following example pulls a public container image from Docker Hub. We recommend that you authenticate with your Docker Hub account (docker login
) first instead of making an anonymous pull request. To improve reliability when using public content, import and manage the image in a private Azure container registry. Learn more about working with public images.
docker pull opendp/smartnoise:privacy
Once you've pulled the image, launch the Jupyter server:
docker run --rm -p 8989:8989 --name smartnoise-run opendp/smartnoise:privacy
This starts a Jupyter server at port 8989
on your localhost
, with password pass@word99
. Assuming you used the command line above to start the container with name smartnoise-privacy
, you can open a bash terminal in the Jupyter server by running:
docker exec -it smartnoise-run bash
The Docker instance clears all state on shutdown, so you will lose any notebooks you create in the running instance. To remedy this, you can bind mount a local folder to the container when you launch it:
docker run --rm -p 8989:8989 --name smartnoise-run --mount type=bind,source=/Users/your_name/my-notebooks,target=/home/privacy/my-notebooks opendp/smartnoise:privacy
Any notebooks you create under the my-notebooks folder will be stored in your local filesystem.
Perform data analysis
To prepare a differentially private release, you need to choose a data source, a statistic, and some privacy parameters, indicating the level of privacy protection.
This sample references the California Public Use Microdata (PUMS), representing anonymized records of citizen demographics:
import os
import sys
import numpy as np
import opendp.smartnoise.core as sn
data_path = os.path.join('.', 'data', 'PUMS_california_demographics_1000', 'data.csv')
var_names = ["age", "sex", "educ", "race", "income", "married", "pid"]
In this example, we compute the mean and the variance of the age. We use a total epsilon
of 1.0 (epsilon is our privacy parameter, spreading our privacy budget across the two quantities we want to compute. Learn more about privacy metrics.
with sn.Analysis() as analysis:
# load data
data = sn.Dataset(path = data_path, column_names = var_names)
# get mean of age
age_mean = sn.dp_mean(data = sn.cast(data['age'], type="FLOAT"),
privacy_usage = {'epsilon': .65},
data_lower = 0.,
data_upper = 100.,
data_n = 1000
)
# get variance of age
age_var = sn.dp_variance(data = sn.cast(data['age'], type="FLOAT"),
privacy_usage = {'epsilon': .35},
data_lower = 0.,
data_upper = 100.,
data_n = 1000
)
analysis.release()
print("DP mean of age: {0}".format(age_mean.value))
print("DP variance of age: {0}".format(age_var.value))
print("Privacy usage: {0}".format(analysis.privacy_usage))
The results look something like those below:
DP mean of age: 44.55598845931517
DP variance of age: 231.79044646429134
Privacy usage: approximate {
epsilon: 1.0
}
There are some important things to note about this example. First, the Analysis
object represents a data processing graph. In this example, the mean and variance are computed from the same source node. However, you can include more complex expressions that combine inputs with outputs in arbitrary ways.
The analysis graph includes data_upper
and data_lower
metadata, specifying the lower and upper bounds for ages. These values are used to precisely calibrate the noise to ensure differential privacy. These values are also used in some handling of outliers or missing values.
Finally, the analysis graph keeps track of the total privacy budget spent.
You can use the library to compose more complex analysis graphs, with several mechanisms, statistics, and utility functions:
Statistics | Mechanisms | Utilities |
---|---|---|
Count | Gaussian | Cast |
Histogram | Geometric | Clamping |
Mean | Laplace | Digitize |
Quantiles | Filter | |
Sum | Imputation | |
Variance/Covariance | Transform |
See the data analysis notebook for more details.
Approximate utility of differentially private releases
Because differential privacy operates by calibrating noise, the utility of releases may vary depending on the privacy risk. Generally, the noise needed to protect each individual becomes negligible as sample sizes grow large, but overwhelm the result for releases that target a single individual. Analysts can review the accuracy information for a release to determine how useful the release is:
with sn.Analysis() as analysis:
# load data
data = sn.Dataset(path = data_path, column_names = var_names)
# get mean of age
age_mean = sn.dp_mean(data = sn.cast(data['age'], type="FLOAT"),
privacy_usage = {'epsilon': .65},
data_lower = 0.,
data_upper = 100.,
data_n = 1000
)
analysis.release()
print("Age accuracy is: {0}".format(age_mean.get_accuracy(0.05)))
The result of that operation should look similar to that below:
Age accuracy is: 0.2995732273553991
This example computes the mean as above, and uses the get_accuracy
function to request accuracy at alpha
of 0.05. An alpha
of 0.05 represents a 95% interval, in that released value will fall within the reported accuracy bounds about 95% of the time. In this example, the reported accuracy is 0.3, which means the released value will be within an interval of width 0.6, about 95% of the time. It is not correct to think of this value as an error bar, since the released value will fall outside the reported accuracy range at the rate specified by alpha
, and values outside the range may be outside in either direction.
Analysts may query get_accuracy
for different values of alpha
to get narrower or wider confidence intervals, without incurring additional privacy cost.
Generate a histogram
The built-in dp_histogram
function creates differentially private histograms over any of the following data types:
- A continuous variable, where the set of numbers has to be divided into bins
- A boolean or dichotomous variable, that can only take on two values
- A categorical variable, where there are distinct categories enumerated as strings
Here is an example of an Analysis
specifying bins for a continuous variable histogram:
income_edges = list(range(0, 100000, 10000))
with sn.Analysis() as analysis:
data = sn.Dataset(path = data_path, column_names = var_names)
income_histogram = sn.dp_histogram(
sn.cast(data['income'], type='int', lower=0, upper=100),
edges = income_edges,
upper = 1000,
null_value = 150,
privacy_usage = {'epsilon': 0.5}
)
Because the individuals are disjointly partitioned among histogram bins, the privacy cost is incurred only once per histogram, even if the histogram includes many bins.
For more on histograms, see the histograms notebook.
Generate a covariance matrix
SmartNoise offers three different functionalities with its dp_covariance
function:
- Covariance between two vectors
- Covariance matrix of a matrix
- Cross-covariance matrix of a pair of matrices
Here is an example of computing a scalar covariance:
with sn.Analysis() as analysis:
wn_data = sn.Dataset(path = data_path, column_names = var_names)
age_income_cov_scalar = sn.dp_covariance(
left = sn.cast(wn_data['age'],
type = "FLOAT"),
right = sn.cast(wn_data['income'],
type = "FLOAT"),
privacy_usage = {'epsilon': 1.0},
left_lower = 0.,
left_upper = 100.,
left_n = 1000,
right_lower = 0.,
right_upper = 500_000.,
right_n = 1000)
For more information, see the covariance notebook
Next Steps
- Explore SmartNoise sample notebooks.