This website uses cookies and similar technologies to understand visitors' experiences. By continuing to use this website, you accept our use of cookies and similar technologies,Terms of Use, and Privacy Policy.

Sep 26 2019 - 03:18 PM
Exploring Google's differential privacy library

Data privacy or information privacy is a branch of data security concerned with the proper handling of data – consent, notice, and regulatory obligations. To protect data privacy, companies are now required to determine what data privacy acts and laws affect their users. For instance, we must know where the data originated (country and state), what personally identifiable information it might contain and usage methodology. The data privacy regulations include GDPR (General Data Protection Regulation)HIPAA (Health Information Privacy and Portability Act)GLBA (Gramm-leach-Bliley Act), and CCPA (California Consumer Privacy Act).

Instead of regulations, technologies can be applied to avoid the disclosure of individual information. Google recently announced the open-source differential privacy library. This library contains a C++ library of ε-differential privacy. The Github repository keeps updating and more improvement.

In the repository, Google also provides a use case story.

There are around 200 animals at Farmer Fred's zoo. Every day, Farmer Fred feeds the animals as many carrots as they desire. The animals record how many carrots they have eaten per day. At the end of each day, Farmer Fred often asks an aggregate question about how many carrots everyone ate. For example, he wants to know how many carrots are eaten each day, so he knows how many to order the next day. The input data set looks like this: {'Aardvark':1, 'Albatross': 88, 'Alligator': 35, ...} (animal names and carrot consumption). 

The animals are fearful that Fred will use the data against their best interest. For example, Fred could get rid of the animals who eat the most carrots! To protect themselves, animals decide to use the Differential Privacy (DP) library to aggregate their data before reporting it to Fred. This way, the animals can control the risk that Fred will identify individuals' data while maintaining an adequate level of accuracy so that Fred can continue to run the zoo effectively.

It is a new day. Farmer Fred is ready to ask the animals about their carrot consumption. Farmer Fred asks the animals how many whole carrots they have eaten. The animals know the exact sum but report the differentially private sum to Farmer Fred. However, first, they ensure that Farmer Fred still has the privacy budget left.

Privacy budget remaining: 1.00
True sum: 9649
DP sum: 9984

Farmer Fred catches on that the animals are giving him DP results. He asks for the mean number of carrots eaten, but this time, he wants some additional accuracy information to build his intuition.

Privacy budget remaining: 0.75
True mean: 53.02
DP mean output:
elements { value { float_value: 60.088888888888889 }}
error_report { bounding_report { lower_bound { int_value: 32 }
  upper_bound { int_value: 128 }
  num_inputs: 164
  num_outside: 36 }}

The animals help Fred interpret the results. 60.09 is the DP mean. Since no bounds were set for the DP mean algorithm, bounds on the input data were automatically determined. Most of the data fell between [32, 128]. Thus, these bounds were used to determine clamping and global sensitivity. Besides, around 164 input values fell inside of these bounds, and around 36 inputs fell outside of these bounds. num_inputs and num_outside are themselves DP counts.

Fred wonders how many gluttons are in his zoo. How many animals ate over 90 carrots? Moreover, how accurate is the result?

Privacy budget remaining: 0.50
True count: 21
DP count output:
elements { value { int_value: 19 }}
error_report { noise_confidence_interval { upper_bound: 2.7268330278608408
  lower_bound: -2.7268330278608408
  confidence_level: 0.95 }}

The animals tell Fred that 19 is the DP count. [-2.73, 2.73] is the 0.95 confidence interval of the noise added to the count.

'And how gluttonous is the biggest glutton of them all?' Fred exclaims. He asks for the maximum number of carrots any animal has eaten.

Privacy budget remaining: 0.25
True max: 100
DP max: 70

Fred also wonders how many animals are not eating any carrots at all.

Privacy budget remaining: 0.00

Error querying for the count: Not enough privacy budget.

The animals notice that the privacy budget is depleted. They refuse to answer any more of Fred's questions for risk of violating privacy.

So far, there are some main characteristics of DP :

  1. The user cannot query unlimited time. The number of the query is based on the privacy budget and the type of query.
  2. The result of DP is not the real value, but it is close, and sometimes it can provide a confidence interval.
  3. The developer can set the boundary of output value.

These features allow developers to control how the aggregate data (group patterns) is obtained flexibly. However, the application of this technique is more focus on group pattern recognization. How we can use a similar idea to protect the individual information when more accurate result and personalized information is required (e.g., recommendation system)? How can we use the differential privacy library in EdLab?

|By: Yi Chen|1733 Reads