22
loading...
This website collects cookies to deliver better user experience
“You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.” – Cynthia Dwork in The Algorithmic Foundations of Differential Privacy.
Database#1
and Database#2
differ by only one record, say, your data. If the results obtained from querying the database under these two different settings, are almost the same or similarly distributed, then they essentially are indistinguishable to an adversary.d
d’
M(d)
is the output of the training algorithm for the training subset d
and M(d’)
is the output of the training algorithm for the training subset d’
. S
under both these conditions should be arbitrarily close. The above equation should hold for all subsets d
and d’
.Smaller the value of Ɛ, stronger the privacy guarantees.
Differentially private database mechanisms can therefore, make confidential data widely available for accurate data analysis.
“anonymized”
records with non-anonymized records in a different dataset.Therefore, access to the IMDb would no longer permit a linkage attack to someone whose history is in the Netflix training set than to someone not in the training set.
!pip install python-dp # installing PyDP
import pydp as dp # by convention our package is to be imported as dp (dp for Differential Privacy!)
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
import pandas as pd
import statistics
import numpy as np
import matplotlib.pyplot as plt
url1 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial_4-Launch_demo/data/01.csv'
df1 = pd.read_csv(url1,sep=",", engine = "python")
print(df1.head())
url2 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial_4-Launch_demo/data/02.csv'
df2 = pd.read_csv(url2,sep=",", engine = "python")
print(df2.head())
url3 ='https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial_4-Launch_demo/data/03.csv'
df3 = pd.read_csv(url3,sep=",", engine = "python")
df3.head()
url4 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial_4-Launch_demo/data/04.csv'
df4 = pd.read_csv(url4,sep=",", engine = "python")
print(df4.head())
url5 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial_4-Launch_demo/data/05.csv'
df5 = pd.read_csv(url5,sep=",", engine = "python")
print(df5.head())
combined_df_temp = [df1, df2, df3, df4, df5]
original_dataset = pd.concat(combined_df_temp)
print(original_dataset.shape)
# Result
# (5000,6)
redact_dataset = original_dataset.copy()
redact_dataset = redact_dataset[1:]
print(original_dataset.head())
print(redact_dataset.head())
Is the amount of money we spend at our neighborhood store private or sensitive information? Well, it may not seem all that sensitive! But, what if the same information can be used to identify us?
sales_amount
column in our original dataset, and the redact_dataset
. The difference between these two sums exactly gives us the amount that Osbourne spent, and is verified as shown in the code snippet below. sum_original_dataset = round(sum(original_dataset['sales_amount'].to_list()), 2)
sum_redact_dataset = round(sum(redact_dataset['sales_amount'].to_list()), 2)
sales_amount_Osbourne = round((sum_original_dataset - sum_redact_dataset), 2)
assert sales_amount_Osbourne == original_dataset.iloc[0, 4]
dp_sum_original_dataset = BoundedSum(epsilon= 1.5, lower_bound = 5, upper_bound = 250, dtype ='float')
dp_sum_og = dp_sum_original_dataset.quick_result(original_dataset['sales_amount'].to_list())
dp_sum_og = round(dp_sum_og, 2)
print(dp_sum_og)
# Output dp_sum_og
# 636723.61
dp_redact_dataset = BoundedSum(epsilon= 1.5, lower_bound = 5, upper_bound = 250, dtype ='float')
dp_redact_dataset.add_entries(redact_dataset['sales_amount'].to_list())
dp_sum_redact=round(dp_redact_dataset.result(), 2)
print(dp_sum_redact)
# Output dp_sum_redact
# 636659.17
print(f"Sum of sales_value in the orignal dataset: {sum_original_dataset}")
print(f"Sum of sales_value in the orignal dataset with DP: {dp_sum_og}")
assert dp_sum_og != sum_original_dataset
# Output
Sum of sales_value in the orignal dataset: 636594.59
Sum of sales_value in the orignal dataset with DP: 636723.61
print(f"Sum of sales_value in the second dataset: {sum_redact_dataset}")
print(f"Sum of sales_value in the second dataset with DP: {dp_sum_redact}")
assert dp_sum_redact != sum_redact_dataset
# Output
Sum of sales_value in the second dataset: 636562.65
Sum of sales_value in the second dataset with DP: 636659.17
print(f"Difference in Sum with DP: {round(dp_sum_og - dp_sum_redact, 2)}")
print(f"Actual Difference in Sum: {sales_amount_Osbourne}")
assert round(dp_sum_og - dp_sum_redact, 2) != sales_amount_Osbourne
# Output
Difference in sum using DP: 64.44
Actual Value: 31.94