CS代考计算机代写 decision tree algorithm Machine Learning and Differential Privacy
Machine Learning and Differential Privacy
Maria-Florina Balcan
04/22/2015
Learning and Privacy
• To do machine learning, we need data.
• What if the data contains sensitive information?
• medical data, web search query data, salary data, student grade data.
• Even if the (person running the) learning algo can be trusted, perhaps the output of the algorithm reveals sensitive info.
• E.g., using search logs of friends to recommend query completions:
Why are _
Why are my feet so itchy?
Learning and Privacy
• To do machine learning, we need data.
• What if the data contains sensitive information?
• Even if the (person running the) learning algo can be trusted, perhaps the output of the algorithm reveals sensitive info.
• E.g., SVM or perceptron on medical data:
– Suppose feature 𝑗 is has-green-hair and the learned 𝑤
has 𝑤𝑗 ≠ 0.
– If there is only one person in town with green hair, you know they were in the study.
Learning and Privacy
• To do machine learning, we need data.
• What if the data contains sensitive information?
• Even if the (person running the) learning algo can be trusted, perhaps the output of the algorithm reveals sensitive info.
• An approach to address these problems: Differential Privacy
“The Algorithmic Foundations of Differential Privacy”. Cynthia Dwork, Aaron Roth. Foundations and Trends in Theoretical Computer Science, NOW Publishers. 2014.
Differential Privacy
E.g., want to release average while preserving privacy.
High level idea:
• What we want is a protocol that has a probability distribution over outputs:
such that if person i changed their input from xi to any other allowed xi’, the relative probabilities of any output do not change by much.
Differential Privacy
High level idea:
• What we want is a protocol that has a probability distribution over outputs:
such that if person i changed their input from xi to any other allowed xi’, the relative probabilities of any output do not change by much.
• This would effectively allow that person to pretend their input was any other value they wanted.
Bayes rule: Pr 𝑥𝑖 𝑜𝑢𝑡𝑝𝑢𝑡 = Pr 𝑜𝑢𝑡𝑝𝑢𝑡 𝑥𝑖 ⋅ Pr 𝑥𝑖 Pr 𝑥𝑖′ 𝑜𝑢𝑡𝑝𝑢𝑡 Pr 𝑜𝑢𝑡𝑝𝑢𝑡 𝑥𝑖′ Pr 𝑥𝑖′
(Posterior ≈ Prior)
Differential Privacy: Definition
It’s a property of a protocol A which you run on some dataset X producing some output A(X).
• A is 2-differentially private if for any two neighbor datasets S, S’ (differ in just one element xi ! xi’),
xi
x’i
for all outcomes v,
e-2 · Pr(A(S)=v)/Pr(A(S’)=v) · e2
1⁄4 1-2 probability over 1⁄4 1+2 randomness in A
Differential Privacy: Definition
It’s a property of a protocol A which you run on some dataset X producing some output A(X).
• A is 2-differentially private if for any two neighbor datasets S, S’ (differ in just one element xi ! xi’),
View as model of plausible deniability
If your real input is 𝑥𝑖 and you’d like to pretend was 𝑥𝑖’, somebody looking at the output of A can’t tell, since for any outcome v, it was nearly just as likely to come from S as it was to come from S’.
for all outcomes v,
e-2 · Pr(A(S)=v)/Pr(A(S’)=v) · e2
1⁄4 1-2 probability over 1⁄4 1+2 randomness in A
Differential Privacy: Methods
It’s a property of a protocol A which you run on some dataset X producing some output A(X).
• Can we achieve it?
• Sure, just have A(X) always output 0.
• This is perfectly private, but also completely useless.
• Can we achieve it while still providing useful information?
Laplace Mechanism
Say have n inputs in range [0,b]. Want to release average while preserving privacy.
• Changing one input can affect average by ≤ b/n. • Idea: take answer and add noise from Laplace
distrib 𝑝 𝑥 ∝ 𝑒−|𝑥|𝜖𝑛/𝑏
• Changing one input changes prob of any given answer by
≤ 𝑒𝜖.
b/n
x
Value with real me
Value with fake me
Laplace Mechanism
Say have n inputs in range [0,b]. Want to release average while preserving privacy.
• Changing one input can affect average by ≤ b/n.
• Idea: : compute the true answer and add noise from
Laplace distrib 𝑝 𝑥 ∝ 𝑒−|𝑥|𝜖𝑛/𝑏
• Amount of noise added will be ≈ ±𝑏/(𝑛𝜖).
• To get an overall error of ±𝛾, you need a sample size 𝑛 = 𝑏 . 𝛾𝜖
• If you want to ask 𝑘 queries, the privacy loss adds, so to
have 𝜖-differential privacy overall, you need 𝑛 = 𝑘𝑏. 𝛾𝜖
Laplace Mechanism
Good features:
• Can run algorithms that just need to use approximate statistics (since just adding small amounts of noise to them).
• E.g., “approximately how much would this split in my decision tree reduce entropy?”
More generally
• Anything learnable via “Statistical Queries” is learnable differentially privately.
Practical Privacy: The SuLQ Framework. Blum, Dwork, McSherry,Nissim. PODS 2005.
• Statistical Query Model [Kearns93] :
S
q(x,l) PrD[q(x,f(x))=1] ±𝛾.
What is the error rate of my current rule?
What is the correlation of x1 with f when x2=0? …
• Many algorithms (including ID3, Perceptron, SVM, PCA) can be re-written to interface via such statistical estimates.
• •
Laplace Mechanism
Problems:
• If you ask many questions, need large dataset to be able to can give accurate and private answers to all of them. (privacy losses accumulate over questions asked).
• Also, differential privacy may not be appropriate if multiple examples correspond to same individual (e.g., search queries, restaurant reviews).
More generally
Problems:
• The more interconnected our data is (A and B are friends because of person C) the trickier it becomes to reason about privacy.
• Lots of current work on definitions and algorithms.