Yin David Yang Zhenjie Zhang Gerome Miklau Prev Session Marianne Winslett Xiaokui Xiao 1 What we talked in the last session Privacy is a major concern in data publishing ID: 536093
Download Presentation The PPT/PDF document "Part 4: Data Dependent Query Processing ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Part 4: Data Dependent Query Processing Methods
Yin “David” Yang Zhenjie Zhang Gerome Miklau Prev. Session: Marianne Winslett Xiaokui Xiao
1Slide2
What we talked in the last sessionPrivacy is a major concern in data publishingSimple anonymization methods fail to provide sufficient privacy protectionDefinition of differential privacy
Hard to tell if a record is in the DB from query results Plausible deniabilityBasic solutionsLaplace mechanism: inject Laplace noise into query resultsExponential mechanism: choose a result randomly; a “good” result has higher probabilityData independent methods 2Slide3
Data independent vs. data dependent
Data independent methodsData dependent methodsSensitive infoQuery resultsQuery results +
data dependent parameters
Error source
Injected noise
Injected noise +
information loss
Noise type
Unbiased
Often Biased
Asymptotic error bound
HigherLower, with data dependent constantsPractical accuracyHigherLower for some data
3Slide4
Types of data dependent methods Type 1: optimizing noisy resultsInject noiseOptimize the noisy query results based on their values
Type 2: transforming original dataTransform the data to reduce the amount of necessary noiseInject noise4Slide5
Optimizing noisy results: Hierarchical Strategy presented in the last session.Hierarchical strategy: tree with count in each nodeData dependent optimization
:If a node N has noisy count close to 0Set the noisy count at N to 05
Noisy count: 0.05
Optimized
count: 0
Hay
et al
. Boosting the Accuracy of Differentially-Private Queries Through Consistency, VLDB’10.Slide6
Optimizing noisy results: iReductSetting: answer a set of m queriesGoal: minimize their total
relative errorRelErr = (noisy result – actual result) / actual resultExample:Two queries, q1 and q2Actual results: q1 :10, q2 :20Observation: we should add less noise to q1 than to q26
Xiao
et al
.
iReduct
: Differential Privacy with Reduced Relative Errors, SIGMOD’11.Slide7
Answering queries differently leads to different total relative errorContinuing the exampleTwo queries, q1 and
q2, with actual answers 10 and 20Suppose each of q1 and q2 has sensitivity 1Two strategies:Answer q1 with ε/2, q2 with ε/2Noise on q1: 2/ε Noise on q
1: 2/ε
Answer q1 with 2ε
/3, q2 with ε
/3Noise on q1: 1.5
ε Noise variance on q
1
: 3/
ε
7
Lower relative error overall
But we don’t know which strategy is better before comparing their actual answers!Slide8
Idea of iReductAnswer all queries with privacy budget ε
/tRefine the noisy results with budget ε/tmore budget on queries with smaller resultsHow to refine a noisy count?Method 1: obtain a new noisy version, compute weighted average with the old versionMethod 2: obtain a refined version directly from a complicated distributionRepeat the last step t1 times8Slide9
Example of iReduct9
q1q2
Iteration 1:
16
14
ε
/2
t
ε
/
t
14/30ε/2t
ε
/
t
16/30
12
24
Iteration 2:
ε
/
t
2/3
ε
/
t
1/3
9
22
…
…
Iteration 3:
ε
/
t
22/31
ε
/
t
9/31Slide10
Optimizing noisy results: MWProblem: publish a histogram under DP that is optimized for a given query set.Idea: Start from a uniform histogram.
Repeat the following t timesEvaluate all queries.Find the query q with the worst accuracy.Modify the histogram to improve the accuracy of qusing a technique called multiplicative weights (MW)10Hardt
et al
. A simple and practical algorithm for differentially private data release,
arXiv
.Slide11
Example of MW11
Exact histogram
q
1
q
2
Initial histogram
Range count queries
q
1
q
2
less accurate
No privacy budget cost!
Iteration 1: optimize
q
1
privacy cost:
ε
/
t
q
1
q
2
still less accurate
Iteration 2: optimize
q
1
privacy cost:
ε
/
t
q
1
q
2
less accurate
Iteration 3: optimize
q
2
privacy cost:
ε
/
t
q
1
q
2Slide12
Optimizing noisy results: NoiseFirstProblem: publish a histogram
12Xu et al. Differentially Private Histogram Publication, ICDE’12.
Original data
in a medical statistical DB
Histogram
Name
Age
HIV+
Frank
42
Y
Bob
31
Y
Mary
28
Y
Dave
43
N
…
…
…Slide13
Reduce error by merging bins13
Noisy histogram
Exact histogram
Optimized histogram
2 2 2
Bin-merging scheme computed through dynamic programming
Positive/negative noise cancels out!Slide14
Next we focus on the second type.Type 1: optimizing noisy resultsInject noise
Optimize the noisy query results based on their valuesType 2: transforming original dataTransform the data to reduce the amount of necessary noiseInject noise14Slide15
Transforming data: StructureFirstAn alternative solution for histogram publication
15Original histogramHistogram after merging bins
∆=1
∆=1/3
∆=1/2
Lower sensitivity means less noise!
Xu
et al
. Differentially Private Histogram Publication, ICDE’12.
Related:
Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM’10.Slide16
But the optimal structure is sensitive!16
OriginalHistogram
Diff. optimal structures
With/without Alice
Alice is an HIV+ patient !
AliceSlide17
StructureFirst uses the Exponential Mechanism to render its structure differentially private.Randomly perturb the optimal histogram structureSet each boundary using the
exponential mechanism171.31.3
1.3
4.5
4.5
1
1
1
2
1
4
5
1
1
1.3
1.3
1.3
4.5
4.5
1
1
1.3
1.3
1.3
4
2.3
2.3
2.3
¢
¢
¢
1.2
1.2
1.2
5.1
2.4
2.4
2.4
Original histogram
merge bins (k*=3)
Randomly adjust
boundaries
Lap(
∆/
ε
) noise
Consume
ε
1
Consume
ε
2
=
(
ε
-
ε
1
)
Satisfies
ε
-DPSlide18
Observations on StructureFirstMerging bins essentially compresses the dataReduced sensitivity vs. information loss
Question: can we apply other compression algorithms?Yes!Method 1: Perform Fourier transformation, take the first few coefficients, discard all othersRastogi and Nath. Differentially Private Aggregation Of Distributed Time-series With Transformation And Encryption, SIGMOD’10Method 2: apply the theory of sparse representationLi et al. Compressive Mechanism: Utilizing Sparse Representation in Differential Privacy, WPES’11Hardt and Roth. Beating Randomized Response on Incoherent Matrices. STOC’12Your new paper?
18Slide19
Transforming original data: k-d-treeProblem: answer 2D range count queriesSolution: index the data with a
k-d-tree19Cormode et al
. Differentially Private Space Decompositions. ICDE’12.
Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM, 2010
The
k
-
d
-tree structure is sensitive!Slide20
How to protect the k-d-tree structure?Core problem: differentially private median.Method 1: exponential mechanism. (
best) [1]Method 2: simply replace mean with median. [3]Method 3: cell-based method. [2]Partition the data with a grid.Compute differentially private counts using the grid.20[1] Cormode
et al
. Differentially Private Space Decompositions. ICDE’12.
[2] Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM’10.
[3]
Inan
et al. Private Record Matching Using Differential Privacy. EDBT’10.Slide21
Transforming original data: S&AS&A: Sample and AggregateGoal: answer a query q whose result does not dependent on the dataset cardinality, e.g., avg
Idea 1:Randomly partition the dataset into m blocksEvaluate q on each blockReturn average over m blocks + Laplace noiseSensitivity: (max-min)/mIdea 2: median instead of average + exponential mechanismSensitivity is 1!Zhenjie has more21
Mohan et al. GUPT: Privacy Preserving Data Analysis Made Easy. SIGMOD’12.
Smith. Privacy-Preserving Statistical Estimation
with Optimal Convergence Rates. STOC’11.Slide22
Systems using Differential PrivacyPrivacy on the MapPINQAiravatGUPT
22Slide23
Summary on data dependent methodsData dependent vs. data independentOptimizing noisy resultsSimple optimizationsIterative methodsTransforming original data
Reduced sensitivityCaution: parameters may reveal informationNext: Zhenjie on differentially private data mining23