/
Part 4: Data Dependent Query Processing Methods Part 4: Data Dependent Query Processing Methods

Part 4: Data Dependent Query Processing Methods - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
405 views
Uploaded On 2017-04-10

Part 4: Data Dependent Query Processing Methods - PPT Presentation

Yin David Yang Zhenjie Zhang Gerome Miklau Prev Session Marianne Winslett Xiaokui Xiao 1 What we talked in the last session Privacy is a major concern in data publishing ID: 536093

histogram data private noisy data histogram noisy private differentially privacy dependent noise results original query queries count mechanism method

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Part 4: Data Dependent Query Processing ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Part 4: Data Dependent Query Processing Methods

Yin “David” Yang  Zhenjie Zhang  Gerome Miklau Prev. Session: Marianne Winslett  Xiaokui Xiao

1Slide2

What we talked in the last sessionPrivacy is a major concern in data publishingSimple anonymization methods fail to provide sufficient privacy protectionDefinition of differential privacy

Hard to tell if a record is in the DB from query results Plausible deniabilityBasic solutionsLaplace mechanism: inject Laplace noise into query resultsExponential mechanism: choose a result randomly; a “good” result has higher probabilityData independent methods 2Slide3

Data independent vs. data dependent

Data independent methodsData dependent methodsSensitive infoQuery resultsQuery results +

data dependent parameters

Error source

Injected noise

Injected noise +

information loss

Noise type

Unbiased

Often Biased

Asymptotic error bound

HigherLower, with data dependent constantsPractical accuracyHigherLower for some data

3Slide4

Types of data dependent methods Type 1: optimizing noisy resultsInject noiseOptimize the noisy query results based on their values

Type 2: transforming original dataTransform the data to reduce the amount of necessary noiseInject noise4Slide5

Optimizing noisy results: Hierarchical Strategy presented in the last session.Hierarchical strategy: tree with count in each nodeData dependent optimization

:If a node N has noisy count close to 0Set the noisy count at N to 05

Noisy count: 0.05

Optimized

count: 0

Hay

et al

. Boosting the Accuracy of Differentially-Private Queries Through Consistency, VLDB’10.Slide6

Optimizing noisy results: iReductSetting: answer a set of m queriesGoal: minimize their total

relative errorRelErr = (noisy result – actual result) / actual resultExample:Two queries, q1 and q2Actual results: q1 :10, q2 :20Observation: we should add less noise to q1 than to q26

Xiao

et al

.

iReduct

: Differential Privacy with Reduced Relative Errors, SIGMOD’11.Slide7

Answering queries differently leads to different total relative errorContinuing the exampleTwo queries, q1 and

q2, with actual answers 10 and 20Suppose each of q1 and q2 has sensitivity 1Two strategies:Answer q1 with ε/2, q2 with ε/2Noise on q1: 2/ε Noise on q

1: 2/ε

Answer q1 with 2ε

/3, q2 with ε

/3Noise on q1: 1.5

ε Noise variance on q

1

: 3/

ε

7

Lower relative error overall

But we don’t know which strategy is better before comparing their actual answers!Slide8

Idea of iReductAnswer all queries with privacy budget ε

/tRefine the noisy results with budget ε/tmore budget on queries with smaller resultsHow to refine a noisy count?Method 1: obtain a new noisy version, compute weighted average with the old versionMethod 2: obtain a refined version directly from a complicated distributionRepeat the last step t1 times8Slide9

Example of iReduct9

q1q2

Iteration 1:

16

14

ε

/2

t

ε

/

t

14/30ε/2t

ε

/

t

16/30

12

24

Iteration 2:

ε

/

t

2/3

ε

/

t

1/3

9

22

Iteration 3:

ε

/

t

22/31

ε

/

t

9/31Slide10

Optimizing noisy results: MWProblem: publish a histogram under DP that is optimized for a given query set.Idea: Start from a uniform histogram.

Repeat the following t timesEvaluate all queries.Find the query q with the worst accuracy.Modify the histogram to improve the accuracy of qusing a technique called multiplicative weights (MW)10Hardt

et al

. A simple and practical algorithm for differentially private data release,

arXiv

.Slide11

Example of MW11

Exact histogram

q

1

q

2

Initial histogram

Range count queries

q

1

q

2

less accurate

No privacy budget cost!

Iteration 1: optimize

q

1

privacy cost:

ε

/

t

q

1

q

2

still less accurate

Iteration 2: optimize

q

1

privacy cost:

ε

/

t

q

1

q

2

less accurate

Iteration 3: optimize

q

2

privacy cost:

ε

/

t

q

1

q

2Slide12

Optimizing noisy results: NoiseFirstProblem: publish a histogram

12Xu et al. Differentially Private Histogram Publication, ICDE’12.

Original data

in a medical statistical DB

Histogram

Name

Age

HIV+

Frank

42

Y

Bob

31

Y

Mary

28

Y

Dave

43

N

…Slide13

Reduce error by merging bins13

Noisy histogram

Exact histogram

Optimized histogram

2 2 2

Bin-merging scheme computed through dynamic programming

Positive/negative noise cancels out!Slide14

Next we focus on the second type.Type 1: optimizing noisy resultsInject noise

Optimize the noisy query results based on their valuesType 2: transforming original dataTransform the data to reduce the amount of necessary noiseInject noise14Slide15

Transforming data: StructureFirstAn alternative solution for histogram publication

15Original histogramHistogram after merging bins

∆=1

∆=1/3

∆=1/2

Lower sensitivity means less noise!

Xu

et al

. Differentially Private Histogram Publication, ICDE’12.

Related:

Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM’10.Slide16

But the optimal structure is sensitive!16

OriginalHistogram

Diff. optimal structures

With/without Alice

Alice is an HIV+ patient !

AliceSlide17

StructureFirst uses the Exponential Mechanism to render its structure differentially private.Randomly perturb the optimal histogram structureSet each boundary using the

exponential mechanism171.31.3

1.3

4.5

4.5

1

1

1

2

1

4

5

1

1

1.3

1.3

1.3

4.5

4.5

1

1

1.3

1.3

1.3

4

2.3

2.3

2.3

¢

¢

¢

1.2

1.2

1.2

5.1

2.4

2.4

2.4

Original histogram

merge bins (k*=3)

Randomly adjust

boundaries

Lap(

∆/

ε

) noise

Consume

ε

1

Consume

ε

2

=

(

ε

-

ε

1

)

Satisfies

ε

-DPSlide18

Observations on StructureFirstMerging bins essentially compresses the dataReduced sensitivity vs. information loss

Question: can we apply other compression algorithms?Yes!Method 1: Perform Fourier transformation, take the first few coefficients, discard all othersRastogi and Nath. Differentially Private Aggregation Of Distributed Time-series With Transformation And Encryption, SIGMOD’10Method 2: apply the theory of sparse representationLi et al. Compressive Mechanism: Utilizing Sparse Representation in Differential Privacy, WPES’11Hardt and Roth. Beating Randomized Response on Incoherent Matrices. STOC’12Your new paper?

18Slide19

Transforming original data: k-d-treeProblem: answer 2D range count queriesSolution: index the data with a

k-d-tree19Cormode et al

. Differentially Private Space Decompositions. ICDE’12.

Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM, 2010

The

k

-

d

-tree structure is sensitive!Slide20

How to protect the k-d-tree structure?Core problem: differentially private median.Method 1: exponential mechanism. (

best) [1]Method 2: simply replace mean with median. [3]Method 3: cell-based method. [2]Partition the data with a grid.Compute differentially private counts using the grid.20[1] Cormode

et al

. Differentially Private Space Decompositions. ICDE’12.

[2] Xiao et al. Differentially Private Data Release through Multi-Dimensional Partitioning. SDM’10.

[3]

Inan

et al. Private Record Matching Using Differential Privacy. EDBT’10.Slide21

Transforming original data: S&AS&A: Sample and AggregateGoal: answer a query q whose result does not dependent on the dataset cardinality, e.g., avg

Idea 1:Randomly partition the dataset into m blocksEvaluate q on each blockReturn average over m blocks + Laplace noiseSensitivity: (max-min)/mIdea 2: median instead of average + exponential mechanismSensitivity is 1!Zhenjie has more21

Mohan et al. GUPT: Privacy Preserving Data Analysis Made Easy. SIGMOD’12.

Smith. Privacy-Preserving Statistical Estimation

with Optimal Convergence Rates. STOC’11.Slide22

Systems using Differential PrivacyPrivacy on the MapPINQAiravatGUPT

22Slide23

Summary on data dependent methodsData dependent vs. data independentOptimizing noisy resultsSimple optimizationsIterative methodsTransforming original data

Reduced sensitivityCaution: parameters may reveal informationNext: Zhenjie on differentially private data mining23