/
Differential Privacy on Linked Data: Theory and Implementat Differential Privacy on Linked Data: Theory and Implementat

Differential Privacy on Linked Data: Theory and Implementat - PowerPoint Presentation

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
436 views
Uploaded On 2016-03-21

Differential Privacy on Linked Data: Theory and Implementat - PPT Presentation

Yotam Aron Table of Contents Introduction Differential Privacy for Linked Data SPIM implementation Evaluation Contributions Theory how to apply differential privacy to linked data Implementation privacy module for SPARQL queries ID: 264700

data privacy query differential privacy data differential query http linked sensitivity sparql noise queries air module private time count

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Differential Privacy on Linked Data: The..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Differential Privacy on Linked Data: Theory and Implementation

Yotam

AronSlide2

Table of Contents

Introduction

Differential Privacy for Linked Data

SPIM implementation

EvaluationSlide3

Contributions

Theory: how to apply differential privacy to linked data.

Implementation: privacy module for SPARQL queries.

Experimental evaluation: differential privacy on linked data.Slide4

IntroductionSlide5

Overview: Privacy Risk

Statistical data can leak privacy.

Mosaic Theory: Different data sources harmful when combined.

Examples:

Netflix Prize Data set

GIC Medical Data set

AOL Data logs

Linked data has added ontologies and meta-data, making it even more vulnerable

.Slide6

Current Solutions

Accountability:

Privacy Ontologies

Privacy Policies and Laws

Problems:

Requires agreement among parties.

Does not actually prevent breaches, just a deterrent.Slide7

Current Solutions (Cont’d)

Anonymization

Delete “private” data

K – anonymity

(Strong Privacy Guarantee)

Problems

Deletion provides no strong guarantees

Must be carried out for every data set

What data should be

anonymized

?

High computational cost

(k-

anonimity

is

np

-hard)Slide8

Differential Privacy

Definition for relational databases (from PINQ paper):

A

randomized function

K

gives

Ɛ-differential

privacy if for all data sets

and

differing

on at most one record, and all

,

 Slide9

Differential Privacy

What does this mean?

Adversaries get roughly same results from

and

, meaning a single individual’s data will not greatly affect their knowledge acquired from each data set.

 Slide10

How Is This Achieved?

Add noise to result.

Simplest: Add Laplace noiseSlide11

Laplace Noise Parameters

Mean = 0 (so don’t add bias)

Variance =

, where

is defined, for a record

j

, as

Theorem: For query

Q

result

R

,

the output R + Laplace(0, ) is differentially private. Slide12

Other Benefit of Laplace Noise

A set of queries each with sensitivity

will have an overall sensitivity of

Implementation-wise, can allocate an “budget” Ɛ for a client and for each query client specifies

to use.

 Slide13

Benefits of Differential Privacy

Strong Privacy Guarantee

Mechanism-Based, so don’t have to mess with data.

Independent of data set’s structure.

Works well with for statistical analysis algorithms.Slide14

Problems with Differential Privacy

Potentially poor performance

Complexity

Noise

Only works with statistical data (though this has fixes)

How to calculate sensitivity of arbitrary query without brute-force?Slide15

Theory: Differential Privacy for Linked DataSlide16

Differential Privacy and Linked Data

Want same privacy guarantees for linked data without, but no “records.”

What should be “unit of difference”?

One triple

All URIs

related to

person’s

URI

All links going out from person’s URISlide17

Differential Privacy and Linked Data

Want same privacy guarantees for linked data without, but no “records.”

What should be “unit of difference”?

One triple

All URIs related to person’s

URI

All links going out from person’s URISlide18

Differential Privacy and Linked Data

Want same privacy guarantees for linked data without, but no “records.”

What should be “unit of difference”?

One triple

All URIs related to person’s

URI

All links going out from person’s URISlide19

Differential Privacy and Linked Data

Want same privacy guarantees for linked data without, but no “records.”

What should be “unit of difference”?

One triple

All URIs related to person’s

URI

All links going out from person’s URISlide20

“Records” for Linked Data

Reduce links in graph to attributes

Idea:

Identify individual contributions from a single individual to total answer.

Find contribution that affects answer most.Slide21

“Records” for Linked Data

Reduce links in graph to attributes, makes it a record.

P1

P2

Knows

Person

Knows

P1

P2Slide22

“Records” for Linked Data

Repeated attributes and null values allowed

P1

P2

Knows

P3

P4

Loves

Knows

KnowsSlide23

“Records” for Linked Data

Repeated attributes and null values allowed (not good RDBMS form but makes definitions easier)

Person

Knows

Knows

Loves

P1

P2

Null

P4

P3

P2

P4

NullSlide24

Query Sensitivity in Practice

Need to find triples that “belong” to a person.

Idea:

Identify individual contributions from a single individual to total answer.

Find contribution that affects answer most.

Done using sorting and limiting functions in SPARQLSlide25

Example

COUNT of places visited

P1

P2

MA

S2

S3

State of Residence

S1

VisitedSlide26

Example

COUNT of places visited

P1

P2

MA

S2

S3

State of Residence

S1

VisitedSlide27

Example

COUNT of places visited

P1

P2

MA

S2

S3

State of Residence

S1

Visited

Answer

: Sensitivity of 2Slide28

Using SPARQL

Query:

(COUNT(?s) as ?

num_places_visited

) WHERE{

?p :visited ?s }Slide29

Using SPARQL

Sensitivity Calculation Query (Ideally):

SELECT ?p (COUNT(ABS(?s)) as ?

num_places_visited

) WHERE

{

?p :visited ?

s;

?p

foaf:name

?n }

GROUP BY ?p ORDER BY ?

num_places_visited

LIMIT 1Slide30

In reality…

LIMIT, ORDER BY, GROUP BY doesn’t work together in 4store…

For now: Don’t use LIMIT and get top answers manually.

I.e. Simulate using these keywords in python

Will affect results, so better testing should be carried out in the future.

Would like to keep it on

sparql

-side ideally so there is less transmitted data (e.g. on large data sets)Slide31

(Side rant) 4store limitations

Many operations not supported in unison

E.g. cannot always filter and use “order by” for some reason

Severely limits the types of queries I could use to test.

May be desirable to work with a different

triplestore

that is more up-to-date (ARQ).

Didn’t because wanted to keep code in python.

Also had already written all code for 4storeSlide32

Problems with this Approach

Need to identify “people” in graph.

Assume, for example, that URI with a

foaf:name

is a person

and use its triples in privacy calculations.

Imposes some constraints on linked data format for this to work.

For future work, look if there’s a way to automatically identify private data, maybe by using ontologies.

Complexity is tied to speed of performing query over large data set.

Still not generalizable to all functions.Slide33

…and on the Plus Side

Model for sensitivity calculation can be expanded to arbitrary statistical functions.

e.g. dot products, distance functions, variance, etc.

Relatively simple to implement using SPARQL 1.1 Slide34

Implementation: Design of Privacy SystemSlide35

SPARQL Privacy Insurance Module

i.e. SPIM

Use authentication, AIR, and differential privacy in one system.

Authentication to manage Ɛ-budgets.

AIR to control flow of information and non-statistical data.

Differential privacy for statistics.

Goal: Provide a module that can integrate into SPARQL 1.1 endpoints and provide privacy.Slide36

Design

Triplestore

User Data

Privacy Policies

SPIM Main Process

AIR

Reasoner

Differential Privacy Module

HTTP Server

OpenID

AuthenticationSlide37

HTTP Server and Authentication

HTTP Server:

Django

server that handles http requests.

OpenID

Authentication:

Django

module.

HTTP Server

OpenID

AuthenticationSlide38

SPIM Main Process

Controls flow of information.

First checks user’s budget, then uses AIR, then performs final differentially-private query.

SPIM Main ProcessSlide39

AIR Reasoner

Performs access control by translating SPARQL queries to n3 and checking against policies.

Can potentially perform more complicated operations (e.g. check user credentials)

Privacy Policies

AIR

ReasonerSlide40

Differential Privacy Protocol

Differential Privacy Module

Client

SPARQL Endpoint

Scenario: Client wishes to make standard SPARQL 1.1 statistical query. Client has Ɛ “budget” of overall accuracy for all queries.Slide41

Differential Privacy Protocol

Differential Privacy Module

Client

SPARQL Endpoint

Step 1: Query and epsilon value sent to the endpoint and intercepted by the enforcement module.

Query,

Ɛ > 0Slide42

Differential Privacy Protocol

Differential Privacy Module

Client

SPARQL Endpoint

Step 2: The sensitivity of the query is calculated using a re-written, related query.

Sens

QuerySlide43

Differential Privacy Protocol

Differential Privacy Module

Client

SPARQL Endpoint

Step 3: Actual query sent.

QuerySlide44

Differential Privacy Protocol

Differential Privacy Module

Client

SPARQL Endpoint

Step 4: Result with Laplace noise sent over.

Result and NoiseSlide45

Experimental EvaluationSlide46

Evaluation

Three things to evaluate:

Correctness of operation

Correctness of differential privacy

Runtime

Used an

anonymized

clinical database as the test data and added fake names, social security numbers, and addresses.Slide47

Correctness of Operation

Can the system do what we want?

Authentication provides access control

AIR restricts information and types of queries

Differential privacy gives strong privacy guarantees.

Can we do better?Slide48

Use Case Used in Thesis

Clinical database data protection

HIPAA: Federal protection of private information fields, such as name and social security number, for patients.

3 users

Alice: Works in CDC, needs unhindered access

Bob: Researcher that needs access to private fields (e.g. addresses)

Charlie: Amateur researcher to whom HIPAA should apply

Assumptions:

Django

is secure enough to handle “clever attacks”

Users do not collude, so can allocate individual epsilon values.Slide49

Use Case Solution Overview

What should happen:

Dynamically apply different AIR policies at runtime.

Give different epsilon-budgets.

How allocated:

Alice

: No AIR

Policy, no noise.

Bob: Give access to addresses but hide all other private information fields.

Epsilon budget: E1

Charlie: Hide all private information fields in accordance with HIPAA

Epsilon budget: E2Slide50

Use Case Solution Overview

Alice: No AIR Policy

Bob: Give access to addresses but hide all other private information fields.

Epsilon budget: E1

Charlie: Hide all private information fields in accordance with HIPAA

Epsilon budget: E2Slide51

Example: A Clinical Database

Client Accesses

triplestore

via HTTP server.

OpenID

Authentication verifies user has access to data. Finds epsilon value,

HTTP Server

OpenID

AuthenticationSlide52

Example: A Clinical Database

AIR

reasoner

checks incoming queries for HIPAA violations.

Privacy policies contain HIPAA rules.

Privacy Policies

AIR

ReasonerSlide53

Example: A Clinical Database

Differential Privacy applied to statistical queries.

Statistical result + noise returned to client.

Differential Privacy ModuleSlide54

Correctness of Differential Privacy

Need to test how much noise is added.

Too much noise = poor results.

Too little noise = no guarantee.

Test: Run queries and look at sensitivity calculated vs. actual sensitivity.Slide55

How to test sensitivity?

Ideally:

Test noise calculation is correct

Test that noise makes data still useful (e.g. by applying machine learning algorithms).

Fort his project, just tested former

Machine learning APIs not as prevalent for linked data.

What results to compare to?Slide56

Test suite

10 queries for each operation (COUNT, SUM, AVG, MIN, MAX)

10 different WHERE CLAUSES

Test:

Sensitivity calculated from original query

Remove each personal URI using “MINUS” keyword and see which removal is most sensitiveSlide57

Example for Sens Test

Query:

PREFIX

rdf

: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX

rdfs

: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX

foaf

: <http://xmlns.com/foaf/0.1#>

PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>

SELECT

(SUM(?

o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o))

}Slide58

Example for Sens Test

Sensitivity query:

PREFIX

rdf

: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX

rdfs

: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX

foaf

: <http://xmlns.com/foaf/0.1#>

PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>

SELECT

(SUM(?

o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o)) MINUS {?s

foaf:name

"%s"}

} % (name)Slide59

Results Query 6 - ErrorSlide60

Runtime

Queries were also tested for runtime.

Bigger WHERE clauses

More keywords

Extra overhead of doing the calculations.Slide61

Results Query 6 - RuntimeSlide62

Interpretation

Sensitivity calculation time on-par with query time

Might not be good for big data

Find ways to reduce sensitivity calculation time?

AVG does not do so well…

Approximation yields too much noise vs. trying all possibilities

Runs ~4x slower than simple querying

Solution 1: Look at all data manually (large data transfer)

Solution 2: Can

we use NOISY_SUM / NOISY_COUNT instead?Slide63

ConclusionSlide64

Contributions

Theory on how to apply differential privacy to linked data.

Overall privacy module for SPARQL queries.

Limited but a good start

Experimental implementation of differential privacy.

Verification that it is applied correctly

.

Other:

Updated

sparql

to n3 translation to

Sparql

version 1.1

Expanded upon IARPA project to create policies against statistical queries.Slide65

Shortcomings and Future Work

Triplestores

need some structure for this to work

Personal information must be explicitly defined in triples.

Is there a way to automatically detect what triples would constitute private information?

Complexity

Lots of noise for sparse data.

Can divide data into disjoint sets to reduce noise like PINQ does

Use localized sensitivity measures?

Third party software problems

Would this work better using a different

T

riplestore

implementation?Slide66

Diff. Privacy and an Open Web

How applicable is this to an open web?

High sample numbers, but potentially high data variance.

Sensitivity calculation might take too long, need to approximate.

Can use disjoint subsets of the web to increase number of queries with ɛ budgets.Slide67

Demo

air.csail.mit.edu:8800/

spim_module

/Slide68

References

Differential Privacy Implementations:

“Privacy Integrated Queries (PINQ)” by Frank

McSherry

:

http://

research.microsoft.com/pubs/80218/sigmod115-mcsherry.pdf

Airavat

: Security and Privacy for

MapReduce

” by Roy,

Indrajit

;

Setty, Srinath T. V. ; Kilzer, Ann; Shmatikov, Vitaly; and Witchel, Emmet: http://www.cs.utexas.edu/~shmat/shmat_nsdi10.pdf“Towards Statistical Queries over Distributed Private User Data” by Chen, Ruichuan; Reznichenko, Alexey; Francis, Paul; Gehrke, Johannes: https://www.usenix.org/conference/nsdi12/towards-statistical-queries-over-distributed-private-user-dataSlide69

References

Theoretical Work

“Differential Privacy” by Cynthia

Dwork

:

http://

research.microsoft.com/pubs/64346/dwork.pdf

“Mechanism Design via Differential Privacy” by

McSherry

, Frank; and

Talwar

,

Kunal

:

http://research.microsoft.com/pubs/65075/mdviadp.pdf“Calibrating Noise to Sensitivity in Private Data Analysis” by Dwork, Cynthia; McSherry, Frank; Nissim, Kobbi; and Smith, Adam: http://people.csail.mit.edu/asmith/PS/sensitivity-tcc-final.pdf“Differential Privacy for Clinical Trail Data: Preliminary Evaluations”, by Vu, Duy; and Slavković, Aleksandra: http://sites.stat.psu.edu/~sesa/Research/Papers/padm09sesaSep24.pdfSlide70

References

Other

“Privacy Concerns of FOAF-Based Linked Data” by

Nasirifard

,

Peyman

;

Hausenblas

, Michael; and Decker

, Stefan:

http://

citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.5772

The Mosaic Theory, National Security, and the Freedom of Information

Act”, by David E. Pozen http://www.yalelawjournal.org/pdf/115-3/Pozen.pdf“A Privacy Preference Ontology (PPO) for Linked Data”, by Sacco, Owen; and Passant, Alexandre: http://ceur-ws.org/Vol-813/ldow2011-paper01.pdf“k-Anonimity: A Model for Protecting Privacy”, by Latanya Sweeney: http://arbor.ee.ntu.edu.tw/archive/ppdm/Anonymity/SweeneyKA02.pdfSlide71

References

Other

“Approximation Algorithms for k-

Anonimity

”, by

Aggarwal

,

Gagan

;

Feder

, Tomas;

Kenthapadi

,

Krishnaram

; Motwani, Rajeev; Panigraphy, Rina; Thomas, Dilys; and Zhu, An: http://research.microsoft.com/pubs/77537/k-anonymity-jopt.pdfSlide72

Appendix: Results Q1, Q2

Q1

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.020976

0.05231

Q2

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.015823126

0.011798859

SUM

0

0.010298967

0.01198101

AVG

868.8379

0.010334969

0.04432416

MAX

0

0.010645866

0.012124062

MIN

0

0.010524988

0.012120962Slide73

Appendix: Results Q3, Q4

Q3

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.007927895

0.00800705

SUM

0

0.007529974

0.007997036

AVG

375.8253

0.00763011

0.030416012

MAX

0

0.007451057

0.008117914

MIN

0

0.007512093

0.008100986

Q4

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.01048708

0.012546062

SUM

0

0.01123786

0.012809038

AVG

860.91

0.011286974

0.048202038

MAX

0

0.01145792

0.01297307

MIN

0

0.011392117

0.012881041Slide74

Appendix: Results Q5, Q6

Q5

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.08081007

0.098078012

SUM

0

0.085678816

0.097680092

AVG

115099.5

0.087270975

0.373119116

MAX

0

0.084903955

0.097922087

MIN

0

0.083213806

0.098366022

Q6

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.136605978

0.153807878

SUM

0

0.139995098

0.155878067

AVG

115118.4

0.139881134

0.616436958

MAX

0

0.148360014

0.160467148

MIN

0

0.144635916

0.158998966Slide75

Appendix: Results Q7, Q8

Q7

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.006100178

0.004678965

SUM

0

0.004260063

0.004747868

AVG

0

0.004283905

0.017117977

MAX

0

0.004103184

0.004703999

MIN

0

0.004188061

0.004717112

Q8

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.002182961

0.002643108

SUM

0

0.002092123

0.002592087

AVG

0

0.002075911

0.002662182

MAX

0

0.00207901

0.002576113

MIN

0

0.002048969

0.002597094Slide76

Appendix: Results Q9, Q10

Q9

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.004920959

0.010298014

SUM

0

0.004822016

0.010312796

AVG

0.00037

0.004909992

0.024574041

MAX

0

0.004843235

0.01032114

MIN

0

0.004893064

0.010319948

Q10

Error

Query_Time

Sens_Calc_Time

COUNT

0

0.012365818

0.014447212

SUM

0

0.013066053

0.014631987

AVG

860.91

0.013166904

0.056000948

MAX

0

0.013354063

0.014893055

MIN

0

0.013329029

0.014914989