Yotam Aron Table of Contents Introduction Differential Privacy for Linked Data SPIM implementation Evaluation Contributions Theory how to apply differential privacy to linked data Implementation privacy module for SPARQL queries ID: 264700
Download Presentation The PPT/PDF document "Differential Privacy on Linked Data: The..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Differential Privacy on Linked Data: Theory and Implementation
Yotam
AronSlide2
Table of Contents
Introduction
Differential Privacy for Linked Data
SPIM implementation
EvaluationSlide3
Contributions
Theory: how to apply differential privacy to linked data.
Implementation: privacy module for SPARQL queries.
Experimental evaluation: differential privacy on linked data.Slide4
IntroductionSlide5
Overview: Privacy Risk
Statistical data can leak privacy.
Mosaic Theory: Different data sources harmful when combined.
Examples:
Netflix Prize Data set
GIC Medical Data set
AOL Data logs
Linked data has added ontologies and meta-data, making it even more vulnerable
.Slide6
Current Solutions
Accountability:
Privacy Ontologies
Privacy Policies and Laws
Problems:
Requires agreement among parties.
Does not actually prevent breaches, just a deterrent.Slide7
Current Solutions (Cont’d)
Anonymization
Delete “private” data
K – anonymity
(Strong Privacy Guarantee)
Problems
Deletion provides no strong guarantees
Must be carried out for every data set
What data should be
anonymized
?
High computational cost
(k-
anonimity
is
np
-hard)Slide8
Differential Privacy
Definition for relational databases (from PINQ paper):
A
randomized function
K
gives
Ɛ-differential
privacy if for all data sets
and
differing
on at most one record, and all
,
Slide9
Differential Privacy
What does this mean?
Adversaries get roughly same results from
and
, meaning a single individual’s data will not greatly affect their knowledge acquired from each data set.
Slide10
How Is This Achieved?
Add noise to result.
Simplest: Add Laplace noiseSlide11
Laplace Noise Parameters
Mean = 0 (so don’t add bias)
Variance =
, where
is defined, for a record
j
, as
Theorem: For query
Q
result
R
,
the output R + Laplace(0, ) is differentially private. Slide12
Other Benefit of Laplace Noise
A set of queries each with sensitivity
will have an overall sensitivity of
Implementation-wise, can allocate an “budget” Ɛ for a client and for each query client specifies
to use.
Slide13
Benefits of Differential Privacy
Strong Privacy Guarantee
Mechanism-Based, so don’t have to mess with data.
Independent of data set’s structure.
Works well with for statistical analysis algorithms.Slide14
Problems with Differential Privacy
Potentially poor performance
Complexity
Noise
Only works with statistical data (though this has fixes)
How to calculate sensitivity of arbitrary query without brute-force?Slide15
Theory: Differential Privacy for Linked DataSlide16
Differential Privacy and Linked Data
Want same privacy guarantees for linked data without, but no “records.”
What should be “unit of difference”?
One triple
All URIs
related to
person’s
URI
All links going out from person’s URISlide17
Differential Privacy and Linked Data
Want same privacy guarantees for linked data without, but no “records.”
What should be “unit of difference”?
One triple
All URIs related to person’s
URI
All links going out from person’s URISlide18
Differential Privacy and Linked Data
Want same privacy guarantees for linked data without, but no “records.”
What should be “unit of difference”?
One triple
All URIs related to person’s
URI
All links going out from person’s URISlide19
Differential Privacy and Linked Data
Want same privacy guarantees for linked data without, but no “records.”
What should be “unit of difference”?
One triple
All URIs related to person’s
URI
All links going out from person’s URISlide20
“Records” for Linked Data
Reduce links in graph to attributes
Idea:
Identify individual contributions from a single individual to total answer.
Find contribution that affects answer most.Slide21
“Records” for Linked Data
Reduce links in graph to attributes, makes it a record.
P1
P2
Knows
Person
Knows
P1
P2Slide22
“Records” for Linked Data
Repeated attributes and null values allowed
P1
P2
Knows
P3
P4
Loves
Knows
KnowsSlide23
“Records” for Linked Data
Repeated attributes and null values allowed (not good RDBMS form but makes definitions easier)
Person
Knows
Knows
Loves
P1
P2
Null
P4
P3
P2
P4
NullSlide24
Query Sensitivity in Practice
Need to find triples that “belong” to a person.
Idea:
Identify individual contributions from a single individual to total answer.
Find contribution that affects answer most.
Done using sorting and limiting functions in SPARQLSlide25
Example
COUNT of places visited
P1
P2
MA
S2
S3
State of Residence
S1
VisitedSlide26
Example
COUNT of places visited
P1
P2
MA
S2
S3
State of Residence
S1
VisitedSlide27
Example
COUNT of places visited
P1
P2
MA
S2
S3
State of Residence
S1
Visited
Answer
: Sensitivity of 2Slide28
Using SPARQL
Query:
(COUNT(?s) as ?
num_places_visited
) WHERE{
?p :visited ?s }Slide29
Using SPARQL
Sensitivity Calculation Query (Ideally):
SELECT ?p (COUNT(ABS(?s)) as ?
num_places_visited
) WHERE
{
?p :visited ?
s;
?p
foaf:name
?n }
GROUP BY ?p ORDER BY ?
num_places_visited
LIMIT 1Slide30
In reality…
LIMIT, ORDER BY, GROUP BY doesn’t work together in 4store…
For now: Don’t use LIMIT and get top answers manually.
I.e. Simulate using these keywords in python
Will affect results, so better testing should be carried out in the future.
Would like to keep it on
sparql
-side ideally so there is less transmitted data (e.g. on large data sets)Slide31
(Side rant) 4store limitations
Many operations not supported in unison
E.g. cannot always filter and use “order by” for some reason
Severely limits the types of queries I could use to test.
May be desirable to work with a different
triplestore
that is more up-to-date (ARQ).
Didn’t because wanted to keep code in python.
Also had already written all code for 4storeSlide32
Problems with this Approach
Need to identify “people” in graph.
Assume, for example, that URI with a
foaf:name
is a person
and use its triples in privacy calculations.
Imposes some constraints on linked data format for this to work.
For future work, look if there’s a way to automatically identify private data, maybe by using ontologies.
Complexity is tied to speed of performing query over large data set.
Still not generalizable to all functions.Slide33
…and on the Plus Side
Model for sensitivity calculation can be expanded to arbitrary statistical functions.
e.g. dot products, distance functions, variance, etc.
Relatively simple to implement using SPARQL 1.1 Slide34
Implementation: Design of Privacy SystemSlide35
SPARQL Privacy Insurance Module
i.e. SPIM
Use authentication, AIR, and differential privacy in one system.
Authentication to manage Ɛ-budgets.
AIR to control flow of information and non-statistical data.
Differential privacy for statistics.
Goal: Provide a module that can integrate into SPARQL 1.1 endpoints and provide privacy.Slide36
Design
Triplestore
User Data
Privacy Policies
SPIM Main Process
AIR
Reasoner
Differential Privacy Module
HTTP Server
OpenID
AuthenticationSlide37
HTTP Server and Authentication
HTTP Server:
Django
server that handles http requests.
OpenID
Authentication:
Django
module.
HTTP Server
OpenID
AuthenticationSlide38
SPIM Main Process
Controls flow of information.
First checks user’s budget, then uses AIR, then performs final differentially-private query.
SPIM Main ProcessSlide39
AIR Reasoner
Performs access control by translating SPARQL queries to n3 and checking against policies.
Can potentially perform more complicated operations (e.g. check user credentials)
Privacy Policies
AIR
ReasonerSlide40
Differential Privacy Protocol
Differential Privacy Module
Client
SPARQL Endpoint
Scenario: Client wishes to make standard SPARQL 1.1 statistical query. Client has Ɛ “budget” of overall accuracy for all queries.Slide41
Differential Privacy Protocol
Differential Privacy Module
Client
SPARQL Endpoint
Step 1: Query and epsilon value sent to the endpoint and intercepted by the enforcement module.
Query,
Ɛ > 0Slide42
Differential Privacy Protocol
Differential Privacy Module
Client
SPARQL Endpoint
Step 2: The sensitivity of the query is calculated using a re-written, related query.
Sens
QuerySlide43
Differential Privacy Protocol
Differential Privacy Module
Client
SPARQL Endpoint
Step 3: Actual query sent.
QuerySlide44
Differential Privacy Protocol
Differential Privacy Module
Client
SPARQL Endpoint
Step 4: Result with Laplace noise sent over.
Result and NoiseSlide45
Experimental EvaluationSlide46
Evaluation
Three things to evaluate:
Correctness of operation
Correctness of differential privacy
Runtime
Used an
anonymized
clinical database as the test data and added fake names, social security numbers, and addresses.Slide47
Correctness of Operation
Can the system do what we want?
Authentication provides access control
AIR restricts information and types of queries
Differential privacy gives strong privacy guarantees.
Can we do better?Slide48
Use Case Used in Thesis
Clinical database data protection
HIPAA: Federal protection of private information fields, such as name and social security number, for patients.
3 users
Alice: Works in CDC, needs unhindered access
Bob: Researcher that needs access to private fields (e.g. addresses)
Charlie: Amateur researcher to whom HIPAA should apply
Assumptions:
Django
is secure enough to handle “clever attacks”
Users do not collude, so can allocate individual epsilon values.Slide49
Use Case Solution Overview
What should happen:
Dynamically apply different AIR policies at runtime.
Give different epsilon-budgets.
How allocated:
Alice
: No AIR
Policy, no noise.
Bob: Give access to addresses but hide all other private information fields.
Epsilon budget: E1
Charlie: Hide all private information fields in accordance with HIPAA
Epsilon budget: E2Slide50
Use Case Solution Overview
Alice: No AIR Policy
Bob: Give access to addresses but hide all other private information fields.
Epsilon budget: E1
Charlie: Hide all private information fields in accordance with HIPAA
Epsilon budget: E2Slide51
Example: A Clinical Database
Client Accesses
triplestore
via HTTP server.
OpenID
Authentication verifies user has access to data. Finds epsilon value,
HTTP Server
OpenID
AuthenticationSlide52
Example: A Clinical Database
AIR
reasoner
checks incoming queries for HIPAA violations.
Privacy policies contain HIPAA rules.
Privacy Policies
AIR
ReasonerSlide53
Example: A Clinical Database
Differential Privacy applied to statistical queries.
Statistical result + noise returned to client.
Differential Privacy ModuleSlide54
Correctness of Differential Privacy
Need to test how much noise is added.
Too much noise = poor results.
Too little noise = no guarantee.
Test: Run queries and look at sensitivity calculated vs. actual sensitivity.Slide55
How to test sensitivity?
Ideally:
Test noise calculation is correct
Test that noise makes data still useful (e.g. by applying machine learning algorithms).
Fort his project, just tested former
Machine learning APIs not as prevalent for linked data.
What results to compare to?Slide56
Test suite
10 queries for each operation (COUNT, SUM, AVG, MIN, MAX)
10 different WHERE CLAUSES
Test:
Sensitivity calculated from original query
Remove each personal URI using “MINUS” keyword and see which removal is most sensitiveSlide57
Example for Sens Test
Query:
PREFIX
rdf
: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX
rdfs
: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX
foaf
: <http://xmlns.com/foaf/0.1#>
PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>
SELECT
(SUM(?
o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o))
}Slide58
Example for Sens Test
Sensitivity query:
PREFIX
rdf
: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX
rdfs
: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX
foaf
: <http://xmlns.com/foaf/0.1#>
PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>
SELECT
(SUM(?
o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o)) MINUS {?s
foaf:name
"%s"}
} % (name)Slide59
Results Query 6 - ErrorSlide60
Runtime
Queries were also tested for runtime.
Bigger WHERE clauses
More keywords
Extra overhead of doing the calculations.Slide61
Results Query 6 - RuntimeSlide62
Interpretation
Sensitivity calculation time on-par with query time
Might not be good for big data
Find ways to reduce sensitivity calculation time?
AVG does not do so well…
Approximation yields too much noise vs. trying all possibilities
Runs ~4x slower than simple querying
Solution 1: Look at all data manually (large data transfer)
Solution 2: Can
we use NOISY_SUM / NOISY_COUNT instead?Slide63
ConclusionSlide64
Contributions
Theory on how to apply differential privacy to linked data.
Overall privacy module for SPARQL queries.
Limited but a good start
Experimental implementation of differential privacy.
Verification that it is applied correctly
.
Other:
Updated
sparql
to n3 translation to
Sparql
version 1.1
Expanded upon IARPA project to create policies against statistical queries.Slide65
Shortcomings and Future Work
Triplestores
need some structure for this to work
Personal information must be explicitly defined in triples.
Is there a way to automatically detect what triples would constitute private information?
Complexity
Lots of noise for sparse data.
Can divide data into disjoint sets to reduce noise like PINQ does
Use localized sensitivity measures?
Third party software problems
Would this work better using a different
T
riplestore
implementation?Slide66
Diff. Privacy and an Open Web
How applicable is this to an open web?
High sample numbers, but potentially high data variance.
Sensitivity calculation might take too long, need to approximate.
Can use disjoint subsets of the web to increase number of queries with ɛ budgets.Slide67
Demo
air.csail.mit.edu:8800/
spim_module
/Slide68
References
Differential Privacy Implementations:
“Privacy Integrated Queries (PINQ)” by Frank
McSherry
:
http://
research.microsoft.com/pubs/80218/sigmod115-mcsherry.pdf
“
Airavat
: Security and Privacy for
MapReduce
” by Roy,
Indrajit
;
Setty, Srinath T. V. ; Kilzer, Ann; Shmatikov, Vitaly; and Witchel, Emmet: http://www.cs.utexas.edu/~shmat/shmat_nsdi10.pdf“Towards Statistical Queries over Distributed Private User Data” by Chen, Ruichuan; Reznichenko, Alexey; Francis, Paul; Gehrke, Johannes: https://www.usenix.org/conference/nsdi12/towards-statistical-queries-over-distributed-private-user-dataSlide69
References
Theoretical Work
“Differential Privacy” by Cynthia
Dwork
:
http://
research.microsoft.com/pubs/64346/dwork.pdf
“Mechanism Design via Differential Privacy” by
McSherry
, Frank; and
Talwar
,
Kunal
:
http://research.microsoft.com/pubs/65075/mdviadp.pdf“Calibrating Noise to Sensitivity in Private Data Analysis” by Dwork, Cynthia; McSherry, Frank; Nissim, Kobbi; and Smith, Adam: http://people.csail.mit.edu/asmith/PS/sensitivity-tcc-final.pdf“Differential Privacy for Clinical Trail Data: Preliminary Evaluations”, by Vu, Duy; and Slavković, Aleksandra: http://sites.stat.psu.edu/~sesa/Research/Papers/padm09sesaSep24.pdfSlide70
References
Other
“Privacy Concerns of FOAF-Based Linked Data” by
Nasirifard
,
Peyman
;
Hausenblas
, Michael; and Decker
, Stefan:
http://
citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.5772
“
The Mosaic Theory, National Security, and the Freedom of Information
Act”, by David E. Pozen http://www.yalelawjournal.org/pdf/115-3/Pozen.pdf“A Privacy Preference Ontology (PPO) for Linked Data”, by Sacco, Owen; and Passant, Alexandre: http://ceur-ws.org/Vol-813/ldow2011-paper01.pdf“k-Anonimity: A Model for Protecting Privacy”, by Latanya Sweeney: http://arbor.ee.ntu.edu.tw/archive/ppdm/Anonymity/SweeneyKA02.pdfSlide71
References
Other
“Approximation Algorithms for k-
Anonimity
”, by
Aggarwal
,
Gagan
;
Feder
, Tomas;
Kenthapadi
,
Krishnaram
; Motwani, Rajeev; Panigraphy, Rina; Thomas, Dilys; and Zhu, An: http://research.microsoft.com/pubs/77537/k-anonymity-jopt.pdfSlide72
Appendix: Results Q1, Q2
Q1
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.020976
0.05231
Q2
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.015823126
0.011798859
SUM
0
0.010298967
0.01198101
AVG
868.8379
0.010334969
0.04432416
MAX
0
0.010645866
0.012124062
MIN
0
0.010524988
0.012120962Slide73
Appendix: Results Q3, Q4
Q3
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.007927895
0.00800705
SUM
0
0.007529974
0.007997036
AVG
375.8253
0.00763011
0.030416012
MAX
0
0.007451057
0.008117914
MIN
0
0.007512093
0.008100986
Q4
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.01048708
0.012546062
SUM
0
0.01123786
0.012809038
AVG
860.91
0.011286974
0.048202038
MAX
0
0.01145792
0.01297307
MIN
0
0.011392117
0.012881041Slide74
Appendix: Results Q5, Q6
Q5
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.08081007
0.098078012
SUM
0
0.085678816
0.097680092
AVG
115099.5
0.087270975
0.373119116
MAX
0
0.084903955
0.097922087
MIN
0
0.083213806
0.098366022
Q6
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.136605978
0.153807878
SUM
0
0.139995098
0.155878067
AVG
115118.4
0.139881134
0.616436958
MAX
0
0.148360014
0.160467148
MIN
0
0.144635916
0.158998966Slide75
Appendix: Results Q7, Q8
Q7
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.006100178
0.004678965
SUM
0
0.004260063
0.004747868
AVG
0
0.004283905
0.017117977
MAX
0
0.004103184
0.004703999
MIN
0
0.004188061
0.004717112
Q8
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.002182961
0.002643108
SUM
0
0.002092123
0.002592087
AVG
0
0.002075911
0.002662182
MAX
0
0.00207901
0.002576113
MIN
0
0.002048969
0.002597094Slide76
Appendix: Results Q9, Q10
Q9
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.004920959
0.010298014
SUM
0
0.004822016
0.010312796
AVG
0.00037
0.004909992
0.024574041
MAX
0
0.004843235
0.01032114
MIN
0
0.004893064
0.010319948
Q10
Error
Query_Time
Sens_Calc_Time
COUNT
0
0.012365818
0.014447212
SUM
0
0.013066053
0.014631987
AVG
860.91
0.013166904
0.056000948
MAX
0
0.013354063
0.014893055
MIN
0
0.013329029
0.014914989