Module 17 Sharing Data While Preserving Privacy some slides by Gen Bartlett Jelena Mirkovic USC CSCI 430 Why Do We Want to Share Share existing data sets Research Companies Buy data from each other ID: 908926
Download Presentation The PPT/PDF document "Introduction to Security" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Introduction to Security
Module 17 – Sharing Data While Preserving Privacysome slides by Gen Bartlett
Jelena
Mirkovic
USC CSCI 430
Slide2Why Do We Want to Share?Share existing data sets:Research
Companies Buy data from each other Check out each other’s assets before merges/buyoutsStart a new dataset:Mutually beneficial relationshipsShare data with me and you can use this service
2
Slide3Sharing Everything?Easy, but what are the ramifications?Legal/policy may limit what can be shared/collected
IRBs: Institutional Review BoardHITECH & HIPAA: Health Insurance Portability and Accountability ActFuture use and protection of data?
3
Slide4Mechanisms for Limited SharingRemove really sensitive stuff (sanitization)PPI & PII (private, personal & private identifying)
Without a crystal ball, this is hardAnonymizationReplace information to limit ability to tie entities to meaningful identitiesAggregationRemove PII by only collecting/releasing statistics
4
Slide5Anonymization ExampleNetwork trace:
PAYLOAD
5
Slide6Anonymization ExampleNetwork trace:
PAYLOAD
All sorts of PII and PPI in there!
6
Slide7Anonymization ExampleNetwork trace:
PAYLOAD
Routing information: IP addresses, TCP flags/options, OS fingerprinting
7
Slide8Anonymization ExampleNetwork trace:
PAYLOAD
Remove IPs? Anonymize IPs?
8
Slide9Anonymization ExampleNetwork trace:
PAYLOAD
Removing IPs severely limits what you can do with the data.
Replace with something identifying, but not the same data.
IP1 = A
IP2 = B
Etc.
9
Slide10Aggregation Example“Fewer U.S. Households Have Debt, But Those Who Do Have More, Census Bureau Reports”
10
Slide11Methods Can Be Bad Or GoodJust because someone uses aggregation or anonymization, doesn’t mean the data is safeExample:
Release aggregate stats of people’s favorite color? 11
Slide12What is Inferred?Take 2 sources of information, correlate dataX + Y = ….Example: Google Street View + what my car looks like + where I live = you know where I was last year
12
Slide13Another ExamplePaula Broadwell who had an affair with CIA director David Petraeus, similarly took extensive precautions to hide her identity. She never logged in to her anonymous e-mail service from her home network. Instead, she used hotel and other public networks when she e-mailed him. The FBI correlated hotel registration data from several different hotels -- and hers was the common name.
13
Slide14Another Example: Netflix & IMDBNetflix prize: released an anonymized datasetCorrelated with IMDB: undid anonymization (University of Texas)
14
Slide15Designing Privacy-Preserving SystemsAim for the minimum amount of information needed to achieve goals
Think through how info can be gained and inferredInferred is often a gotcha! x + y = something private, but x and y by themselves don’t seem all that specialThink through how information can be gainedOn the wire? Stored in logs? At a router? At an ISP?
15
Slide16Privacy and Stored InformationData is only as safe as the systemHow long is the data stored affects privacy
Longer term = bigger privacy risk (in general)Longer time frame, more data to correlate & inferLonger opportunity for data theftIncreased chances of mistakes, lapsed security etc.
16
Slide17Anonymized Data
Goal: release anonymized dataRemove identifying information, like nameSome diseases are still unique to one person
17
name
age
hosp. reason
Paul Smith
80
cancer
Jerry Goel
43
cancer
Marry Smith
32
flu
Amy Gilbert
21
flu
Theodore Tuck
74
gallbladder
Jennifer Dill
53
heart attack
Slide18k-anonymity
OK to release data if a sensitive feature pertains to k or more peopleImagine k=218
name
age
hosp. reason
Paul Smith
80
cancer
Jerry Goel
43
cancer
Marry Smith
32
flu
Amy Gilbert
21
flu
Theodore Tuck
74
gallbladder
Jennifer Dill
53
heart attack
Slide19k-anonymity
But there is only one person age=80If I were to observe my elderly neighbor go into that hospital, I can learn his condition from anonymized data
19
name
age
hosp. reason
Paul Smith
80
cancer
Jerry Goel
43
cancer
Marry Smith
32
flu
Amy Gilbert
21
flu
Theodore Tuck
74
gallbladder
Jennifer Dill
53
heart attack
Slide20k-anonymity
Anonymize age tooGood privacy but I’m losing correlationsin data
20
name
age
hosp. reason
Paul Smith
20-80
cancer
Jerry Goel
20-80
cancer
Marry Smith
20-80
flu
Amy Gilbert
20-80
flu
Theodore Tuck
74
gallbladder
Jennifer Dill
53
heart attack
Slide21Differential Privacy
Allow queries on dataAdd random noise to protect privacyAmplitude of the noise ~ data distribution
21
name
age
hosp. reason
Paul Smith
80
cancer
Jerry Goel
43
cancer
Marry Smith
32
flu
Amy Gilbert
21
flu
Theodore Tuck
74
gallbladder
Jennifer Dill
53
heart attack
Slide22Differential Privacy
22
LaPlace
mechanism adds noise drawn from
LaPlace
distribution with parameter
Where is
global sensitivity
of the function f (how much is max change of f if any one row of the table is removed)
E.g., sensitivity of a count is 1, sensitivity of avg(age) in our table is 9.8
Typical is 0.1
Slide23Differential Privacy
Current state of the art for privacy protectionWorks well when you have a lot of dataWorks well to learn about the average population but not about outliersOffers strong mathematical guarantees about privacy, not so much about utilityAdopted by all major companies: Microsoft, Apple, Google, Facebook
23