MapReduce Indrajit Roy Srinath TV Setty Ann Kilzer Vitaly Shmatikov Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Illusion of infinite resources ID: 1044130
Download Presentation The PPT/PDF document "Airavat : Security and Privacy for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. Airavat: Security and Privacy for MapReduceIndrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett WitchelThe University of Texas at Austin
2. Computing in the year 201X2Illusion of infinite resourcesPay only for resources usedQuickly scale up or scale down …Data
3. Programming model in year 201X3Frameworks available to ease cloud programmingMapReduce: Parallel processing on clusters of machinesReduceMapOutputData Data mining Genomic computation Social networks
4. Programming model in year 201X4Thousands of users upload their data Healthcare, shopping transactions, census, click stream Multiple third parties mine the data for better serviceExample: Healthcare dataIncentive to contribute: Cheaper insurance policies, new drug research, inventory control in drugstores…Fear: What if someone targets my personal data?Insurance company can find my illness and increase premium
5. Privacy in the year 201X ?5OutputInformation leak? Data mining Genomic computation Social networksHealth DataUntrusted MapReduce program
6. Use de-identification?6Achieves ‘privacy’ by syntactic transformationsScrubbing , k-anonymity …Insecure against attackers with external informationPrivacy fiascoes: AOL search logs, Netflix datasetRun untrusted code on the original data?How do we ensure privacy of the users?
7. Audit the untrusted code?Audit all MapReduce programs for correctness?Aim: Confine the code instead of auditing7Also, where is the source code?Hard to do! Enlightenment?
8. This talk: Airavat8Framework for privacy-preserving MapReduce computations with untrusted code.Airavat is the elephant of the clouds (Indian mythology).Untrusted ProgramProtectedDataAiravat
9. Airavat guarantee9Bounded information leak* about any individual data after performing a MapReduce computation.*Differential privacyUntrusted ProgramProtectedDataAiravat
10. Outline10MotivationOverviewEnforcing privacyEvaluationSummary
11. map(k1,v1) list(k2,v2)reduce(k2, list(v2)) list(v2)Data 1Data 2Data 3Data 4OutputBackground: MapReduce11Map phaseReduce phase
12. iPadTablet PCiPadLaptopMapReduce example12Map(input){ if (input has iPad) print (iPad, 1) }Reduce(key, list(v)){ print (key + “,”+ SUM(v)) }(iPad, 2)Counts no. ofiPads sold(ipad,1)(ipad,1)SUMMap phaseReduce phase
13. Airavat model13Airavat framework runs on the cloud infrastructure Cloud infrastructure: Hardware + VMAiravat: Modified MapReduce + DFS + JVM + SELinuxCloud infrastructureAiravat framework1Trusted
14. Airavat model14Data provider uploads her data on AiravatSets up certain privacy parametersCloud infrastructureData provider2Airavat framework1Trusted
15. Airavat model15Computation provider writes data mining algorithmUntrusted, possibly maliciousCloud infrastructureData provider2Airavat framework13Computation providerOutputProgramTrusted
16. Threat model16Airavat runs the computation, and still protects the privacy of the data providersCloud infrastructureData provider2Airavat framework13Computation providerOutputProgramTrustedThreat
17. Roadmap17What is the programming model?How do we enforce privacy?What computations can be supported in Airavat?
18. Programming model18MapReduce program for data mining Split MapReduce into untrusted mapper + trusted reducerDataDataNo need to auditAiravatUntrusted MapperTrusted ReducerLimited set of stock reducers
19. Programming model19MapReduce program for data mining DataDataNo need to auditAiravatUntrusted MapperTrusted ReducerNeed to confine the mappers !Guarantee: Protect the privacy of data providers
20. Challenge 1: Untrusted mapper20Untrusted mapper code copies data, sends it over the networkPeterMegReduceMapPeterDataChrisLeaks using system resources
21. Challenge 2: Untrusted mapper21Output of the computation is also an information channel Output 1 million if Peter bought Vi*graPeterMegReduceMapDataChris
22. Airavat mechanisms22Prevent leaks throughstorage channels like network connections, files…ReduceMapMandatory access controlDifferential privacy Prevent leaks through the output of the computation OutputData
23. Back to the roadmap23What is the programming model?How do we enforce privacy?Leaks through system resourcesLeaks through the outputWhat computations can be supported in Airavat?Untrusted mapper + Trusted reducer
24. Airavat confines the untrusted codeMapReduce + DFSSELinuxUntrusted programGiven by the computation providerAdd mandatory access control (MAC)Add MAC policy Airavat
25. Airavat confines the untrusted codeMapReduce + DFSSELinuxUntrusted programWe add mandatory access control to the MapReduce frameworkLabel input, intermediate values, outputMalicious code cannot leak labeled dataData 1Data 2Data 3OutputAccess control labelMapReduce
26. Airavat confines the untrusted codeMapReduce + DFSSELinuxUntrusted programSELinux policy to enforce MACCreates trusted and untrusted domainsProcesses and files are labeled to restrict interactionMappers reside in untrusted domainDenied network access, limited file system interaction
27. But access control is not enough27Labels can prevent the output from been readWhen can we remove the labels?iPadTablet PCiPadLaptop(iPad, 2)Output leaks the presence of Peter !Peterif (input belongs-to Peter) print (iPad, 1000000)(ipad,1000001)(ipad,1)SUMAccess control labelMap phaseReduce phase(iPad, 1000002)
28. But access control is not enough28Need mechanisms to enforce that the output does not violate an individual’s privacy.
29. Background: Differential privacy29A mechanism is differentially private if every output is produced with similar probability whether any given input is included or notCynthia Dwork. Differential Privacy. ICALP 2006
30. Differential privacy (intuition)30A mechanism is differentially private if every output is produced with similar probability whether any given input is included or notOutput distributionF(x)ABCCynthia Dwork. Differential Privacy. ICALP 2006
31. Differential privacy (intuition)31A mechanism is differentially private if every output is produced with similar probability whether any given input is included or notSimilar output distributionsBounded risk for D if she includes her data!F(x)F(x)ABCABCDCynthia Dwork. Differential Privacy. ICALP 2006
32. Achieving differential privacy32A simple differentially private mechanismHow much noise should one add?Tell me f(x)f(x)+noise…xnx1
33. Achieving differential privacy33Function sensitivity (intuition): Maximum effect of any single input on the outputAim: Need to conceal this effect to preserve privacyExample: Computing the average height of the people in this room has low sensitivityAny single person’s height does not affect the final average by too muchCalculating the maximum height has high sensitivity
34. Achieving differential privacy34Function sensitivity (intuition): Maximum effect of any single input on the outputAim: Need to conceal this effect to preserve privacyExample: SUM over input elements drawn from [0, M]X1X2X3X4SUMSensitivity = MMax. effect of any input element is M
35. Achieving differential privacy35A simple differentially private mechanismf(x)+Lap(∆(f))…xnx1Tell me f(x)Intuition: Noise needed to mask the effect of a single inputLap = Laplace distribution∆(f) = sensitivity
36. Back to the roadmap36What is the programming model?How do we enforce privacy?Leaks through system resourcesLeaks through the outputWhat computations can be supported in Airavat?Untrusted mapper + Trusted reducerMAC
37. Enforcing differential privacy37Mapper can be any piece of Java code (“black box”) but…Range of mapper outputs must be declared in advanceUsed to estimate “sensitivity” (how much does a single input influence the output?)Determines how much noise is added to outputs to ensure differential privacyExample: Consider mapper range [0, M] SUM has the estimated sensitivity of M
38. Enforcing differential privacy38Malicious mappers may output values outside the rangeIf a mapper produces a value outside the range, it is replaced by a value inside the rangeUser not notified… otherwise possible information leakData 1Data 2Data 3Data 4Range enforcerNoiseMapperReducerRange enforcerMapperEnsures that code is not more sensitive than declared
39. Enforcing sensitivity39All mapper invocations must be independentMapper may not store an input and use it later when processing another inputOtherwise, range-based sensitivity estimates may be incorrectWe modify JVM to enforce mapper independenceEach object is assigned an invocation numberJVM instrumentation prevents reuse of objects from previous invocation
40. Roadmap. One last time40What is the programming model?How do we enforce privacy?Leaks through system resourcesLeaks through the outputWhat computations can be supported in Airavat?Untrusted mapper + Trusted reducerMACDifferential Privacy
41. What can we compute?41Reducers are responsible for enforcing privacyAdd an appropriate amount of random noise to the outputs Reducers must be trustedSample reducers: SUM, COUNT, THRESHOLDSufficient to perform data mining algorithms, search log processing, recommender system etc.With trusted mappers, more general computations are possibleUse exact sensitivity instead of range based estimates
42. Sample computations42Many queries can be done with untrusted mappersHow many iPads were sold today?What is the average score of male students at UT?Output the frequency of security books that sold more than 25 copies today.… others require trusted mapper codeList all items and their quantity soldSumMeanThresholdMalicious mapper can encode information in item names
43. Revisiting Airavat guarantees43Allows differentially private MapReduce computationsEven when the code is untrustedDifferential privacy => mathematical bound on information leakWhat is a safe bound on information leak ?Depends on the context, datasetNot our problem
44. Outline44MotivationOverviewEnforcing privacyEvaluationSummary
45. Implementation details45450 LoC5000 LoC500 LoCLoC = Lines of Code
46. Evaluation : Our benchmarks46Experiments on 100 Amazon EC2 instances1.2 GHz, 7.5 GB RAM running Fedora 8BenchmarkPrivacy groupingReducer primitiveMapReduce operationsAccuracy metricAOL queriesUsersTHRESHOLD,SUMMultiple% queries releasedkNN recommenderIndividual ratingCOUNT, SUMMultipleRMSEK-MeansIndividual pointsCOUNT, SUMMultiple, till convergenceIntra-cluster varianceNaïve BayesIndividual articlesSUMMultipleMisclassification rate
47. Performance overhead47Normalized execution timeOverheads are less than 32%
48. Evaluation: accuracy48Accuracy increases with decrease in privacy guaranteeReducer : COUNT, SUMPrivacy parameterAccuracy (%)No information leakDecrease in privacy guarantee*Refer to the paper for remaining benchmark results
49. Related work: PINQ49Set of trusted LINQ primitivesAiravat confines untrusted code and ensures that its outputs preserve privacyPINQ requires rewriting code with trusted primitivesAiravat provides end-to-end guarantee across the software stackPINQ guarantees are language level[McSherry SIGMOD 2009]
50. Airavat in brief50Airavat is a framework for privacy preserving MapReduce computationsConfines untrusted codeFirst to integrate mandatory access control with differential privacy for end-to-end enforcementProtectedAiravatUntrusted Program
51. Thank you